Abstract

Background

   microRNAs (miRNAs) are short regulatory RNAs that are involved in
   several diseases, including cancers. Identifying miRNA functions is
   very important in understanding disease mechanisms and determining the
   efficacy of drugs. An increasing number of computational methods have
   been developed to explore miRNA functions by inferring the miRNA-mRNA
   regulatory relationships from data. Each of the methods is developed
   based on some assumptions and constraints, for instance, assuming
   linear relationships between variables. For such reasons, computational
   methods are often subject to the problem of inconsistent performance
   across different datasets. On the other hand, ensemble methods
   integrate the results from individual methods and have been proved to
   outperform each of their individual component methods in theory.

Results

   In this paper, we investigate the performance of some ensemble methods
   over the commonly used miRNA target prediction methods. We apply eight
   different popular miRNA target prediction methods to three cancer
   datasets, and compare their performance with the ensemble methods which
   integrate the results from each combination of the individual methods.
   The validation results using experimentally confirmed databases show
   that the results of the ensemble methods complement those obtained by
   the individual methods and the ensemble methods perform better than the
   individual methods across different datasets. The ensemble method,
   Pearson+IDA+Lasso, which combines methods in different approaches,
   including a correlation method, a causal inference method, and a
   regression method, is the best performed ensemble method in this study.
   Further analysis of the results of this ensemble method shows that the
   ensemble method can obtain more targets which could not be found by any
   of the single methods, and the discovered targets are more
   statistically significant and functionally enriched. The source codes,
   datasets, miRNA target predictions by all methods, and the ground truth
   for validation are available in the Supplementary materials.

Introduction

   miRNAs are important endogenous non-coding RNAs that regulate gene
   expression by promoting mRNA degradation and repressing translation
   [[29]1]. They regulate target mRNAs post-transcriptionally by base
   pairing to complementary sequences in the 3’-untranslated region
   (3’UTR) of the mRNAs [[30]1–[31]4]. miRNAs have also been observed to
   target genes through sites in the 5’UTR and sometimes in the open
   reading frames (ORFs) [[32]5]. They are involved in several
   cancer-related processes such as proliferation [[33]6, [34]7],
   metabolism [[35]8], differentiation [[36]9], development [[37]10],
   apoptosis [[38]11], cellular signaling [[39]12], and even cancer
   development and progression [[40]1, [41]13]. Identifying miRNA
   functions helps elucidate the complex mechanisms of fatal diseases and
   assists with the design of drugs for treatments.

   Several methods have been proposed to identify miRNA targets using
   sequence data [[42]14–[43]16], which are based on sequence
   complementarity and/or structural stability of the putative duplex.
   Although these methods can produce a list of potential target genes,
   the results may contain a high-rate of false positives and false
   negatives [[44]17]. Moreover, the results predicted using different
   methods are often inconsistent [[45]17]. A major reason for this is
   that the miRNA binding sites are small, thus they do not support
   statistically significant predictions, and a small difference in the
   algorithms can lead to a great diversity in the results [[46]17,
   [47]18]. In addition, the predictions are static, and the predicted
   targets of a certain miRNA might not be expressed at all in a specific
   condition [[48]19].

   In recent years, various computational methods have been devised to
   incorporate expression profiles into the study of miRNA-mRNA regulatory
   relationships. The principle of these methods is that if a gene is
   regulated by a miRNA, a correlation should show between the expression
   levels of the gene and the miRNA. The first stream of the research saw
   various methods developed to identify co-expressed miRNAs and mRNAs.
   Some highlights of those methods are correlation analysis [[49]20,
   [50]21], regression models [[51]22, [52]23], population-based
   probabilistic learning model [[53]24], and rule based methods [[54]25].
   In the second stream, attempts were made to predict the regulatory
   networks of miRNAs and mRNAs for specific biological processes, using
   methods such as Bayesian network learning [[55]26–[56]28] and causal
   inference techniques [[57]29–[58]31].

   It has been shown that computational methods may perform inconsistently
   in inferring gene regulatory networks with different datasets [[59]32].
   Each method usually has a set of assumptions on data when building the
   model. These assumptions may be suitable for some datasets but may be
   violated in the others. For example, correlation based methods assume
   linear relationships between variables. In reality, however, many
   relationships are not linear, or even cannot be approximated using
   linear relationships. Therefore, these methods may not perform well
   when the underlying relationships are not linear. As shown in the
   Results section, our experiments on different matched miRNA and mRNA
   expression datasets confirm the inconsistent performance of the target
   prediction methods. Therefore, it is important to design more reliable
   methods for miRNA target prediction.

   On the other hand, ensemble methods which integrate different methods
   have been shown to perform better than all the individual component
   methods [[60]32]. Different miRNA target prediction methods have their
   own pros and cons, and their results may complement each other.
   Therefore, combining the predictions of multiple methods would improve
   the prediction power.

   In this paper, we firstly compare the performance of eight popular
   miRNA target prediction methods with three different cancer gene
   expression datasets. We then construct the ensemble methods such that
   each of the ensembles integrates a different combination of the top
   five individual methods, and investigate their performance against the
   top five individual performers. The experimental results show that the
   top performed ensemble methods are better than all individual methods.
   Especially, the method that combines a correlation method (Pearson), a
   causal inference method (IDA), and a regression method (Lasso) is the
   best method in the study. Importantly, our experimental results show
   that different individual methods infer complementary results,
   suggesting their own merits, and the top ensemble methods outperform
   the individual methods by including those complementary results.
   Although there is no complete ground truth of miRNA target prediction,
   the results show that ensemble methods can find some confirmed
   interactions in the incomplete ground truth that existing individual
   methods fail to discover. Therefore, using ensemble methods, we would
   receive more reliable results than existing individual methods.

   Related to our work, there have been several methods [[61]33–[62]35] of
   integrating the results of multiple sequence-based prediction methods.
   For examples, Zhang and Verbeek [[63]33] presented a comparison study
   of integrating three sequence-based prediction methods, miRanda
   [[64]14], TargetScanS [[65]15], and RNAhybrid [[66]36], using different
   integrating approaches, including the “low level” approaches which deal
   with raw data features directly and the “high level” approaches which
   combine predictions from different methods. The “high level” approaches
   in this study, e.g. cross entropy aggregation [[67]37], share the same
   purpose with our study and can be adapted for integrating predicted
   results from expression-based methods. Our experiment results, however,
   show that the ensemble methods that are based on Borda count election
   are better than the cross entropy approach for the datasets in this
   study. Shirdel et al. [[68]35] collated commonly used sequence-based
   prediction databases and created a database called mirDIP
   ([69]http://ophid.utoronto.ca/mirDIP/). The database allows users to
   explore miRNA targets using one or the combination of multiple
   prediction databases. They also used mirDIP to explore miRNA-target
   interactions and the signaling pathway networks. This is a good
   resource for miRNA research, but the aim is different from that of our
   study as we explore the prediction power of the methods using gene
   expression data. Moreover, our approach not only seeks to outperform
   the individual methods, but also aims to discover the interactions that
   individual methods may fail to identify. In another direction, Pio et
   al. [[70]34] proposed a semi-supervised method that was based on
   support vector machine to predict the miRNA targets. They used an
   experimentally confirmed database, miRTarBase [[71]38], together with
   the results of ten sequence-based prediction methods in mirDIP as the
   training data, and used another experimentally confirmed database,
   Tarbase [[72]39], as the ground truth to evaluate the model. This is
   different from our approach since we integrate unsupervised methods to
   form a more accurate model. In our work, we utilise both miRTarBase and
   Tarbase for validation due to the sparseness in the number of confirmed
   interactions, and expression data are used for training the individual
   methods.

   The datasets we work on are the matched miRNA and mRNA expression
   profiles in the NCI-60 cancer cell lines datasets (EMT), the
   multi-class cancer (MCC) datasets, and the 51 human breast cancer cell
   lines datasets (BR51). We use the experimentally confirmed miRNA target
   databases to validate the effectiveness of the methods. The results
   indicate that integrating different miRNA target prediction methods
   would generate reliable prediction results. We report miRNA target
   predictions and validated targets for all the individual and the
   ensemble methods discussed in the paper. We provide the R scripts for
   constructing ensemble methods from individual methods which can be used
   to integrate other miRNA target prediction methods.

Materials and Methods

Datasets

NCI-60 data for Epithelial to Mesenchymal Transition (EMT)

   EMT is part of the process of tissue remodelling during embryonic
   development and wound healing [[73]40], and during cancer progression
   [[74]41]. The NCI-60 data were profiled from the 60 cancer cell lines
   of the drug screening panel of human cancer cell lines at the National
   Cancer Institute. The miRNA expression profiles are from Søkilde et al.
   [[75]42]. They are available at
   [76]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26375. The
   mRNA expression profiles for NCI-60 were obtained from ArrayExpress
   [77]http://www.ebi.ac.uk/arrayexpress, accession number E-GEOD-5720. We
   use the work in [[78]43] to categorise the samples into epithelial (11
   samples) and mesenchymal (36 samples).

   We perform differential gene expression analysis on the gene expression
   profiles to identify differentially expressed miRNAs and mRNAs between
   the epithelial and mesenchymal samples. In this work, the limma package
   [[79]44] of Bioconductor is used for the analysis. As a result of the
   analysis, 35 probes of miRNAs and 1154 probes of mRNAs are identified
   to be differentially expressed at significant level (adjusted p-value <
   0.05, adjusted by Benjamini–Hochberg (BH) method). The detailed result
   can be found in [80]S1 File.

Multi-class cancer (MCC) data

   The MCC dataset includes normal and cancerous samples from bladder,
   breast, colon, kidney, lung, pancreas, prostate and uterus. As for the
   EMT data, we use the differential gene expression analysis to find the
   differentially expressed miRNAs and mRNAs between normal and cancerous
   samples. The miRNA expression profiles are from [[81]45], and 108
   probes are found to display differential expression at significant
   level (p-value < 0.05, adjusted by BH method). The mRNA data are from
   [[82]46], with 1860 mRNA probes being identified to display significant
   differential expression. The detailed result can be found in [83]S1
   File.

Data of the 51 human breast cancer cell lines (BR51)

   Breast cancer is a complex disease which involves several different
   sub-types. In this work, we use the BR51 data from [[84]47]. The miRNA
   data were downloaded from [[85]47] and the mRNA gene expression data
   are available at:
   [86]http://www.ncbi.nlm.nih.govgeoqueryacc.cgiacc=GSE41313.

   We use the samples categorised results from [[87]47] to divide the
   samples into luminal and basal groups. There are 27 samples in the
   luminal group and 23 samples in the basal group. In these datasets, the
   changes in expression levels of the mRNAs in the two groups are very
   significant, while the changes in the case of miRNAs are small. In
   total, 92 miRNAs (adjusted p-value < 0.2) and 1500 mRNAs (adjusted
   p-value < 0.0001) are identified to be differentially expressed. To
   cover the important miRNAs mentioned in the analysis of [[88]47] and to
   have a manageable number of mRNAs for the computational methods, we
   choose different thresholds of adjusted p-values in the differential
   gene expression analysis. Details of the datasets are available in
   [89]S1 File.

Methods of miRNA target prediction

   In this section, we introduce the commonly used methods for miRNA
   target prediction using gene expression data.

   Pearson’s correlation coefficient is the commonly used measure for the
   strength of the association between a pair of variables. For the miRNA
   target prediction problem, the correlations in the expression levels
   between pairs of miRNA and mRNA are calculated. The miRNA-mRNA pairs
   are ranked based on the correlation coefficient values. As miRNAs
   mainly down-regulate mRNAs, the negative correlations are ranked at the
   top (according to their absolute values).

   However, Pearson correlation is designed to capture only linear
   associations, and its usefulness is greatly reduced when the
   associations are non-linear [[90]48]. To deal with this problem, the
   maximal information coefficient (MIC) [[91]49] has been proposed to
   measure a wide range of associations, including linear and non-linear
   relationships, with comparable scores. Therefore, both Pearson and MIC
   are selected in our comparison study.

   Meanwhile, Lasso and Elastic-net are high-dimensional regression
   techniques which can be used to infer the associations between
   variables. In our problem, the expression level of each mRNA is
   presented as a linear expression of the expression levels of all
   miRNAs, and the coefficient of a miRNA in the regression model is used
   as the association strength between the miRNA and the mRNA. For each
   mRNA, we can perform Lasso and Elastic-net regression on all the
   miRNAs, using the R package glmnet [[92]50]. Similar to the Pearson
   correlation coefficient method, the miRNA-mRNA pairs that have negative
   effects are ranked at the top of the ranking list to favour the down
   regulation.

   In another direction, Maathuis et al. [[93]51, [94]52] proposed a
   causal inference method, called IDA, which estimates the causal effect
   that a variable has on the other. The estimated causal effects simulate
   the effects of randomised controlled experiments. Recently, Le et al.
   [[95]29] applied this method to gene expression datasets to infer the
   causal (regulatory) relationships between miRNAs and mRNAs. The
   discovered miRNA-mRNA causal regulatory relationships were found to
   have a large portion of overlap with the results of the follow-up gene
   knockdown experiments. Therefore, we include this method in our study.

   Z-score [[96]53] is a network inference method to estimate the effects
   of gene knockout experiment. When we knockout a miRNA, Z-score
   describes the normalised deviation of the expression level of each gene
   from the average expression of all genes. However, in our study, we use
   observational data, i.e. gene expression data, as the input dataset.
   Therefore, the Z-score method is modified where we assume that the
   minimum expression value of the miRNA in a profile indicates the
   knockout experiment. In other words, we firstly locate the sample that
   contain the minimum value of the miRNA, then we use the expression
   value of the mRNA in the sample to calculate Z-score.

   ProMISe [[97]54] assumes that there is a competition between miRNAs to
   be attracted by a mRNA and between mRNAs to attract a miRNA for
   interactions. The method estimates the probability of a mRNA to be a
   target of a miRNA by taking the competition between mRNAs and between
   miRNAs into account. The tool was downloaded from Bioconductor at:
   [98]www.bioconductor.org/packages/release/bioc/html/Roleswitch.html.

   GenMiR++ [[99]26] is a well-cited and commonly used method for miRNA
   target prediction. GenMiR++ is a Bayesian network parameter learning
   method to infer the probabilities of targets of being true by
   approximating the posterior with variational inference techniques. In
   the model, the changes in mRNA expression level are encoded by the
   summation of negatively weighted changes of the expression level of
   their regulators miRNAs. The Matlab code of GenMiR++ was obtained from
   [100]http://www.psi.toronto.edu/genmir/.

Ensemble methods for miRNA target prediction

   Ensemble methods are designed to take the advantages of the individual
   methods, and to compensate for their drawbacks. Specifically,
   individual methods rank the miRNA-mRNA interactions based on some
   criteria, e.g the strength of the correlation coefficients, and
   ensemble methods integrate the rankings from different methods to form
   a new ranking. There are different strategies for integrating different
   rankings, such as taking the intersection, the union, or the average of
   the top k results predicted by individual methods. Marbach et al.
   [[101]32] have proved that the Borda count election method is a simple
   yet effective method for integrating the results from different
   individual methods. Borda count election is presented by Jean-Charles
   de Borda as a method to select candidates in a democratic election by
   choosing the candidate with the best average rank. In this paper, we
   apply the Borda count election method for forming ensemble methods.
   Please note that using Borda average rank of multiple methods does not
   result in a method which averages the performance of the individual
   methods. In fact, the aim is to design a method that outperforms the
   individual methods, and the method is inspired by the following theorem
   which was proved in [[102]32].

   Theorem 1 [[103]32]: Let M[i], i = 1…n, be the prediction methods. If
   each M[i], 1 ≤ i ≤ n, has better than random predictive power, then the
   community (ensemble) method of applying Borda count election method can
   outperform the best individual method.

   We summarise the procedure of the ensemble methods in the following,
   and the R source code for the ensemble methods which use Borda count
   election is available in [104]S2 File.

   Step 1: With a dataset, identify differentially expressed miRNAs and
   mRNAs between different biological conditions (this step is common for
   all methods in the comparison).

   Step 2: For each differentially expressed miRNA, we use each of the
   individual methods (Pearson, IDA, MIC, Lasso, Elastic, Z-score,
   ProMISe, and GenMiR++) respectively to produce the rankings of the
   differentially expressed mRNAs (as the predicted targets of the miRNA).
   Then, we select top five performers in identifying miRNA targets based
   on the validation results using experimentally confirmed targets.

   Step 3: Apply Borda rank election to the rankings from Step 2 for each
   miRNA to produce a single ranking list of elected mRNAs with respect to
   the miRNA. We then extract the top-ranked genes from the list as the
   final output, i.e. the potential target genes for the given miRNA.

   We have implemented the Borda count election method for integrating all
   combinations of the selected top individual miRNA target prediction
   methods. In our implementation, we do not use the whole ranking list
   produced by each method to produce the ranking for the ensemble
   methods, but only use the top k (users defined) predicted targets of a
   miRNA produced by individual methods. The ensemble methods are designed
   to appreciate the genes that are predicted within the top k predicted
   targets across each individual component method. Genes that are not in
   the top k predicted targets of a method is assigned the maximal ranking
   for that method, i.e. the maximum number of genes in the dataset.

Ground truth for validation

   It is difficult to validate computation results, as the number of
   experimentally confirmed targets of miRNAs are still limited and there
   is no complete ground-truth for evaluating and comparing different
   computational methods [[105]55]. In this paper, we use the union of
   four popular databases, Tarbase v6.0 [[106]39], miRecords v2013
   [[107]56], miRWalk v2.0 [[108]57], and miRTarBase v4.5 [[109]38] to
   validate the predictions of the methods for all miRNAs in the EMT, MCC
   and BR51 datasets. Tarbase, miRecords, and miRTarBase include verified
   interactions that are manually curated from the literature, and miRWalk
   contains both the predicted and the experimentally validated
   interactions. Respectively for Tarbase, miRecords, miRWalk, and
   miRTarBase, we have 20095 interactions with 228 miRNAs, 21590
   interactions with 195 miRNAs, 1710 interactions with 226 miRNAs, and
   37372 interactions with 576 miRNAs. [110]Table 1 shows the number of
   overlapping interactions between each pair of the databases. After
   removing the duplicates, we have in total 62858 unique interactions to
   be used in the validation (details in [111]S3 File). We refer to those
   interactions as “confirmed interactions” in this paper.

Table 1. Number of overlapping interactions between databases.

              Tarbase miRecords miRWalk miRTarBase
   Tarbase    20095   593       646     15100
   miRecords          1710      537     1147
   miRWalk                      21590   1482
   miRTarBase                           37372
   [112]Open in a new tab

Results

Individual methods perform inconsistently across datasets

   In this section, we compare the performance of eight individual methods
   on the three datasets. For each miRNA, we extract the top 100, 200,
   300, and 400 target genes ranked by each method for validation. We only
   keep in the validations the miRNAs that have at least one confirmed
   target predicted by all the methods. We then compare the performance of
   the methods for each miRNA based on the number of confirmed targets
   using defined ground truth. As we have eight methods, with respect to
   each miRNA, we score each method using a number (called ranking score)
   in the range of 1 to 8, with 8 indicating the best method and 1 the
   worst method. Finally, we calculate the ranking score of each method
   for a dataset by summing up its scores for all miRNAs. The higher the
   ranking score of a method, the better the method is. [113]Fig 1 shows
   the comparison in terms of their ranking scores in each of the datasets
   ([114]Fig 1(a), 1(b) and 1(c)), and the overall ranking score in all
   three datasets ([115]Fig 1(d)). The overall ranking score is calculated
   by summing up the scores in all miRNAs in all three datasets.

Fig 1. Comparison of eight individual miRNA target prediction methods.

   [116]Fig 1
   [117]Open in a new tab

   From the figure, we can see that there is no superior method
   outperforming the other methods in all datasets. Pearson correlation
   and MIC perform very well in the MCC dataset, but their performance is
   not very good in the other two datasets. Meanwhile, IDA, Lasso, and
   Elastic-net perform consistently in the three datasets, and therefore
   they are the best performers overall as shown in [118]Fig 1(d). ProMISe
   and GenMiR++ perform similarly to the top performers (IDA, Lasso, and
   Elastic-net) in the BR51 dataset, but they are the bottom two methods
   in EMT and MCC. The possible reason for GenemiR++ and ProMISe having
   performed badly is that they are originally designed to also use the
   target information predicted based on sequence data. When they are
   applied to gene expression data alone, their performance is not as good
   as the other methods.

   Overall, IDA is the best method and ProMISe is the worst method in
   terms of finding confirmed miRNA targets from expression data. However,
   note that each method has its own merits as they may discover different
   sets of miRNA targets [[119]55]. In the following section, we
   investigate the performance of the ensemble methods with the hypothesis
   that they may outperform all the individual methods.

Top ensemble methods are better than individual methods

   In this section, we investigate the performance of the ensemble methods
   constructed based on different combinations of the individual miRNA
   target prediction methods. As our aim is to utilise the strength of
   each individual method across different datasets, we remove ProMISe and
   GenMiR++ from the candidates of forming ensemble methods, due to their
   poor performance in all the gene expression datasets. Therefore, we do
   not discuss the results of the ensemble methods that include ProMiSe
   and GenMiR++ in this study. However, we report the comparison results
   of the ensemble method that includes ProMise, GenMiR++ and other
   methods in [120]S4 File. We have also observed that Lasso and
   Elastic-net perform similarly to each other, therefore we remove
   Elastic-net from the combinations for forming the ensemble methods. In
   the end, we have five methods (Pearson, MIC, IDA, Lasso, and Z-score)
   for forming ensemble methods.

   We firstly combine k individual methods (k = 2, 3, 4, 5) using the
   Borda count election method (described in [121]Materials and Methods)
   to create a level k ensemble method. We then select the three best
   performed level k methods (except one method for level five) and
   compare their performance across the three datasets. [122]Table 2 shows
   the performance of the ensemble methods in terms of their ranking
   scores. We can see that all level 3 ensemble methods are better than
   any individual methods and the ensemble methods at other levels.
   Particularly, Pearson+IDA+Lasso is the best method in this comparison
   study.

Table 2. Rankings of top ensemble methods and individual methods.

   Rank Ranking scores Level Method
   1    3783.5         3     Pearson+IDA+Lasso
   2    3667           3     Pearson+IDA+Z-score
   3    3636           3     IDA+MIC+Lasso
   4    3578           4     Pearson+IDA+MIC+Lasso
   5    3530           1     IDA
   6    3497.5         2     IDA+Lasso
   7    3489.5         4     IDA+MIC+Lasso+Z-score
   8    3484.5         2     IDA+MIC
   9    3459           2     Pearson+IDA
   10   3432           4     Pearson+IDA+Lasso+Z-score
   11   3341           1     Lasso
   12   3289           5     Pearson+IDA+MIC+Lasso+Z-score
   13   3218.5         1     Pearson
   14   3165.5         1     MIC
   15   3029           1     Z-score
   [123]Open in a new tab

   We conduct the Wilcoxon signed-rank test based on the ranking scores to
   compare the performance of Pearson+IDA+Lasso and each of the individual
   methods (Pearson, IDA, Lasso). The results show that the ensemble
   method is significantly better than the individual methods (the
   p-values are 0.008, 0.026, and 0.002 for Pearson, IDA, and Lasso,
   respectively), showing that an ensemble method of different approaches
   (correlation, causality and regression in this case) results in the
   better-than-individual method.

   The ensemble method that combines all individual methods (i.e. the
   level 5 ensemble method) performs better than three individual methods
   (Pearson, MIC and Z-score), but not better than IDA, Lasso. However,
   this result does not violate the conclusion in Theorem 1, which stated
   that the ensemble methods outperform the individual methods. The reason
   is that we are using an incomplete ground truth for validating the
   methods, and the rankings in [124]Table 2 indicate the performance of
   the methods only based on the current knowledge about miRNA-mRNA
   interactions. However, as we will see in the next section, an ensemble
   method is able to obtain complementary results, i.e. an ensemble method
   can identify the targets that are discovered by one individual method
   only but not by the other individual methods. On another note, our
   experimental results show that the method combining all individual
   methods may not be better than other ensemble methods. This observation
   is consistent with the results in [[125]32]. In the following section,
   we will show that the ensemble method that combines methods from
   different approaches performs better than other ensemble methods.

   The Borda count applied for top predicted targets (Borda-Topk)
   presented in this paper has the same principle as the Original Borda
   count method [[126]32]. However, Borda-Topk provides a way for
   flexibility in the cut-off targets, i.e. users may only be interested
   in the top 200 targets of a particular miRNA. If k is set to the number
   of mRNAs in the dataset, the two methods are identical. [127]Fig 2
   shows the comparison results of the best ensemble method
   (Pearson+IDA+Lasso) using three different ranking integration
   approaches, including Borda-Topk, Original Borda, and the popular rank
   aggregation approach using cross-entropy Monte Carlo algorithm
   [[128]37] in the R package called RankAggreg [[129]58]. In [130]Fig 2,
   we compare the ranking scores of the three approaches in each dataset
   and in all datasets. The ranking scores are calculated based on the
   number of confirmed interactions in the top 100, 200, 300 and 400
   predicted targets of all miRNAs in each dataset. We can see from
   [131]Fig 2 that, the Borda-Topk is the best approach in the EMT and
   BR51 datasets, but overall Borda-Topk and Original Borda are similar.
   Meanwhile, Cross-Entropy Monte Carlo is the worst approach in all three
   datasets. In the following sections, we discuss the results of the
   Borda-Topk approach.

Fig 2. Pearson+IDA+Lasso with different integration approaches.

   [132]Fig 2
   [133]Open in a new tab

   Borda-Topk and Borda (Original Borda from [[134]32]) are better than
   cross-entropy Monte Carlo in all three datasets.

The ensemble methods find complementary results

   We hypothesise that the ensemble methods are better than individual
   methods, because the ensemble methods can produce results which are
   complementary to the results of different individual methods. To have a
   closer look at this hypothesis, we extract the results predicted by the
   best ensemble method (Pearson+IDA+Lasso) and compare those results with
   the predictions by each individual method (Pearson, IDA and Lasso). The
   complementary characteristic is shown in most of the miRNA cases,
   especially there are many cases in which the results of the ensemble
   method include confirmed interactions that are not all discovered by a
   single individual method and therefore the ensemble method performs
   better than the individual methods. Such cases are shown in 4 miRNAs in
   EMT (miR-1180, miR-141, miR-18a and miR-96), 7 miRNAs in MCC (miR-197,
   miR-19a, miR-23a, miR-30a, miR-32, miR-98 and miR-9), and 6 miRNAs in
   BR51 (miR-125b, miR-196a, miR-21*, miR-27a, miR-30a and miR-342-5p).
   [135]Fig 3 shows the comparison of the methods in terms of the number
   of confirmed miRNA-mRNA interactions in the top 200 targets of the
   miRNAs in each dataset ([136]Fig 3(a), 3(b) and 3(c)), and the overall
   number in all three datasets ([137]Fig 3(d)).

Fig 3. Venn diagram of the number of confirmed miRNA-mRNA interactions for
the best ensemble method (Pearson+IDA+Lasso) and each individual method
(Pearson, IDA and Lasso).

   [138]Fig 3
   [139]Open in a new tab

   For each miRNA, we extract top 200 target genes ranked by each method
   for validation.

   From [140]Fig 3, the ensemble method has the largest number of
   confirmed miRNA-mRNA interactions in each dataset, and include most of
   the confirmed interactions that an individual method can discover. The
   advantage of the ensemble method would enhance the results of miRNA
   target predictions. The details of confirmed miRNA-mRNA interactions
   identified by each method in the three datasets can be found in [141]S4
   File.

Targets identified by Pearson+IDA+Lasso are statistically and functionally
enriched

   In this section, we firstly conduct the statistical analysis of the
   predictions by Pearson+IDA+Lasso based on z-score (standard score). The
   z-score reflects the performance of a prediction method in finding
   confirmed targets compared to the expected rate in the ground truth, it
   is defined as follows:
   [MATH:
   <mrow><mi>z</mi><mn>-</mn><mtext>score</mtext><mo>=</mo><mfrac><mrow><m
   i>x</mi><mo>−</mo><mi>μ</mi></mrow><mrow><mi>σ</mi><mo>/</mo><msqrt><mi
   >n</mi></msqrt></mrow></mfrac><mo>,</mo></mrow> :MATH]

   where x is the probability of finding confirmed interactions by a
   method, i.e. the ratio of number of confirmed interactions and number
   of all possible miRNA-mRNA interactions in a dataset (n); μ is the
   expected probability of confirmed interactions in the ground truth,
   i.e. the ratio of confirmed interactions in the ground truth and all
   possible miRNA-mRNA interactions in the ground truth; and σ is the
   standard deviation of the population (the ground truth), and is
   calculated assuming Bernoulli distribution:
   [MATH: <mrow><mi>σ</mi><mo>=</mo><msqrt><mrow><mi>μ</mi><mo
   stretchy="false">(</mo><mn>1</mn><mo>−</mo><mi>μ</mi><mo
   stretchy="false">)</mo></mrow></msqrt></mrow> :MATH]

   Based on the z-score formula, the higher the z-score is, the more
   significant the prediction results are. As illustrated in [142]Fig 4,
   the enrichment results of the ensemble method in the three datasets are
   high (z-score>10), suggesting that the results of the ensemble method
   are not random and they are significant, and the method is effective in
   recovering confirmed interactions.

Fig 4. Statistical enrichment analysis of the best ensemble method
(Pearson+IDA+Lasso) in EMT, MCC and BR51 datsets.

   [143]Fig 4
   [144]Open in a new tab

   For each miRNA, we extract top 100, 200, 300 and 400 target genes for
   the analysis.

   We also use the GeneCodis [[145]59] (the online tool at
   [146]http://genecodis.cnb.csic.es/analysis) to conduct pathway
   enrichment analysis of target genes with the focus on significant KEGG
   (Kyoto Encyclopedia of Genes and Genomes) [[147]60] pathways (adjusted
   p-value<0.05). We select the genes in the top 1000 miRNA-mRNA
   interactions identified by the method in each dataset for the analysis.
   As shown in [148]Table 3, the top 10 enrichment KEGG pathways in the
   three datasets are all significantly associated with the KEGG pathway:
   Pathways in cancer. Since the three datasets are all cancer datasets,
   the results demonstrate that the predicted targets are closely related
   to the biological conditions of the three datasets. The detailed
   information of significant KEGG pathways in the three datasets can be
   found in [149]S5 File.

Table 3. Top 10 enrichment KEGG pathways in the EMT, MCC and BR51 datasets.

   The p-values have been adjusted by Benjamini-Hochberg (BH) method.
   Datasets Top 10 enrichment KEGG pathways                 Adj-p-value
   EMT      Tight junction                                  3.43E-05
            Arrhythmogenic right ventricular cardiomyopathy 6.56E-05
            Leukocyte transendothelial migration            0.000602
            Cell adhesion molecules (CAMs)                  0.000941
            ECM-receptor interaction                        0.004284
            Pathways in cancer                              0.006845
            Focal adhesion                                  0.009677
            Adherens junction                               0.010311
            Endocytosis                                     0.010872
            Pathogenic Escherichia coli infection           0.020611
   MCC      Vascular smooth muscle contraction              2.18E-05
            PPAR signaling pathway                          2.62E-05
            Prostate cancer                                 7.71E-05
            Focal adhesion                                  0.00192
            Cytokine-cytokine receptor interaction          0.002144
            Oocyte meiosis                                  0.002232
            Alzheimer’s disease                             0.002648
            Adherens junction                               0.002803
            Mineral absorption                              0.002821
            Pathways in cancer                              0.003993
   BR51     Cytokine-cytokine receptor interaction          1.49E-05
            Pathways in cancer                              2.52E-05
            Chemokine signaling pathway                     0.000217
            Focal adhesion                                  0.000234
            NOD-like receptor signaling pathway             0.000267
            Complement and coagulation cascades             0.00043
            Chagas disease (American trypanosomiasis)       0.000741
            Amoebiasis                                      0.000741
            Endocytosis                                     0.000741
            Leukocyte transendothelial migration            0.001382
   [150]Open in a new tab

Highly-confident novel interactions

   As experimentally confirmed databases are still sparse, we report a set
   of highly-confident novel interactions for further experiment
   validations. The novel interactions are those predicted by
   Pearson+IDA+Lasso and included in TargetScan v6.2 [[151]61], but not
   yet confirmed by the experimentally confirmed databases. While the
   ensemble method provides reliable results based on expression data,
   TargetScan v6.2 provides predictions on direct binding information
   using sequence data. Integrating the effects measured by expression
   levels and the sequence binding information would provide
   highly-confident candidates for followed-up experiments. [152]Fig 5
   shows the novel interactions in the three datasets, where red nodes are
   miRNAs and white nodes are novel target genes. Details of the
   highly-confident novel interactions are in [153]S6 File.

Fig 5. Highly-confident novel interactions in the EMT, MCC, and BR51
datasets.

   [154]Fig 5
   [155]Open in a new tab

   Red nodes are miRNAs and white nodes are mRNAs. The interactions are
   predicted by both the ensemble method, Pearson+IDA+Lasso, and
   TargetScan.

Discussion and conclusion

   The huge amount of high-throughput expression data calls for novel
   computational methods for predicting miRNA targets. Recently, a wide
   range of computational methods have been proposed to identify miRNA
   targets from expression data. However, each method has its advantages
   and limitations, and the performance of a method is not consistent
   across different datasets. Thus, how to obtain reliable and
   comprehensive results is a long-standing challenge for miRNA target
   prediction.

   In this paper, we are inspired by the concept of The wisdom of crowds
   in [[156]32] to propose ensemble methods. We firstly collect 8 popular
   miRNA target prediction methods (Pearson, IDA, MIC, Lasso, Elastic,
   Z-score, ProMISe and GenMiR++) to compare the performance of them using
   the real world gene expression datasets. We then develop ensemble
   methods using Borda rank election to combine the rankings from
   different methods. As shown in the Results section, the ensemble
   methods are better or as good as the individual methods. While the
   results reported in [[157]32] showing the outperformance of the
   ensemble method over individual ones for predicting TF-gene
   interactions, here we also show similar results that can be achieved
   for the problem of miRNA target prediction.

   Ensemble methods help with obtaining more comprehensive results than
   individual methods. However, combining more individual methods does not
   guarantee better performance of the ensemble method. For example, the
   level 3 ensemble method (Pearson+IDA+Lasso) performs the best in the
   EMT, MCC and BR51 datasets rather than the level 5 ensemble method
   (Pearson+IDA+MIC+Lasso+Z-score). The experimental results also suggest
   that combining the methods taking different approaches may result in a
   better ensemble method than combining methods in the same category,
   e.g. combining methods using correlation analysis. A possible reason is
   that the complementary behavior of the ensemble methods can be best
   utilised in those cases. The software tool we provided in this paper
   includes source codes of eight prediction methods, and the ensemble
   method whose procedure is described in the Methods section. The tool
   will help researchers apply the ensemble method to new datasets, e.g.
   using Pearson+IDA+Lasso, which is the best method suggested by the
   results of this paper. The tool also allows researchers to create new
   ensemble methods when the novel single computational methods arise.

   In summary, we propose a framework based on Borda count election to
   generate ensemble methods from existing individual methods. The results
   show that the ensemble method Pearson+IDA+Lasso is a good method for
   miRNA target prediction. We report high-confident novel miRNA targets
   for further experimental validations. The R codes are provided, so that
   future methods can be integrated to form novel ensemble methods.

Supporting Information

   S1 File. Differential expression profiles of miRNAs and mRNAs for the
   EMT, MCC and BR51 datasets.

   The p-values are adjusted by Benjamini-Hochberg (BH) method.

   (XLSX)
   [158]Click here for additional data file.^ (2.9MB, xlsx)
   S2 File. R source codes for eight individual miRNA target prediction
   methods and three ensemble approaches.

   (ZIP)
   [159]Click here for additional data file.^ (5.7MB, zip)
   S3 File. The union of four databases including TarBase, miRecords,
   miRWalk and miRTarBase.

   The total number of miRNA-mRNA interactions is 62858.

   (XLSX)
   [160]Click here for additional data file.^ (913KB, xlsx)
   S4 File. Comparison between ensemble methods and part of confirmed
   miRNA-mRNA interactions by the ensemble method (Pearson+IDA+Lasso) and
   three individual methods (Pearson, IDA and Lasso) in EMT, MCC and BR51
   datasets.

   Pearson+IDA+Lasso finds complementary results.

   (XLSX)
   [161]Click here for additional data file.^ (21.2KB, xlsx)
   S5 File. The significant KEGG pathways of the prediction results using
   the ensemble method (Pearson+IDA+Lasso) in the EMT, MCC and BR51
   datasets.

   The p-values are adjusted by Benjamini-Hochberg (BH) method.

   (XLSX)
   [162]Click here for additional data file.^ (23.4KB, xlsx)
   S6 File. Highly-confident novel miRNA-mRNA interactions in the EMT, MCC
   and BR51 datasets.

   (XLSX)
   [163]Click here for additional data file.^ (18.8KB, xlsx)

Data Availability

   All relevant data are within the paper and its Supporting Information
   files.

Funding Statement

   This work has been partially supported by Australian Research Council
   Discovery Project DP130104090, (JL and LL), and the Applied Basic
   Research Foundation of Science and Technology of Yunnan Province, No.
   2013FD038 (JZ). The funders had no role in study design, data
   collection and analysis, decision to publish, or preparation of the
   manuscript.

References