Abstract miRNAs are small RNA molecules (′ 22nt) that interact with their corresponding target mRNAs inhibiting the translation of the mRNA into proteins and cleaving the target mRNA. This second effect diminishes the overall expression of the target mRNA. Several miRNA-mRNA relationship databases have been deployed, most of them based on sequence complementarities. However, the number of false positives in these databases is large and they do not overlap completely. Recently, it has been proposed to combine expression measurement from both miRNA and mRNA and sequence based predictions to achieve more accurate relationships. In our work, we use LASSO regression with non-positive constraints to integrate both sources of information. LASSO enforces the sparseness of the solution and the non-positive constraints restrict the search of miRNA targets to those with down-regulation effects on the mRNA expression. We named this method TaLasso (miRNA-Target LASSO). We used TaLasso on two public datasets that have paired expression levels of human miRNAs and mRNAs. The top ranked interactions recovered by TaLasso are especially enriched (more than using any other algorithm) in experimentally validated targets. The functions of the genes with mRNA transcripts in the top-ranked interactions are meaningful. This is not the case using other algorithms. TaLasso is available as Matlab or R code. There is also a web-based tool for human miRNAs at [39]http://talasso.cnb.csic.es/. Introduction miRNAs are small RNA molecules (′ 22nt) that regulate the expression of their corresponding mRNA targets in many eukaryotes. The imperfect base pairing of the miRNAs with the 3′-untranslated region (3′-UTR) of their targets can either inhibit translation or cause mRNA cleavage [40][1]. Some authors have recently stated that “miRNAs predominantly act to decrease target mRNA levels” [41][2]. Although low translational repression with no mRNA destabilization is also possible, the overall effect is the down-regulation of both protein and mRNA concentrations. This deregulation is key in a wide range of biological processes and human diseases [42][3]. MiRBase [43][4] is the de facto standard database used to retrieve data related to miRNAs. It currently contains near 1400 human miRNAs (release 17). Considering that each miRNA has a sequence compatible with around 200 target mRNAs, the number of putative interactions is very large. Most of the computational methods developed to identify mRNA-miRNA interactions are based on sequence complementarities of miRNA and its mRNA targets. These algorithms have been used to create databases of interactions, such as miRBase [44][4], TargetScan [45][5]–[46][7], PicTar [47][8], miRanda [48][9] or miRGen [49][10]. Unfortunately, the number of false predictions using these computational methods is still high [50][11]. Although there are experimental tools for miRNA target validation, several involved steps must be performed to verify the authenticity of a miRNA target. Therefore, the number of experimentally-validated targets is still very low. For instance, TarBase [51][12] includes around 1300 experimentally-validated interactions. This number is small if compared with the number of predicted interactions in miRBase (>500000). In recent years, different computational methods for miRNA-mRNA interaction prediction that use expression data to decipher miRNA targets have emerged. Some of them combine this information with sequence based predictions to obtain more reliable miRNA targets. Among them are GenMiR++ [52][13], [53][14], HOCTAR [54][15] or MAGIA [55][16]. The key assumption is that due to the down-regulation effect of miRNAs on its targets, their expression levels must be inversely related, i.e. if a particular miRNA expression increases, the expression of its mRNA targets must decrease. The difference between these methods relies on the way they use this information. For example, GenMiR++ uses a Bayesian framework and states the probabilities of having an interaction between a miRNA and its putative targets. HOCTAR [56][15] assumes that the expression of intronic miRNAs is strongly correlated with the expression of the genes where they are located. Therefore, gene expression can be used as an estimator of intronic miRNA expression. This approach uses the advantage of a huge number of samples, for which gene expression is available, to obtain statistically significant results. In the end, HOCTAR uses simply the correlation. This measure has been also used by other authors for miRNA-mRNA target prediction [57][17]–[58][19]. On the other hand, MAGIA uses the mutual information. A possible drawback is that it is not possible to distinguish between positive and negative regulations. Recently, other methods based on expression data analysis for miRNA-mRNA interaction prediction have been published [59][20], [60][21]. Jayaswal et al. [61][20] use a two step process that consists of clustering each expression data for miRNA and mRNA followed by a t-test to find significant miRNA-mRNA relationships. Li et al. [62][21] apply Partial Least Squares Regression to a preselected set of differentially expressed mRNAs and miRNAs. Adding expression data to sequence based predictions has been shown to reduce false positive rate [63][22]. We have developed the TaLasso algorithm that combines the information of sequence databases with mRNA and miRNA expression to quantify the down-regulation between miRNAs and their putative targets. It can be considered as a filter to sequence-based databases. Our method is based on the LASSO regression. The norm-1 penalty term of LASSO ensures the number of predicted interactions is small. In addition, we included non-positive constraints to the LASSO regression to restrict the predicted mRNA targets to those with down-regulation effects from miRNAs. TaLasso uses as initial putative targets the union of several sequence databases (miRGen [64][10], miRBase [65][4], miRanda (microRNA.org) [66][9], TarBase [67][12], miRecords [68][23], miRWalk [69][24] release of 2010). TaLasso was applied to two datasets with matched samples of mRNA and miRNA expression data. TaLasso was validated by measuring its ability to predict experimentally-validated targets and by analysing the biological relevancy of the predicted interactions. These results were compared with other methods (GenMiR++ and Pearson correlation). We illustrate that the top interactions predicted by TaLasso are significantly enriched in validated targets and that this enrichment is larger than using GenMiR++ or Pearson correlation. Furthermore, the functions of the genes with mRNA transcripts in the top-ranked interactions are biologically sound in the context of the experiments being studied. We also show that including the non-positive constraints improves the specificity of the prediction using LASSO regression. Materials and Methods Given a set of miRNA-mRNA putative interactions and given their expression values, TaLasso quantifies the down-regulation effect of each miRNA on its mRNA targets. Due to the biological complexity underlying mRNA regulation, this quantification is made by considering several simplifications. Assumptions First of all, it will be assumed that the miRNAs are the only regulators of mRNA expression, considering other possible effects as part of the noise. Therefore, TaLassso –as other methods– is able to detect only the most prominent interactions. Secondly, the model assumes that the miRNAs down-regulate their corresponding mRNA targets. Although the existence of miRNAs that act as transcription factors [70][1] have been shown, they will not be considered in the model. Thirdly, TaLasso will only quantify the down-regulation effect on those miRNA-mRNA interactions from an initial set of putative miRNA-mRNA pairs (i.e. predicted from sequence based algorithms). Consequently, TaLasso will not be able to recover those interactions not included in this initial set of putative targets. Finally, the logarithm of the mRNA expression will be assumed to be linearly dependent on the logarithm of the expression of its putative miRNA regulators. Mathematical model Let vectors x[j] = [x[j1], x[j2], … , x[jI]][1xI] and z[k] = [z[k1], z[k2], … , z[kI]][1xI] be the logarithms of the expression levels of mRNA j and miRNA k, in samples 1 to I. Consider also that there are J mRNAs and K miRNAs. Let c[jk] be an indicator variable whose value is 1 if mRNA j is a potential target of miRNA k predicted in sequence databases and 0 otherwise. Then, a linear relationship between the logarithm of the expressions of each mRNA j and the K miRNAs is assumed and it is represented by the following linear model, graphic file with name pone.0030766.e001.jpg (1) where ε[j] is an error term and x[j]^0 = x[j]^0 ⋅ [1, 1, … , 1][1xI] is the intercept term, i.e. the expected value of the logarithm of the expression of mRNA j in the absence of regulation from miRNAs in samples 1 to I. β[jk] is the amount of regulation for each miRNA - mRNA pair, i.e. how much the miRNA down-regulates the target mRNA. β[jk] and x[j]^0 are unknown. Resolution of the linear model In many cases, the number of miRNAs that putatively regulate an mRNA is larger than the number of samples. Therefore equation (1) is an undetermined system of equations. We propose to solve Eq (1) for each mRNA using l[1]-regularized least squares and adding non-positivity constraints: graphic file with name pone.0030766.e002.jpg (2) The l[1]-norm of the penalty term ( Inline graphic ) enforces the sparsity of the solution. It is adjusted with the tuning parameter λ[j]. The larger the value of this parameter the sparser the solution is. Non-positivity constraints are added to ensure that the solution includes only negative relationships between mRNA and miRNA expressions. This is a convex problem and thus, if an algorithm is able to find a local minimum, this minimum is guaranteed to be also global. Equation (2) is very similar to a LASSO regression problem. The only differences are that Eq (2) includes a term that must not be penalized (x[j]^0) and that there are non-positivity constraints. Some LASSO packages that include inequality restrictions are available and they can be accommodated to solve this optimization problem [71][25]–[72][27]. We have adapted two solvers for this particular problem, one for Matlab and the other one for R ([73]http://cran.r-project.org/). TaLasso can be run in either Matlab or R. Matlab code is adapted from the l1_ls software ([74]http://www.stanford.edu/~boyd/l1_ls/) [75][26]. Some minor changes in the initial Matlab code boosted the speed two-fold. R code was developed by using the package Rcplex ([76]http://CRAN.R-project.org/package=Rcplex) [77][27]. In this case, the l[1]-regularized least square problem was converted into a quadratic problem by expanding the norm-2 factor. Both packages have comparable performance. Specifically, Matlab package is 30% faster. There is no method that provides statistical significance in the context of regularized Least Squares with constraints. Although bootstrap could be an alternative, the estimation is inappropriate due to the bias and is time consuming. We have instead opted to use the statistics of multiple linear regression to assign to the solutions of TaLasso their p-vales pstatistical significance. In the software, we have included a function called significance_beta, in both R and Matlab codes (see [78]texts S5 and [79]S6 for further information). Selection of the l[1]-penalty LASSO regression theory states [80][26] that the possible values of the tuning parameter λ[j] must lie within the interval [0, λ[j]^max], being λ[j]^max equal to, graphic file with name pone.0030766.e004.jpg (3) If λ[j] is zero, the standard minimum squares solution is obtained. If λ[j] is λ[j]^max then the optimum solution is the null vector. The value of λ[j] can be selected by cross validation. It would be possible to compute λ[j] independently for each mRNA. However, this approach requires many additional parameters in the model and it is prone to overfitting. Instead of this, we select λ[j] for every mRNA as a fraction of a maximum value Λ[j]. In the following, the variable κ = λ[j]/Λ[j], same for all mRNAs, will be used to denote these fractions. TaLasso provides two possible ways to select Λ[j]: the global and the local methods. In the global method, Λ[j], is selected as, graphic file with name pone.0030766.e005.jpg (4) And therefore, graphic file with name pone.0030766.e006.jpg (5) The estimation of λ^G is called global tuning parameter in the rest of the paper and provides a single penalty term for every mRNA. In the local method Λ[j] is selected as, graphic file with name pone.0030766.e007.jpg (6) and therefore, graphic file with name pone.0030766.e008.jpg (7) The estimation for λ[j]^L in Eq (7) will be referred to as the local tuning parameter. In either case we used a single κ^L or for κ^G for all mRNAs. Notice that, although κ^L is the same for all mRNAs, λ[j]^max is not and thus, λ[j]^L will be different for each mRNA. In our case studies, global selection outperforms local selection. Results are included in the [81]text S1. The values of κ^G (and κ^L) were determined by testing the factors {1/10000, 1/100, 1/50, 1/20, 1/10, 1/5, 1/3, 1/2} using Leave One Out Cross Validation (LOOCV). This method performs the optimization using all samples but one and validates the results with the sample left out. This procedure is repeated for every sample. The selected value of κ is the one that provides the minimum square error (MSE) in the samples left out training the algorithm with the other samples, i.e, graphic file with name pone.0030766.e009.jpg (8) where Inline graphic and Inline graphic ^0[j,] [≠i] are the estimations of the down-regulation and the intercept obtained with TaLasso for the mRNA and miRNA expression of all samples but i. Inline graphic [j,i] is the predicted expression value of mRNA j in sample i determined by the values Inline graphic . Finally, x[j,i] and z[k,i] are the expressions of mRNA j and miRNA k on sample i. Results We have implemented the solvers for Eq (2) in R and Matlab. Putative and experimental interactions are included in the downloadable software. We have also developed a convenient web page that performs this computation. We named them “TaLasso”. Therefore, TaLasso is both the web page and the algorithm adaptations that solve the problem described in Eq. (2). The result is the amount of down-regulation of a miRNA for their targets in a particular experiment. We run TaLasso for two real datasets and validate the results by studying the enrichment on experimentally-validated targets. We also analyze the biological relevance of the predicted interactions. In this section we present these results and include a comparison between TaLasso, GenMiR++ and correlation methods. Expression datasets TaLasso has been tested with two datasets with matched miRNA and mRNA expression data. The first dataset, which will be referred as Multi Class Cancer (MCC) dataset, is composed of miRNA expression data from [82][28] and mRNA expression data from [83][29]. This dataset was used in GenMiR++. The second one, which will be referred to as acute lymphoblastic Leukemia DataSet (LDS), is publicly available in GEO ([84]GSE14834). It corresponds to expression profiles published by [85][30] and is part of a case study used in MAGIA. In both datasets, the expression values were further normalized by subtracting from each mRNA or miRNA their median value. Enrichment on experimentally-validated targets We compared TaLasso results with GenMiR++ and the Pearson correlation. We were not able to reproduce mutual information results obtained in MAGIA. Nevertheless, it has been stated that for normal random variables, Mutual Information (MI) can be estimated as a function of the Spearman or the Pearson correlation coefficient [86][31]. Therefore, if normal conditions are met, the ranking of MI results and the ranking of the absolute value of correlation are expected to be similar. We have checked that correlation performs better than the absolute value of the correlation. The comparison of the scoring algorithms is based on measuring the enrichment of their results on experimentally-validated interactions. If the top ranked interactions of an algorithm have more experimentally-validated targets, this algorithm is expected to perform better as more predicted interactions are validated. The union of miRBase, miRanda, miRGen, miRecords, TarBase and miRWalk were used as the initial set of interactions. Among them, TaRBase (v5), miRecords (May 2010) and miRWalk databases were used as reference of experimentally-validated targets. Whereas the interactions included in TaRBase and miRecords have been manually curated, those in MiRWalk were automatically extracted from the literature using text mining techniques. Therefore, this last database includes less reliable but more interactions than the other two. [87]Figure 1 shows the Venn diagram of the number of interactions provided by each database and the intersections among them. The number of interactions in any particular experiment is usually smaller since a particular experiment includes only a subset of the mRNAs and miRNAs in the databases. Nevertheless, the initial set of interactions comprises more than half a million putative interactions. From them, only a few (∼1000 or 10000 depending on the database) correspond to the experimentally-validated targets. Figure 1. Shared interactions among the different databases of human miRNA targets that have been used as initial set of putative interactions. [88]Figure 1 [89]Open in a new tab The overlap among the different databases is small. With reference to databases with experimentally-validated targets, the union of miRecords and TarBase includes 623 interactions that are also cited in any of the computationally predicted databases. This number rises to 4372 in case miRWalk is also considered. Applying TaLasso to the MCC and the LDS datasets, β, the downregulation effect of each miRNA on its targets was obtained. For both datasets we performed an enrichment analysis using the results provided by TaLasso, GenMiR++ and Correlation to mRNAs with at least one experimentally-validated miRNA target. The aim of this analysis is to compare each of these algorithms in terms of the number of experimentally-validated interactions. The enrichment analysis was developed as follows. Firstly, the interactions were ranked according to the scores provided by each method. Secondly, taking the top-ranked n interactions, we counted the number of experimentally-validated targets within them. With this data, we compute the corresponding p-value using the hypergeometric distribution. The hypergeometric distribution is a discrete probability distribution that describes the probability of obtaining p successes (experimentally-validated interactions) when n elements (selected interactions) from a finite population without replacement (the union of all the putative interactions) are drawn. These p-values help to compare the algorithms: for a given n (number of selected interactions) the smaller the p-value, the more enriched the solution in experimentally-validated interactions is. We compared the algorithms using two scores: the number of experimentally validated interactions in the top-500 predicted interactions and the minimum p-value on the enrichment curves. In the second case we also provide the number of experimentally-validated targets (N[E]) and the amount of predicted interactions drawn (N[T]) on that point. The following subsection shows the enrichment results for the union of TarBase and miRecords. The enrichment for miRWalk is included in [90]text S1 and shows a similar trend. Multi Class Cancer (MCC) expression data The dataset used in GenMiR++ is a set of 88 paired cancer and normal samples with mRNA data from [91][28] and miRNA expression data from [92][28]. The mRNA expression data was measured with oligonucleotide microarrays Hu6800 and Hu35KsubA GeneChips (Affymetrix, Santa Clara, CA). On the other hand, the miRNA expression was obtained using bead-based flow cytometry. The dataset consists of normal and cancerous counterparts from bladder, breast, colon, kidney, lung, pancreas, prostate and uterus samples, ovary cancer, melanoma and mesothelioma samples with no normal references. We compiled the expression data from