Abstract

   miRNAs are small RNA molecules (′ 22nt) that interact with their
   corresponding target mRNAs inhibiting the translation of the mRNA into
   proteins and cleaving the target mRNA. This second effect diminishes
   the overall expression of the target mRNA. Several miRNA-mRNA
   relationship databases have been deployed, most of them based on
   sequence complementarities. However, the number of false positives in
   these databases is large and they do not overlap completely. Recently,
   it has been proposed to combine expression measurement from both miRNA
   and mRNA and sequence based predictions to achieve more accurate
   relationships. In our work, we use LASSO regression with non-positive
   constraints to integrate both sources of information. LASSO enforces
   the sparseness of the solution and the non-positive constraints
   restrict the search of miRNA targets to those with down-regulation
   effects on the mRNA expression. We named this method TaLasso
   (miRNA-Target LASSO).

   We used TaLasso on two public datasets that have paired expression
   levels of human miRNAs and mRNAs. The top ranked interactions recovered
   by TaLasso are especially enriched (more than using any other
   algorithm) in experimentally validated targets. The functions of the
   genes with mRNA transcripts in the top-ranked interactions are
   meaningful. This is not the case using other algorithms.

   TaLasso is available as Matlab or R code. There is also a web-based
   tool for human miRNAs at [39]http://talasso.cnb.csic.es/.

Introduction

   miRNAs are small RNA molecules (′ 22nt) that regulate the expression of
   their corresponding mRNA targets in many eukaryotes. The imperfect base
   pairing of the miRNAs with the 3′-untranslated region (3′-UTR) of their
   targets can either inhibit translation or cause mRNA cleavage [40][1].
   Some authors have recently stated that “miRNAs predominantly act to
   decrease target mRNA levels” [41][2]. Although low translational
   repression with no mRNA destabilization is also possible, the overall
   effect is the down-regulation of both protein and mRNA concentrations.
   This deregulation is key in a wide range of biological processes and
   human diseases [42][3].

   MiRBase [43][4] is the de facto standard database used to retrieve data
   related to miRNAs. It currently contains near 1400 human miRNAs
   (release 17). Considering that each miRNA has a sequence compatible
   with around 200 target mRNAs, the number of putative interactions is
   very large. Most of the computational methods developed to identify
   mRNA-miRNA interactions are based on sequence complementarities of
   miRNA and its mRNA targets. These algorithms have been used to create
   databases of interactions, such as miRBase [44][4], TargetScan
   [45][5]–[46][7], PicTar [47][8], miRanda [48][9] or miRGen [49][10].
   Unfortunately, the number of false predictions using these
   computational methods is still high [50][11]. Although there are
   experimental tools for miRNA target validation, several involved steps
   must be performed to verify the authenticity of a miRNA target.
   Therefore, the number of experimentally-validated targets is still very
   low. For instance, TarBase [51][12] includes around 1300
   experimentally-validated interactions. This number is small if compared
   with the number of predicted interactions in miRBase (>500000).

   In recent years, different computational methods for miRNA-mRNA
   interaction prediction that use expression data to decipher miRNA
   targets have emerged. Some of them combine this information with
   sequence based predictions to obtain more reliable miRNA targets. Among
   them are GenMiR++ [52][13], [53][14], HOCTAR [54][15] or MAGIA
   [55][16]. The key assumption is that due to the down-regulation effect
   of miRNAs on its targets, their expression levels must be inversely
   related, i.e. if a particular miRNA expression increases, the
   expression of its mRNA targets must decrease.

   The difference between these methods relies on the way they use this
   information. For example, GenMiR++ uses a Bayesian framework and states
   the probabilities of having an interaction between a miRNA and its
   putative targets. HOCTAR [56][15] assumes that the expression of
   intronic miRNAs is strongly correlated with the expression of the genes
   where they are located. Therefore, gene expression can be used as an
   estimator of intronic miRNA expression. This approach uses the
   advantage of a huge number of samples, for which gene expression is
   available, to obtain statistically significant results. In the end,
   HOCTAR uses simply the correlation. This measure has been also used by
   other authors for miRNA-mRNA target prediction [57][17]–[58][19]. On
   the other hand, MAGIA uses the mutual information. A possible drawback
   is that it is not possible to distinguish between positive and negative
   regulations.

   Recently, other methods based on expression data analysis for
   miRNA-mRNA interaction prediction have been published [59][20],
   [60][21]. Jayaswal et al. [61][20] use a two step process that consists
   of clustering each expression data for miRNA and mRNA followed by a
   t-test to find significant miRNA-mRNA relationships. Li et al. [62][21]
   apply Partial Least Squares Regression to a preselected set of
   differentially expressed mRNAs and miRNAs.

   Adding expression data to sequence based predictions has been shown to
   reduce false positive rate [63][22].

   We have developed the TaLasso algorithm that combines the information
   of sequence databases with mRNA and miRNA expression to quantify the
   down-regulation between miRNAs and their putative targets. It can be
   considered as a filter to sequence-based databases.

   Our method is based on the LASSO regression. The norm-1 penalty term of
   LASSO ensures the number of predicted interactions is small. In
   addition, we included non-positive constraints to the LASSO regression
   to restrict the predicted mRNA targets to those with down-regulation
   effects from miRNAs. TaLasso uses as initial putative targets the union
   of several sequence databases (miRGen [64][10], miRBase [65][4],
   miRanda (microRNA.org) [66][9], TarBase [67][12], miRecords [68][23],
   miRWalk [69][24] release of 2010). TaLasso was applied to two datasets
   with matched samples of mRNA and miRNA expression data.

   TaLasso was validated by measuring its ability to predict
   experimentally-validated targets and by analysing the biological
   relevancy of the predicted interactions. These results were compared
   with other methods (GenMiR++ and Pearson correlation). We illustrate
   that the top interactions predicted by TaLasso are significantly
   enriched in validated targets and that this enrichment is larger than
   using GenMiR++ or Pearson correlation. Furthermore, the functions of
   the genes with mRNA transcripts in the top-ranked interactions are
   biologically sound in the context of the experiments being studied. We
   also show that including the non-positive constraints improves the
   specificity of the prediction using LASSO regression.

Materials and Methods

   Given a set of miRNA-mRNA putative interactions and given their
   expression values, TaLasso quantifies the down-regulation effect of
   each miRNA on its mRNA targets. Due to the biological complexity
   underlying mRNA regulation, this quantification is made by considering
   several simplifications.

Assumptions

   First of all, it will be assumed that the miRNAs are the only
   regulators of mRNA expression, considering other possible effects as
   part of the noise. Therefore, TaLassso –as other methods– is able to
   detect only the most prominent interactions.

   Secondly, the model assumes that the miRNAs down-regulate their
   corresponding mRNA targets. Although the existence of miRNAs that act
   as transcription factors [70][1] have been shown, they will not be
   considered in the model.

   Thirdly, TaLasso will only quantify the down-regulation effect on those
   miRNA-mRNA interactions from an initial set of putative miRNA-mRNA
   pairs (i.e. predicted from sequence based algorithms). Consequently,
   TaLasso will not be able to recover those interactions not included in
   this initial set of putative targets.

   Finally, the logarithm of the mRNA expression will be assumed to be
   linearly dependent on the logarithm of the expression of its putative
   miRNA regulators.

Mathematical model

   Let vectors x[j] = [x[j1], x[j2], … , x[jI]][1xI] and z[k] = [z[k1],
   z[k2], … , z[kI]][1xI] be the logarithms of the expression levels of
   mRNA j and miRNA k, in samples 1 to I. Consider also that there are J
   mRNAs and K miRNAs. Let c[jk] be an indicator variable whose value is 1
   if mRNA j is a potential target of miRNA k predicted in sequence
   databases and 0 otherwise. Then, a linear relationship between the
   logarithm of the expressions of each mRNA j and the K miRNAs is assumed
   and it is represented by the following linear model,
   graphic file with name pone.0030766.e001.jpg (1)

   where ε[j] is an error term and x[j]^0 = x[j]^0 ⋅ [1, 1, … , 1][1xI] is
   the intercept term, i.e. the expected value of the logarithm of the
   expression of mRNA j in the absence of regulation from miRNAs in
   samples 1 to I. β[jk] is the amount of regulation for each miRNA - mRNA
   pair, i.e. how much the miRNA down-regulates the target mRNA. β[jk] and
   x[j]^0 are unknown.

Resolution of the linear model

   In many cases, the number of miRNAs that putatively regulate an mRNA is
   larger than the number of samples. Therefore equation (1) is an
   undetermined system of equations. We propose to solve Eq (1) for each
   mRNA using l[1]-regularized least squares and adding non-positivity
   constraints:
   graphic file with name pone.0030766.e002.jpg (2)

   The l[1]-norm of the penalty term ( Inline graphic ) enforces the
   sparsity of the solution. It is adjusted with the tuning parameter
   λ[j]. The larger the value of this parameter the sparser the solution
   is. Non-positivity constraints are added to ensure that the solution
   includes only negative relationships between mRNA and miRNA
   expressions. This is a convex problem and thus, if an algorithm is able
   to find a local minimum, this minimum is guaranteed to be also global.

   Equation (2) is very similar to a LASSO regression problem. The only
   differences are that Eq (2) includes a term that must not be penalized
   (x[j]^0) and that there are non-positivity constraints. Some LASSO
   packages that include inequality restrictions are available and they
   can be accommodated to solve this optimization problem
   [71][25]–[72][27].

   We have adapted two solvers for this particular problem, one for Matlab
   and the other one for R ([73]http://cran.r-project.org/). TaLasso can
   be run in either Matlab or R. Matlab code is adapted from the l1_ls
   software ([74]http://www.stanford.edu/~boyd/l1_ls/) [75][26]. Some
   minor changes in the initial Matlab code boosted the speed two-fold. R
   code was developed by using the package Rcplex
   ([76]http://CRAN.R-project.org/package=Rcplex) [77][27]. In this case,
   the l[1]-regularized least square problem was converted into a
   quadratic problem by expanding the norm-2 factor. Both packages have
   comparable performance. Specifically, Matlab package is 30% faster.

   There is no method that provides statistical significance in the
   context of regularized Least Squares with constraints. Although
   bootstrap could be an alternative, the estimation is inappropriate due
   to the bias and is time consuming. We have instead opted to use the
   statistics of multiple linear regression to assign to the solutions of
   TaLasso their p-vales pstatistical significance. In the software, we
   have included a function called significance_beta, in both R and Matlab
   codes (see [78]texts S5 and [79]S6 for further information).

Selection of the l[1]-penalty

   LASSO regression theory states [80][26] that the possible values of the
   tuning parameter λ[j] must lie within the interval [0, λ[j]^max], being
   λ[j]^max equal to,
   graphic file with name pone.0030766.e004.jpg (3)

   If λ[j] is zero, the standard minimum squares solution is obtained. If
   λ[j] is λ[j]^max then the optimum solution is the null vector. The
   value of λ[j] can be selected by cross validation. It would be possible
   to compute λ[j] independently for each mRNA. However, this approach
   requires many additional parameters in the model and it is prone to
   overfitting. Instead of this, we select λ[j] for every mRNA as a
   fraction of a maximum value Λ[j]. In the following, the variable
   κ = λ[j]/Λ[j], same for all mRNAs, will be used to denote these
   fractions.

   TaLasso provides two possible ways to select Λ[j]: the global and the
   local methods. In the global method, Λ[j], is selected as,
   graphic file with name pone.0030766.e005.jpg (4)

   And therefore,
   graphic file with name pone.0030766.e006.jpg (5)

   The estimation of λ^G is called global tuning parameter in the rest of
   the paper and provides a single penalty term for every mRNA.

   In the local method Λ[j] is selected as,
   graphic file with name pone.0030766.e007.jpg (6)

   and therefore,
   graphic file with name pone.0030766.e008.jpg (7)

   The estimation for λ[j]^L in Eq (7) will be referred to as the local
   tuning parameter.

   In either case we used a single κ^L or for κ^G for all mRNAs. Notice
   that, although κ^L is the same for all mRNAs, λ[j]^max is not and thus,
   λ[j]^L will be different for each mRNA. In our case studies, global
   selection outperforms local selection. Results are included in the
   [81]text S1.

   The values of κ^G (and κ^L) were determined by testing the factors
   {1/10000, 1/100, 1/50, 1/20, 1/10, 1/5, 1/3, 1/2} using Leave One Out
   Cross Validation (LOOCV). This method performs the optimization using
   all samples but one and validates the results with the sample left out.
   This procedure is repeated for every sample. The selected value of κ is
   the one that provides the minimum square error (MSE) in the samples
   left out training the algorithm with the other samples, i.e,
   graphic file with name pone.0030766.e009.jpg (8)

   where Inline graphic and Inline graphic ^0[j,] [≠i] are the estimations
   of the down-regulation and the intercept obtained with TaLasso for the
   mRNA and miRNA expression of all samples but i. Inline graphic [j,i] is
   the predicted expression value of mRNA j in sample i determined by the
   values Inline graphic . Finally, x[j,i] and z[k,i] are the expressions
   of mRNA j and miRNA k on sample i.

Results

   We have implemented the solvers for Eq (2) in R and Matlab. Putative
   and experimental interactions are included in the downloadable
   software. We have also developed a convenient web page that performs
   this computation. We named them “TaLasso”. Therefore, TaLasso is both
   the web page and the algorithm adaptations that solve the problem
   described in Eq. (2). The result is the amount of down-regulation of a
   miRNA for their targets in a particular experiment.

   We run TaLasso for two real datasets and validate the results by
   studying the enrichment on experimentally-validated targets. We also
   analyze the biological relevance of the predicted interactions. In this
   section we present these results and include a comparison between
   TaLasso, GenMiR++ and correlation methods.

Expression datasets

   TaLasso has been tested with two datasets with matched miRNA and mRNA
   expression data. The first dataset, which will be referred as Multi
   Class Cancer (MCC) dataset, is composed of miRNA expression data from
   [82][28] and mRNA expression data from [83][29]. This dataset was used
   in GenMiR++. The second one, which will be referred to as acute
   lymphoblastic Leukemia DataSet (LDS), is publicly available in GEO
   ([84]GSE14834). It corresponds to expression profiles published by
   [85][30] and is part of a case study used in MAGIA. In both datasets,
   the expression values were further normalized by subtracting from each
   mRNA or miRNA their median value.

Enrichment on experimentally-validated targets

   We compared TaLasso results with GenMiR++ and the Pearson correlation.
   We were not able to reproduce mutual information results obtained in
   MAGIA. Nevertheless, it has been stated that for normal random
   variables, Mutual Information (MI) can be estimated as a function of
   the Spearman or the Pearson correlation coefficient [86][31].
   Therefore, if normal conditions are met, the ranking of MI results and
   the ranking of the absolute value of correlation are expected to be
   similar. We have checked that correlation performs better than the
   absolute value of the correlation.

   The comparison of the scoring algorithms is based on measuring the
   enrichment of their results on experimentally-validated interactions.
   If the top ranked interactions of an algorithm have more
   experimentally-validated targets, this algorithm is expected to perform
   better as more predicted interactions are validated.

   The union of miRBase, miRanda, miRGen, miRecords, TarBase and miRWalk
   were used as the initial set of interactions. Among them, TaRBase (v5),
   miRecords (May 2010) and miRWalk databases were used as reference of
   experimentally-validated targets. Whereas the interactions included in
   TaRBase and miRecords have been manually curated, those in MiRWalk were
   automatically extracted from the literature using text mining
   techniques. Therefore, this last database includes less reliable but
   more interactions than the other two.

   [87]Figure 1 shows the Venn diagram of the number of interactions
   provided by each database and the intersections among them. The number
   of interactions in any particular experiment is usually smaller since a
   particular experiment includes only a subset of the mRNAs and miRNAs in
   the databases. Nevertheless, the initial set of interactions comprises
   more than half a million putative interactions. From them, only a few
   (∼1000 or 10000 depending on the database) correspond to the
   experimentally-validated targets.

Figure 1. Shared interactions among the different databases of human miRNA
targets that have been used as initial set of putative interactions.

   [88]Figure 1
   [89]Open in a new tab

   The overlap among the different databases is small. With reference to
   databases with experimentally-validated targets, the union of miRecords
   and TarBase includes 623 interactions that are also cited in any of the
   computationally predicted databases. This number rises to 4372 in case
   miRWalk is also considered.

   Applying TaLasso to the MCC and the LDS datasets, β, the downregulation
   effect of each miRNA on its targets was obtained. For both datasets we
   performed an enrichment analysis using the results provided by TaLasso,
   GenMiR++ and Correlation to mRNAs with at least one
   experimentally-validated miRNA target. The aim of this analysis is to
   compare each of these algorithms in terms of the number of
   experimentally-validated interactions.

   The enrichment analysis was developed as follows. Firstly, the
   interactions were ranked according to the scores provided by each
   method. Secondly, taking the top-ranked n interactions, we counted the
   number of experimentally-validated targets within them. With this data,
   we compute the corresponding p-value using the hypergeometric
   distribution. The hypergeometric distribution is a discrete probability
   distribution that describes the probability of obtaining p successes
   (experimentally-validated interactions) when n elements (selected
   interactions) from a finite population without replacement (the union
   of all the putative interactions) are drawn.

   These p-values help to compare the algorithms: for a given n (number of
   selected interactions) the smaller the p-value, the more enriched the
   solution in experimentally-validated interactions is.

   We compared the algorithms using two scores: the number of
   experimentally validated interactions in the top-500 predicted
   interactions and the minimum p-value on the enrichment curves. In the
   second case we also provide the number of experimentally-validated
   targets (N[E]) and the amount of predicted interactions drawn (N[T]) on
   that point.

   The following subsection shows the enrichment results for the union of
   TarBase and miRecords. The enrichment for miRWalk is included in
   [90]text S1 and shows a similar trend.

Multi Class Cancer (MCC) expression data

   The dataset used in GenMiR++ is a set of 88 paired cancer and normal
   samples with mRNA data from [91][28] and miRNA expression data from
   [92][28]. The mRNA expression data was measured with oligonucleotide
   microarrays Hu6800 and Hu35KsubA GeneChips (Affymetrix, Santa Clara,
   CA). On the other hand, the miRNA expression was obtained using
   bead-based flow cytometry. The dataset consists of normal and cancerous
   counterparts from bladder, breast, colon, kidney, lung, pancreas,
   prostate and uterus samples, ovary cancer, melanoma and mesothelioma
   samples with no normal references. We compiled the expression data from