Abstract miRNAs are small non-coding RNAs that regulate gene expression by binding to the 3′-UTR of genes. Many recent studies have reported that miRNAs play important biological roles by regulating specific mRNAs or genes. Many sequence-based target prediction algorithms have been developed to predict miRNA targets. However, these methods are not designed for condition-specific target predictions and produce many false positives; thus, expression-based target prediction algorithms have been developed for condition-specific target predictions. A typical strategy to utilize expression data is to leverage the negative control roles of miRNAs on genes. To control false positives, a stringent cutoff value is typically set, but in this case, these methods tend to reject many true target relationships, i.e., false negatives. To overcome these limitations, additional information should be utilized. The literature is probably the best resource that we can utilize. Recent literature mining systems compile millions of articles with experiments designed for specific biological questions, and the systems provide a function to search for specific information. To utilize the literature information, we used a literature mining system, BEST, that automatically extracts information from the literature in PubMed and that allows the user to perform searches of the literature with any English words. By integrating omics data analysis methods and BEST, we developed Context-MMIA, a miRNA-mRNA target prediction method that combines expression data analysis results and the literature information extracted based on the user-specified context. In the pathway enrichment analysis using genes included in the top 200 miRNA-targets, Context-MMIA outperformed the four existing target prediction methods that we tested. In another test on whether prediction methods can re-produce experimentally validated target relationships, Context-MMIA outperformed the four existing target prediction methods. In summary, Context-MMIA allows the user to specify a context of the experimental data to predict miRNA targets, and we believe that Context-MMIA is very useful for predicting condition-specific miRNA targets. Introduction MicroRNAs (miRNAs) are small non-coding RNAs that are 19-24 nucleotides in length. These RNAs regulate gene expression at the post-transcriptional level by binding to the 3′-UTR of mRNAs [[36]1, [37]2]; thus, miRNAs are functionally important. There are numerous scientific findings on the functional roles of miRNAs by regulating specific genes. For example, it is reported that miR-15 and miR-16-1 bind to BCL2 [[38]3] and that apoptosis is induced. Another example is that miR-125b, miR-145, miR-21 and miR-155 are dysregulated in breast cancer cells, and different expression levels of these miRNAs have significant correlations with breast cancer phenotypes, such as tumor stages and status of estrogen and progesterone receptors [[39]4]. Moreover, it is well known that miRNAs are related to proliferation, differentiation, and cell death [[40]5]. The functional roles of miRNAs differ in different contexts. In other words, the relationship between miRNA and target genes is dynamic in different conditions. Thus, it is very important to identify which genes are targeted by miRNAs in a given context. There are more than 1000 miRNAs, and approximately 60% of protein-coding genes are regulated by miRNAs [[41]6]. Since it is not possible to perform biological experiments for such a large number of miRNAs and genes, computational prediction is very important, and numerous computational methods have been developed for predicting targets of miRNAs. The first generation of computational tools leverage sequence complementary information and binding energy potentials. These prediction methods include TargetScan [[42]7], PITA [[43]8], mirSVR [[44]9], miRanda [[45]10] and PicTar [[46]11]. These tools generally come with corresponding databases that compile miRNA-target information. In addition to sequence complementary information, there are different approaches used in each of these methods. miRanda estimates the energy on sequence matching of miRNA and mRNA pairs to predict targets [[47]10]. PicTar first finds candidate 3′-UTR sites and uses a hidden Markov model (HMM) to filter out target sites [[48]11]. TargetScan considers a conservation seed match and then considers regions outside seed matches [[49]7]. The mirSVR algorithm uses a support vector regression method to compute scores on candidate target sites that are identified by miRanda [[50]9]. PITA uses the accessibility of target sites as a main feature to predict targets [[51]8]. Target prediction methods based on the sequence similarity score rely on the existence of target sites, and these methods are accompanied by target databases. However, such target information is not condition specific without considering which miRNAs and which genes are expressed; thus, there are many false positives even if the target information is accurate, which is not the case since many target databases do not agree on the miRNA-target relationship. To make the target information condition specific, many expression-based target prediction methods have been developed. These methods take miRNA-mRNA expression data and several sequence-based target databases as input data and filter out miRNA-mRNA targets using statistical significance or computational algorithms. We briefly summarize the previous expression-based algorithms. GenMiR++ used a Bayesian model and expectation maximization algorithm to predict the posterior probability of a miRNA target for mRNA [[52]12]. MMIA employs a two-step method, where the first step is to select differentially expressed miRNA, and the second step is to select negatively correlated differentially expressed mRNA [[53]13] only for the differentially expressed miRNAs. MMIA also supports sequence data analysis on a cloud environment, which enables the user to utilize both microarray data and NGS data [[54]14]. MAGIA2 is a web-based tool that considers the correlation among miRNA and mRNA and transcription factor (TF) regulation [[55]15]. CoSMic extracts the significant target mRNA cluster for each miRNA [[56]16]. CoSMic employs methods similar to gene set enrichment analysis (GSEA) to identify miRNA targets [[57]17]. miRNAmRNA is a target prediction algorithm based on the global test of a linear regression model [[58]18]. To extract condition-specific miRNA activity, identifying causal relationships using intervention calculus when the DAG is absent was proposed [[59]19]. A recent tool, PlantMirnaT, was designed as a plant-specific miRNA-mRNA sequencing data analysis algorithm [[60]20]. The unique feature of PlantMirnaT is using the expression quantity information from sequencing data and employing a split ratio model to identify the relationship of target pairs. Motivation There are approximately 1,500 known miRNAs in the human genome. The number of possible miRNA-gene pairs exceeds 30 million when more than 20,000 protein-coding genes are considered. Among these pairs, only a fraction of the relationships are significant in terms of biological functions, e.g., phenotypes or cancer subtypes. Computational methods for predicting the miRNA target employ various techniques to identify phenotype-specific miRNA targets. Because this is a typical prediction problem, the challenges can be summarized in terms of false positives and false negatives. * Target databases have high false positive rates: Sequence-based target prediction algorithms, such as TargetScan, mirSVR, and PITA, and their corresponding databases generally produce high false positives. There are two major reasons for these high false positives. First, these databases contain all known targets; thus, the target information is not condition specific. For this reason, when transcriptome data measured in a specific condition are analyzed, many targets are false positives. Second, sequence-based prediction methods do not consider the regulatory role of miRNA, which generally results in a negative correlation between miRNA and the target gene. In addition, sequence-based prediction methods do not consider sample-specific sequence information. For example, sequence variations in the target regions can affect the target relationship, but the current algorithms do not consider minor but subtle sequence variations. * Expression-based methods may have false negative rates: Expression-based methods utilize negative correlation information between miRNA and targets or similar approaches. For these methods, there is always an issue of establishing a cutoff threshold value, e.g., for a negative correlation. If the cutoff value is not stringent, then there are too many miRNA-target relationships. Thus, in general, it is a common practice to set a quite stringent cutoff value. In this case, many true miRNA-target relationships can be rejected, i.e., the false negative issue. Addressing the false positive and false negative issues is a very challenging problem unless we fully understand how miRNAs regulate target genes. Using sequence pairing information and gene expression information is very useful because such methods have already produced many biologically meaningful results. However, one important information source, the literature, is not utilized in current methods. The scientific literature is currently growing exponentially. As shown in [61]Fig 1, more than 100,000 papers related to ‘cancer’ are published every year. Thus, if we combine sequence pairing information and gene expression information with the literature information, we can certainly make a good improvement in predicting miRNA targets, reducing both false positives and false negatives. In particular, as with the use of gene expression information, the use of the literature information should be condition specific. The main issues are how to handle the vast amount of studies in the literature, how to allow the user to specify the experimental conditions, and finally, how to combine sequence pairing information, gene expression information and the literature information in a single computational framework. Fig 1. The number of published papers related to the keyword ‘cancer’ since 2010. [62]Fig 1 [63]Open in a new tab More than 100,000 papers have been published every year. Toward this goal, two research groups are working together to design and implement a novel human-specific miRNA-target prediction method. First, we compute the omics score by utilizing sequence pairing information and gene expression information to produce candidate miRNA-target pairs. Then, we compute the literature-based context score to evaluate each candidate miRNA-target pair using the Biomedical Entity Search Tool (BEST) [[64]21]. Using BEST, the user can specify the experimental condition using a set of any keywords, which will automatically be translated to a set of genes and related miRNAs. Subsequently, the two scores, the omics score and the context score, are combined into a single score in a conditional probabilistic form. The remainder of this paper is organized as follow. In the Methods section, we explain how to compute the omics score based on the expression data and miRNA-gene relationship and the context score from the literature according to user-provided keywords. In the Results section, we show how our proposed method performs compared with four existing methods in experiments with omics datasets in the public domain. Methods In this section, we explain how our method, Context-MMIA, predicts human miRNA targets by combining the literature information and gene expression data. Context-MMIA takes two-class (control vs. treated) human miRNA-mRNA expression data as input. Then, with user-specified keywords as the context of the experiment, it computes the probabilities of miRNA-gene pairs relevant to the phenotype differences by combining gene/miRNA expression data and the literature data. [65]Fig 2 illustrates the workflow of Context-MMIA. First, differentially expressed miRNAs (DEmiRNAs) and differentially expressed mRNAs or genes (DEmRNAs) are determined with a cutoff value at the relaxed level such that most of the true positives can be retained in this step. Note that we use negative correlation information and the literature information to filter out and re-weight candidates for interaction pairs in the following steps. In the second step of processing omics data, human miRNA-mRNA pairs are predicted using miRNA target databases such as TargetScan, mirSVR, and PITA. These miRNA-mRNA pairs are further screened by negative correlation information between miRNA and mRNA. In the third step, for each pair of miRNA and mRNA, Context-MMIA calculates the omics score based on expression data and the context score based on the literature information compiled based on the user-provided keywords. Finally, target pairs are ranked by combining the omics score and context score. For each miRNA-mRNA pair, Context-MMIA computes alignments of human miRNA and the 3′-UTR of mRNA and generates the visualization of the miRNA-mRNA alignment on the website. Fig 2. Schematic workflow for Context-MMIA. [66]Fig 2 [67]Open in a new tab The system accepts expression information of miRNA and mRNA as inputs. In the MMIA step, DEmiRNAs and DEmRNAs are extracted based on their expression level difference, and their negative correlation is computed. In the Context-MMIA step, the system computes omics and context scores based on user-provided keywords by utilizing the BEST system. Finally, the system ranks miRNA-mRNA pairs using the scores. Identifying genes and miRNAs based on the user-provided context Context-MMIA takes a set of keywords from the user to specify the context of the experiment. Currently, the most widely used biomedical literature database, PubMed, contains over 26 millions records. When we perform a search with the keyword ‘cancer’, over 3 million records are retrieved. Thus, we believe that this literature database contains enough articles to rank miRNA-gene pairs in terms of the user-provided context. However, there are two major issues in ranking miRNA-gene pairs: given the keywords, relevant papers should be identified and relevant gene names and miRNA names should also be identified. Since not all papers contain the user-provided keywords, it is necessary to infer the relevance of the words to extract genes and miRNAs in the relevant articles. To address this issue, we use BEST to identify relevant words and genes/miRNAs [[68]21]. BEST has predefined biomedical entities for each category, such as drug, pathway, gene, and disease, and then it identifies relevant entities extracted from PubMed articles from the user query. For example, it returns entities such as ‘ERBB2’, ‘wnt signaling pathway’, and ‘tamoxifen’ with the keyword ‘breast cancer’ as an input. BEST has its own scoring system for entities, which is very useful in ranking gene-miRNA pairs with respect to the user-provided keywords. For example, there are keywords ‘breast cancer’ and entities ‘cell cycle’, ‘mir-200c’, ‘BRCA1’, and ‘ESR’. At the beginning, BEST compiles PubMed articles containing ‘breast cancer’ and the four entities in the abstract. Then, it measures the score and the rank for each entity and lists entities ordered by score. After compiling articles containing ‘BRCA1’ and ‘breast cancer’, BEST calculates a document score for each article and sums the score to measure the entity score, which is denoted as BEST(BreastCancer, BRCA1). In this paper, we use BEST to measure the relevance of each miRNA and mRNA for a given user query. Omics score The omics score (OS) is the probability of a gene-miRNA contributing to the class difference when expression data are analyzed. The OS is based on the general principle that differentially expressed miRNA targets genes differentially, resulting in negative correlations between genes and miRNA; then, differentially expressed gene explains the phenotype differences. Context-MMIA computes the omics score based on a strategy similar to MMIA. It measures miRNA differential scores, mRNA differential scores, and then correlation scores. The DEmiRNAs and DEmRNAs can be determined by MMIA. After the DEmRNAs and DEmiRNAs are determined, the probability of miRNA-mRNA contributing to the class difference is calculated. Let the p-values of miRNA and mRNA be [MATH: pmi :MATH] and [MATH: pgj :MATH] , respectively. For miRNA m[i], m[i]’s differential score diff(m[i]) is defined by [69]Eq 1, and its normalization diff[n](m[i]) is defined by [70]Eq 2. [MATH: diff(mi)=- log2(pmi) :MATH] (1) [MATH: diffn(mi )=diff(mi< /mi>)-min(diff)max(diff)-min(diff) :MATH] (2) The calculation of diff[n] for mRNA is similar to that of miRNA. The range of diff[n] is between 0 and 1 by [71]Eq 2. If miRNA is significantly differentially expressed in a given condition, then the value of diff[n] will be close to 1. Correlation score is defined by measuring the Pearson’s correlation coefficient of the miRNA-mRNA pair’s logarithmic expression as in [[72]22]. Context-MMIA considers only negatively correlated miRNA-mRNA pairs; thus, a negative value of the coefficient is defined as the correlation score as in [73]Eq 3. [MATH: corr(mi,gj )=-pearson_correlatio n(mi,gj) :MATH] (3) The omics score of miRNA-mRNA OS(m[i], g[j]) is defined in [74]Eq 4. [MATH: OS(mi,gj)=diffn(mi )*corr(mi,gj)*diffn(gj ) :MATH] (4) By definition, OS(m[i], g[j]) ∈ [0, 1]; thus, a value of OS close to 1 means that the miRNA and mRNA are both significantly differentially expressed and anticorrelated. Thus, we predict that the pair is related to the phenotype difference with a high confidence in terms of expression data. Context score We defined the context score (CS) to measure the probability of a miRNA-mRNA pair contributing to the phenotype difference in terms of the literature information. As described in the previous section, BEST estimates a score between predefined entities and keywords. We denoted the user-input keyword as k, which is context specified by the user (e.g., disease, gene, pathway, and so forth). As shown in [75]Eq 5, CS(m[i], g[j]|k) measures the significance of the m[i]-g[j] pair for k in terms of the literature information. [MATH: CS(mi,gj|k )=P(mi< /mi>|k)*P(gj|k) :MATH] (5) To compute P(m[i]|k), we used Bayes’ rule and transformed P(m[i]|k) into [76]Eq 6 because BEST only measures the score for predefined entities and does not support undefined keywords (e.g., broad keyword, new drug or pathway, and so on) [[77]23]. [MATH: P(mi|< /mo>k)=P< mi>n(k|mi)*Pn(mi)l=1pPn(k|ml< /msub>)*Pn(ml) :MATH] (6) By converting P(m[i]|k) using Bayes’ rule, our method provides the user with a freeform keyword environment, which allows the user to easily utilize our system even when the user is not familiar with biological terms. [MATH: P(k|mi)=lo g2(BEST(k,mi< /msub>)+1) :MATH] (7) The literature significance of miRNA (m[i]) for a given keyword k, P(k|m[i]), is computed as shown in [78]Eq 7. BEST(k, m[i]) is the score of m[i] for k computed by BEST, and we converted the scale of the score by taking the logarithm of the BEST score. For example, assume that the keyword ‘immune system’ and the miRNA ‘miR-155’ are used in an analysis. If the relation between ‘miR-155’ and ‘immune system’ is well studied, then P(immune system | miR155) and BEST(immune system, miR155) will have a high score. [MATH: P(mi)< /mo>=log2(BEST(< /mo>mi,mi)+1) :MATH] (8) [79]Eq 8 describes how to compute P(m[i]), which denotes how much literature information exists for m[i]; the more that papers report m[i], the higher the value it will have. After computing P(m[i]) and P(k|m[i]), normalization terms P[n](m[i]) and P[n](k|m[i]) are defined by the min-max normalization. P(m[i]|k) is computed using Bayes’ rule and specifies the significance of m[i] given the literature domain k, and the value of P(m[i]|k) has a correlation with the amount of studies, i.e., the number of papers about m[i] in domain k. For mRNA g[j], P(g[j]|k) is computed in a similar way, and we measured the significance of the m[i]-g[j] pair in k by computing CS(m[i], g[j]|k) using P(m[i]|k) and P(g[j]|k). Pair score The pair score of m[i], g[j] and k is denoted as Score(m[i], g[j], k), which is a confidence value of target prediction in terms of both expression and literature data. [MATH: Score(mi,gj< /msub>,k)=OS(mi, gj)*CS(mi,gj|k) :MATH] (9) [80]Eq 9 can be interpreted as a weighted omics score, where the weight is determined by a probability of a m[i], g[j] pair being true in terms of the user-provided context given keywords k. Results To evaluate Context-MMIA, we performed three experiments in comparison with four existing tools: MMIA, MAGIA2, CoSMic and GenMiR++. The three experiments were pathway analysis, reproducibility of validated miRNA targets in human, and sensitivity tests when different keywords were used for specifying the experimental context. We used 2-class microarray datasets containing miRNA and mRNA expression profiles in humans. [81]GSE21411 [[82]24], [83]GSE40059 [[84]25], and [85]GSE53482 [[86]26] from human disease studies were used. Each study reports experimentally validated miRNA and the correlated target mRNA pair, which was used to evaluate the miRNA target prediction methods in this section. A detailed description of each dataset is listed in [87]Table 1. Table 1. Dataset summary. Each GEO study comes with an experimentally validated miRNA-mRNA target (the second column) to affect their disease domain (the third column). Disease information was used to test performances when different contexts are specified. Data Experimentally validated target Disease [88]GSE21411 hsa-miR-23a—NEDD4L Interstitial Lung Diseases [89]GSE40059 hsa-miR-200c—CFL2 Breast Cancer [90]GSE53482 hsa-miR-155—JARID2 Primary Myelofibrosis [91]Open in a new tab [92]Table 1 summarizes the validated target pair and the domain of the experimental design in each dataset. In the interstitial lung diseases (ILD) study, it was reported that ZEB-1 affects the persistence of disease in ILD through suppression of NEDD4L by miR-23a. In the [93]GSE40059 breast cancer study, the authors investigated differences between aggressive breast cancer cell lines and less-aggressive cell lines and reported that CFL2 was up-regulated by miR-200c. The authors also reported that CFL2 expression was correlated with tumor grade. In the primary myelofibrosis (PMF) study, the authors revealed that overexpressed miR-155-5p regulates JARID2, and they suggested that regulated JARID2 may be related to MK hyperplasia in PMF. Disease information was used to test performances when different contexts are specified for Context-MMIA. It is necessary to choose keywords to specify contexts. ‘Interstitial lung disease’ and ‘primary myelofibrosis’ are too specific to use literature data; thus, we used the more general words ‘lung disease’ and ‘myelofibrosis’ as the keywords for Context-MMIA. Pathway analysis To evaluate the effectiveness of the approach used in Context-MMIA, we compared it with four expression-based methods: MMIA, MAGIA2, GenMiR++, and CoSMic. GenMiR++ computes probabilities for target pairs using an EM algorithm. MMIA extracts DEmiRNA to reduce the search space by a user-defined cutoff and finds negatively expressed target DEmRNAs. MAGIA2 provides several methods for the integrated analysis, and we chose Pearson’s correlation method from among these methods. After measuring the correlation, MAGIA2 calculates the false discovery rate (FDR) for each target. CoSMic extracts an mRNA cluster for each miRNA and computes the significance of a cluster using permutation tests. Likewise, each algorithm uses a different strategy to predict the miRNA target and to reduce the search space. We used these four algorithms to compare performances in terms of the predictive power. The methods compute confidence values for the predicted miRNA and mRNA targets, typically probability or p-value. We ranked the prediction results in terms of the confidence values. In the experiments, we used a p-value cutoff of 0.1 for Context-MMIA. For MMIA, a p-value of 0.05 was used for both DEmiRNA and DEmRNA selection. For the performance evaluation, we used the top 200 predicted miRNA-mRNA pairs predicted by each method. Then, we mapped genes included in the interacting pairs to human pathways using DAVID [[94]27, [95]28] to determine which pathways were significantly enriched. Among these pathways, we carefully selected pathways that are most likely related to the disease through the literature study as shown in [96]Table 1. We set evaluation criteria as how these literature-guided pathways were predicted by each method. [97]Table 2 shows the ratios of the number of genes that are mapped to significantly enriched pathways to the number of genes included in the top 200 miRNA-target edges. The number of genes is less than 200 because the same gene was multiply targeted, e.g., miR-200c-BRCA1 and miR-23a-BRCA1. Table 2. The ratio of the mapped genes and the number of the genes in the top 200 miRNA-target pairs. From each method, we extracted the top 200 target pairs using each method and performed pathway analysis using DAVID. The numerator is the number of genes mapped to the enriched pathways, and the denominator is the genes in the top 200 edges. The ratio of Context-MMIA is the largest for each dataset. Methods [98]GSE21411 [99]GSE40059 [100]GSE53482 Context-MMIA 37 / 79 45 / 157 42 / 127 MMIA 12 / 157 20 / 179 11 / 124 GenMiR++ 0 / 194 18 / 197 26 / 200 MAGIA2 18 / 182 12 / 191 19 / 193 CoSMic 24 / 196 9 / 195 X [101]Open in a new tab As shown in [102]Table 2, the number of genes mapped to the significantly enriched pathways is quite different for each method even though the number of genes does not considerably differ for each method. In terms of the ratio of mapped genes to predicted genes, Context-MMIA outperforms the existing methods 2 to 4 times. A gene set in a pathway means that genes have similar biological functions in terms of regulating molecular processes. Thus, the ratios in [103]Table 2 indicate that Context-MMIA produces more functionally coherent gene sets. [104]Table 3 lists pathways related to ‘breast cancer’ and enriched pathways predicted by each method for the [105]GSE40059 dataset. The enriched pathway analysis for the data from all three experiments is presented in [106]S1 File. The circles in [107]Table 3 mean an enriched pathway when DAVID pathway analysis was performed by using genes in the top 200 edges. For example, if the ECM-receptor interaction is enriched in the Context-MMIA and GenMiR++ results, circles are marked in the context column and the second column for the corresponding tools. As shown in [108]Table 3, more pathways related to ‘breast cancer’ were enriched in the gene sets produced by Context-MMIA than in the gene sets produced by the competing methods. In addition, several important pathways were enriched only in Context-MMIA. For example, it is well known that approximately half of breast tumors have stronger MAP kinase activity than the surrounding benign tissues [[109]32]. Inflammation plays a pivotal role in tumor initiation, promotion, angiogenesis and metastasis. Cytokines are important in all the phenomena, and it has been reported that cytokines participate in regulating both induction and protection in breast cancer [[110]33]. In addition, many studies have reported that TGF-beta signaling is critically important in the regulation of breast cancer [[111]38]. High focal adhesion kinase expression is known to be related to aggressive breast cancer phenotypes [[112]47]. Furthermore, cell adhesion molecules (CAMs) have a strong relationship with the process of metastasis, which is an important feature in predicting breast cancer prognosis [[113]42]. Moreover, a study revealed that activated leukocyte cell adhesion molecule (ALCAM) expression has a correlation with clinical outcomes such as grade, TNM stage, and NPI [[114]48]. Table 3. Enriched pathway analysis on [115]GSE40059 breast cancer data. Breast-cancer-related pathways are selected by the literature search. A circle in a cell means that the pathway is enriched by the gene set predicted by each method (A: Context-MMIA, B: MMIA, C: GenMiR++, D: MAGIA2, and E: CoSMic). More pathways are enriched by the gene set in the Context-MMIA result. Breast-Cancer-Related Pathway A B C D E Purine metabolism [[116]29] O Pyrimidine metabolism [[117]30] O ABC transporters [[118]31] O MAPK signaling pathway [[119]32] O Cytokine-cytokine receptor interaction [[120]33] O Neuroactive ligand-receptor interaction [[121]34] O p53 signaling pathway [[122]35] O O Apoptosis [[123]36] O Notch signaling pathway [[124]37] O TGF-beta signaling pathway [[125]38] O Axon guidance [[126]39] O Focal adhesion [[127]40] O O O ECM-receptor interaction [[128]41] O Cell adhesion molecules (CAMs) [[129]42] O O Adherens junction [[130]43] O Regulation of actin cytoskeleton [[131]44] O Glioma [[132]45] O Melanoma [[133]46] O [134]Open in a new tab Reproducibility of validated targets in humans [135]Table 4 shows the rankings of experimentally validated targets among the targets predicted by each method. Because Context-MMIA computes the context score using the literature data for given keywords, there is a possibility that the original papers of the datasets can affect the context score. Thus, we penalized the validated targets to compute P(k|m[i]) by excluding each paper when the BEST tool measures a score BEST(k, m[i]). Table 4. Reproducibility of validated targets. This table contains the rankings of validated target pairs in three datasets. The validated targets are listed in the second column of Table I. Context-MMIA outperformed existing tools in predicting the validated targets. MAGIA2 and CoSMic failed to reproduce the validated targets. Data [136]GSE21411 [137]GSE40059 [138]GSE53482 Context-MMIA 481 338 21 MMIA 1411 387 1465 GenMiR++ 8625 1673 95492 MAGIA2 X X X CoSMic X X X (Not Work) [139]Open in a new tab As shown in [140]Table 4, Context-MMIA outperformed the other expression-based methods even though the penalized score is used. MMIA took the second place in reproducing the validated targets, but it ranked validated targets much lower than Context-MMIA. Although not rejecting the validated targets, GenmiR++ ranked validated targets very low. This result shows that GenmiR++ produced too many false positives for the three datasets. MAGIA2 failed to identify the validated targets as positive target pairs in any datasets because none of the validated target pairs satisfied the statistical cutoff. CoSMic also failed to identify the validated target pairs for two datasets, [141]GSE21411 and [142]GSE40059. In addition, CoSMic did not run successfully for dataset [143]GSE53482 due to an input error issue. Many tools were not successful in reproducing validated targets, which can be an indication of false negatives. To further confirm the reproducibility of our algorithm, we investigated how many experimentally verified targets in humans are detected in the top 200 miRNA-mRNA pairs by each of the methods. Experimentally validated human miRNA-mRNA pairs were extracted from miRTarBase [[144]49], which curated experimentally validated miRNA-target interactions (MTI) by reporter assay, western blot, microarray, and next-generation sequencing experiments. We used human functional MTIs with strong evidence for functionality in humans as true interacting pairs. [145]Table 5 summarizes the number of validated targets in the top 200 miRNA-mRNA pairs predicted by each method. Table 5. Detection of human-specific validated targets. This table contains the number of validated target pairs in three datasets. The validated targets are extracted from miRTarBase target pairs filtered by human functional miRNA target interaction (MTI). Data [146]GSE21411 [147]GSE40059 [148]GSE53482 Context-MMIA 27 38 24 MMIA 5 4 12 GenMiR++ 3 4 3 MAGIA2 0 0 0 CoSMic 7 0 X (Not Work) [149]Open in a new tab As shown in 5, Context-MMIA predicted two to five times more validated targets compared to the existing methods. Context-MMIA predicted more than 10% of the experimentally validated MTIs in humans, with is a considerably higher prediction accuracy than existing methods; thus, we believe that Context-MMIA suggests good candidates for further experimental validation. Sensitivity tests when different keywords are used The performance of Context-MMIA depends on how the keywords to specify context are related to the goal of the experiment. In addition to disease-related keywords, we performed experiments using less-relevant keywords such as insulin resistance, influenzas, HIV and hepatocellular carcinoma. The results of Context-MMIA using less-relevant keywords are presented in [150]Table 6. The relevant keywords for the three datasets are listed in the third column of [151]Table 1. As shown in [152]Table 6, the rankings of the validated pairs were considerably higher when the keywords that reflect experimental designs were used. This result indicates that our method is able to reflect the degree of relevance to the experimental design and capture the different miRNA-mRNA pairs when different keywords were used. In summary, the experiments with irrelevant keywords showed that our method can capture the miRNA-mRNA pairs, reflecting the user-specified biological context. Table 6. Sensitivity tests when different keywords are used. Rankings of validated targets are shown when different keywords are used. The validated targets had high ranks when disease-related keywords were used. Keyword [153]GSE21411 [154]GSE40059 [155]GSE53482 Correct keyword 481 338 21 Insulin resistance 12479 2036 4250 Influenzas 6826 1169 1623 HIV 5865 4002 3238 Hepatocellular carcinoma 5278 3265 7180 [156]Open in a new tab Conclusion We presented Context-MMIA, a human-specific miRNA-mRNA target pair prediction system that utilizes both expression profiles and the literature information from the user-specified experimental design goals. A major contribution of our system is that we handled the false positives and false negatives, which are an inherent issue in expression-based prediction tools, by incorporating the user-specified context information from the literature. Analyses on three independent human datasets showed that Context-MMIA can capture the true positive miRNA-mRNA target pairs that are specific to a biological context. Context-MMIA outperformed existing tools in a series of experiments, such as pathway analysis, validated target ranking, and irrelevant keyword experiments. We emphasize that computational predictions of miRNA-mRNA target pairs should be further validated in biological experiments and that our system is intended to provide good candidates for experimental validation. Context-MMIA is available at [157]http://biohealth.snu.ac.kr/software/contextMMIA Supporting information S1 File. Pathway analysis results. S1 File contains pathway results for the other two datasets. (PDF) [158]Click here for additional data file.^ (128.6KB, pdf) Data Availability GSE21411, GSE40059 and GSE53482 are available from the GEO database(accession numbers GSE21411, GSE40059, GSE53482). Funding Statement This work was supported by grant numbers 2012M3A9D1054622, 2014M3C9A3063541, and 2012M3C4A7033341, National Research Foundation of Korea (URL: [159]http://www.nrf.re.kr/nrf_tot_cms/index.jsp?pmi-sso-return2=none). The authors who received the funding are: Minsik, Sungmin, Ji Hwan, Heejoon, Sunwon, Jaewoo, Sun. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. References