Abstract

Motivation

   microRNAs (miRNAs) play crucial roles in post-transcriptional gene
   regulation of both plants and mammals, and dysfunctions of miRNAs are
   often associated with tumorigenesis and development through the effects
   on their target messenger RNAs (mRNAs). Identifying miRNA functions is
   critical for understanding cancer mechanisms and determining the
   efficacy of drugs. Computational methods analyzing high-throughput data
   offer great assistance in understanding the diverse and complex
   relationships between miRNAs and mRNAs. However, most of the existing
   methods do not fully utilise the available knowledge in biology to
   reduce the uncertainty in the modeling process. Therefore it is
   desirable to develop a method that can seamlessly integrate existing
   biological knowledge and high-throughput data into the process of
   discovering miRNA regulation mechanisms.

Results

   In this article we present an integrative framework, CIDER (Causal
   miRNA target Discovery with Expression profile and Regulatory
   knowledge), to predict miRNA targets. CIDER is able to utilise a
   variety of gene regulation knowledge, including transcriptional and
   post-transcriptional knowledge, and to exploit gene expression data for
   the discovery of miRNA-mRNA regulatory relationships. The benefits of
   our framework is demonstrated by both simulation study and the analysis
   of the epithelial-to-mesenchymal transition (EMT) and the breast cancer
   (BRCA) datasets. Our results reveal that even a limited amount of
   either Transcription Factor (TF)-miRNA or miRNA-mRNA regulatory
   knowledge improves the performance of miRNA target prediction, and the
   combination of the two types of knowledge enhances the improvement
   further. Another useful property of the framework is that its
   performance increases monotonically with the increase of regulatory
   knowledge.

Introduction

   miRNAs are short non-protein coding RNAs that regulate gene expression
   by either marking their target mRNAs for degradation or repressing
   translation. miRNAs mainly identify their target mRNAs by binding to
   the 3’-untranslated region (3’ UTR) or 5’ UTR. Studies have shown that
   miRNAs play important roles in a broad range of biological processes,
   such as differentiation [[32]1], development [[33]2], apoptosis [[34]3]
   and cellular signaling [[35]4]. Because of their biological importance,
   miRNAs are related to a variety of diseases, such as cancer and
   cardiovascular diseases [[36]5]. Therefore, precise identification of
   miRNA targets is critical to the understanding of the functions of
   miRNAs in both healthy and diseased tissues [[37]6, [38]7].

   Computational approaches are a necessary and promising way to help
   unveil the complete picture of miRNA regulatory relationships.
   Significant progress has been made in elucidating the relationships
   between miRNAs and their targets using wet-lab biological experiments
   [[39]8–[40]11]. However, it is unrealistic to hope for a complete
   picture of miRNA regulation mechanisms by relying solely on wet-lab
   experiments due to the huge number of possible relationships and high
   expenses of the experiments [[41]12]. Therefore, dry-lab approaches
   have been considered as a cost-effective and promising alternative and
   have shown great promise in identifying putative miRNA targets
   [[42]13–[43]16].

   Because of the large number of miRNAs and mRNAs involved in gene
   regulation, providing reliable predictions has always been a
   significant challenge for computational biology approaches. This
   problem is further exacerbated by the small number of available
   samples. Therefore researchers have to rely on the integration of
   biological knowledge and data driven discovery process to obtain a
   complete understanding of miRNA regulation mechanisms.

   Bayesian network (BN) [[44]17–[45]22] provides an excellent platform
   for seamless integration of prior knowledge and data in the process of
   causal structure learning. Furthermore, the causal semantics of a BN
   makes it a preferred model for representing gene regulatory networks
   since the interactions among genes are causal relationships rather than
   statistical associations.

   Valuable wet-lab validated knowledge cannot be effectively utilised
   with the existing methods [[46]23–[47]26]. These algorithms use prior
   knowledge to restrict their search space in the way that the knowledge
   is used to initialise the structure of a BN and the learning process is
   aimed at removing false positives from the initial structure
   [[48]27–[49]31]. Therefore the final structure is a sub-graph of the
   initial one and a miRNA-mRNA interaction will not be predicted if it is
   not included in the prior knowledge. Consequently such methods usually
   require users to have a large amount of knowledge which covers the
   complete or nearly complete knowledge of the network structure, and are
   not able to utilise the sparse and limited validated knowledge.

   In this paper, we propose the CIDER framework to effectively utilise
   sparse wet-lab validated knowledge, including transcriptional
   miRNA-mRNA and post-transcriptional TF-miRNA regulatory knowledge
   [[50]32]. Our method differentiates from the existing work in two
   aspects: first instead of using the regulatory knowledge to initiate
   the network structure and then remove false positive edges, we enforce
   the learning process to maintain the experimentally confirmed
   relationships without restricting the search space. Secondly the
   regulatory knowledge is used for the purpose of obtaining more accurate
   estimation of the causal effect of miRNAs on mRNAs, whereas existing
   methods use prior knowledge to learn the causal regulatory structure.

   Our results on both real-world and simulated datasets demonstrate that
   a very small amount of validated regulatory knowledge improves the
   accuracy of predicted miRNA targets significantly, and the performance
   of CIDER increases monotonically with the increase of regulatory
   knowledge.

   We show that when wet-lab validated knowledge is analysed together with
   expression profiles, CIDER discovers significantly more validated miRNA
   targets than using expression profiles alone. It is also shown that
   either TF-miRNA or miRNA-mRNA regulatory knowledge improves the
   performance, and the combination of the two types of knowledge enhances
   the performance further.

   An important property of the framework is that the performance of miRNA
   target prediction improves monotonically with the amount of regulatory
   knowledge used. In other words CIDER makes more reliable discoveries
   from the data when the knowledge integrated into the framework
   increases. In [51]Fig 1, we illustrate a promising knowledge discovery
   process based on this property. With the incorporation of regulatory
   knowledge in CIDER, the process becomes a feedback loop for the
   discovery of new biological hypotheses and it naturally combines
   dry-lab predictions with web-lab experiments.

Fig 1. An iterative process of integrating and discovering miRNA regulatory
relationships.

   [52]Fig 1
   [53]Open in a new tab

   Our proposed framework is one iteration of the above knowledge and data
   integrated discovery process. In the long run, wet-lab and dry-lab
   discoveries become an integrated feedback process for uncovering new
   biological insights. Bayesian network based causal reasoning provides
   an excellent platform for a seamless integration.

Materials

Matched expression profiles

NCI-60 data for Epithelial to Mesenchymal Transition (EMT)

   The EMT [[54]33] dataset includes the miRNA expression profiles for the
   NCI-60 panel cell lines from [[55]34], and the dataset is available at
   [56]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26375. The
   mRNA expression profiles for NCI-60 were downloaded from ArrayExpress
   available at [57]http://www.ebi.ac.uk/arrayexpress, accession number
   E-GEOD-5720. We use the cell lines categorized as epithelial (11
   samples) and mesenchymal (36 samples) in this study.

Data of the 51 human breast cancer cell lines (BRCA)

   The BRCA dataset includes miRNA expression profiles from the breast
   cancer cell lines data provided by [[58]35]. The mRNA expression
   profiles for these cell lines can be downloaded from
   [59]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41313. 27
   samples in the luminal group and 23 samples in the basal group are
   used.

Gene regulation databases

TF-miRNA interaction database

   For transcriptional regulatory knowledge, we use TransmiR [[60]36], a
   TF-miRNA regulatory relationships database including approximately 700
   entries manually collected from relevant literatures. This database is
   available online at [61]http://www.cuilab.cn/transmir.

Experimentally validated miRNA-mRNA interaction databases

   The post-transcriptional regulatory knowledge is obtained from miRNA
   target databases Tarbase v6.0 [[62]37], miRTarbase v4.5 [[63]13] and
   miRWalk [[64]38]. Tarbase and miRTarbase contain experimentally
   confirmed miRNA target information manually collected from related
   literatures. miRWalk contains both predicted and validated miRNA
   targets, but we only utilise the experimentally validated targets in
   our experiments. The detailed information of experimentally validated
   miRNA-mRNA interactions retrieved from all these databases can be found
   in [65]S3 File.

Predicted miRNA-mRNA interaction database

   We also utilise TargetScan v6.2 [[66]39], a commonly used miRNA target
   prediction database. TargetScan predicts miRNA targets by searching for
   binding sites that match the seed region of each miRNA. This database
   is available online at [67]http://www.targetscan.org.

Methods

Notation

   Let
   [MATH: <mrow><mi mathvariant="script">G</mi><mo>=</mo><mo>(</mo><mi
   mathvariant="bold">V</mi><mo>,</mo><mi
   mathvariant="bold">E</mi><mo>)</mo></mrow> :MATH]
   denote a graph where V = {X[1], …, X[l]} is a set of vertices and E ⊆ V
   × V is a set of edges. In our framework, the vertex set V represents a
   set of random variables corresponding to the expression levels of
   miRNAs and mRNAs (including TF coding mRNAs), and the edges represent
   the causal relationships between the variables.

   We use X[i] → X[j] or X[i] ← X[j] to represent a directed edge between
   X[i] and X[j]. X[i] − X[j] is used to represent an undirected edge
   between X[i] and X[j]. The set of all parent nodes of X[j] is denoted
   as pa[j]. A directed graph is a graph in which all edges are directed.
   An undirected graph is a graph in which all edges are undirected. We
   say that a graph
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   is acyclic if and only if all its directed edges do not form any cycle
   in
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   . In this article, we always assume the graph is acyclic.

The proposed CIDER framework

   As illustrated in [68]Fig 2, the CIDER framework consists of three
   steps. In the first step we perform differential gene expression
   analysis and query the databases for gene regulation knowledge. To
   identify the targets of a miRNA, we use do-calculus [[69]18] to
   estimate the causal effects the miRNA have on all the mRNAs. In other
   words, do-calculus estimates how the expression values of the mRNAs
   change when the expression of the miRNA is intervened [[70]41]. In
   order to apply do-calculus, we need to know the causal relationships
   between the variables. Therefore in Step 2 we construct the causal
   structure with the incorporation of regulatory knowledge, then we
   identify the miRNA targets using do-calculus in Step 3.

Fig 2. The proposed CIDER framework. First the differentially expressed
miRNAs and mRNAs are selected in the expression profiles [[71]40], then we
query the regulatory databases for gene regulation knowledge.

   [72]Fig 2
   [73]Open in a new tab

   After that we build the causal structure according to the expression
   profiles and the knowledge, followed by the causal inference to
   identify miRNA-mRNA interaction pairs.

Step 1 (Data preparation)

   The differential expression analysis is performed as described in
   [[74]42]. As a result for the EMT dataset, 35 miRNA probes and 1154
   probes of mRNAs are identified as significantly differentially
   expressed. For the BRCA dataset, 92 miRNA probes and 1500 mRNA probes
   are identified. The detailed result can be found in [75]S1 File.

   After differential expression analysis, we extract the regulatory
   knowledge (i.e. TF-miRNA and miRNA-mRNA interactions) relevant to the
   differentially expressed expression profiles from the regulatory
   knowledge databases described previously.

Step 2 (Casual structure construction)

   Using both the gene regulation knowledge and gene expression data, we
   learn a causal Bayesian network (CBN) which models the structure of the
   gene regulatory network. A CBN consists of a pair
   [MATH: <mrow><mo><</mo><mi
   mathvariant="script">G</mi><mo>,</mo><mi>P</mi><mo>></mo></mrow> :MATH]
   , where
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   is a directed acyclic graph with the differentially expressed miRNAs
   and mRNAs as its vertices, and P is the joint probability function of
   the vertices. An edge in
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   indicates a causal relationship between the two vertices. For example,
   an edge directing from a miRNA to a mRNA means that the miRNA regulates
   the mRNA; and an edge directing from a TF coding mRNA to a miRNA
   indicates the TF regulates the miRNA.

   A common way to learn the causal structure is to start from a completed
   graph, then update the graph according to the gene expression data. In
   order to integrate the regulatory knowledge, in CIDER we label all the
   edges given in the regulatory knowledge as constant edges, which are
   never to be removed or altered (in terms of their directions) during
   the entire structure construction step.

Step 3 (Causal effect estimation)

   We estimate the causal effect that each miRNA has on all the mRNAs
   according to the causal structure and expression profiles. The causal
   effects measures when the expression level of a certain miRNA changes,
   how the expression level of other mRNA will change. For each miRNA, we
   choose the mRNAs with the largest causal effects as the predicted
   targets.

   In the rest of this section, we discuss the details and intuitions of
   Step 2 and Step 3.

Causal structure construction

   There are two steps involved in constructing a causal structure:
   determining the existence of edges between the nodes, and orienting the
   direction of the edges.

   A common way [[76]43, [77]44] to determine whether an edge exists
   between two nodes is conditional independence (CI) tests. More
   specifically, starting from a fully connected graph, we use CI tests to
   determine the dependency between all connected nodes pairs. If two
   nodes become independent when conditioned on any subsets of their
   neighbours, the edge between them is removed from the graph. Otherwise,
   the edge will remain in the causal structure.

   During this procedure, edges may be incorrectly removed or maintained.
   Because the number of available samples is limited when comparing to
   the large number of variables in expression profiles, CI tests may
   declare two nodes are independent even if a dependency exists, thus the
   edge between them will be removed correctly. Furthermore, the
   incorrectly removed edges will not appear in the conditioning sets of
   later CI tests, which may lead to false positives (i.e. two nodes would
   have been tested to be independent and their edge would have been
   removed if the incorrectly removed edges were kept and were in the
   conditioning set of the CI test).

   In order to determine the orientation of edges we need to identify the
   v-structures in the causal structure defined as follows:

   Definition 1 ([[78]18]) A triple (X[i], X[j], X[k]) forms a v-structure
   in graph
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   if and only if it suffices both of the following conditions:
    1. X[i] and X[j] as well as X[j] and X[k] are adjacent, X[i] and X[k]
       are not adjacent,
    2. X[i] and X[k] are not independent when conditioned on X[j].

   When a v-structure (X[i], X[j], X[k]) is identified, the edges can then
   be oriented as X[i] → X[j] ← X[k] [[79]18]. After all the v-structures
   have been identified and oriented, we can orient the remaining edges
   according to the principle of avoiding the creation of cycles and new
   v-structures [[80]44].

   Unfortunately, under most circumstances the above strategy can only
   orient some of the edges, leaving many undirected. Undirected edges
   introduce uncertainty in the next step, since the estimation has to be
   done on all possible orientations of the undirected edges and take the
   lower bounds as the inferred causal effects [[81]31].

   In our framework, we utilise regulatory knowledge to alleviate both the
   false edges and the undirected edges problems. We introduce the concept
   of constant edge. A constant edge is an edge between two nodes where
   their relationship are already validated via biological experiments, so
   the edge will never be removed no matter what result the CI tests are,
   and the direction of the edge can be correctly determined according to
   the knowledge. Now let us have a look at benefits of introducing
   constant edges with the following example.

   With the introduction of constant edges, we are able to recover
   incorrectly removed edges and also remove some falsely discovered
   edges. [82]Fig 3A shows a causal structure learned with CI tests only,
   which includes one falsely identified regulatory relationship (miR-200a
   to miR-200b) and two missed regulatory relationships (miR-200a to ZEB1
   and miR-200b to ZEB1). Since it is has been experimentally confirmed
   that ZEB1 is a target of miR-200a, we mark the edge from miR-200a to
   ZEB1 as a constant edge and do not remove it when using CI tests (see
   [83]Fig 3B). Because of the introduction of the edge from miR-200a to
   ZEB1, the falsely discovered edge from miR-200a to miR-200b is removed
   (see [84]Fig 3C) as the result of the conditional independence test
   with ZEB1 being added to the conditioning set.

Fig 3. An illustration of how the prior knowledge helping the causal
structure construction.

   [85]Fig 3
   [86]Open in a new tab

   Solid/dashed black lines indicate the edges correctly/incorrectly
   detected during the causal structure construction without the prior
   knowledge; Dotted brown lines indicate the edges added based on prior
   knowledge.

   Constant edges can also help to orient more undirected edges. For
   example, in [87]Fig 3C, although we have removed the false edge between
   miR-200a and miR-200b, the directions of the two edges (miR-200b/QKI
   and miR-429/ZEB1) still cannot be determined. However, when we have
   another constant edge that miR-200b regulates ZEB1 from the regulatory
   knowledge, we can orient the two edges as in [88]Fig 3D otherwise a new
   v-structure (at ZEB1) or a cycle (miR-200b → ZEB1 → miR429 → QKI →
   miR-200b) will be introduced, either of which is not allowed acyclic
   assumption [[89]43].

   As shown above, even when only one or two constant edge is introduced,
   the uncertainness in the causal structure can be significantly reduced.
   We briefly summarise the procedure of constructing the causal structure
   in Algorithm 1 (The details of the algorithm can be found in [90]S6
   File).

   Algorithm 1 Construct the causal structure
   [MATH: <mi mathvariant="script">G</mi> :MATH]

   Input: Gene expression profile, regulatory knowledge matrix.

   Output: Constructed causal structure
   [MATH: <mi mathvariant="script">G</mi> :MATH]

    Initiate
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   as a fully connected graph

    //Mark constant edges

    Mark all constant edges in
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   according to the regulatory knowledge matrix.

    //Removes edges from
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   using CI tests

    Test conditional dependence among non-constant edges, remove an edge
   between two vertices if they are found independent.

    //Orient constant edges

    Orient constant edges according to regulatroy knowledge

    //Orient remaining edges

    Identify and orient all v-structures

    Orient remaining edges without creating new v-structure and cycle

   return
   [MATH: <mi mathvariant="script">G</mi> :MATH]

Causal Effect Estimation

   With the expression data and the causal structure among its variables,
   we need to infer the causal effects that a miRNA has on a mRNA. By
   assuming all variables in the expression profiles follow the
   multivariate Gaussian distribution, we can calculate the causal effects
   as follows:

   Theorem 1 ([[91]45]) Let X[1], …, X[p], X[p + 1], …, X[p + q] be
   jointly normal distributed. The causal effect of X[i](i = 1, …, p) on
   X[j](j = p + 1, …, p + q), ce(X[i], X[j]) can be calculated as:
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>c</mi><mi>e</mi><mrow><mo>(</mo><msub><mi
   >X</mi><mi>i</mi></msub><mo>,</mo><msub><mi>X</mi><mi>j</mi></msub><mo>
   )</mo></mrow><mo>=</mo><msub><mi>β</mi><mrow><mrow><mi>i</mi><mi>j</mi>
   <mo>|</mo><mi>p</mi></mrow><msub><mi>a</mi><mi>j</mi></msub></mrow></ms
   ub><mo>=</mo><mfenced close="" open="{" separators=""><mtable><mtr><mtd
   columnalign="left"><mn>0</mn></mtd><mtd
   columnalign="left"><mrow><msub><mi>X</mi><mi>j</mi></msub><mo>∈</mo><mi
   >p</mi><msub><mi>a</mi><mi>i</mi></msub></mrow></mtd></mtr><mtr><mtd
   columnalign="left"><mrow><msub><mi>β</mi><mrow><mi>i</mi><mi>j</mi></mr
   ow></msub><mspace width="4.pt"></mspace><mtext>in</mtext><mspace
   width="4.pt"></mspace><msub><mi>X</mi><mi>j</mi></msub><mo>∼</mo><msub>
   <mi>β</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><msub><mi>X</mi><mi>i
   </mi></msub><mo>+</mo><mi>p</mi><msub><mi>a</mi><mi>j</mi></msub><mo>,<
   /mo></mrow></mtd><mtd
   columnalign="left"><mrow><msub><mi>X</mi><mi>j</mi></msub><mo>∉</mo><mi
   >p</mi><msub><mi>a</mi><mi>i</mi></msub></mrow></mtd></mtr></mtable></m
   fenced></mrow></mtd></mtr></mtable> :MATH]
   (1)

   where X[j] ∼ β[ij] X[i] + pa[j] is the shorthand for the linear
   regression of X[j] on X[i] and pa[j], and β[ij] is the coefficient for
   X[i] in the regression.

   Given the above theorem, we are able to estimate the regulatory effect
   of each miRNA on all mRNAs in a dataset, and use the mRNAs with top
   ranked causal effects as the targets of the corresponding miRNA. Note
   that because the available regulatory knowledge is very sparse, some
   edges in the causal structure may still remain undirected. Therefore we
   use the minimum absolute value as the estimation of the lower bound of
   the causal effect. We briefly summarise this procedure in Algorithm 2.
   For more details, please refer to the [92]S6 File.

   Algorithm 2 Causal effects estimation

   Input: Gene expression data X[s × n], causal structure
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   .

   Output: Causal effects matrix C where C(i, j) is the causal effect of
   miRNA[i] on mRNA[j].

    Initialize C as a zero matrix

    for All pairs of miRNA[i] and mRNA[j] do

     for All possible orientations of
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   do

      Calculate the causal effect with Theorem 1

     end for

     Let C(i, j) be the causal effect with lowest absolute value

    end for

    return C

Evaluation methods

   Evaluating miRNA target prediction methods is not an easy task. This is
   mainly because the current understanding of miRNA regulation mechanisms
   is still limited and experimentally validated target databases only
   contain information about frequently studied miRNAs. Therefore to
   evaluate the effectiveness of the CIDER framework, we use a number of
   different evaluation approaches described in the following:
    1. We compare the predicted results to wet-lab validated miRNA target
       databases. Since CIDER needs access to regulatory knowledge, we
       reserve a part of the known regulatory relationships as the ground
       truth for evaluation. Specifically, when studying the performance
       of CIDER using TF-miRNA regulatory knowledge, we utilise the
       TF-miRNA interactions retrieved from TransmiR as the prior
       knowledge in constructing the causal structure and reserve the
       miRNA-mRNA interactions obtained from the miRNA target databases as
       the ground truth; when studying the effect of miRNA-mRNA regulatory
       knowledge, we utilise miRNA-mRNA interactions retrieved from
       TargetScan for causal structure construction and reserve the
       miRNA-mRNA interactions obtained from the experimentally validated
       miRNA target databases as the ground truth. In addition, if an
       interaction appears in the prior knowledge and the ground truth, we
       remove this entry from the knowledge and only use it for
       evaluation.
    2. We compare the predicted targets to the results of miRNA
       transfection experiments. miRNA transfection is a technique that
       actively transfects a particular miRNA into cells, and by comparing
       the transfected expression profile to the controlled sample (same
       cell but without miRNA transfection), difference in mRNA expression
       level can be measured and mRNAs with top ranked logarithm fold
       change values can be considered as groundtruth miRNA targets
       [[93]46].
    3. We use gene pathway enrichment tools to analyse the functionality
       of predicted miRNA targets. It is often hypothesized that the
       predicted miRNA targets based on the expression profile should be
       closely related to the biological condition of the expression
       profiles. For example, the mRNAs targeted by miRNAs in the EMT
       dataset should be closely related to the epithelial to mesenchymal
       transition process. Therefore pathway functional analysis can be
       used to demonstrate the effectiveness of miRNA target prediction
       methods.

   The above evaluations are used to demonstrate the effectiveness of
   CIDER for finding biologically relevant miRNA targets. To further
   demonstrate the performance CIDER when used with different amount of
   regulatory knowledge, we use the following simulation.

   We simulate a gene regulatory networks and the corresponding gene
   expression profiles based on the linear structural equation model
   [[94]47]. First we construct a directed graph where each node
   represents a miRNA or mRNA (including TF coding mRNA) in the regulatory
   network and the direction of an edge indicates that the parent node
   regulates the child node. Then we assign to each edge a weight w[i] (
   [MATH: <mrow><msub><mi>w</mi><mi>i</mi></msub><mo>∼</mo><mi
   mathvariant="script">U</mi><mrow><mo>(</mo><mrow><mo>[</mo><mo>-</mo><m
   n>1</mn><mo>,</mo><mo>-</mo><mn>0</mn><mo>.</mo><mn>1</mn><mo>]</mo></m
   row><mo>∪</mo><mrow><mo>[</mo><mn>0</mn><mo>.</mo><mn>1</mn><mo>,</mo><
   mn>1</mn><mo>]</mo></mrow><mo>)</mo></mrow></mrow> :MATH]
   ) which measures the amount of regulatory effect that the parent node
   has on the child node. Starting from the nodes without parents, we
   generate the expression value for each node following Gaussian
   distribution, with a non-Gaussian error terms added. Specifically the
   expression value of each gene is defined as follows:
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msub><mi>x</mi><mi>i</mi></msub><mo>=</mo><m
   sub><mi>b</mi><mi>i</mi></msub><mo>+</mo><munder><mo>∑</mo><mrow><mi>j<
   /mi><mo>∈</mo><mi>p</mi><mi>a</mi><mo>(</mo><msub><mi>x</mi><mi>i</mi><
   /msub><mo>)</mo></mrow></munder><msub><mi>w</mi><mi>j</mi></msub><mo>·<
   /mo><msub><mi>x</mi><mi>j</mi></msub><mo>+</mo><msub><mi>ϵ</mi><mi>i</m
   i></msub><mo>,</mo></mrow></mtd></mtr></mtable> :MATH]
   (2)

   where pa(x[i]) denotes the parent nodes of x[i], w[j] ⋅ x[j] is the
   regulatory effect of the j-th node has on the i-th one, ϵ[i] represents
   the non-Gaussian error term of the i-th node, and b[i] represents the
   interception term. To alleviate the effect of randomness in the
   simulated data, in total 50 networks (each of the network has
   approximately 1000 nodes) are generated and the average results from
   these 50 networks are reported. For each network we generate two sets
   of expression profiles, containing 250 and 500 samples, respectively.

   To evaluate the performance on simulated datasets, we use F-Score (the
   harmonic mean of precision and recall) to measure the performance of
   all methods, which is formulated as follows:
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>F</mi><mo>=</mo><mn>2</mn><mo>·</mo><mfra
   c><mrow><mi>p</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi
   >i</mi><mi>o</mi><mi>n</mi><mo>·</mo><mi>r</mi><mi>e</mi><mi>c</mi><mi>
   a</mi><mi>l</mi><mi>l</mi></mrow><mrow><mi>p</mi><mi>r</mi><mi>e</mi><m
   i>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo>+</mo><mi
   >r</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow></mfrac
   ><mo>.</mo></mrow></mtd></mtr></mtable> :MATH]

   We use F-Score to compare CIDER with a variety of popular miRNA target
   prediction methods, including Pearson correlation [[95]42], Lasso
   [[96]48], Z-Score [[97]49]. Pearson correlation calculates the
   correlation coefficients between pairs of miRNAs and mRNAs, and use the
   strength of the correlations to measure the regulatory effect. Lasso is
   a popular regression method which also measures linear correlation, but
   uses the L1-norm to overcome the sparseness of the high dimensional
   expression profiles. Z-Score is a specifically designed method to infer
   gene regulatory network using data from gene knock-out experiments.
   Since only observational data is used in our study, we use the lowest
   expression value of each gene among all sample as the value of
   knocked-out gene expression.

Results and Discussions

Transcriptional knowledge improves miRNA-mRNA target prediction

   In this section, we investigate the effect of transcriptional TF-miRNA
   regulatory knowledge on miRNA target prediction. We first apply CIDER
   to analyse only the expression profiles, then we allow CIDER to access
   both TF-miRNA regulatory knowledge and the expression data and compare
   the performance of these two settings. For each miRNA, we consider the
   mRNAs with Top 50 and Top 100 ranked causal effects as its targets and
   compare them with those in the combination of three experimentally
   confirmed miRNA-mRNA interaction databases: Tarbase, miRWalk and
   miRTarbase.

   Although for both datasets only less than 20 of TF-miRNA interactions
   are integrated (the total number of possible edges is around 10^6), it
   is evident to see the benefit of TF-miRNA knowledge for predicting
   miRNA targets. As shown in [98]Fig 4, with the help of TF-miRNA
   regulation knowledge, CIDER predicts more validated miRNA targets than
   using expression profiles alone.

Fig 4. Number of experimentally validated miRNA targets (total number for all
miRNAs) identified by CIDER when utilizing expression profiles (EP) only, EP
+ transcriptional regulatory knowledge, EP + post-transcriptional knowledge.

   [99]Fig 4
   [100]Open in a new tab

   (Left) Results for Top 100 predicted targets for each miRNA. (Right)
   Results for Top 150 predicted targets.

   [101]Fig 5 illustrates a comparison of the miRNA targets predicted by
   CIDER with and without TF-miRNA knowledge from both datasets. For
   example, without the TF-miRNA knowledge of BMP2→miR-31, only three
   predicted targets of miR-31 agrees with the experimentally validated
   database. However, when the TF-miRNA regulation between BMP2 is
   incorporated, CIDER not only successfully uncovers the up-regulation
   effect between BMP2 and miR-31, but also identifies 9 experimentally
   validated targets.

Fig 5. Comparison of miRNA targets identified by CIDER with and without
TF-miRNA regulatory knowledge.

   [102]Fig 5
   [103]Open in a new tab

   Gray dashed lines indicate the TF-miRNA regulatory knowledge introduced
   from TransmiR. Black solid lines indicate miRNA-mRNA regulations found
   without knowledge. Brown dotted lines represent the additional
   miRNA-mRNA regulations found when TF-miRNA knowledge is utilised.

   We conduct pathway enrichment analysis of the predicted target genes
   with the focus on KEGG pathways (adjusted p-value<0.05). To determine
   whether the top predicted miRNA targets are related to respective
   biological processes (EMT and BRCA), we select the top 5 predicted
   targets for each miRNA. As shown in [104]Table 1, the KEGG pathways are
   highly associated with the relevant biological process. For instance,
   epithelial tight junctions are closely related to EMT process and focal
   adhesion is shown to be related to breast cancer in previous research
   [[105]50].

Table 1. Top 10 enchriment KEGG pathways in the EMT and BRCA datasets.

   The p-values have been obtained through Hypergeometric analysis
   corrected by FDR method.
   Datasets Top 10 enrichment KEGG pathways                 Adj-p-value
     EMT    Epithelial tight junctions                      5.95e-06
            Leukocyte transendothelial migration            1.82e-05
            Cell adhesion molecules                         2.38e-04
            Arrhythmogenic right ventricular cardiomyopathy 3.23e-04
            Cell adhesion molecules                         2.06e-03
            Melanogenesis                                   8.40e-03
            Regulation of actin cytoskeleton                9.74e-03
            Huntington’s disease                            3.30e-02
            Pathways in cancer                              1.07e-02
            Amoebiasis                                      1.07e-02
     BRCA   Pancreatic secretion                            1.20e-03
            Leukocyte transendothelial migration            1.83e-03
            Focal adhesion                                  2.32e-03
            Amoebiasis                                      4.94e-03
            Purine metabolism                               5.19e-03
            Regulation of actin cytoskeleton                5.30e-03
            Salivary secretion                              5.58e-03
            Adherens junction                               5.58e-03
            Pathways in cancer                              6.03e-03
            Tight junction                                  6.09e-03
   [106]Open in a new tab

Post-transcriptional knowledge improves miRNA target prediction

   In this section we show that post-transcriptional miRNA-mRNA knowledge
   improves the performance of CIDER. Similar to the previous section, we
   first apply CIDER to analyse the expression profiles alone, then
   compare it to the results obtained by allowing CIDER to access both the
   regulatory knowledge and the expression profiles.

   Since we need to keep the experimentally validated target databases to
   evaluate the performance, miRNA-mRNA regulatory relationships predicted
   by TargetScan are used as the regulatory knowledge.

   We depict the number of experimentally validated miRNA targets found by
   CIDER using expression profiles only and using both
   post-transcriptional regulatory knowledge and expression profiles in
   [107]Fig 4. CIDER is able to successfully utilise the
   post-transcriptional knowledge and find significantly more validated
   targets than using expression profiles alone, despite that the
   regulatory knowledge in TargetScan contains false positives. The
   results not only demonstrate that CIDER is able to utilise
   post-transcriptional regulatory knowledge, but also indicate that CIDER
   can benefit from sequence-based prediction knowledge with false
   positives.

   The reason behind the robustness of CIDER lies in the causal inference
   step. There the causal structure and expression profiles are analysed
   together to infer the amount of causal effects. If the false edges
   between miRNAs and mRNAs are not supported by the inference results,
   the noise introduced from false positive regulatory knowledge will be
   mitigated by the causal inference step.

   When accessing all the experimentally validated miRNA target databases
   together with expression profiles, CIDER discovers more targets than
   accessing expression profiles alone. Since we use the databases as
   knowledge, other means are needed for evaluation. Therefore we compare
   the predicted targets for the EMT dataset to the transfection
   experiment on the MDA-MB-231 human cell line [[108]41]. In this
   experiment, the gene expression level in the MDA-MB-231 samples
   transfected with hsa-miR-200a-3p/hsa-miR-200b-3p along with the
   expression level in those samples without hsa-miR-200a-3p and
   hsa-miR-200b-3p (control) were measured. (Please refer to [109]S3 File
   for the detailed transfection experiment results). The differentially
   expressed genes from the controlled and transfected samples are used to
   validate the our computational predictions. Specifically, 345 and 533
   genes are identified to be regulated by hsa-miR-200a-3p and
   hsa-miR-200b-3p, respectively.

   The results demonstrate that with the help of post-transcriptional
   regulatory knowledge, CIDER identifies significantly more validated
   miRNA targets comparing to the miRNA targets predicted based only on
   expression profiles. [110]Fig 6 shows that when equipped with the
   post-transcriptional miRNA-mRNA regulatory knowledge (brown dotted
   lines), CIDER is able to discover many novel miRNA-mRNA regulatory
   relationships that are missed by using expression data alone.

Fig 6. Comparison of validated regulatory relationships with/without
regulatory knowledge on the EMT dataset.

   [111]Fig 6
   [112]Open in a new tab

   Black solid lines indicate validated interactions found with expression
   profiles; grey dashed lines indicate interactions provided by the
   regulatory knowledge; brown dotted lines indicate new interactions
   discovered by CIDER utilizing both expression profiles and regulatory
   knowledge, yellow shaded nodes are known oncogenes and oncomiRs
   according to [[113]51].

More prior knowledge leads to better predictions

   It is important to know that how the framework works with different
   amounts and types of regulatory knowledge. In this section we study the
   performance of CIDER when utilizing different amounts and types of
   knowledge. Since currently the wet-lab validated knowledge is very
   sparse, we generate the simulated networks and expression profiles as
   described in the Evaluation Methods section for our analysis.

   Even without knowledge, CIDER achieves comparable performance of
   state-of-the-art miRNA target prediction methods. As shown in [114]Fig
   7, when only utilizing the expression data, the performance of CIDER
   without prior knowledge is much better than Z-Score. Lasso and Pearson
   show similar performance regardless of the sparsity constraint added in
   Lasso. When comparing CIDER with Pearson and Lasso, even without using
   regulatory knowledge, CIDER shows slightly better performance than both
   methods because of CIDER utilised causation instead of correlation.

Fig 7. Comparing CIDER with Pearson, Lasso and Z-Score when only accessing
expression profiles.

   [115]Fig 7
   [116]Open in a new tab

   Left: 250 samples; Right: 500 samples.

   The performance of CIDER increases monotonically with the amount of
   knowledge. Combining post-transcriptional and transcriptional knowledge
   significantly boosts the performance of CIDER. To demonstrate this, we
   evaluate CIDER with three types of knowledge: miRNA-mRNA interactions,
   TF-miRNA interactions and the combination of these two. For each type
   of regulatory knowledge, starting from expression data only, we
   gradually increase the amount of knowledge available to CIDER from 0%
   to 50% (of the total amount of available knowledge of the type) by a 5%
   interval. As shown in [117]Fig 8, both transcriptional and
   post-transcriptional knowledge separately improves the performance of
   CIDER significantly, and the combined knowledge leads to further
   improvement. For every type of regulatory knowledge, as the amount of
   utilised knowledge increases the performance of CIDER improves
   monotonically. With 50% of the combined knowledge, CIDER achieves very
   high accuracy.

Fig 8. Performance of CIDER when utilizing different amounts and types of
regulation knowledge.

   [118]Fig 8
   [119]Open in a new tab

   Sample size: 250 (left), 500 (right).

   In summary, CIDER is not only able to utilise either transcriptional or
   post-transcriptional regulatory knowledge to improve the performance of
   miRNA target prediction, but also able to utilise the combination of
   the two types of regulatory knowledge to further increase prediction
   accuracy. As the amount of regulatory knowledge increases, the
   performance of CIDER continuously improves. With this monotonic
   improvement, the miRNA target predicted by CIDER will become more
   accurate and reliable when our understanding of miRNA regulation
   improves and more knowledge is available for CIDER.

   In return, CIDER can provide more precise guidance for selecting miRNA
   targets for wet-lab validation. Iteratively, as shown in [120]Fig 1,
   CIDER will help to build a more and more complete gene regulation
   network.

Methods utilizing sequence bindings information are not suitable for
integrating experimentally validated knowledge

   Methods designed to utilise sequence based predictions are not suitable
   for utilizing validated regulatory knowledge. In this section we
   compare CIDER with ProMISe [[121]30], a recently proposed method
   designed to utilise sequence binding information and expression
   profiles.

   We compare two algorithms on the EMT and BRCA datasets. Both algorithms
   have access to the expression profiles, and exactly the same amount of
   regulatory knowledge, which contains the sequence binding interactions
   predicted by TargetScan, experimentally validated post-transcriptional
   knowledge in miRWalk and miRTarbase. Specifically, ProMISe uses the
   knowledge as sequence binding information, while CIDER uses it to
   initialise constant edges.

   As can be seen in [122]Fig 9, regardless of what threshold is selected
   for the miRNA targets, CIDER discovers more validated target than
   ProMISe. This results indicate that the top miRNA targets predicted by
   CIDER are consistently better than the ones predicted by ProMISe.

Fig 9. Performance comparison of CIDER and ProMISe when utilizing
post-transcriptional regulation knowledge.

   [123]Fig 9
   [124]Open in a new tab

   Left: EMT dataset, right: 500 BRCA dataset.

   The reason is that instead of considering all possible miRNAs and mRNA
   pairs, ProMISe (and other similar algorithms) uses sequencing
   information to constrain their search space. In other words, a
   miRNA-mRNA interaction would not be considered unless the pair is
   included in the knowledge. Therefore when utilizing sequencing
   information, these algorithms will be misled by the false negatives;
   when utilizing experimentally validated knowledge, they will only
   predict interactions that are already included in the knowledge.

Putative miRNA targets

   In this section, we report the high-confidence miRNA targets predicted
   by CIDER in the EMT and BRCA datasets for biological researchers to
   explore. These predictions utilise expression profiles with both
   transcriptional and post-transcriptional regulatory knowledge. As we
   have shown in the previous section, CIDER performs better when
   utilizing the combined knowledge than using either type of regulatory
   knowledge separately. Therefore, we expect that the miRNA targets
   predicted by CIDER utilizing TF-miRNA interactions from TransmiR and
   miRNA-mRNA knowledge from Tarbase, miRTarbase, miRWalk, should provide
   valuable putative candidates for further biological wet-lab evaluation.
   To utilise sequence binding information to increase the confidence of
   the predicted targets, we intersect our discovery with miRNA target
   prediction from TargetScan.

   These high-confidence predicted miRNA targets are presented in [125]Fig
   10, and we hope that a significant number of them will be validated by
   experiments in the future.

Fig 10. High confidence miRNA targets predicted by CIDER utilizing expression
profiles, transcriptional and post-transcriptional knowledge.

   [126]Fig 10
   [127]Open in a new tab

   Only part of the interactions are shown for clarity of illustration,
   please refer to [128]S5 File for the full results.

Conclusion

   The future of biology is neither based on wet-lab experiments nor
   computational predictions alone, but on their combination. The progress
   of wet-lab experiments would be hampered without the help of quality
   computational predictions, and the power of computational methods would
   be limited if accumulated biological knowledge were not integrated with
   the modeling process.

   In this article, we present the CIDER framework that seamlessly
   integrates biological knowledge with high-throughput expression
   profiles for miRNA target prediction. We use a causal Bayesian network
   based method to explicitly exploit experimentally validated gene
   regulatory knowledge to improve the prediction of miRNA-mRNA
   interactions. Our results demonstrate that when utilizing
   transcriptional or post-transcriptional knowledge, CIDER discovers
   significantly more validated miRNA targets than using expression
   profile alone. Furthermore, when the amount of available regulatory
   knowledge increases, the performance of CIDER increases monotonically.

   With the capability to improve prediction accuracy with the increment
   of gene regulatory knowledge, our causal discovery framework can serve
   as a promising tool for uncovering new biological insights using ever
   increasing regulatory knowledge and new high-throughput data.

Supporting Information

   S1 File. Differential expression profiles of miRNAs and mRNAs for the
   EMT and BRCA datasets.

   The p-values are adjusted by Benjamini-Hochberg (BH) method.

   (XLSX)
   [129]Click here for additional data file.^ (1.3MB, xlsx)
   S2 File. R source code for the proposed CIDER framework.

   (ZIP)
   [130]Click here for additional data file.^ (6.8KB, zip)
   S3 File. Experimentally validated miRNA-mRNA regulatory knowledge.

   This file includes the miRNA-mRNA regulatory knowledge obtained from
   the following databases: TarBase, miRecords, miRWalk and miRTarBase.

   (XLSX)
   [131]Click here for additional data file.^ (896.5KB, xlsx)
   S4 File. miRNA transfection result on MDA-MB-231 samples.

   This file includes the transfection results for hsa-miR-200a and
   hsa-miR-200b, and control sample.

   (XLS)
   [132]Click here for additional data file.^ (7.2MB, xls)
   S5 File. High-confidence miRNA targets predicted by CIDER.

   This file includes the miRNA targets predicted by CIDER when utilizing
   post-transcriptional and transcriptional regulatory knowledge and
   expression profiles, these interactions are also predicted by
   TargetScan v7.0.

   (XLSX)
   [133]Click here for additional data file.^ (13.8KB, xlsx)
   S6 File. Detailed descriptions of Algorithm 1 and Algorithm 2, and
   additional validation results.

   (PDF)
   [134]Click here for additional data file.^ (149.6KB, pdf)

Acknowledgments