Abstract

Background

   Decoding the temporal control of gene expression patterns is key to the
   understanding of the complex mechanisms that govern developmental
   decisions during heart development. High-throughput methods have been
   employed to systematically study the dynamic and coordinated nature of
   cardiac differentiation at the global level with multiple dimensions.
   Therefore, there is a pressing need to develop a systems approach to
   integrate these data from individual studies and infer the dynamic
   regulatory networks in an unbiased fashion.

Results

   We developed a two-step strategy to integrate data from (1) temporal
   RNA-seq, (2) temporal histone modification ChIP-seq, (3) transcription
   factor (TF) ChIP-seq and (4) gene perturbation experiments to
   reconstruct the dynamic network during heart development. First, we
   trained a logistic regression model to predict the probability (LR
   score) of any base being bound by 543 TFs with known positional weight
   matrices. Second, four dimensions of data were combined using a
   time-varying dynamic Bayesian network model to infer the dynamic
   networks at four developmental stages in the mouse [mouse embryonic
   stem cells (ESCs), mesoderm (MES), cardiac progenitors (CP) and
   cardiomyocytes (CM)]. Our method not only infers the time-varying
   networks between different stages of heart development, but it also
   identifies the TF binding sites associated with promoter or enhancers
   of downstream genes.

   The LR scores of experimentally verified ESCs and heart enhancers were
   significantly higher than random regions (p <10^−100), suggesting that
   a high LR score is a reliable indicator for functional TF binding
   sites. Our network inference model identified a region with an elevated
   LR score approximately −9400 bp upstream of the transcriptional start
   site of Nkx2-5, which overlapped with a previously reported enhancer
   region (−9435 to −8922 bp). TFs such as Tead1, Gata4, Msx2, and Tgif1
   were predicted to bind to this region and participate in the regulation
   of Nkx2-5 gene expression. Our model also predicted the key regulatory
   networks for the ESC-MES, MES-CP and CP-CM transitions.

Conclusion

   We report a novel method to systematically integrate multi-dimensional
   -omics data and reconstruct the gene regulatory networks. This method
   will allow one to rapidly determine the cis-modules that regulate key
   genes during cardiac differentiation.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12859-015-0460-0)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Cardiac differentiation, Network inference, Logistic
   regression, Time-varying dynamic Bayesian model, Data integration, Gene
   regulatory network

Background

   Decoding the temporal control of gene expression patterns is essential
   to understand the complex mechanism of developmental regulatory events
   during heart development. High-throughput methods have been employed to
   systematically study the dynamic and coordinated nature of cardiac
   differentiation at the global level with multiple dimensions
   [[31]1-[32]6]. For example, in several studies, RNA-seq and histone
   modification ChIP-seq experiments were performed to profile the changes
   in global gene expression and the chromatin state at distinct stages of
   cardiac differentiation from ESCs to cardiomyocytes in human and mouse
   [[33]1,[34]3]. In these reports, the authors reported changes in
   chromatin modification patterns associated with gene activation and
   identified stage specific distal enhancer elements. He et al., outlined
   the candidate binding sites of five known cardiac transcription factors
   (TFs) (Gata4, Nkx2-5, Tbx5, Srf and Mef2a), which were identified using
   ChIP-seq [[35]2]. Moreover, Schlesinger et al. knocked down each of the
   four key cardiac transcription factors (Gata4, Mef2a, Nkx2-5 and Srf)
   in HL-1 cells using RNA interference, followed by the profiling of the
   changes in global gene expression [[36]4]. Although these studies
   presented a novel and global perspective for the examination of the
   chromatin status and the prediction of transcriptional regulation, they
   were limited in the types of data that were integrated [[37]1,[38]3]
   and they based their initial screening on a small set of candidate TFs
   [[39]2,[40]4]. As large-scale multi-dimensional data are accumulating
   at an unprecedented pace, there is a pressing need to develop
   systematic methods to integrate these data from individual studies and
   infer the dynamic gene regulatory networks (GRN) during cardiac
   differentiation in an unbiased manner.

   Time series expression profiles based on microarray and/or more
   recently RNA-seq data have been widely used to reconstruct the static
   networks, that is, networks with invariant topology over a given set of
   genes [[41]7-[42]11]. However, because the GRN at a particular time
   point depends on a specific biological context, it can undergo
   systematic rewiring rather than being invariant over time. Therefore,
   recent research has focused on inferring the dynamic (time-varying)
   networks over the time course [[43]1-[44]4,[45]12-[46]15]. A key
   technical hurdle to precisely reconstruct dynamic networks based solely
   on temporal expression data is that there are too many unknown
   variables to be estimated (i.e. (T-1)p ^2 network edges). Some attempts
   have been made to circumvent this difficulty including: factorizing
   gene-gene regulatory relationships into modular effects
   [[47]1,[48]3,[49]11,[50]14], deconvolving the observed indirect effects
   into direct effects [[51]2,[52]16], or smoothing the edge weight
   between the networks of neighboring time points [[53]4,[54]13,[55]17].
   However, the overall performance of reconstructing GRN based solely on
   temporal expression profiles is still limited [[56]1,[57]3,[58]18].

   One widely used strategy to infer the causal relationship in GRN is to
   over-express or repress the key TFs and measure the change in global
   expression. The significantly up- or down-regulated genes may be either
   directly or indirectly regulated by the perturbed TFs. This strategy
   has been successfully utilized and several examples include: the GRN in
   sea urchin embryonic development [[59]2,[60]4,[61]19], the early
   response of GRN in embryonic stem cells (ESC) [[62]7-[63]11,[64]20],
   and the cardiac GRN involving several key cardiac genes [[65]4].
   Perturbation-based methods can, in theory, greatly improve the
   prediction accuracy for downstream targets, as compared with the
   methods solely based on temporal expression profiles [[66]18]. The
   limitation of this strategy is that it is unrealistic to perturb all
   TFs in the mammalian genome in a specific context and it is not easy to
   distinguish direct effects from indirect effects in the readout.

   The most common strategy used to discover the direct regulatory
   relationship is to combine the TF information and temporal expression
   profiles [[67]2,[68]12,[69]21-[70]23]. The general assumption is that a
   gene can be regulated by a TF if its promoter or enhancer regions are
   occupied by the TF. The TF binding sites (TFBS) within the putative
   regulatory region of a gene are identified by either scanning the known
   positional weight matrix (PWM) representing a relatively short (5–20
   nucleotides) degenerative sequence motif recognized by a TF, or by TF
   ChIP-seq experiments. Although PWMs have been defined for the TFBSs of
   more than 500 TFs in vertebrates by various techniques [[71]24-[72]32],
   the sensitivity and specificity are generally low when used to predict
   putative binding sites [[73]33]. Alternatively, TF ChIP followed by
   sequencing or microarray analyses emerged as the standard approach to
   directly determine the bona fide TFBS. However, because ChIP-seq
   experiments are still relatively expensive and labor-intensive, and the
   TFBSs tend to vary in distinct biological contexts, for example, only
   7.14% of enhancers identified in ESCs are overlapped with the enhancers
   in heart [[74]34-[75]37], the number of available TF ChIP-seq datasets
   is still limited. Moreover, for most TFs in the genome, there are no
   ChIP-seq datasets available. For example, in ChIPBase, only 12 and 5
   TFs have corresponding ChIP-seq data in ESCs and cardiomyocyte HL-1
   cells, respectively [[76]38]. At present, there is no consensus
   regarding whether ChIP-seq data obtained in one cell type can be
   readily applied to predict TFBS in another cell type. Moreover, it is
   unclear whether or not we can adapt the information from the available
   ChIP-seq results and predict the binding sites of TFs with only PWM
   information in a specific biological context (e.g. cell types or
   developmental stages).

   In lieu of profiling the binding sites of individual TFs, the general
   enhancers or regulatory regions have also been mapped by DNaseI
   hypersensitive sequencing experiments as well as ChIP-seq with p300,
   histone H3 Lys4 mono-methylation (H3K4me1), histone H3 Lys27
   acetylation (H3K27ac) in a wide range of cell types [[77]39-[78]44]
   including mouse ESCs and the heart [[79]34-[80]36,[81]45]. The genomic
   loci defined by these marks, however, typically span several hundred or
   thousand bases, and are generally too broad to define the specific DNA
   sequences mediating promoter or enhancer functions. It has been
   proposed that local depletion in the ChIP signal intensity (dip) is
   indicative of TF binding sites [[82]41]. Thus, several studies have
   used the structural change of these active marks to discover the
   functional TFBS among the enhancer regions, either by heuristic methods
   [[83]1], or by more sophisticated approaches, such as an integrated
   hidden Markov model [[84]46], logistic regression [[85]47,[86]48], or a
   hierarchical mixture model [[87]49]. However, these studies usually
   focused on individual cell types. Moreover, they focus on static
   regulatory relations and do not fall under the framework of inferring
   dynamic gene regulatory networks.

   While each of the aforementioned strategies has its own merits, they
   also have limitations in the inability to capture the dynamic networks.
   An integrated approach for network inference, which combines the
   strengths of all these methods is highly desirable. In this study, we
   presented a framework to integrate available four-dimensional data: (1)
   temporal RNA-seq, (2) temporal histone ChIP-seq, (3) TF ChIP-seq and
   (4) perturbation studies to reconstruct the dynamic networks during
   cardiac differentiation. Our method not only infers the time-varying
   networks between distinct stages of heart development, but also
   identifies the TF binding sites on the promoter or enhancer of the
   genes being regulated.

Results

Overview

   We developed a two-step strategy to infer the dynamic GRN during
   cardiac differentiation (Figure [88]1). In the first step, based on 17
   TFs whose ChIP-seq data are available for either mouse ESCs or
   cardiomyocyte HL-1 cells (Table [89]1), we trained a logistic
   regression model to predict the probability for any base being bound by
   any TFs with known PWMs, at a specific differentiation stage. The model
   included the context independent features that do not change during
   differentiation (e.g. base conservation) and context dependent features
   such as the expression levels of nearby genes, the intensity of histone
   modifications within defined distances, as well as histone modification
   changes between adjacent time points. This concept was modified from
   the work by Ernst et al. that infers a score quantifying the general
   binding preferences of TFBS [[90]48]. However, it should be noted that,