Abstract
Background
Recent studies have suggested that combinations of multiple epigenetic
modifications are essential for controlling gene expression. Despite
numerous computational approaches have been developed to decipher the
combinatorial epigenetic patterns or “epigenetic code”, none of them
has explicitly addressed the relationship between a specific
transcription factor (TF) and the patterns.
Methods
Here, we developed a novel computational method, T-cep, for annotating
chromatin states associated with a specific TF. T-cep is composed of
three key consecutive modules: (i) Data preprocessing, (ii) HMM
training, and (iii) Potential TF-states calling.
Results
We evaluated T-cep on a TCF7L2-omics data. Unexpectedly, our method has
uncovered a novel set of TCF7L2-regulated intragenic enhancers missed
by other software tools, where the associated genes exert the highest
gene expression. We further used siRNA knockdown, Co-transfection,
RT-qPCR and Luciferase Reporter Assay not only to validate the accuracy
and efficiency of prediction by T-cep, but also to confirm the
functionality of TCF7L2-regulated enhancers in both MCF7 and PANC1
cells respectively.
Conclusions
Our study for the first time at a genome-wide scale reveals the
enhanced transcriptional activity of cell-type-specific TCF7L2
intragenic enhancers in regulating gene expression.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-017-3764-9)
contains supplementary material, which is available to authorized
users.
Keywords: T-cep, TCF7L2, Intragenic enhancer, PANC1, MCF7
Background
Numerous public data resources, including the ENCODE and Epigenomics
Roadmap, have generated thousands of genome-wide data sets, and
provided us with substantial quantities of data to study
transcriptional and epigenetic patterns in different cell types at a
genome-wide scale [[31]1, [32]2]. Many studies have shown that
epigenetic modifications play a central role in regulating gene
expression, and are involved in a diversity of biological processes in
the human organism [[33]3, [34]4]. However, the “epigenetic code”,
referring to the transcription of genetic information encoded in DNA is
in part regulated by DNA methylation and histone modifications, has not
yet been fully elucidated. Substantial efforts in cracking this code
and uncovering its biological functions have been made, utilizing both
in vitro and in silico methods [[35]5–[36]7]. One innovative and
practical way is to segment the epigenome into various chromatin states
[[37]8–[38]10], where each state is decoded as a specific combinatorial
pattern of multiple epigenetic modifications. Such states or
combinatorial patterns can reflect a variety of sets of genes’
expression levels, which are essential to establish and maintain
distinct functions in chromatin [[39]11, [40]12]. For instance, novel
classes of enhancers have been identified through genome-wide pattern
analyses [[41]13–[42]15]. Therefore, identification of epigenetic
regulatory combinations and networks is increasingly important for
understanding genome functions, chromatin components and molecular
mechanisms.
Computational approaches and models have been a major force to uncover
these complex epigenetic patterns. These contain probabilistic methods,
such as ChromaSig, an unbiased clustering and alignment approach that
finds over-represented epigenetic signatures [[43]16]. More advanced
machine learning approaches have been applied in recent work, such as
hidden Markov models (HMM) [[44]10] and Bayesian network methods
[[45]17]. The HMM has proved to be a good model in training models with
numerous inputs, and that generate thousands of outputs
[[46]18–[47]20]. For example, ChromHMM utilized a multivariate HMM to
learn chromatin states and to output emission probabilities for each
mark in each state and then to infer the number of combinatorial marks
in each of all states [[48]21]. Segway [[49]17] applied dynamic
Bayesian networks (DBN) to segregate the genome with a higher segment
resolution, and derive different chromatin states from chromatin marks
[[50]22]. Although these computational methods have successfully been
used to annotate chromatin elements, many of them suffer from
significant limitations. For instance, some supervised learning methods
cannot find de novo information. Some unsupervised learning methods
train on small genomic regions such as Segway [[51]17] and HMMSeg
[[52]23], or output a single mark’s probability such as ChomHMM
[[53]21]. None of these algorithms explicitly compare targets of a
specific TF with different histone modification marks to qualitatively
assess the association of that TF with active, repressive and
elongation regions. Most TFs have narrow binding patterns compared to
most histone modification marks, but are major regulators of gene
expression [[54]24, [55]25]. Therefore, by ignoring TF binding in the
beginning of algorithm design or training, a method may erroneously
consider it as less meaningful due to the lower probability of TF than
chromatin states, or not incorporate TF binding information at all. As
such, it is critical to develop an algorithm that considers a specific
TF together with many different epigenetic marks.
In order to address TF-dependent epigenetic regulatory events, we have
developed a novel algorithm and software tool, T-cep (Transcription
factor-associated combinatorial epigenetic patterns), which applies a
univariate HMM to identify combinatorial epigenetic patterns associated
with cell type-specific TF targets in different cell types. To evaluate
T-cep, we applied it on TCF7L2-omics data, including ChIP-seq data of
TCF7L2, Pol-II, active chromatin marks (H3K4me1,3 and H3K27ac),
repressive chromatin marks (H3K27me3 and H3K9me3), a mark of
transcriptional elongation (H3K36me3) as well as DNase-seq for open
chromatin regions in five cancer cell types.
TCF7L2 (transcription factor 7-like 2), an important component of WNT
pathway, has been implicated in several human diseases including
carcinogenesis, type 2 diabetes and bipolar disorder [[56]26–[57]29].
The WNT pathway is often constitutively activated in human cancers,
such as colon, liver, breast, and pancreatic cancer [[58]30], with high
upregulation of TCF7L2. Several studies have shown tissue-specific
alternative splicing of TCF7L2, suggesting that TCF7L2 may have
different functional properties in different cells [[59]31, [60]32]. In
our previous studies, we have mapped genome-wide binding of TCF7L2 in
six cell lines [[61]33, [62]34]. However, the combinatorial epigenetic
profile of TCF7L2 in these cancer types has not been well studied. One
hypothesis is that TCF7L2 regulates its downstream target genes in a
cell type-specific manner, i.e., utilizing different combinatorial
patterns with various epigenetic environment cues in each cell type.
Therefore, one way to test this hypothesis is to identify
TCF7L2-assoicated combinatorial epigenetic patterns in a diverse set of
cell types.
In this paper, we first describe the workflow of T-cep, then present
training results on TCF7L2-omics data in five cancer cell types, and
compare it with ChromHMM [[63]21]. Finally, we perform functional
validations using siRNA, RT-qPCR, co-transfections and luciferase
reporter assays on selected gene loci. To the best of our knowledge,
this is the first genome-wide combinational epigenetic pattern
discovery study for cell type-specific TCF7L2 regulation.
Results
The workflow of T-cep
To identify TF regulated chromatin states through a variety of
epigenetic marks, we have developed a novel computational method,
T-cep, for annotating chromatin states associated with a specific TF.
T-cep is composed of three key consecutive modules: (i) Data
pre-processing, (ii) HMM training, and (iii) Potential TF-states
calling (Fig. [64]1a). The detail of the algorithm is described in
[65]Methods.
Fig. 1.
Fig. 1
[66]Open in a new tab
T-cep algorithm and training results for TCF7L2-omics data. a The
workflow of T-cep algorithm. Shown is a schematic summary of the four
steps needed for de novo identifying transcription factor associated
chromatin state (TF-state). The approach begins with preprocessing the
ChIP-seq data into an alphabet of 2^n observations, builds a HMM to
generate a best selected model without infrequent state, re-estimate
transition and emission probabilities and then annotate biological
meaningful TF-states. b Result of T-cep approach. The emission
probabilities of each mark are independent of others for the selected
18-state HMM. The mark probability of greater than 0.1 is considered to
be associated with a chromatin state. c A screenshot of a genomic
region in Chr8 of MCF-7 annotated by T-cep in UCSC genome browser.
ChIP-seq of TCF7L2 and histone marks are in the first nine lines, while
the tenth track represents the main output of the T-cep, annotated
different states represented by different colors. RefSeq gene (HG19)
genomic positions are shown in the last line
Step 1: Data pre-processing
T-cep uses a univariate HMM, where each bin can only emit one
probability value, which corresponds to a combination of different
marks. An alphabet of 2^n (n = the number of epigenetic marks)
observation symbols is constructed by enumerating each possible
combination of marks (including no marks). We combine all ChIP-seq data
in multiple cells as observations and translate them to such an
alphabet of 2^n observations as an input for the next step. If
necessary for computational efficiency, this alphabet can be simplified
by removing symbols that correspond to mark combinations not present in
the input data.
Step 2: HMM training
We initially train multiple HMMs for 300 iterations with T-cep, and
select the best model based on the lowest BIC score. If desired, this
model may be simplified by removing infrequent states. This is
performed by eliminating the states which are called in very few bins
(lower than 5% of the average number) and uniformly redistributing
their transition probabilities to other states. We then re-train this
derived HMM for another 100 iterations to produce the final model.
Step 3: Potential TF-states calling
The Viterbi decoding algorithm is used to output called states for all
bins in the genome. The probabilities of each mark are futher
calculated by marginalization for each of different cell types
respectively. We chose states with a cutoff of probability greater than
0.1 for any individual mark for further investigation, as these are
most likely to yield meaningful biological insights. Finally, the HMM
states can be classified as potential TF-associated chromatin states
(TF-states) and potential non-TF-associated chromatin states
(non-TF-States).
Identification of TCF7L2-associated chromatin states by T-cep
To demonstrate the accuracy of T-cep and evaluate its performance, we
chose TCF7L2-omics data as a study case. TCF7L2 is an important
component of the WNT pathway, and has previously been studied by our
laboratory. Additionally, 45 datasets of omics-seq data are available
for the training (Additional file [67]1: Table S1-S2, all are available
in the ENCODE Consortium), including ChIP-seq of TCF7L2, six histone
marks (H3K4me1, H3K4me3, H3K9me3, H3K27ac, H3K27me3, H3K36me3), Pol-II
and DNase-seq in five cancer cell types, HCT116, HeLa, HepG2, MCF7 and
PANC1, with a total of ~2.02 billion reads. For each of these cell
types, we also obtained RNA-seq data from our previous study or
publicly available sources [[68]35, [69]36].
We initially trained five 25-state HMMs over the data with bin sizes of
750 bp for 300 iterations, and selected the best HMM with the lowest
BIC score of 4.3247 • 10^7 (Additional file [70]1: Table S3). The
states of every bin in each genome were called after the first HMM
training. Seven infrequent states (lower than 5% of the average) were
then eliminated (Additional file [71]1: Table S4) to produce an
18-state HMM model with BIC 4.40718 * 10^7. We then performed a
secondary HMM training for 100 iterations, and the log-likelihood was
calculated at each iteration to verify model convergence (Additional
file [72]1: Figure S1). This produced the final model, with a BIC score
of 4.31487 *10^7. The transition probabilities and the emission
probabilities of each final state are shown in Additional file [73]1:
Figures S2-S3. The Pearson correlation between the HMM emission
probabilities and the actual observation frequencies under each state
was R ^2 ≥ 0.97 for all cell types, except HCT116 with R ^2 = 0.9545
(Additional file [74]1: Table S5 and Figure S4). We then divided
emission probabilities for each mark independently by marginalization
among combinations of marks probabilities and potential TF-states
(Fig. [75]1b and Additional file [76]1: Figure S5). A comparison of a
genomic region for the states annotated by T-cep and the actual
ChIP-seq signals (Fig. [77]1c) clearly demonstrated the ability of
T-cep to accurately capture epigenetic elements.
We further examined the genomic characterization of potential TF-states
using the annotated genomic regions defined in Additional file [78]1:
Table S6, as well as the expression levels of their associated genes
(Fig. [79]2a and Additional file [80]1: Figure S6). We were able to
classify four states, 1, 3, 4 and 5, as TCF7L2-associated states, and
the others as non-TCF7L2-associated states (Fig. [81]2b). State 3 was
assigned as TCF7L2-associated promoter because of its higher frequency
in 5’TSS regions (44.2% relative to other states) and its high emission
probabilities for Pol-II and H3K4me3 (Fig. [82]2a). State 1 was
classified as TCF7L2 binding, non-combinatorial TF-state, as its
emission probability is only high for TCF7L2. We classified states 4
and 5 as TCF7L2-associated enhancer states due to their higher emission
probabilities for the enhancer marks H3K27ac and H3K4me1. Location
distribution showed most bins (75.9%) of state 5 were in gene body
regions while the associated genes have a higher average gene
expression as well as its high emission probabilities for enhancer
marks H3K4me1, H3K27ac and gene body mark, H3K36me3 (Fig. [83]1b).
Thus, state 5 was annotated as TCF7L2 intragenic enhancer. State 4 was
classified as a TCF7L2 distal enhancer since 67.9% bins of state 14
were outside intragenic regions. Except those bins in Gene Desert
regions, there were clearly more bins in 5’Distal regions than in
5’Proximal and 3’Distal regions. State 2 was classified as mapping
bias/CNV due to its high emission probabilities for all active and
repressive marks, where marks such as Pol-II and H3K9me3 were not
expected to co-occur, as well as due to the relatively high proportion
of its bins in known amplified regions.
Fig. 2.
Fig. 2
[84]Open in a new tab
The annotation of chromatin states for TCF7L2-omics data. a The
correlation with other genomic features, the distribution of each state
bins as well as the average of gene expression for each of 18 states.
The average gene expression is normalized by z-score and % total bin
shows the percentage of each state in the human genome. b The
biological interpretations of the HMM states and identification of four
TCF7L2-assoicated States according to the emission probability and
other genomic information. c The distribution of four TF-associated
States on the gene structure. All genes’ length is normalized,
representing the 5’–3’ region as gene body region and extending 90 kb
of up/down steam for gene surrounding distribution. d Summary of bin
distribution of three TCF7L2-associated states in five cancer cell
types
We also identified some non-TCF7L2-associated states (non-TF-states).
States 7 and 9 were classified as non-TCF7L2-associated enhancers.
State 9 with a high Pol-II was assigned as an active
non-TCF7L2-associated enhancer, while state 7 was likely to be an
inactive non-TCF7L2-associated enhancer. Interestingly, the average
gene expression levels of both states are less than those of the
TCF7L2-associated states. State 6 was classified as a
non-TCF7L2-associated bivalent state or poised enhancer due to its high
emission probability of both repressive mark H3K27me3 and enhancer mark
H3K4me1, and its lower average gene expression. All the other
non-TCF7L2-associated states were assigned for their possible functions
as well (Fig. [85]2b).
The annotation of chromatin states is further supported by the location
analysis of bins within each state relative to genes (Fig. [86]2c). All
genes’ length is normalized by Virtual bin Creation and the 5’-3’
region represents the whole gene body region and the distal region is
extended to each to 90 kb upstream of 5’TSS or downstream of 3’TSS.
Clearly, it is showed more than 60% of state 3 bins were located at or
near the 5’TSS of each gene, (the green line), in which we interpreted
it as promoter. State 5 is represented with the red line, in which its
bins were distributed mainly in gene body region, thus called as
intragenic enhancer. State 4 as distal enhancer shown in the blue line
was mainly located in distal regions, and state 1 was infrequently in
any of gene body or intragenic regions (Fig. [87]2c and Additional file
[88]1: Figure S7).
In the other aspect, we wanted to re-examine whether the distribution
in proportion of numbers of three annotated functional states is
actually in their expected genomic regions. Indeed, State 3 bins are
only in 5’TSS regions (131,086 bins), state 5 bins in intragenic
regions (206,644 bins) and state 4 bins in 5’Distal regions (70,812
bins) for further analysis. Moreover, we observed in each cell type the
distribution of these three TF-States in order 3/5/4 is the following:
239/32/26 for HCT116, 1061/559/30 for HepG2, 737/1109/29 for HeLa,
802/1265/327 for MCF7 and 3093/1972/1581 for PANC1 (Fig. [89]2d). Our
results indicate the distribution in proportion of numbers of each of
three functional states may be cell type specific.
Gene enrichment and pathway analyses of TCF7L2-associated chromatin states
Next, we wanted to examine in silico the biological functions of
TCF7L2-associated chromatin states. We chose MCF7 and PANC1 cell types
for further analysis as they had the highest number of genes associated
with states 3, 4, and 5, where the number of genes associated with
states 3/4/5 is 742/230/573 for MCF7 and 2718/677/672 for PANC1
respectively (see the list in Additional file [90]2). Firstly, we
checked the cell type specificity of all of the genes. Interestingly,
when examined in the National Cancer Institute’s NCI-60 cell lines
[[91]37], MCF7 cell line was the top category for genes selected in
MCF7 cells, and PANC1 cell line was the top category for its selected
genes (Fig. [92]3a). A Venn diagram of examining these same genes
within each cell line showed a little among genes associated with each
state, with only 22 genes in MCF7 and 77 in PANC1, respectively,
associated with all three states (Fig. [93]3b). We also found that the
genes associated with these three different TCF7L2 states in MCF7 and
PANC1 cells are different, suggesting cell type specificity of TCF7L2
activity. Moreover, when we examined the average expression levels of
all genes associated with each state, we found that the overall gene
expression level of state 5 genes was the highest in all selected cell
types (p-value < 0.0001) and much higher than genes associated with
state 4 or state 3. Our result suggested that TCF7L2-regulated
intragenic enhancers may play a prominent role in upregulating gene
expression than TCF7L2-regulated distal enhancers (Fig. [94]3c).
Fig. 3.
Fig. 3
[95]Open in a new tab
Genes with TCF7L2-associated states in MCF7 and PANC1 cells. a
Identification of gene associated with TCF7L2-states in NCI-60 cell
showing cancer type specificity. b A Venn diagram showing an
overlapping set of genes among TCFL2-associated States in two cancer
cell lines. c Boxplot of gene expression level showing
TCF7L2-associated intragenic enhancers with the highest expression
among three TCF7L2-associated states. Expression levels were
log2-transformed, with genes having expression level <1 considered to
have an expression level of 1. d The KEGG pathways for TCF7L2
promoters, TCF7L2 distal enhancers and TCF7L2 intragenic enhancers
We further examined the biological function of the TCF7L2-regulated
genes by KEGG pathway enrichment analysis [[96]38, [97]39]. Our results
demonstrated that the enriched pathways are different between these
three categories with cell type specificity (Fig. [98]3d). In overall,
pathway enrichments in MCF7 cells were more involved in cancer
initiation and cell cycle pathways, while those in PANC1 cells were
more related to the metabolic pathway and cell adhesion. Interestingly,
TCF7L2-regulated promoters are more enriched with well-known specific
cancer initial development such as Epstein-Barr virus infection in
breast cancer and metabolic in pancreatic cancer [[99]20, [100]40,
[101]41]. However, TCF7L2-regulated intragenic enhancers included
advanced stage markers in cancers such as Thyroid hormone signaling and
Proteoglycan in cell adhesion [[102]42, [103]43], whereas
TCF7L2-regulated distal enhancers are enriched with the general
signaling in cancer development such as PI3K-AKT pathway and Rap
signaling [[104]44]. These results suggested that genes associated with
TCF7L2 regulated intragenic enhancers may be more relevant to
metastasis and cancer progression dependent on a specific cancer type.
Comparison with ChromHMM
To compare T-cep with other published software tools, we chose ChromHMM
[[105]45] since it is a widely used software tool, is powerful in
genomic segmentation with a user-friendly application, and can take
different types of epigenetic marks (histone modification,
transcription factor, DNA methylation and et al) as combinatorial
inputs. However, when coming to the question of segmentation involved
in a specific TF, the method has some limitations. 1) Given that the
number of binding sites for most TFs are much less than those of
histone modification sites at a whole genome, thus treating TF same as
histone marks may lose the specific TF characteristics. 2) In ChromHMM,
the independence between marks is transformed into a combination of
probabilities in a given state. This would indicate that, for example,
those regions showing low probability of TF binding are combined with
the regions without TF binding because both have similar probabilities
of other marks. Such output is contradicted to the actual combination.
3) Since there is no specific rule for determining the final threshold
of segmentation, this may result in various outputs for the same data
samples. In contrast, T-cep is designed to consider all possible events
without ignoring any low number of marks. In addition, TF-state is not
combined with other similar combinations since T-cep focuses on TF
associated combinatorial states.
We trained our TCF7L2-omics data on ChromHMM with both 25-state model
and 18-state HMM. We found that ChromHMM output similar patterns for
two models, indicating the 18-state model is optimized for segmentation
(Additional file [106]1: Figure S9). We matched the output of 18 states
for both tools in order to show a direct side-to-side comparison. We
also adopt the following metrics for the evaluation in order to show
the TF-specific segmentation: 1) whether the tool is able to identify
potential TF-states; 2) how many TF-state bins are able to be recovered
in the annotated functional state; and 3) what is the quantitative
correlation between annotated functional TF-state and gene expression
level.
Despite that both methods can identify TCF7L2-associated states. T-cep
were able to annotate four TF-states, states 1, 3, 4 and 5 and ChromHMM
only annotated two TF-states, states 3 and 4 (Fig. [107]4a). Both
methods identified a TCF7L2-associated promoter state (state 3) with a
high probability of H3K4me3, and a TCF7L2-associated distal enhancer
state (state 4) with highly enriched H3K27ac and H3K4me1. Although
ChromHMM assigned state 5 as an intragenic enhancer with a high
probability of H3K27ac, H3K4me1 and H3K36me3, the intragenic enhancer
state is not correlated well with TCF7L2 binding. This clearly showed
that ChromHMM missed an intragenic TF-enhancer state for TCF7L2-omics
data. We also observed that a non-combinatorial TCF7L2-assoaited state,
state 1, identified by T-cep was missed by ChromHMM. We then examed the
number of bins recovered within each of annotated functional states and
found T-cep had more bins for state 3 (296,248 vs 234,750 bins) and
state 4 (354,573 vs 190,683 bins) respectively than ChomHMM did, while
ChromHMM got more bins for state 5 (534,683 vs 272,409 bins)
(Fig. [108]4b). For promoter regions related to state 3, we observed
that a majority of the bins were common between T-cep (71.9% of all
state 3 bins) and ChromHMM (90.8% of all state 3 bins), and only 83,175
and 21,677 bins were unique annotate in each of two methods
respectively. For intragenic enhancer regions related to state 5, these
two methods shared much less common bins with only 49,382 bins (18.1%
for T-cep and 9.2% for ChromHMM) despite there are more bins overall in
state 5 (Fig. [109]4b). The gene expression associated with those
unique bins showed that the genes associated with promoters have a
slightly higher for ChromHMM while those associated distal and
intragenic enhancers are much higher for T-cep. Interestingly, genes
associated with TCF7L2-regulated intragenic enhancers (state 5 in
T-cep) showed the highest expression, indicating only T-cep is capable
of identifying a specific set of TCF7L2-regulated intragenic enhancers
(Fig. [110]4c). Further, we compared the overlapping and distinction of
PANC1 cell segmentation between two methods. Clearly, promoter
segmentation is similar between two methods, but enhancer segmentations
are dramatically different. We found that state 4 in T-cep was mainly
aligned to state 4 and state 6 in ChromHMM while state 5 in T-cep is
split into several states in ChromHMM (Fig. [111]4d). These comparisons
clearly showed the advantage of T-cep that can identify specific
TF-associated states due to its consideration of a TF within the
algorithmic design. Our results also pointed out a major difference
between two methods such that ChromHMM is focused on annotating more
functional combination of marks, while T-cep considers more influences
of a specific TF on functional chromatin states.
Fig. 4.
Fig. 4
[112]Open in a new tab
A comparison of T-cep with ChromHMM. a The emission probabilities of 18
states trained by T-cep and ChromHMM respectively. b The common and
unique number of TF-state bins in each of three TCF7L2-associated
functional states between T-cep and ChromHMM. c Boxplot of gene
expression associated with those unique bins in each of three TF-states
for T-cep and ChromHMM respectively. d Three TF-states’ segmentation
(regions) identified by T-cep corresponding to states identified by
ChromHMM in PANC1 cells
Functional validations of TCF7L2-associated intragenic enhancers
Using our T-cep method, we were able to identify a set of
TCF7L2-associated distal and intragenic enhancers in MCF7 and PANC1
cells respectively. In order to compare the activity between intragenic
and distal enhancer, we chose the genes having both intragenic and
distal enhancer regions, nine in MCF7 and eight in PANC1 cells,
respectively (Additional file [113]1: Table S7). We first performed the
knockdown TCF7L2 experiment using small interfering RNA (siRNA) in
these two cell lines and then used quantitative PCR assays to confirm
these genes are actually regulated by TCF7L2 (Additional file [114]1:
Figure S8). Our results showed that 94.1% (16/17 genes except SMAD3) of
gene expressional levels were significantly decreased (p < 0.0001)
after TCF7L2 knockdown, demonstrating that 1) the de novo prediction of
TF-associated chromatin states (TF-states) by T-cep is valid; and 2)
TCF7L2 are involved in regulating these genes enhancers (Fig. [115]5a).
Out of the most significantly decreased expressed genes, we further
chose KRT8, PTK2, ERRFI1 and ZNF217 for enhancer activity validations.
We cloned specific TCF7L2-associated enhancer loci (TFE) into pGL3
promoter vector for enhancer efficiency detection and the cloned
regions are listed in Additional file [116]1: Table S8-S9. We applied
Luciferase Reporter Assay to further compare the activity between
different types of TF-enhancers. In control cells, both cell types
clearly illustrate TF-Intragenic enhancer shows more activity than
TF-distal enhancers and ZNF217 TF-intragenic enhancer shows almost 4
times than ZNF217 TF-distal enhancer (Fig. [117]5b). Moreover, we
applied TCF7L2 knockdown and TCF7L2 overexpression in both cell lines
to characterize TCF7L2-associated enhancers. In TCF7L2 knockdown cell
lines, we firstly wanted to confirm if TCF7L2 regulates the predicted
intragenic and distal enhancers. The activities of all enhancers are
decreased after TCF7L2 knockdown, notably the enhancers in PANC1 cells
decreased more than those in MCF7 cells and both of ZNF217 and ERRFI1
intragenic enhancer decrease more than 70% in PANC1 cells, illustrating
TCF7L2 indeed regulates those enhancer activities (Fig. [118]5b). In
the other aspects, in TCF7L2 overexpression cells, clearly, both types
of enhancers for four tested genes showed higher activities with TCF7L2
co-transfection. Strikingly, we found that TCF7L2 intragenic enhancers
showed much higher activity than the TCF7L2 distal enhancers. The
activity of KRT8 and PTK2 TF-intragenic enhancer increased more than
400% comparing their TF-distal enhancer around 300% in MCF7 cells. In
PANC1 cells, ZNF217 and ERRFI1 intragenic enhancer also increase
30 ~ 50% although they are already on high basic activity level before
TCF7L2 overexpressed (Fig. [119]5c). These exprimental results strongly
supported a notion that TF intragenic enhancers might contribute higher
in transcribing gene expression than TF distal enhancers. Taken
together, our experimental validations not only confirmed the
functionalities of annotated TCF7L2-associated enhancer states but also
strengthened the reliability of our method in unveiling novel
TF-states.
Fig. 5.
Fig. 5
[120]Open in a new tab
Functional validations of the predicted TCF7L2 enhancers by T-cep. a
Quantitative PCR analysis of selected gene associated with
TCF7L2-enhancers in MCF7 and PANC1 cells after TCF7L2 knockdown.
Selected genes contain both of TCF7L2 intragenic and distal enhancers
for further analysis. All data are normalized by GAPDH mRNA and
represent the average of three independent experiments with p-values
less than 0.05 (t-test). b Luciferase reporter activity assay in TCF7L2
knockdown cells. MCF7 and PANC1 cells are transient transfected with
intragenic or distal TFE-pGL3 vector at the TCF7L2-siRNA or
non-targeting siRNA conditions. c Luciferase reporter activity assay in
TCF7L2 overexpression cells. Intragenic or distal TF-enhancer-pGL3
vector and TCF7L2-pcDNA or empty vector are transient transfected into
MCF7 and PANC1 cells
Discussion
In this study, we have developed a novel computational method, T-cep,
to identify transcriptional factor associated combinatorial epigenetic
patterns from ChIP-seq data. T-cep applied an univariate HMM training,
selected best models by the BIC score and imposed second HMM
re-training for determine TF-state and non-TF-state. The unique
features of T-cep are the following: (1) encoding each mark as an
alphabet of 2^n observations as an input with ability of scaling up,
(2) training HMM twice to reduce background noise and infrequent
states, (3) outputting all combinatorial patterns for emission
probability, and (4) marginalizing on decoding each mark probability in
each state. This is a computationally intensive approach, allowing
states in T-cep to directly examine combinatorial marks together, since
they have emission probabilities for marks’ combinations rather than
individual marks. This is in contrast to other software tools which are
limited to treat all kind of data set as equal biological weight and
lack a focus on the key factor during HMM training. Importantly, using
the T-cep method, we were able to unveil a novel set of TF-associated
intragenic enhancers that are predicted to be higher transcriptional
activity than typical distal enhancers.
As a study case, TCF7L2 is chosen in that it is a key WNT downstream
regulator which is implicated in cancers and diabetes, as well as its
alternative splicing generates protein variants with differential
promoter-binding and transcriptional activation [[121]31, [122]32,
[123]46]. After trained by T-cep with the TCF7L2-omics data, we
uncovered four types of TCF7L2-associated chromatin states with
distinct combinatorial characteristics and genomic distribution. Of
them, TCF7L2 intragenic enhancer is particularly interesting. The Gene
Enrichment and Pathway analyses further revealed that genes with
TCF7L2-associated states are cancer type specific (Fig. [124]3a) and
their intragenic enhancers are more associated with advanced stage
markers in cancer such as Thyroid hormone signaling and Proteoglycan in
cell adhesion (Fig. [125]3c), illustrating their functional relevance
to specific cancer progression and metastasis. Our functional
validations found that such intragenic enhancers exert higher
transcriptional activities than those distal enhancers in TCF7L2
regulated gene expression. Our findings provide a new set of intragenic
enhancers consisting of epigenetic signatures H3K4me1, H3K27ac and
H3K36me3 [[126]47] that may play more important functional role in
diseases progression. This is in contrast to the role of conventional
distal enhancers with typical epigenetic marks of H3K4me1, H3K27ac but
the absence of significant H3K4me3 [[127]48, [128]49] in regulating the
cell type specificity.
One notable advantage of T-cep is that it can identify more TF-states
at a cell-type dependent manner in comparison to ChromHMM due to the
unique underlying algorithmic design. At a broader application, our
T-cep was able to identify five MYC-associated states, states 1, 2, 4,
5 and 6, the similar pattern found in TCF7L2 data sets: TF-promoter
(state 2), TF-intragenic enhancer (state 6), TF-distal enhancer (state
5) and TF-non-combinatorial state (state 1), as well as TF-inactive
enhancer state (state 4) (Additional file [129]1: Figure S10). Three of
these states (state 1, 4, 6) were noticeably missed by ChromHMM
(Additional file [130]1: Figure S11). In a broader aspect, T-cep may be
capable of deeply exploiting transcription factor associated epigenetic
patterns where many other tools are unable to, and may further uncover
their novel functionalities associated with various diseases or cancer
types, providing a rationale to explore the underlying mechanisms.
Conclusions
In summary, our study for the first time at a genome-wide scale reveals
the enhanced transcriptional activity of cell type-specific
TF-associated intragenic enhancers, allowing us more insights into
underlying epigenetic regulatory codes in different cell or tissue
types as well as in normal or disease conditions.
Methods
Training a hidden Markov model (HMM)
In HMMs, TF-state or non-TF-state are considered as hidden states and
all of the possible combinations of chromatin marks are as
observations. T-cep introduces a binary value for each segment that ‘1’
for mark presence or ‘0’ for mark absence based on a Poisson background
distribution(P < 10^−4). Observation symbols are made up by enumerating
each possible combination of mark. For example, a segment with TCF7L2,
H3K4me1 and H3K36me3 can be translated into 0b101000100, while 0 inside
means no correspondent mark in this segment. The HMMs sequence is
assumed as a segment-homogeneous first-order chain.
Let n be the number of distinct states and m be the number of distinct
observation symbols per state in the model. We denote the individual
hidden states as S = {S [1],S [2],…,S [n]} and individual combination
histone mark symbols as O = {O[1],O [2],…,O [m]}. The hidden state
transition probability matrix is A = {a[ij]}, where A [ij] = P(Y
[t+1] = S [j] | Y [t] = S [i]), i ≥ 1, j ≤ n. The emission is the
combination of TF, remodelers, histone marks symbol probability
distribution. We assume B = {b [j](O [t])} is the symbol in hidden
state S[j], where B [j](O [k]) = P(X [t] = O [k] | Y [t] = S [j]),
1 ≤ j ≤ n, 1 ≤ k ≤ m. X [t] is the Observation of symbol and Y [t]
represents the hidden state in bin t. To begin modeling, π is randomly
initialized state probability distribution and there are three
probability measures noted as λ = (A,B,π). Each HMM is trained for 300
iterations using a probability-scaling variant of the canonical
Baum-Welch expectation-maximization algorithm with a minimum
probability of 10^−6 enforced for all transition, emission and initial
probabilities to avoid potential numerical underflow. The best HMM is
selected by the lowest Bayesian Information Criterion (BIC) score.
After Viterbi decoding algorithm, those states with much less bins are
removed from the model. Another canonical Baum-Welch algorithm is
further modified by running multiple instances of the forward-backward
algorithm. Let ξ [t(i)] is the probability of the partial observation
symbols sequence (O [1],O [2],…,O [t]) until bin t and the state S [i]
at bin t, given the model λ. ξ [t(i)] can be written as P(X [1] = O
[1], X [2] = O [2] ,…,X [t] = O [t] , Y [t] = S [i] | λ), the
forward procedure can be calculated recursively by Eq 1, 2.
[MATH: ξ1i=πibiO1 :MATH]
1
[MATH: ξt+1j=bjOt+1∑i=1
nξtiai<
/mi>j :MATH]
2
We assume that the particular state (Y [t] = S [i]) is the
initialization. Let ζ [t(i)] = P(X [t+1] = O [t+1],X [t+2] = O
[t+2],…,X [T] = O [T]|Y [t] = S [i], λ), to be the probability of the
partial observation symbols sequence from t + 1 to end, given the state
S [i] at bin t and the model λ. The backward procedure can be computed
by Eq 3, 4.
[MATH: ζTi=1 :MATH]
3
[MATH: ζti=∑j=1
nbjOt+1ζt+1jai<
/mi>j :MATH]
4
In order to re-estimate the HMMs parameter λ, the HMMs is trained for a
further 100 iteration to produce the final states. We define P [(Si,t)]
as the probability of being in state S [i] at bin t, given the model
and the observation sequence and P [(Sij,t)] is the probability of
being in state S [i] at bin t and state S [j] at bin t + 1, given the
model and the observation symbols sequence. We calculate the P [(Si,t)]
and P [(Sij,t)] with the forward and backward variables ξ [t(i)] and ζ
[t(i)]. The HMM parameter λ can be re-estimated as following Eq 5:
[MATH: π¯=PSitt=1A¯=∑t=1
T−1PSij,t
/∑t=1
T−1PSitB¯=∑t=1
TPSi,t/∑t=1
Xt=O
mtr>TP<
/mi>Si,t
:MATH]
5
The second Viterbi decoding algorithm is used to compute the highest
probability of combinatorial epigenetic marks in correlation with the
hidden states. Each combination of epigenetic mark has one output which
is done to allow more direct interrogation of combinatorial epigenetic
patterns with states. The probabilities of emitting each epigenetic
mark independent are calculated by marginalization among all output
combinations of marks probabilities. The independent emission
probability follows Eq 6:
[MATH: PID=∑x=1
2nPxx&ID>00x&ID=0
:MATH]
6
where ID is the epigenetic mark binary value, x is the output number
and & is the bitwise AND operator.
Implementation and Application
T-cep is implemented in C++, runs on Linux, and depends on the OpenMP
and Boost APIs. T-cep contains a suite of scripts and program
implementing the preprocessing, HMM training and TF-State calling
steps. A univariate hidden Markov model (HMM) is applied to uncode
observed genome-wide epigenetic data into hidden functional chromatin
states. Several other scripts are provided for various utility
functions.
For the concern of scaling limitation, although it is not necessary in
this study (only 512 output combinations), T-cep also provided several
scripts for allowing large numbers of marks in application.
getOutputMap.pl can output the actually combination in all database.
manyOutputMapper.pl used for mapping output combinations in precompiled
datasets to an arbitrary ID system, which only for at least one bin
actually found in output combinations. Due to far fewer mark
combinations are expected to occur in practice, this script effectively
frees T-cep from the scalability for reasonable outputs instead of
considering every combination of marks as a possible output, which
grows according to 2^n. ManyOutputModelBackConvert.pl applied in
mapping the outputs of HMMs trained with this data back to the original
output combinations. This only considers mark combinations actually
found in the data sets an upper bound to model complexity corresponding
to the number of bins in all datasets.
The tool and scripts are available at our website,
[131]http://compbio.uthscsa.edu/T-cep/.
Raw data processing
Public ChIP-seq datasets were downloaded from ENCODE project and only
the 5’-end of uniquely mapped reads are used for the further analysis.
Alignment to human reference genome hg19 was performed with Bowtie.
Default settings were used with an exception that the number of allowed
alignments was restricted to 1 in order to obtain only unique mapped
reads, and seven threads were used. RNA-seq FASTQ files were aligned to
hg19 using Tophat2 and average gene expression of each state was
calculated by first finding the highest expression for any splice
variant of each gene in the RNA-seq results. The midpoint of each bin
is calculated, and compared to the 5’ and 3’ ends of annotated RefSeq
genes. The average gene expression reported as gene expression levels
are weighted by the number of bins close to them. The percentage of
bins within RepeatMasker regions was calculated for all states in all
cell lines. Hg19 RepeatMasker data was downloaded from
[132]http://www.repeatmasker.org/genomes/hg19/RepeatMasker-rm405-db2014
0131/. The raw RepeatMasker regions were extracted and converted to BED
format. The mergeBed program of the BEDTools package was used to merge
adjacent RepeatMasker regions. The merged regions were filtered to
exclude regions less than 500 bp length. Correlation between state bins
and these filtered regions was performed by BEDTools’ intersectBed.
Co-transfection and RT-qPCR
TCF7L2 siRNAs were purchased from Thermo Fisher Scientific Silencer®
Select siRNAs. For transfection of siRNA oligo, MCF7 and PANC1 cells
were seeded by 6 cell plate with Lipofectamine® RNAiMAX Transfection
Reagent for 48 h. Total RNAs from cells were extracted using Quick-RNA™
MiniPrep kit (Zymo Research). Then cDNA was prepared using RevertAid H
Minus First Strand cDNA Synthesis Kit (Thermo Fisher Scientific).
Expression of mRNA analysis was performed with LightCycler® 480 SYBR
Green I Masteron and LightCycler® 480 System Sequence Detection System
(Roche Applied Science) using GAPDH for normalization.
Luciferase reporter assay
Intragenic or distal TFEs were transient transfection into MCF7/PANC1
cells with Lipofectamine 2000 reagent with or without TCF7L2
co-transfection and β-galactosidase expression vector. Cells were
harvested and luciferase activity was determined using the Luciferase
Assay System (Promega) per the manufacturer’s protocol. The
β-galactosidase activity was performed on lysates as a control. The
experiments were performed in triplicate.
Additional files
[133]Additional file 1:^ (4.3MB, docx)
Supplementary Figures and Tables. (DOCX 4445 kb)
[134]Additional file 2:^ (330.2KB, xlsx)
Gene List associated with TF-states. (XLSX 330 kb)
Acknowledgments