Abstract
Epigenetic alteration is a fundamental characteristic of nearly all
human cancers. Tumor cells not only harbor genetic alterations, but
also are regulated by diverse epigenetic modifications. Identification
of epigenetic similarities across different cancer types is beneficial
for the discovery of treatments that can be extended to different
cancers. Nowadays, abundant epigenetic modification profiles have
provided a great opportunity to achieve this goal. Here, we proposed a
new approach TriPCE, introducing tri-clustering strategy to integrative
pan-cancer epigenomic analysis. The method is able to identify coherent
patterns of various epigenetic modifications across different cancer
types. To validate its capability, we applied the proposed TriPCE to
analyze six important epigenetic marks among seven cancer types, and
identified significant cross-cancer epigenetic similarities. These
results suggest that specific epigenetic patterns indeed exist among
these investigated cancers. Furthermore, the gene functional analysis
performed on the associated gene sets demonstrates strong relevance
with cancer development and reveals consistent risk tendency among
these investigated cancer types.
Keywords: epigenetic analysis, pattern discovery, tri-clustering,
FP-growth algorithm, pan-cancer
Introduction
Cancer genetics and epigenetics are closely linked in driving the
cancer phenotype ([29]Bailey et al., 2018). The vast majority of human
cancers emerge from a gradual accumulation of somatic alterations and
epigenetic abnormalities, which together lead to the malignant growth
([30]Jones et al., 2016). Epigenetic changes can further enable tumor
cells to escape from host immune surveillance and various treatments
([31]You and Jones, 2012). Epigenetic abnormalities are usually
observed as disrupted DNA methylation patterns ([32]Chiappinelli
et al., 2015), abnormal histone post translational modifications
([33]Sawan and Herceg, 2010), and aberrant changes in chromatin
organization ([34]Allis and Jenuwein, 2016). How to identify epigenetic
modification patterns that lead to the corresponding dysregulation in
diverse cancers has become a critical research issue of cancer studies
([35]Dawson, 2017; [36]Kelly and Issa, 2017).
Great advancements have been made in delineating the underlying
mechanisms of human cancers ([37]Lawrence et al., 2014;
[38]Martincorena and Campbell, 2015). Extensive research has centered
on the genetic aspect of cancers, such as how mutational activation and
inactivation of cancer genes influence the cellular pathways
([39]Vogelstein et al., 2013; [40]Waddell et al., 2015). Recently, an
increasing emphasis of drug discovery efforts has been targeting on the
cancer epigenome ([41]Flavahan et al., 2017). Many epigenome mapping
projects have been gradually founded. The Cancer Genome Atlas Network
(TCGA), BLUEPRINT, and the International Cancer Genome Consortium
(ICGC) define the genome-wide distribution of epigenetic marks in many
normal and cancerous tissues ([42]Beck et al., 2012; [43]Kundaje
et al., 2015; [44]Weinstein et al., 2015). Given the genome-wide
distribution of epigenetic modifications of different cancers, it is
urgent to decipher common epigenetic patterns across cancers and to
understand the underlying mechanisms of tumorigenesis. Key epigenomic
similarities shared by different cancer types would present an
important opportunity to design effective cancer treatment strategies
among cancers regardless of tissue or organ and enable the extension of
effective treatments from one cancer type to another ([45]Karlic
et al., 2010; [46]Gan et al., 2018).
To detect significant epigenetic patterns, existing computational
methods mainly focus on identifying combinatorial states of different
epigenetic marks. Specifically, CoSBI captures diverse histone
modification patterns based on the correlations of different histone
signals ([47]Ucar et al., 2011). ChromHMM and HiHMM both apply a HMM
model to annotate genomic sequences by the co-occurrence of multiple
epigenetic marks ([48]Ernst et al., 2011; [49]Sohn et al., 2015). RFECS
is developed mainly based on random forests ([50]Rajagopal et al.,
2013). IDEAS is able to jointly characterize epigenetic landscapes in
many cell types and detect differential regulatory regions ([51]Zhang
et al., 2016). These methods have successfully identified the
combinatorial epigenetic pattern in specific cell type. However, the
relations among different cancer types still need to be investigated.
Because DNA methylation in cancers has been addressed elsewhere
([52]Kretzmer et al., 2015; [53]Yang et al., 2016), here we only focus
on the critical covalent histone modifications that are altered in
various cancers, particularly the well-studied acetylation and
methylation modifications.
In this paper, we proposed a tri-clustering approach, named TriPCE, for
integrative pan-cancer epigenomic analysis. The method TriPCE adopts a
tri-clustering strategy to identify the coherent patterns of various
epigenetic modifications across different cancer types. We applied
TriPCE to investigate six critical epigenetic marks among seven cancer
types, and identified significant pan-cancer epigenetic modification
patterns. The results reveal that there exists consistent epigenetic
modification tendency among these cancer types. Meanwhile, the gene
function analysis demonstrates that these associated genes are strongly
relevant with the cancer cellular pathway.
Materials and Methods
Datasets
To detect epigenetic similarities among different cancers, we analyzed
the epigenome maps of seven cancer types, including A549, K562, HepG2,
HCT116, Hela-S3, multiple myeloma-Cell Line, and sporadic Burkitt
lymphoma-Cell Line. For the epigenetic marks, we first filtered out
those marks that are not included in these seven cancer types, and then
focused on six widely studied ones, including H3K4me1, H3K4me3,
H3K9me3, H3K27ac, H3K27me3, and H3K36me3. Meanwhile, the RNA expression
profiles of these cancers were also collected. Totally, we obtained 42
epigenome maps and 7 RNA expression profiles for these cancers. The
datasets were downloaded from the website of NIH Roadmap Epigenome
Project.
General Scheme of the TriPCE Approach
We developed a tri-clustering approach TriPCE to dissect the pan-cancer
epigenetic pattern. The method not only explicitly detects
combinatorial states of various epigenetic marks in different genomic
segments, but also mines similar epigenetic patterns across different
cancer types. The proposed TriPCE model has three key components, as
shown in [54]Figure 1 . Firstly, preprocess the modification data of
various epigenetic marks in different cancer types. Secondly, identify
bi-Clusters based on FP-growth algorithm for each epigenetic mark.
Thirdly, mine tri-Clusters with coherent epigenetic modification
patterns across different cancer types.
Figure 1.
[55]Figure 1
[56]Open in a new tab
The flowchart of the proposed TriPCE approach. (A) Preprocessing the
epigenetic modification data of different cancer types. (B) For each
epigenetic mark, identifying bi-Clusters based on the FP-growth
algorithm. (C) Mining tri-Clusters with coherent epigenetic
modification patterns across different cancer types.
Step 1. Preprocess the epigenetic modification data of different cancer
types. Firstly, the genome was divided into consecutive genomic
segments, with a typical segment size of 200 bps ([57]Gan et al.,
2017). For each epigenetic modification map, we computed the summary
tag count of every segment. Then, each segment is associated with the
intensities of a set of epigenetic modifications in each cancer type.
To deduce the impact of the noise resulting from spurious tag counts in
the ChIP-seq experiments, raw sequence read counts of each epigenetic
modification were further normalized by the total number of reads
followed by arcsine transformation ([58]Pinello et al., 2014). Finally,
according to the genome annotation data, the epigenetic distribution in
the promoter regions was extracted.
After the preprocessing step, we gained six epigenetic profiles of
seven cancer types along the promoter regions. Let G = {ɡ [1], ɡ [2],…,
ɡ[n]} be a set of n genes, let T = {t [1], t [2],…, t [7]} be the
investigated seven cancer types and let E = {e [1], e [2],…, e [6]} be
the six epigenetic marks. For each epigenetic mark, the epigenetic
profiles of different cancer types in the promoter regions of these
genes are organized as a matrix
[MATH:
Dk=T×G={ti,jk} :MATH]
(with i ∈[1,2…,7], j ∈[1,2…, n], k ∈[1,2…,6]), where rows correspond to
the cancer types, and columns correspond to those genes, respectively.
Each entry
[MATH:
ti,jk :MATH]
is a vector representing the epigenetic profile of e[k] in the ith
cancer along the promoter region of gene j.
Step 2. Identify bi-clusters based on FP-growth algorithm for each
epigenetic mark. Given the preprocessed and reorganized epigenetic
modification data matrix of each epigenetic mark, we first computed the
Pearson correlation coefficients between the epigenetic profiles of any
two cancer types at every promoter region, and then obtained a
correlation coefficient matrix.
Specifically, for the promoter region ɡ[i], we computed the Pearson
correlation coefficients among the epigenetic modification distribution
vectors of any different cancer types. If the calculated correlation
coefficient is higher than a given threshold, the epigenetic
modification trend in these two cancer types is regarded as coherent in
this promoter region. Then, we added this cancer type to the
corresponding itemset, which contains all the cancer types exhibiting
similar epigenetic patterns in this region. Based on extensive
experimental comparison, when the correlation coefficient threshold is
set as 0.7, the identified epigenetic patterns are obviously coherent.
For each epigenetic mark, we respectively constructed the corresponding
similar itemsets for all promoter regions.
Based on the resulted itemset, we further identified the significant
coherent epigenetic patterns using FP-growth algorithm ([59]Han et al.,
2004). FP-growth algorithm is a data mining method that was originally
developed for frequent itemset mining in market basket analysis. Here,
we adopted the FP-tree model to represent in a compact way all the
cancer types with similar epigenetic patterns in different promoter
regions. Then, it can be used to mine potential frequent itemsets and
filter out most of the unrelated data. In this context, a typical
frequent itemset represents a group of cancer types that share similar
epigenetic patterns in abundant promoter regions. To gain the
significant epigenetic states, we set the minimum support of genes as
10% of the investigated genes. For each frequent itemset, we then
inversely identified the corresponding gene set and gained the
bi-Cluster. The resulted bi-Cluster is in the form (“genomic regions,”
“cancer types”), representing the cancer types exhibit similar
epigenetic patterns in these genes. Similarly, we obtained the
corresponding bi-Cluster sets for all investigated epigenetic marks.
Step 3. Mine tri-Clusters with coherent epigenetic modification
patterns across different cancer types. After obtaining the bi-Cluster
sets for each epigenetic mark, we further mined the tri-Clusters. By
enumerating the maximum subsets of different epigenetic marks, we
obtained the tri-Clusters. In detail, we respectively computed the
intersection of the bi-Cluster sets from two epigenetic marks e[k] and
e[l], which are kept with the epigenetic marks to get possible
tri-Clusters. Further, by filtering out the candidates with the support
lower than the predefined minimum support, we obtained the significant
tri-Clusters. Iteratively, we continued the process with another
epigenetic mark until all the epigenetic marks were analyzed. We tried
all such paths and kept the maximal tri-Clusters only. Each tri-Cluster
is represented as (“genomic regions,” “cancer types,” “epigenetic
marks”), listing a gene set with similar trend of epigenetic
modifications in different cancer types. The resulted tri-Clusters
indicate that the conserved epigenetic signatures in these genomic
regions are shared by multiple cancer types.
Functional Analysis of the Genes
From the identified tri-Clusters, we can obtain the gene sets
associated with specific coherent epigenetic patterns. To investigate
the potential functions of these genes, we performed the gene ontology
(GO) enrichment analysis and pathway enrichment analysis via DAVID
bioinformatics resources ([60]Huang et al., 2007). The significant
enrichment lists were obtained with P-value < 0.005.
Results
Identifying Similar Epigenetic Patterns Across Different Cancer Types
We developed a tri-clustering approach, TriPCE, to capture similar
epigenetic patterns among different cancer types. TriPCE was applied to
the genome-wide epigenetic modification maps of seven cancer types,
including A549, K562, HepG2, HCT116, Hela-S3, multiple myeloma-Cell
Line, and sporadic Burkitt lymphoma-Cell Line. For each epigenetic
mark, TriPCE first groups the promoter regions based on the epigenetic
modification profiles among different cancer types. [61]Figure 2 shows
a typical bi-Cluster of epigenetic mark H3K4me1, which contains
abundant genes with similar modification pattern in four cancer types,
including Hela-S3, HepG2, K562, and A549. From this figure, we observe
that the epigenetic profiles of these genes are similar in these cancer
types. Then, the epigenetic profile shared by a cluster of promoter
regions in multiple cancer types is considered to be an epigenetic
pattern. Meanwhile, different cancer types share similar epigenetic
patterns. This result is consistent with previous finding that
H3K9me3/me2 and H3K36me3/me2 frequently observed in breast cancer
([62]Liu et al., 2009), esophageal cancer ([63]Yang et al., 2000), MALT
lymphoma ([64]Vinatzer et al., 2008), and lung sarcomatoid carcinoma
([65]Italiano et al., 2006). Based on the identified bi-Clusters of
these investigated epigenetic marks, we noted that cancers (HepG2 and
HCT116) are clustered together and share a larger number of epigenetic
marks, implying that they share more similar epigenetic regulation
mechanisms.
Figure 2.
[66]Figure 2
[67]Open in a new tab
The profiles of epigenetic mark H3K4me3 in a typical bi-Cluster exhibit
a similar pattern in four cancer types, including Hela-S3, HepG2, K562
and A549.
To identify the significant modification patterns, we set the minimal
support of genes as 10% of the investigated genes. With diverse
correlation coefficient thresholds, we respectively gained different
numbers of bi-Clusters for epigenetic marks H3K4me1, H3K4me3, H3K9me3,
H3K27me3, H3K36me3, and H3K27ac, among these cancer types, as shown in
[68]Figure 3 . The comparison indicates that the similarities of these
epigenetic marks are quite different. Under different threshold
settings, the epigenetic mark H3K4me3 has a relatively small number of
bi-Clusters, indicating that its profiles are less conserved and
exhibit more variable patterns among these cancer types than other
epigenetic marks. On the contrary, there are more similar epigenetic
patterns of H3K4me1 and H3K27me3 among different cancer types
([69]Baylin and Jones, 2016). The plasticity of epigenome depends on
diverse environmental factors. Thus, it is not surprising that
epigenotypes contribute to developmental human disorders and adult
diseases ([70]Brien et al., 2016). As the minimal support threshold
slightly affects the trend among different epigenetic marks, we chose
the bi-Clusters with threshold 0.7 for further analysis.
Figure 3.
Figure 3
[71]Open in a new tab
The numbers of bi-Clusters with varied similarity thresholds for
different epigenetic marks.
Identifying Coherent Patterns Among Different Epigenetic Marks
From the above results, we notice that there are obvious differences
among the investigated epigenetic modifications. To identify the
conserved epigenetic states and explore the similar patterns of these
epigenetic modifications, we further clustered these epigenetic marks
based on the detected bi-Clusters. By systematically computing the
intersection of the bi-Cluster sets from different epigenetic marks, we
kept the tri-Clusters with the support higher than the predefined
minimum support. The identified tri-Clusters are represented as triples
(“genomic regions,” “cancer types,” “epigenetic marks”). Each
tri-Cluster represents that the promoter region of these genes exhibits
similar epigenetic modification patterns in the related cancer types.
Applying TriPCE to the data set, we initially obtained 175 significant
tri-Clusters. [72]Figure 4 shows the information of 15 typical
clusters, including the epigenetic marks, the cancer types, and the
supports of these tri-Clusters. The results indicate that specific
genomic regions indeed share combinatorial epigenetic patterns across
different cancer types. For example, the changing pattern of epigenetic
modifications (H3K4me3, H3K9me3, H3K27me3, and H3K36me3) are shared by
a large number of genes in cancer types A549, HepG2, and K562. On the
contrary, some epigenetic modification patterns are only coherent in
certain cancer types. Among these resulted clusters, we observe that
the similar patterns of H3K36me3, H3K27ac, and H3kK27me3 exist in fewer
cancer types, such as HepG2 and sporadic Burkitt lymphoma-Cell Line.
Notably, these identified tri-Clusters reveal more information about
the epigenetic patterns among these cancer types.
Figure 4.
[73]Figure 4
[74]Open in a new tab
Typical epigenetic tri-Clusters. (A) The epigenetic marks (column) in
each cluster (row). (B) The cancer types (column) in each cluster
(row). Fold enrichment was calculated as the ratio between the number
of genes in the tri-Cluster to that of all genes.
Analyzing the Potential Roles of Associated Genes
Based on the detected tri-Clusters, we further obtained those gene sets
that exhibit coherent epigenetic patterns in different cancer types.
Previous studies have shown that the modification intensities are
significantly distinct between high-expression gene promoters and
low-expression gene promoters, which suggests that these chromatin
components have significant effect on gene regulation ([75]Su et al.,
2012). To investigate the potential functions of those genes in the
cellular control pathways, we performed a systematic GO enrichment
analysis using DAVID tools ([76]https://david.ncifcrf.gov/). Then, for
the associated gene sets in the identified tri-Clusters, we
respectively summarized the key biological processes and pathways that
they are involved in.
Overall, we found that those genes enriched in tri-Clusters exhibit an
enrichment for cancer-related functions. [77]Table 1 lists the
significant GO terms of a typical tri-Cluster (P-value < 0.005). In
this tri-Cluster, the genes exhibit coherent modification patterns on
epigenetic marks (H3K4me1, H3K4me3, H3K9me3, H3K27ac, and H3K27me3) in
cancer types (HeLa-S3, HepG2, multiple myeloma-Cell Line, and sporadic
Burkitt lymphoma-Cell Line). In the table, terms “positive regulation
of cell proliferation” and “negative regulation of apoptotic process”
are enriched in these gene sets. This result implies that the
identified genes in this tri-Cluster are essential for cell
proliferation and apoptotic process, which has been reported to be
related to cancer development by previous researches ([78]Deng et al.,
2016). Meanwhile, the term “positive regulation of gene expression” is
also enriched in the gene set, further indicating that these genes
might perform important regulation roles in these cancers.
Table 1.
Functional enrichment of genes in the identified tri-Clusters.
Term type Term name P-value Term type Term name P-value
BP Positive regulation of cell proliferation 2.84E-06 MF Protein
binding 1.10E-12
BP Translational initiation 1.18E-05 MF Poly(A) RNA binding 3.90E-10
BP mRNA processing 2.72E-05 MF RNA binding 2.13E-05
BP Cell division 4.08E-05 MF Glutathione binding 7.85E-04
BP rRNA processing 2.70E-04 MF Enzyme regulator activity 4.02E-03
BP RNA splicing 4.04E-04 MF Nucleosomal DNA binding 4.25E-03
BP Positive regulation of gene expression, epigenetic 9.41E-04 MF
Translation initiation factor activity 4.30E-03
BP Protein targeting to Golgi 8.87E-05 MF Glutathione transferase
activity 8.00E-03
BP Nitrobenzene metabolic process 1.14E-04 MF Protein binding, bridging
4.33E-03
BP Xenobiotic catabolic process 1.13E-03 MF ATP binding 4.57E-03
BP mRNA splicing, via spliceosome 1.14E-03 CC Nucleoplasm 6.18E-13
BP Sister chromatid cohesion 2.13E-03 CC Cytosol 3.96E-07
BP SRP-dependent cotranslational protein targeting to membrane 1.06E-03
CC Membrane 7.68E-06
BP Negative regulation of transcription, DNA-templated 1.55E-03 CC
Nucleus 2.34E-04
BP Negative regulation of apoptotic process 1.88E-03 CC Cytoplasm
2.69E-04
BP Nucleosome assembly 3.86E-03 KEGG Glutathione metabolism 1.09E-03
BP Glutathione derivative biosynthetic process 4.18E-03 KEGG Systemic
lupus erythematosus 1.93E-03
[79]Open in a new tab
Discussion
Identifying epigenetic patterns is important to understand epigenetic
mechanisms in various cancers. The detected patterns among different
cancers could demonstrate critical cross-cancer similarities, which
reveals some consistent clinical risk among different cancer types and
further suggests strong clinical relevance. Our knowledge about the
patterns of epigenetic modifications and the cause and consequence of
them is still limited. Computational approach that exploits the complex
epigenomic landscapes and discovers significant signatures out of them
is required. Previous computational methods for analyzing epigenomes
primarily focus on the combinatorial states of different epigenetic
marks in a specific cell type. Differently, we developed a
tri-clustering approach TriPCE for integrative pan-cancer epigenomic
analysis. Based on the FP-tree structure, TriPCE can compactly
represent all similar cancer types in the promoter regions for a
specific epigenetic mark. Using the constructed FP-tree, the frequent
patterns are then detected to yield the set of bi-Clusters of this
epigenetic mark, indicating the similar epigenetic pattern in these
cancer types along these genomic regions. TriPCE further mines the
final tri-Clusters based on the bi-Clusters of all investigated
epigenetic marks, explicitly detecting combinatorial epigenetic states
in different genomic segments and similar epigenetic changes across
different cancer types. In the proposed approach TriPCE, the
tri-Cluster enumeration is an expensive operation. In the future we
plan to develop heuristic techniques to efficiently prune the search
space, and then improve the efficiency of mining the tri-Clusters. We
applied TriPCE to uncover the similar patterns of six epigenetic marks
among seven cancer types and successfully identified significant
cross-cancer epigenetic modification similarities, which suggests that
there exhibits consistent epigenetic modification tendency among these
investigated cancer types. Furthermore, the gene functional analysis
demonstrates that these associated genes are strongly relevant with the
cancer cellular pathway.
Data Availability Statement
All datasets generated for this study are included in the
article/supplementary material.
Author Contributions
YG is responsible for the main idea, as well as the completion of the
manuscript. NL and YX have developed the algorithm and performed data
analysis. GZ has coordinated data preprocessing and supervised the
effort. All authors have read and approved the final manuscript.
Funding
This work and the publication costs were supported in part by the
National Natural Science Foundation of China (61772128, 61772367),
National Key Research and Development Program of China
(2016YFC0901704), Shanghai Natural Science Foundation
(17ZR1400200,18ZR1414400), and the Fundamental Research Funds for the
Central Universities (2232016A3-05),
Conflict of Interest
The authors declare that the research was conducted in the absence of
any commercial or financial relationships that could be construed as a
potential conflict of interest.
Acknowledgments