Abstract

   Epigenetic alteration is a fundamental characteristic of nearly all
   human cancers. Tumor cells not only harbor genetic alterations, but
   also are regulated by diverse epigenetic modifications. Identification
   of epigenetic similarities across different cancer types is beneficial
   for the discovery of treatments that can be extended to different
   cancers. Nowadays, abundant epigenetic modification profiles have
   provided a great opportunity to achieve this goal. Here, we proposed a
   new approach TriPCE, introducing tri-clustering strategy to integrative
   pan-cancer epigenomic analysis. The method is able to identify coherent
   patterns of various epigenetic modifications across different cancer
   types. To validate its capability, we applied the proposed TriPCE to
   analyze six important epigenetic marks among seven cancer types, and
   identified significant cross-cancer epigenetic similarities. These
   results suggest that specific epigenetic patterns indeed exist among
   these investigated cancers. Furthermore, the gene functional analysis
   performed on the associated gene sets demonstrates strong relevance
   with cancer development and reveals consistent risk tendency among
   these investigated cancer types.

   Keywords: epigenetic analysis, pattern discovery, tri-clustering,
   FP-growth algorithm, pan-cancer

Introduction

   Cancer genetics and epigenetics are closely linked in driving the
   cancer phenotype ([29]Bailey et al., 2018). The vast majority of human
   cancers emerge from a gradual accumulation of somatic alterations and
   epigenetic abnormalities, which together lead to the malignant growth
   ([30]Jones et al., 2016). Epigenetic changes can further enable tumor
   cells to escape from host immune surveillance and various treatments
   ([31]You and Jones, 2012). Epigenetic abnormalities are usually
   observed as disrupted DNA methylation patterns ([32]Chiappinelli
   et al., 2015), abnormal histone post translational modifications
   ([33]Sawan and Herceg, 2010), and aberrant changes in chromatin
   organization ([34]Allis and Jenuwein, 2016). How to identify epigenetic
   modification patterns that lead to the corresponding dysregulation in
   diverse cancers has become a critical research issue of cancer studies
   ([35]Dawson, 2017; [36]Kelly and Issa, 2017).

   Great advancements have been made in delineating the underlying
   mechanisms of human cancers ([37]Lawrence et al., 2014;
   [38]Martincorena and Campbell, 2015). Extensive research has centered
   on the genetic aspect of cancers, such as how mutational activation and
   inactivation of cancer genes influence the cellular pathways
   ([39]Vogelstein et al., 2013; [40]Waddell et al., 2015). Recently, an
   increasing emphasis of drug discovery efforts has been targeting on the
   cancer epigenome ([41]Flavahan et al., 2017). Many epigenome mapping
   projects have been gradually founded. The Cancer Genome Atlas Network
   (TCGA), BLUEPRINT, and the International Cancer Genome Consortium
   (ICGC) define the genome-wide distribution of epigenetic marks in many
   normal and cancerous tissues ([42]Beck et al., 2012; [43]Kundaje
   et al., 2015; [44]Weinstein et al., 2015). Given the genome-wide
   distribution of epigenetic modifications of different cancers, it is
   urgent to decipher common epigenetic patterns across cancers and to
   understand the underlying mechanisms of tumorigenesis. Key epigenomic
   similarities shared by different cancer types would present an
   important opportunity to design effective cancer treatment strategies
   among cancers regardless of tissue or organ and enable the extension of
   effective treatments from one cancer type to another ([45]Karlic
   et al., 2010; [46]Gan et al., 2018).

   To detect significant epigenetic patterns, existing computational
   methods mainly focus on identifying combinatorial states of different
   epigenetic marks. Specifically, CoSBI captures diverse histone
   modification patterns based on the correlations of different histone
   signals ([47]Ucar et al., 2011). ChromHMM and HiHMM both apply a HMM
   model to annotate genomic sequences by the co-occurrence of multiple
   epigenetic marks ([48]Ernst et al., 2011; [49]Sohn et al., 2015). RFECS
   is developed mainly based on random forests ([50]Rajagopal et al.,
   2013). IDEAS is able to jointly characterize epigenetic landscapes in
   many cell types and detect differential regulatory regions ([51]Zhang
   et al., 2016). These methods have successfully identified the
   combinatorial epigenetic pattern in specific cell type. However, the
   relations among different cancer types still need to be investigated.
   Because DNA methylation in cancers has been addressed elsewhere
   ([52]Kretzmer et al., 2015; [53]Yang et al., 2016), here we only focus
   on the critical covalent histone modifications that are altered in
   various cancers, particularly the well-studied acetylation and
   methylation modifications.

   In this paper, we proposed a tri-clustering approach, named TriPCE, for
   integrative pan-cancer epigenomic analysis. The method TriPCE adopts a
   tri-clustering strategy to identify the coherent patterns of various
   epigenetic modifications across different cancer types. We applied
   TriPCE to investigate six critical epigenetic marks among seven cancer
   types, and identified significant pan-cancer epigenetic modification
   patterns. The results reveal that there exists consistent epigenetic
   modification tendency among these cancer types. Meanwhile, the gene
   function analysis demonstrates that these associated genes are strongly
   relevant with the cancer cellular pathway.

Materials and Methods

Datasets

   To detect epigenetic similarities among different cancers, we analyzed
   the epigenome maps of seven cancer types, including A549, K562, HepG2,
   HCT116, Hela-S3, multiple myeloma-Cell Line, and sporadic Burkitt
   lymphoma-Cell Line. For the epigenetic marks, we first filtered out
   those marks that are not included in these seven cancer types, and then
   focused on six widely studied ones, including H3K4me1, H3K4me3,
   H3K9me3, H3K27ac, H3K27me3, and H3K36me3. Meanwhile, the RNA expression
   profiles of these cancers were also collected. Totally, we obtained 42
   epigenome maps and 7 RNA expression profiles for these cancers. The
   datasets were downloaded from the website of NIH Roadmap Epigenome
   Project.

General Scheme of the TriPCE Approach

   We developed a tri-clustering approach TriPCE to dissect the pan-cancer
   epigenetic pattern. The method not only explicitly detects
   combinatorial states of various epigenetic marks in different genomic
   segments, but also mines similar epigenetic patterns across different
   cancer types. The proposed TriPCE model has three key components, as
   shown in [54]Figure 1 . Firstly, preprocess the modification data of
   various epigenetic marks in different cancer types. Secondly, identify
   bi-Clusters based on FP-growth algorithm for each epigenetic mark.
   Thirdly, mine tri-Clusters with coherent epigenetic modification
   patterns across different cancer types.

Figure 1.

   [55]Figure 1
   [56]Open in a new tab

   The flowchart of the proposed TriPCE approach. (A) Preprocessing the
   epigenetic modification data of different cancer types. (B) For each
   epigenetic mark, identifying bi-Clusters based on the FP-growth
   algorithm. (C) Mining tri-Clusters with coherent epigenetic
   modification patterns across different cancer types.

   Step 1. Preprocess the epigenetic modification data of different cancer
   types. Firstly, the genome was divided into consecutive genomic
   segments, with a typical segment size of 200 bps ([57]Gan et al.,
   2017). For each epigenetic modification map, we computed the summary
   tag count of every segment. Then, each segment is associated with the
   intensities of a set of epigenetic modifications in each cancer type.
   To deduce the impact of the noise resulting from spurious tag counts in
   the ChIP-seq experiments, raw sequence read counts of each epigenetic
   modification were further normalized by the total number of reads
   followed by arcsine transformation ([58]Pinello et al., 2014). Finally,
   according to the genome annotation data, the epigenetic distribution in
   the promoter regions was extracted.

   After the preprocessing step, we gained six epigenetic profiles of
   seven cancer types along the promoter regions. Let G = {ɡ [1], ɡ [2],…,
   ɡ[n]} be a set of n genes, let T = {t [1], t [2],…, t [7]} be the
   investigated seven cancer types and let E = {e [1], e [2],…, e [6]} be
   the six epigenetic marks. For each epigenetic mark, the epigenetic
   profiles of different cancer types in the promoter regions of these
   genes are organized as a matrix
   [MATH:
   <mrow><msub><mi>D</mi><mi>k</mi></msub><mo>=</mo><mi>T</mi><mo>×</mo><m
   i>G</mi><mo>=</mo><mo>{</mo><msubsup><mi>t</mi><mrow><mi>i</mi><mo>,</m
   o><mi>j</mi></mrow><mi>k</mi></msubsup><mo>}</mo></mrow> :MATH]
   (with i ∈[1,2…,7], j ∈[1,2…, n], k ∈[1,2…,6]), where rows correspond to
   the cancer types, and columns correspond to those genes, respectively.
   Each entry
   [MATH:
   <mrow><msubsup><mi>t</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow><mi
   >k</mi></msubsup></mrow> :MATH]
   is a vector representing the epigenetic profile of e[k] in the ith
   cancer along the promoter region of gene j.

   Step 2. Identify bi-clusters based on FP-growth algorithm for each
   epigenetic mark. Given the preprocessed and reorganized epigenetic
   modification data matrix of each epigenetic mark, we first computed the
   Pearson correlation coefficients between the epigenetic profiles of any
   two cancer types at every promoter region, and then obtained a
   correlation coefficient matrix.

   Specifically, for the promoter region ɡ[i], we computed the Pearson
   correlation coefficients among the epigenetic modification distribution
   vectors of any different cancer types. If the calculated correlation
   coefficient is higher than a given threshold, the epigenetic
   modification trend in these two cancer types is regarded as coherent in
   this promoter region. Then, we added this cancer type to the
   corresponding itemset, which contains all the cancer types exhibiting
   similar epigenetic patterns in this region. Based on extensive
   experimental comparison, when the correlation coefficient threshold is
   set as 0.7, the identified epigenetic patterns are obviously coherent.
   For each epigenetic mark, we respectively constructed the corresponding
   similar itemsets for all promoter regions.

   Based on the resulted itemset, we further identified the significant
   coherent epigenetic patterns using FP-growth algorithm ([59]Han et al.,
   2004). FP-growth algorithm is a data mining method that was originally
   developed for frequent itemset mining in market basket analysis. Here,
   we adopted the FP-tree model to represent in a compact way all the
   cancer types with similar epigenetic patterns in different promoter
   regions. Then, it can be used to mine potential frequent itemsets and
   filter out most of the unrelated data. In this context, a typical
   frequent itemset represents a group of cancer types that share similar
   epigenetic patterns in abundant promoter regions. To gain the
   significant epigenetic states, we set the minimum support of genes as
   10% of the investigated genes. For each frequent itemset, we then
   inversely identified the corresponding gene set and gained the
   bi-Cluster. The resulted bi-Cluster is in the form (“genomic regions,”
   “cancer types”), representing the cancer types exhibit similar
   epigenetic patterns in these genes. Similarly, we obtained the
   corresponding bi-Cluster sets for all investigated epigenetic marks.

   Step 3. Mine tri-Clusters with coherent epigenetic modification
   patterns across different cancer types. After obtaining the bi-Cluster
   sets for each epigenetic mark, we further mined the tri-Clusters. By
   enumerating the maximum subsets of different epigenetic marks, we
   obtained the tri-Clusters. In detail, we respectively computed the
   intersection of the bi-Cluster sets from two epigenetic marks e[k] and
   e[l], which are kept with the epigenetic marks to get possible
   tri-Clusters. Further, by filtering out the candidates with the support
   lower than the predefined minimum support, we obtained the significant
   tri-Clusters. Iteratively, we continued the process with another
   epigenetic mark until all the epigenetic marks were analyzed. We tried
   all such paths and kept the maximal tri-Clusters only. Each tri-Cluster
   is represented as (“genomic regions,” “cancer types,” “epigenetic
   marks”), listing a gene set with similar trend of epigenetic
   modifications in different cancer types. The resulted tri-Clusters
   indicate that the conserved epigenetic signatures in these genomic
   regions are shared by multiple cancer types.

Functional Analysis of the Genes

   From the identified tri-Clusters, we can obtain the gene sets
   associated with specific coherent epigenetic patterns. To investigate
   the potential functions of these genes, we performed the gene ontology
   (GO) enrichment analysis and pathway enrichment analysis via DAVID
   bioinformatics resources ([60]Huang et al., 2007). The significant
   enrichment lists were obtained with P-value < 0.005.

Results

Identifying Similar Epigenetic Patterns Across Different Cancer Types

   We developed a tri-clustering approach, TriPCE, to capture similar
   epigenetic patterns among different cancer types. TriPCE was applied to
   the genome-wide epigenetic modification maps of seven cancer types,
   including A549, K562, HepG2, HCT116, Hela-S3, multiple myeloma-Cell
   Line, and sporadic Burkitt lymphoma-Cell Line. For each epigenetic
   mark, TriPCE first groups the promoter regions based on the epigenetic
   modification profiles among different cancer types. [61]Figure 2 shows
   a typical bi-Cluster of epigenetic mark H3K4me1, which contains
   abundant genes with similar modification pattern in four cancer types,
   including Hela-S3, HepG2, K562, and A549. From this figure, we observe
   that the epigenetic profiles of these genes are similar in these cancer
   types. Then, the epigenetic profile shared by a cluster of promoter
   regions in multiple cancer types is considered to be an epigenetic
   pattern. Meanwhile, different cancer types share similar epigenetic
   patterns. This result is consistent with previous finding that
   H3K9me3/me2 and H3K36me3/me2 frequently observed in breast cancer
   ([62]Liu et al., 2009), esophageal cancer ([63]Yang et al., 2000), MALT
   lymphoma ([64]Vinatzer et al., 2008), and lung sarcomatoid carcinoma
   ([65]Italiano et al., 2006). Based on the identified bi-Clusters of
   these investigated epigenetic marks, we noted that cancers (HepG2 and
   HCT116) are clustered together and share a larger number of epigenetic
   marks, implying that they share more similar epigenetic regulation
   mechanisms.

Figure 2.

   [66]Figure 2
   [67]Open in a new tab

   The profiles of epigenetic mark H3K4me3 in a typical bi-Cluster exhibit
   a similar pattern in four cancer types, including Hela-S3, HepG2, K562
   and A549.

   To identify the significant modification patterns, we set the minimal
   support of genes as 10% of the investigated genes. With diverse
   correlation coefficient thresholds, we respectively gained different
   numbers of bi-Clusters for epigenetic marks H3K4me1, H3K4me3, H3K9me3,
   H3K27me3, H3K36me3, and H3K27ac, among these cancer types, as shown in
   [68]Figure 3 . The comparison indicates that the similarities of these
   epigenetic marks are quite different. Under different threshold
   settings, the epigenetic mark H3K4me3 has a relatively small number of
   bi-Clusters, indicating that its profiles are less conserved and
   exhibit more variable patterns among these cancer types than other
   epigenetic marks. On the contrary, there are more similar epigenetic
   patterns of H3K4me1 and H3K27me3 among different cancer types
   ([69]Baylin and Jones, 2016). The plasticity of epigenome depends on
   diverse environmental factors. Thus, it is not surprising that
   epigenotypes contribute to developmental human disorders and adult
   diseases ([70]Brien et al., 2016). As the minimal support threshold
   slightly affects the trend among different epigenetic marks, we chose
   the bi-Clusters with threshold 0.7 for further analysis.

Figure 3.

   Figure 3
   [71]Open in a new tab

   The numbers of bi-Clusters with varied similarity thresholds for
   different epigenetic marks.

Identifying Coherent Patterns Among Different Epigenetic Marks

   From the above results, we notice that there are obvious differences
   among the investigated epigenetic modifications. To identify the
   conserved epigenetic states and explore the similar patterns of these
   epigenetic modifications, we further clustered these epigenetic marks
   based on the detected bi-Clusters. By systematically computing the
   intersection of the bi-Cluster sets from different epigenetic marks, we
   kept the tri-Clusters with the support higher than the predefined
   minimum support. The identified tri-Clusters are represented as triples
   (“genomic regions,” “cancer types,” “epigenetic marks”). Each
   tri-Cluster represents that the promoter region of these genes exhibits
   similar epigenetic modification patterns in the related cancer types.

   Applying TriPCE to the data set, we initially obtained 175 significant
   tri-Clusters. [72]Figure 4 shows the information of 15 typical
   clusters, including the epigenetic marks, the cancer types, and the
   supports of these tri-Clusters. The results indicate that specific
   genomic regions indeed share combinatorial epigenetic patterns across
   different cancer types. For example, the changing pattern of epigenetic
   modifications (H3K4me3, H3K9me3, H3K27me3, and H3K36me3) are shared by
   a large number of genes in cancer types A549, HepG2, and K562. On the
   contrary, some epigenetic modification patterns are only coherent in
   certain cancer types. Among these resulted clusters, we observe that
   the similar patterns of H3K36me3, H3K27ac, and H3kK27me3 exist in fewer
   cancer types, such as HepG2 and sporadic Burkitt lymphoma-Cell Line.
   Notably, these identified tri-Clusters reveal more information about
   the epigenetic patterns among these cancer types.

Figure 4.

   [73]Figure 4
   [74]Open in a new tab

   Typical epigenetic tri-Clusters. (A) The epigenetic marks (column) in
   each cluster (row). (B) The cancer types (column) in each cluster
   (row). Fold enrichment was calculated as the ratio between the number
   of genes in the tri-Cluster to that of all genes.

Analyzing the Potential Roles of Associated Genes

   Based on the detected tri-Clusters, we further obtained those gene sets
   that exhibit coherent epigenetic patterns in different cancer types.
   Previous studies have shown that the modification intensities are
   significantly distinct between high-expression gene promoters and
   low-expression gene promoters, which suggests that these chromatin
   components have significant effect on gene regulation ([75]Su et al.,
   2012). To investigate the potential functions of those genes in the
   cellular control pathways, we performed a systematic GO enrichment
   analysis using DAVID tools ([76]https://david.ncifcrf.gov/). Then, for
   the associated gene sets in the identified tri-Clusters, we
   respectively summarized the key biological processes and pathways that
   they are involved in.

   Overall, we found that those genes enriched in tri-Clusters exhibit an
   enrichment for cancer-related functions. [77]Table 1 lists the
   significant GO terms of a typical tri-Cluster (P-value < 0.005). In
   this tri-Cluster, the genes exhibit coherent modification patterns on
   epigenetic marks (H3K4me1, H3K4me3, H3K9me3, H3K27ac, and H3K27me3) in
   cancer types (HeLa-S3, HepG2, multiple myeloma-Cell Line, and sporadic
   Burkitt lymphoma-Cell Line). In the table, terms “positive regulation
   of cell proliferation” and “negative regulation of apoptotic process”
   are enriched in these gene sets. This result implies that the
   identified genes in this tri-Cluster are essential for cell
   proliferation and apoptotic process, which has been reported to be
   related to cancer development by previous researches ([78]Deng et al.,
   2016). Meanwhile, the term “positive regulation of gene expression” is
   also enriched in the gene set, further indicating that these genes
   might perform important regulation roles in these cancers.

Table 1.

   Functional enrichment of genes in the identified tri-Clusters.
   Term type Term name P-value Term type Term name P-value
   BP Positive regulation of cell proliferation 2.84E-06 MF Protein
   binding 1.10E-12
   BP Translational initiation 1.18E-05 MF Poly(A) RNA binding 3.90E-10
   BP mRNA processing 2.72E-05 MF RNA binding 2.13E-05
   BP Cell division 4.08E-05 MF Glutathione binding 7.85E-04
   BP rRNA processing 2.70E-04 MF Enzyme regulator activity 4.02E-03
   BP RNA splicing 4.04E-04 MF Nucleosomal DNA binding 4.25E-03
   BP Positive regulation of gene expression, epigenetic 9.41E-04 MF
   Translation initiation factor activity 4.30E-03
   BP Protein targeting to Golgi 8.87E-05 MF Glutathione transferase
   activity 8.00E-03
   BP Nitrobenzene metabolic process 1.14E-04 MF Protein binding, bridging
   4.33E-03
   BP Xenobiotic catabolic process 1.13E-03 MF ATP binding 4.57E-03
   BP mRNA splicing, via spliceosome 1.14E-03 CC Nucleoplasm 6.18E-13
   BP Sister chromatid cohesion 2.13E-03 CC Cytosol 3.96E-07
   BP SRP-dependent cotranslational protein targeting to membrane 1.06E-03
   CC Membrane 7.68E-06
   BP Negative regulation of transcription, DNA-templated 1.55E-03 CC
   Nucleus 2.34E-04
   BP Negative regulation of apoptotic process 1.88E-03 CC Cytoplasm
   2.69E-04
   BP Nucleosome assembly 3.86E-03 KEGG Glutathione metabolism 1.09E-03
   BP Glutathione derivative biosynthetic process 4.18E-03 KEGG Systemic
   lupus erythematosus 1.93E-03
   [79]Open in a new tab

Discussion

   Identifying epigenetic patterns is important to understand epigenetic
   mechanisms in various cancers. The detected patterns among different
   cancers could demonstrate critical cross-cancer similarities, which
   reveals some consistent clinical risk among different cancer types and
   further suggests strong clinical relevance. Our knowledge about the
   patterns of epigenetic modifications and the cause and consequence of
   them is still limited. Computational approach that exploits the complex
   epigenomic landscapes and discovers significant signatures out of them
   is required. Previous computational methods for analyzing epigenomes
   primarily focus on the combinatorial states of different epigenetic
   marks in a specific cell type. Differently, we developed a
   tri-clustering approach TriPCE for integrative pan-cancer epigenomic
   analysis. Based on the FP-tree structure, TriPCE can compactly
   represent all similar cancer types in the promoter regions for a
   specific epigenetic mark. Using the constructed FP-tree, the frequent
   patterns are then detected to yield the set of bi-Clusters of this
   epigenetic mark, indicating the similar epigenetic pattern in these
   cancer types along these genomic regions. TriPCE further mines the
   final tri-Clusters based on the bi-Clusters of all investigated
   epigenetic marks, explicitly detecting combinatorial epigenetic states
   in different genomic segments and similar epigenetic changes across
   different cancer types. In the proposed approach TriPCE, the
   tri-Cluster enumeration is an expensive operation. In the future we
   plan to develop heuristic techniques to efficiently prune the search
   space, and then improve the efficiency of mining the tri-Clusters. We
   applied TriPCE to uncover the similar patterns of six epigenetic marks
   among seven cancer types and successfully identified significant
   cross-cancer epigenetic modification similarities, which suggests that
   there exhibits consistent epigenetic modification tendency among these
   investigated cancer types. Furthermore, the gene functional analysis
   demonstrates that these associated genes are strongly relevant with the
   cancer cellular pathway.

Data Availability Statement

   All datasets generated for this study are included in the
   article/supplementary material.

Author Contributions

   YG is responsible for the main idea, as well as the completion of the
   manuscript. NL and YX have developed the algorithm and performed data
   analysis. GZ has coordinated data preprocessing and supervised the
   effort. All authors have read and approved the final manuscript.

Funding

   This work and the publication costs were supported in part by the
   National Natural Science Foundation of China (61772128, 61772367),
   National Key Research and Development Program of China
   (2016YFC0901704), Shanghai Natural Science Foundation
   (17ZR1400200,18ZR1414400), and the Fundamental Research Funds for the
   Central Universities (2232016A3-05),

Conflict of Interest

   The authors declare that the research was conducted in the absence of
   any commercial or financial relationships that could be construed as a
   potential conflict of interest.

Acknowledgments