Graphical abstract

   graphic file with name ga1.jpg
   [29]Open in a new tab

   Keywords: Genome, Epigenome, 3D chromatin domain, Regulatory mutation,
   Cancer, Machine learning, Integrative data analysis

   Abbreviations: differentially methylated region, DMR; topologically
   associated domain, TAD; follicular lymphoma, FL; single nucleotide
   variation, SNV; differentially expressed gene, DEG; principal component
   analysis, PCA; T-distributed stochastic neighbor embedding, t-SNE;
   group mean difference, GMD

Abstract

   A major challenge in human genetics is of the analysis of the interplay
   between genetic and epigenetic factors in a multifactorial disease like
   cancer. Here, a novel methodology is proposed to investigate
   genome-wide regulatory mechanisms in cancer, as studied with the
   example of follicular Lymphoma (FL). In a first phase, a new
   machine-learning method is designed to identify Differentially
   Methylated Regions (DMRs) by computing six attributes. In a second
   phase, an integrative data analysis method is developed to study
   regulatory mutations in FL, by considering differential methylation
   information together with DNA sequence variation, differential gene
   expression, 3D organization of genome (e.g., topologically associated
   domains), and enriched biological pathways. Resulting mutation
   block-gene pairs are further ranked to find out the significant ones.
   By this approach, BCL2 and BCL6 were identified as top-ranking
   FL-related genes with several mutation blocks and DMRs acting on their
   regulatory regions. Two additional genes, CDCA4 and CTSO, were also
   found in top rank with significant DNA sequence variation and
   differential methylation in neighboring areas, pointing towards their
   potential use as biomarkers for FL. This work combines both genomic and
   epigenomic information to investigate genome-wide gene regulatory
   mechanisms in cancer and contribute to devising novel treatment
   strategies.

1. Introduction

   Carcinogenesis involves epigenomic, genomic, transcriptomic and
   proteomic changes. The integration of these changes is important for
   gaining a better insight into cancer molecular biology and may lead to
   improved diagnosis and finding novel strategies for cure. Epigenetic
   alterations such as DNA methylation and chromatin modifications are
   interlinked [30][1]. DNA hypomethylation or loss of DNA methylation on
   CpG dinucleotides was the first epigenetic anomaly to be recognized in
   cancer cells [31][2]. DNA methylation represses gene expression.
   Hypomethylation of DNA by contrast, can induce expression of genes,
   including oncogenes. Further, it can result in activation of
   transposable elements and loss of genomic imprinting. Hypermethylation
   of DNA is also a common feature of cancer. DNA hypermethylation is
   often seen at promoter regions of tumor suppressor genes, inducing
   their epigenetic silencing. Key gatekeeper genes like cyclin-dependent
   kinase inhibitor 2A(CDKN2A) and BRCA1, for example, are silenced in
   this way [32][3], [33][4].

   Gene expression profiles of tumor cells have been useful in
   classification, prognostication and prediction of multiple types of
   cancers such as breast, colorectal and lung cancer etc. [34][5],
   [35][6], [36][7]. It has been demonstrated in multiple studies that
   differential expression of certain gene sets is linked to cancer
   progression [37][8]. This has led to development of gene signatures to
   predict prognosis of cancer [38][9]. Gene expression profiling can not
   only predict clinical outcome but also be used to select optimal
   personalized therapy. Nonetheless, expression profiles of cancer can be
   highly variable limiting the use of expression profiling in clinical
   practice. [39][10]. Incorporation of DNA sequence variations like
   Single Nucleotide Variations (SNVs) is therefore warranted, as they are
   typically present in cancer and are stably detected [40][11]. Multiple
   cancer genetic studies have reported SNVs that can contribute to cell
   transformation as gain-of-function or loss-of-function, for example by
   activating an oncogene or reducing the expression of a tumor suppressor
   [41][12]. However, it remains difficult to associate any SNV
   functionally to a gene, and consequently to cancer because of the
   regulatory complexity of higher order genomes. Due to presence of
   long-distance regulatory elements, physical proximity of any SNV to an
   oncogene or tumor suppressor gene is not sufficient evidence for it to
   be labeled as cancer driver SNV [42][13]. Chromatin architecture also
   comes into play here. The majority of the long-range chromatin
   interactions happen within and are limited by Topologically Associated
   Domains (TAD) boundaries. Coupling SNVs with the genes present in a
   similar TAD can give us an indication about the functional relevance of
   a SNV, and thereby its contribution as a cancer driver mutation
   [43][14], [44][15].

   Follicular lymphoma (FL) is a recurring lymphoma for which chromosomal
   translocation was identified [45][16]. However, this translocation
   alone is not sufficient to cause FL. Recurrent mutations have also been
   reported in FL in multiple developmental, signaling pathway and
   chromatin regulator genes [46][17]. In addition, epigenetic alterations
   have been observed in FL [47][18]. DNA methylation of tumor suppressor
   genes have been reported in FL [48][19]. DNA hypermethylation may also
   cause transcriptional repression of functionally important genes in FL
   [49][20].A gene-expression profiling study predicted the risk of
   progression of patients with follicular lymphoma using a 23-gene score
   [50][21]. Of note, the most previous research conducted in FL does not
   comprehensively address all genetic changes as explained earlier. For
   instance, an integrated data analysis pipeline [51][22] was developed
   previously to identify putative functional regulatory mutations in FL
   by considering both the gene expression profiles and the clustered
   distribution of SNVs. However, it was only able to predict regulatory
   mutations near the promoter region of genes. To have a comprehensive
   picture of oncogenesis of FL, we are motivated to design a new
   integrated data analysis method that takes into account epigenomic
   (e.g., DNA methylation), genomic (e.g., chromatin architecture- TAD),
   and transcriptomic (e.g., gene expression) information together with
   the distribution of genome wide SNVs in patients. In this way, our
   method makes it possible to predict functional regulatory mutations
   that affect gene regulation through a long-distance. This will be a
   great leap forward for the investigation of non-coding mutations in
   cancer or disease, by utilizing genome wide sequence technology in
   clinical studies.

   Publicly available biological data sets allow comprehensive analysis of
   data. However, efficient statistical and bioinformatic methods for such
   integrative analysis are lacking. A few tools like sTRAP and is-rSNP
   predict functional non-coding variants by hypothesizing that non-coding
   variants can affect gene expression by altering protein-DNA binding
   [52][23], [53][24]. Another method searched for SNP combinations for
   disease on the basis of the energy distribution difference considering
   an individual’s genotype data as a point with a unit of energy
   [54][25]. CADD and FunSeq2 can integrate some data types like predicted
   transcription factor binding sites, measured ChIP-Seq peaks of TFs,
   chromatin state marks, conservation scores and protein–protein
   interactions [55][26], [56][27]. Another study integrates DNA
   methylation, gene expression and somatic mutations to infer
   tissue-of-origin of a tumor [57][28]. In short, there are multiple
   integrative studies conducted on cancer, but none integrates all the
   genetic and epigentic changes (i.e. DNA methylation, gene expression,
   DNA sequence variation and Topologically associated domains) in
   cancer.[58][29], [59][30], [60][31], [61][32]. Moreover, there is less
   focus on using differential methylation for functional annotation of
   SNVs. Previous studies have reported a correlation between the regional
   methylation level and the rate of mutation at CpG sites in genomic
   regions [62][33]. Especially in the presence of allele-specific
   methylation, mutations in the driver genes can be inherently connected
   with the aberrant DNA methylation landscape in cancer [63][34],
   [64][35], [65][36]. This points towards the potential use of
   differential methylation to gauge the functional relevance of a SNV.

   Although many studies have highlighted the link of the DNA methylation
   and SNVs, there is no method to our knowledge which employs DNA
   methylation data to identify disease-related SNVs. While some methods
   for differential methylation detection are already available, it is
   important to have a score or rank explaining the magnitude of the
   differential methylation. In particular, if we want to profit from
   differentially methylated regions (DMRs) in an integrative study. To
   overcome this limitation, we have first devised a new method for
   significant differential methylation detection, which was used to
   analyze DMRs in FL through parallel computation of six attributes
   ([66]Fig. 1). Some of the computed attributes are used to identify high
   confidence DMRS (hcDMRs). HcDMRs then serve as a standard to rank the
   remaining methylated regions. The resulting model reports a set of DMRs
   with respective scores depicting the significance of a particular DMR.

Fig. 1.

   [67]Fig. 1
   [68]Open in a new tab

   Work flow for ranking differentially methylated regions between two
   groups by a new machine learning approach. This figure describes a new
   machine learning approach for predicting and ranking high quality
   differentially methylated regions (DMRs) with four steps: 1) search for
   methylated region (MR) in a genome-wide manner based on predefined
   criterias, 2) parallel computation of six attributes in each MR, 3)
   four of the attributes (e.g., the percentage of differentially
   methylated CpG methylation sites, the clustering accuracy of predicted
   sample group label based on 2-D t-SNE map, the percentage of the high
   and median methylation level changes, and the significance of Euclidean
   distance difference between the intra-group and the inter-group) are
   used to identify highly confidence DMRs (hcDMRs), 4) fits a logistic
   regression model for all available MRs by using hcDMRs as true targets.
   Probability value of each MR (from logistic regression model) is used
   to rank the DMRs per their significance.

   Acknowledging the role of differential methylation and chromatin
   architecture at the epigenetic level and DNA sequence variations like
   SNVs and gene expression profiles in cancer, we broaden the scope of
   our study by using a newly developed integrative data analysis method
   to investigate regulatory mutations in FL at a genome wide level.
   First, genomic blocks having a high SNV concentration were identified
   (we will address them as mutation blocks), and mapped to the DMRs and
   differentially expressed genes (DEG) that are present in the same TAD.
   Then, the frequency of occurrence of mutation blocks and its annotation
   to the genomic elements, differential expression level of the
   associated genes and the related DMR score are used as features to
   identify significant mutation block-gene pairs in FL. From the
   analysis, a final set of genes associated to mutation blocks is
   obtained which is further evaluated by a robustness analysis, based on
   an independent source (e.g., chromatin state segmentations) that was
   not used in the prediction. Mutation block-gene associations related to
   FL that passed robustness analysis with high statistical and biological
   support, are reported in this study. These set of mutation block-genes
   can be seen as cancer drivers. Similarly, the set of hcDMRs can have a
   strong potential of being used as biomarkers for diagnosis, prognosis,
   prediction and potential treatment of FL. Our study presents a robust
   method for DMR detection and integrative genomic analysis of regulatory
   mutations that can be applied to any malignancy.

2. Material and methods

2.1. Diverse high throughput sequencing data for follicular lymphoma patients

   Genome-wide sequencing data of 14 tumor-normal paired FL patients was
   obtained from a previous study [69][37], by getting access to
   controlled data kept on ICGC. Samples were downloaded from European
   Genome-phenome Archive [70][38] ([71]https://www.ebi.ac.uk/ega/) under
   accession numbers EGAD00001000645 and EGAD00001000355. RNA-Seq data of
   four control samples (Germinal center B-cell - GCB) from healthy people
   was downloaded from GEO database under accession number GSE4598265
   [72][39]. DNA methylation data of whole-genome bisulfite sequencing
   (WGBS) for 8 FL patients and 4 GCB control samples was acquired from an
   earlier work [73][40]. For data sets not available specifically for FL,
   we chose data sets closest possible to FL. Human common Topologically
   Associating Domains (TAD) and boundaries information was downloaded
   from supplementary Tables 1 and 4 of [74][41]. The common boundaries
   are from five human cell lines that represent three distinct embryonic
   germ layers (GM12878 and HMEC, mesoderm; IMR90, endoderm; HUVEC and
   NHEK, ectoderm). Those TAD boundaries were reported in the original
   study as significantly (83–85% with pval 〈10^−7) conserved between
   normal and malignant cells and were thus used in this study. Human
   enhancer annotations of 197 tissue/cell types were download from
   EnhancerAtlas 2.0 [75][42]. Annotations from 5 out of 197 cell lines
   (DOHH2, GC B cell, Namalwa, OCI-ly1 and OCI-Ly7) were later grouped as
   FL-related cell lines [76][43]. Information of KEGG, BIOCARTA, and GO
   Biology Process pathways was retrieved from DAVID functional annotation
   tool [77][44]. Here, all sequencing data were aligned to hs37D5, a
   variant of GRCh37 human genome assembly used by the 1000 Genomes
   project [78][45]. Genome-wide mutations were called by using Strelka
   [79][46] and MuTect [80][47] with the default parameters. An
   intersection of mutation calls from both programs was used in the
   further data analysis for each patient [81][48]. From mutations called
   by these programs, we only considered SNVs. For identifying the
   transcripts of all protein-coding genes, we used gene annotation from
   the UCSC hg19 [82][49]. Annotation to the reference genome were
   performed on four defined genomic regions: the TSS/TES regions between
   −5kb and + 1kp to the TSS (transcription start site)/TES (transcription
   end site) of protein-coding genes, and the gene body region between TSS
   and TES, and the 5′distance regions were calculated from 1 Mb to 5 kb
   upstream of the TSS. These four defined regions are similar to the
   previous publications [83][50], [84][51] in differential methylation
   analysis and identification of promoter-distal loops. Gene expression
   levels are measured as reads per kilobase of transcript per million
   mapped reads (RPKM) of RNA-Seq experiments and were computed by
   applying the featureCounts [85][52] and our in-house Python code on
   aligned BAM files. In this study, the genetic (SNVs), epigenetic (DNA
   methylation), and transcriptomic (gene expression) data are from the
   same FL patient cohort [86][40]. A brief description of these FL
   samples and the procedures for obtaining them are provided in
   supplementary Stable 1. The tumor cell content in the cryopreserved
   sample material was at least 60% in all cases [87][40]. More
   information of these FL samples and the basic characterization
   including histopathological panel review and immunohistochemical and
   FISH analyses can be seen in the previous publication [88][37].

2.2. Segmentation of human genome in functional regions based on chromatin
features

   Chromatin modifications are important epigenic makers in genome, which
   can be used to characterize functional regions (e.g., enhancer and TSS
   et al). We obtained predicted segmentation of human genome based on
   chromatin features from an earlier publication [89][53]. This
   segmentation of human genome is based on the predictions from two
   machine-learning methods (ChromHMM [90][54] and Segway [91][55]), by
   using multiple chromatin marks (e.g., H3K4me1, H3K4me2, H3K4me3,
   H3K9ac, H3K27ac, H3K27me3, H3K36me3, and H3K20me) known to be involved
   in enhancer, repression, or promoter regions across six human
   cell-lines (e.g., GM12878, H1 hESC, HeLa-S3, HepG2, HUVEC, and K562).
   In addition to chromatin marks, other genomic marks such as Pol2, CTCF,
   and nucleosome density were also considered. The final combined
   segmentation from the two predictions uses only seven chromatin states
   to segment the human genome in functional regions (e.g., TSS –
   predicted promoter region including TSS; PF – predicted promoter
   flanking region; E – predicted enhancer; WE – predicted weak enhancer
   or open chromatin cis regulatory element; CTCF – CTCF enriched element;
   T – predicted transcribed region; R – predicted repressed or low
   activity region).

2.3. Identifying methylation regions

   There are a few popular methods available for differential methylation
   analysis, but with their own limitations. For example, MethylKit
   predicts differential methylation at a single base pair resolution,
   neglecting the confounding effects of any neighboring methylated site
   [92][56]. This limitation is addressed by HMST-Seq-Analyzer by
   predicting DMRs instead of differentially methylated sites [93][51].
   However, HMST-Seq-Analyzer can only make one-to-one or one-to-many
   comparisons and cannot make group comparisons (many-to-many).
   Therefore, a new method for rigorous differential methylation detection
   is developed. It takes into account the confounding effect of
   neighboring methylated sites and can perform group comparisons with
   multiple samples in each group. As a foremost step, it tries to search
   for Methylated Regions (MR) in the genome. The selection criteria for
   MRs are that there should be a minimum number of CpG methylation sites
   in a MR (default of minimum 5), and any two neighboring CpGs must not
   be any further than a specified distance (by default 250 bp). Gathering
   MRs from genome-wide data can produce hundreds and thousands of MRs
   depending upon the type of sequencing method used for methylation
   detection. A brute force attempt for differential methylation analysis
   can be computationally exhaustive at the genome-wide scale. Therefore,
   once all MRs passing aforementioned filtering conditions are acquired
   from the sequencing data, a parallel computational algorithm is used to
   assess the significance of differential methylation between the two
   groups in each MR. The total number of observed MRs is equally divided
   on the available computer processers (e.g., 10 or 20 processes) and is
   ran in parallel, which significantly speeds up the calculation. Lastly,
   a rigorous set of DMRs is reported with their scores which can be used
   as input to any further integrative study.

2.4. Parallel computation of six attributes in each methylated region

   Six major attributes are computed and evaluated at the MR level in
   order to identify Differentially Methylated Region (DMR): 1)
   Interpolated and smoothed data curves of original methylation levels
   are computed for both tumor and control/normal groups, and a 95%
   confidence interval from the group mean is graphically illustrated. 2)
   Based on the smoothed methylation curves of all samples, the first
   three principal component elements of PCA (Principal Component Analysis
   - a linear dimensionality reduction method) are calculated and
   visualized in a 3D plot (e.g., samples are colored by their group
   label), from which the difference among samples due to methylation
   variation in a MR is revealed. 3) As a third attribute, a two sampled
   T-test is performed to evaluate the significance of differential
   methylation at each CpG site, respectively. The percentage of CpG
   methylation sites in a MR that reaches a predefined significant level
   (e.g., T-test P value < 0.05) is recorded. The higher the percentage
   the better the differentially methylated region. 4) As a fourth
   attribute, T-distributed Stochastic Neighbor Embedding (t-SNE, a
   non-linear dimensionality reduction method) is applied on the smoothed
   methylation profiles of each MR, and the corresponding 2-D plot of
   samples is generated (e.g., samples are colored by their group label).
   Subsequently, k-means clustering is applied on this 2-D t-SNE map, to
   estimate the clustering accuracy by comparing the predicted sample
   group label against the true sample group label. The higher the
   clustering accuracy the more significant the DMR. 5) The fifth
   attribute is categorizing the levels of differential methylation
   changes (or absolute group mean difference - GMD) at each CpG site of a
   MR. As a default, three levels are defined i.e., 0.07 < GMD<=0.1,
   0.1 < GMD<=0.2, and GMD > 0.2 for low, median and high methylation
   changes, respectively, between tumor and control group. The percentages
   of CpG sites in a MR with low, median, and high methylation changes are
   recorded. The higher the percentage of high methylation changes the
   more significant the DMR will be. 6) Finally, Euclidean distances for
   samples within a group (e.g., intra-group in either tumor or normal
   group) and between groups (e.g., inter-group such as between tumor and
   normal group) are calculated based on the methylation profiles,
   respectively. A two sampled T-test is used to evaluate the significance
   of the difference between the intra-group distance and the inter-group
   distance (e.g., Euclidean distances of samples within a tumor group vs.
   Euclidean distance of samples between tumor and normal). The
   corresponding P-values are recorded for each MR. The underlying
   hypothesis for computing this attribute is that the intra-group
   distance is usually significantly different from inter-group distance
   for a DMR. Thus, the more significant the difference of distance
   (between the intra and the inter groups), the more significant the DMR.
   Here, the first two attributes (attribute 1 and 2) are computed for the
   purpose of visualization and giving an impression of the distribution
   of data at a broader level. The rest of the four attributes and the
   recorded summary statistics are directly used as features for
   identifying high confidence DMRs (hcDMRs).

2.5. Predicting high confidence differentially methylated regions by a
two-levels approach

   After completing the parallel computation of six major attributes in
   all the identified MRs, a two-level filtering approach is adopted to
   determine a set of hcDMRs. The first filter is applied on the attribute
   5 i.e., the percentage of high methylation changes (e.g., mean group
   difference > 0.2) between two groups should be greater than zero. Then
   a second level filter is applied, where a putative hcDMR should meet at
   least one of the two conditions: either the percentage of CpG sites in
   a MR that are significantly different between two groups (e.g., T-test
   P value < 0.05 in attribute 3) is greater than a threshold value (e.g.,
   default > 0), or there should be a significant difference of Euclidean
   distances between the intra-group and the inter-group (T-test P
   value < 0.05 in attribute 6). After the two levels filtering, a set of
   putative hcDMRs are obtained. The strength of these DMRs is controlled
   by three key parameters such as 1) the percentage of high methylation
   changes (e.g., mean group changes > 0.2; default) between two group and
   2) the percentage of methylation sites show differential methylation in
   a MR are greater than zero, and 3) a significant difference (P
   value < 0.05 in default) of Euclidean distances between the intra-group
   and the inter-group.

2.6. Ranking differentially methylated regions through logistic regression

   Although hcDMRs are a solid set of DMRs, to improve the sensitivity of
   the method and to sort DMRs based on their significance, a ranking
   approach through logistic regression method is further introduced.
   Here, hcDMRs are used as true target sites in logistic regression to
   fit all the available MRs with pre-computed attributes. The four
   attributes used as regressors are: 1) the percentage of methylation
   sites in a MR that have passed significant level of differential
   methylation (from attribute 3), 2) the clustering accuracy for
   predicting group labels in a MR based on a 2-D projected t-SNE
   methylation profiles (from attribute 4), 3) the percentage of group
   mean methylation changes in the high and median level changes (from
   attribute 5) and 4) the log10 transformed P values of the significance
   of difference in Euclidean distance between the intra-group and the
   inter-group (from attribute 6). A probability value is assigned to each
   MR after fitting the logistic regression model to all MRs by using the
   high confidence DMRs as true target. These probability values can be
   used to sort and select the most significant DMRs. For example, if a
   probability of logistic regression model is P>=0.7, then around 90% of
   the hcDMRs from the initial two-step filtering (e.g., the mean group
   changes > 0.2 and, either the percentage of differentially methylated
   CpG sites is > 0 or there is a significant difference in Euclidean
   distance between the intra-group and the inter-group) will be included,
   endorsing that the initial two-steps filtration is powerful for
   identifying sturdy DMRs. Unlike other popular methods for differential
   methylation analysis, the length of DMRs predicted by the current
   method can range from dozens of bp to hundred thousand bp which is
   similar to actual behavior of DMRs. Moreover, only the distribution of
   methylation sites is considered in defining a MR and the same trend of
   methylation level changes in a MR is not forced. Therefore, three types
   of DMRs (hyper, hypo, and mix) are reported in the prediction.
   HyperDMRs indicate increase in methylation levels as compared to
   control/normal samples. HypoDMRs have decreased methylation levels as
   compared to control/normal samples. It can be misleading to assume that
   methylated regions, showing differential methylation, will either show
   increase or decrease of the methylation level. Some DMRs can show both
   increasing and decreasing levels of methylation at different sites
   within the region. The current method captures such a possibility as
   well and reports them as mixed DMRs. All DMRs can also be manually
   explored through the plots exported by our method. More information
   about both the identification of hcDMRs and the ranking of DMRs are
   shown in [94]Fig. 1.

2.7. Integrating differential Methylation, differential gene expression and
topologically associated domain information in regulatory mutation prediction

   Based on the aforementioned new method for analyzing and ranking DMRs
   between two groups of samples, it is possible to integrate the
   differential methylation with differentially expressed genes (DEG) data
   in predicting functional non-coding mutation in disease. First, SNVs
   from whole-genome-sequencing (WGS) data for the disease are identified
   by using Strelka and Mutect. An intersection of the SNVs predicted by
   the both programs was used for further analysis to strengthen the
   evidence, as performed in a previous publication [95][57]. A region
   harboring multiple SNVs can have more regulatory potential as compared
   to a single SNV. Hence, genome-wide identification of mutation blocks
   in patient samples is done by using BayesPI-BAR2 [96][58]. We first
   identify mutation clusters and then group them into mutation blocks.
   Mutation cluster will be a genomic region having a certain number of
   consecutive SNVs present in any of the patient sample. Here, any
   regions having at least one SNV in any of the patient samples were
   selected and grouped into mutation clusters to keep low stringency. In
   case of more than one SNVs the distance between adjacent SNVs should
   be<30 bp to be included in the same cluster. The mutation clusters were
   further grouped into mutation blocks. Mutation block will be a group of
   mutation clusters, where any consecutively located clusters are
   not>500 bp apart from each other. Then, these mutation blocks were
   annotated to multiple genomic regions based on annotated HG19 reference
   genome (i.e., Gene, TSS, TES, 5′distance and enhancers). In this study,
   only mutation blocks that were either overlapping with a DMR or
   associated with a DEG (e.g., a mutation block located in a the TSS,
   TES, 5′distance region or gene body of the DEG) in FL patients are
   considered. However, mutations can impact the expression profiles of
   target genes from long range interaction as well. To ensure the
   confidence in relevance of relationship between a mutation block and
   its long-distance target gene (e.g., DEG), the search for mutation
   block-DEG pair in 5′distance region is confined within the same
   topologically associated domains (TAD). The information of TAD that are
   common in five cell lines, is obtained from a recent paper [97][41].

   Usually, mutations or SNVs are more influential if present in the
   regulatory region such as enhancers, as they may disrupt or create a
   binding site for a transcription factor. The identified mutation blocks
   are further mapped to enhancer regions provided by EnhancerAtlas2.0
   [98][42]. Once all mutation blocks are linked to either DMR or DEG, the
   strength of genes associated (e.g., through gene, TSS, TES, or
   5′distance) to relevant mutation blocks is inferred by a method similar
   to weighted voting systems [99][59], where four features are used in
   ranking: 1) the number of patients having affected mutation blocks in a
   gene, 2) the probability of logistic regression fitting for DMRs
   associated to a gene, 3) the absolute log10 transformed P-values of
   DEG, and 4) the annotated genomic region that is linked to a mutation
   block (e.g., a mutation block locates in TSS, Gene, Enhancer, TES, and
   5′distance region will be assigned a weight 4, 4, 3, 2, and 1,
   respectively). An average of normalized four features’ scores (min–max
   normalization) is being used to rank the strength of associations
   between a mutation block and a gene (e.g., normalized score spans from
   0 to 1). Finally, mutation block-gene pairs with an average of
   normalized feature scores>0.5 are extracted, and subjected to GO and
   pathway enrichment analysis (P-value < 0.05) by using DAVID functional
   annotation tool [100][60]. In [101]Fig. 2, a workflow or pipeline for
   such integrated analysis (DMR, DEG, TAD information, and DNA sequence
   variation) of mutation blocks in FL is presented.

Fig. 2.

   [102]Fig. 2
   [103]Open in a new tab

   Work flow for integrative analysis of regulatory mutations in
   follicular lymphoma by using whole genome sequencing, differential
   methylation, differential expression and TAD information. First, SNVs
   from the whole genome sequencing (WGS) data of 14 follicular lymphoma
   (FL) patients were used to identify genome-wide mutation blocks in FL
   by using BayesPI-BAR2. Then, the mutation blocks were annotated to four
   genomic regions (gene, TSS, TES, and 5′distance regions) as well as to
   enhancers collected from EnhancerAtlas 2.0. Subsequently, mutation
   blocks overlapping with either differentially methylated region (DMR)
   or differentailly expressed gene (DEG) are recorded. For a mutation
   block-DEG association that is linked by a 5′distance region of gene,
   the method requests that both the mutation block and the gene are
   located in the same topologically associated domain (TAD). The
   importance of mutation block-gene associations is ranked by four
   features: 1) the number of patients having affected mutation blocks, 2)
   the significance of a DMR, 3) the significance of the differentially
   expressed gene, and 4) a weighted score for annotated genomic region
   (e.g., gene, TSS, TES, 5′distance or enhancers) that a mutation block
   is mapped to. Finally, a set of top ranked mutation block-gene
   associations are extracted, where the selected genes are enriched in a
   common pathway or GO biology process.

2.8. A robustness analysis of mutation block-gene associations by evaluating
seven chromatin states

   Here, a new robustness analysis is developed to evaluate the top ranked
   mutation block-gene associations obtained from an average of normalized
   feature scores (e.g., 327 genes in [104]Fig. 2). The new analysis is
   based on independent information i.e. segmentation of human genome to
   seven functional regions based on chromatin features [105][53] (e.g.,
   active or repressive histone modifications and nucleosome density),
   which was not used in the prediction. The robustness analysis utilized
   a permutation test to assess the significance of associated mutation
   blocks in seven chromatin states (e.g., R, T, TSS, enhancer, CTCF, WE,
   PF), respectively. The segmentation of human genome in chromatin states
   was predicted by two machine learning methods ChromHMM and Segway.
   First, for mutation blocks associated to each predicted target gene, if
   there are N number of mutation blocks in a gene, then the percentage of
   blocks (the actual percentage) located in the seven chromatin states
   are calculated, respectively. Then, we randomly draw 10,000 times of N
   mutation blocks from all ∼ 66467 blocks of 14 FL patients that overlap
   with the seven chromatin states, where the N mutation blocks predicted
   in the first step are excluded. Subsequently, for each sampled N random
   mutation blocks, their percentage (the expected percentage) in the
   seven chromatin states is computed, respectively. In each of the
   chromatin states, if the expected percentage obtained from the randomly
   sampled mutation blocks is greater than the actual percentage in the
   first step, then the chromatin state is incremented by one. Finally,
   for each tested gene, an expected P-value of its associated mutation
   blocks located in one of the seven chromatin states is calculated
   (e.g., P-value = the total number of the expected percentages greater
   than the actual ones divided by the number of samplings such as 10000),
   respectively. A filtered list of top ranked mutation block-gene
   associations will be obtained by assuming that the functional
   regulatory mutation blocks are significantly enriched (e.g., expected
   P-value < 0.05) in either TSS or enhancer regions. Thus, a final top
   ranked (e.g., within the top 20) mutation block-gene associations will
   be more reliable, if they passed such robust analysis in multiple
   predictions based on the same data.

3. Results

3.1. Predicting DMRs with a new machine learning method

   To identify DMRs between FL patients and normal samples, a newly
   developed machine learning approach (default parameters) was applied on
   the WGBS data of 8 tumors and 4 control samples, where multiple samples
   from each group were analyzed simultaneously. A probability P > 0.7 of
   logistic regression model was used as a cutoff value for detecting DMRs
   between case (FL) and control(normal) samples. Total 275,949 DMRs were
   predicted. These DMRs were annotated to four defined genomic regions
   (TSS, TES, gene, and 5′distance), as well as to human enhancers
   obtained from EnhancerAtlas [106][42] by using intersection function of
   BEDTools [107][61]. Example of predicted DMRs is: mr37 in chromosome 1
   (SFig. 1), having probability = 1 in the logistic regression model. A
   counter example of a region not considered DMR (mr60385 in chromosome
   three; probability = 5.582354e-16 in logistic regression model) is
   illustrated in SFig. 2. In SFig. 1, the predicted DMR is roughly 250 bp
   long with six attributes illustrated. A difference in the trend of
   smoothed methylation profiles between tumor and normal samples can be
   seen in the upper panel of SFig. 1. Upon calculating the PCA based on
   the smoothed methylation profiles, tumor samples clearly separate from
   the normal ones in a 3-D plot of the first three principal components.
   Nevertheless, it is crucial to see the shape of methylation level
   distribution in the lower panel of SFig. 1, where ∼ 53.8% of the
   methylated sites in the MR are differentially methylated
   (P-value < 0.05) between tumor and normal group. Clustering accuracy of
   K-means clustering for tumor and normal samples is high
   (equals ∼ 0.92), according to the two-dimensional t-SNE projections for
   all available methylation sites in the MR. Especially, the difference
   between intra-group and inter-group Euclidian distance is marginally
   significant for normal group (P-value < 0.06) but significant for tumor
   group (P-value < 0.0083), and the peak for mean group methylation
   changes is centered around −0.1. These six attributes in the SFig. 1
   reiterate the significant differential methylation in mr37 on
   chromosome one.

3.2. Comparison of predicted DMRs between the new method and the
HMST-seq-Analyzer

   Though HMST-Seq-Analyzer is a tool to predict DMRs from data obtained
   from multiple methylation detection methods including Whole Genome
   Bisulfite Sequencing (WGBS) [108][51], in the case of group
   comparisons, it performs one-to-many comparison only. In this study,
   mean methylation levels of 4 normal GCB WGBS samples were used to
   compare to 8 FL patient data [109][40], respectively, by using the
   default parameters of HMST-seq-Analyzer. Only those DMRs (∼60322)
   predicted in all 8 samples (∼78–90% of each prediction) were used for
   further comparison to the new results (∼259963 with probability > 0.8
   in logistic regression model of new machine learning approach). Around
   92% (239325) of the DMRs from the new prediction method were completely
   overlapping with that from the HMST-Seq-Analyzer. If the probability
   cutoff of logistic regression model is varied between 0.5 and 0.7, the
   overlap between the two results remains ∼ 91 to 92%. It is worthy to
   note that HMST-Seq-Analyzer generates much longer DMRs than the new
   method, which explains why more DMRs are predicted by the new method
   than the HMST-Seq-Analyzer while maintaining a high percentage of
   overlap between the both. Thus, DMRs detected by the new method are
   robust and their shorter length makes their annotation and integrated
   data analysis further easier.

3.3. Including differential methylation and differential gene expression in
regulatory mutation analysis

   There are 118867 and 81812 single nucleotide variants (SNVs) in 14 FL
   patients, called by MuTect and Strelka, respectively. About 71235 SNVs
   (∼87% overlap) were detected by both methods. They were selected to
   identify genome-wide mutation blocks in FL by using BayesPI-BAR2
   [110][58], [111][62]. The result spans to 66868 mutation blocks, which
   requests minimum one patient and one SNV in a cluster. The SNP cluster
   and block distance was kept as 30 bp and 500 bp, respectively. Here, a
   built-in mutation background model from BayesPI-BAR2 was not applied to
   select highly mutated blocks (i.e., SNVs from multiple patients are
   located in the same mutation block). Instead, a putative functional
   mutation block was selected based on a different criterion: either it
   overlaps with a DMR or triggers a nearby gene activity (e.g., DEG).
   This assumes that gene regulation or TF binding and DNA methylation
   often affect each other [112][63], [113][64]. For example, impact of
   SNV on TF binding may cause a variation of DNA methylation levels in
   neighboring regions or a dysregulation of gene expressions in the
   nearby location. Following this assumption, ∼13143 mutation blocks were
   found overlapping with the predicted DMRs. After considering genes that
   were associated to these DMRs through a gene body, TSS, TES, or
   5′distance regions, and mutation blocks that are overlapping with
   either DMR (∼4603 mutation blocks) or their associated genes are
   differentially expressed (∼1831 DEG; P < 0.05), previously published
   mutation block-gene associations in FL are recovered in this initial
   analysis: for example, three known regulatory mutation blocks near the
   promoters of dysregulated BCL6 and BCL2 genes [114][57] (up and down
   regulated in FL compared to normal, respectively). This is a result
   that supports the hypothesis that functional regulatory mutation may
   affect DNA methylation level and/or gene expression activity in the
   nearby region.

3.4. Ranking mutation block-gene association by considering diverse
information

   Though it is possible to narrow down the number of mutation block-gene
   association by considering both DMR and DEG information, it is a
   challenge to evaluate their significance. The problem is further
   complicated by the fact that a mutation block may be assigned to
   multiple 5′distance regions of different genes (e.g., from 1 Mb to 5 kb
   upstream of the TSS). Thus, an additional evaluation of mutation
   block-DEG associations through 5′distance regions is needed: for
   example, if a mutation block and a gene are not located in the same
   TAD, then their association through 5′distance region will be ignored.
   In this study, information of common TADs/boundaries in five cell lines
   (GM12878, HUVEC, IMR90, HMEC and NHEK) were obtained from a previous
   publication [115][41]. After thus filtering mutation block-gene pairs,
   ∼1453 differentially expressed genes (DEG in TSS, TES, gene, or
   5′distance) are associated with ∼ 5756 mutation blocks (e.g., either
   overlapping with DMRs or not). Among these mutation blocks, ∼3684 of
   them are located in enhancer regions [116][42]. To rank the
   significance of these inferred mutation block-gene associations, a
   weighted vote approach was used to rank them by integrating normalized
   four feature scores (e.g., the number of patients affected by mutation
   blocks, DMR significance, P-value to DEG, and the weighted genomic
   feature for a mutation block; [117]Fig. 2).

   In this work, an average of normalized feature scores was used to
   export the top ranked mutation block-genes associations: for example,
   with a mean feature score>=0.5, ∼327 genes are selected. Among these
   the BCL2 gene is ranked at the top. The number of mutation blocks
   located in enhancers/TES/TSS/Gene remains stable (∼76%) when a mean
   feature score cutoff value is decreased (e.g., <0.4). These 327 top
   ranked genes were used in further pathway analysis because functional
   mutations often influence multiple genes in the same pathway or
   biological process [118][65], [119][66]. Thus, GO enrichment analysis
   is applied on these top ranked genes by using DAVID functional
   annotation tool [120][60]. The enriched GO biological process and
   KEGG/BIOCARTA pathways (e.g., P < 0.05) are extracted for defining a
   final list of mutation block-gene associations, which includes 159
   genes involved in several important signaling pathways and biological
   processes related to FL. For example, intrinsic apoptotic signaling
   pathway in response to DNA damage, T cell receptor signaling pathway, B
   cell receptor signaling pathway, NF-kappa B signaling pathway,
   transcriptional misregulation in cancer, and immune response etc. The
   aforementioned enriched pathways are affected by mutation blocks from
   at least 13 FL patients (data in supplementary website). Notably, our
   previously predicted putative functional regulatory mutation blocks
   near BCL2 and BCL6 genes [121][57] are ranked in the top 10 (e.g.,
   ranked 1 and 9 for BCL2 and BCL6, respectively; supplementary Stable 2)
   by this new analysis. Additionally, several novel mutation block-gene
   associations in FL are also identified (e.g., mutation block associated
   to CTSO and CDCA4 are ranked at top 2 and 5, respectively;
   supplementary Stable 2). The presence of numerous mutation blocks and
   DMRs in the vicinity of these genes urges also to investigate the role
   of long-range interaction in their regulation. Coming sections will
   discuss the genes and the respective distal elements in detail.

3.5. Hypomethylation of mutation blocks and enhancers in the BCL2 promoter
region can contribute to BCL2 overexpression in follicular lymphoma.

   The BCL2 gene is located at chromosome 18q21 and codes for BCL-2
   protein which inhibits apoptosis and is important for normal B-cell
   development and differentiation. Follicular lymphoma shows the t(14;18)
   chromosomal translocation. This translocation involves BCL2 and causes
   its overexpression, thus providing survival advantage to the malignant
   B-cells [122][67]. A previous study aimed at predicted two mutation
   blocks in TSS region of BCL2 [123][57]. The same two mutation blocks of
   3113 bp and 726 bp (block_66303, block_66304 respectively) in the TSS
   region ([124]Fig. 3; stable 3) are detected by this new genome-wide
   analysis. However, the new method expands its scope beyond the promoter
   region, hence there are additional eleven mutation blocks predicted in
   gene body of BCL2 and one in 5′ distance of BCL2 ([125]Fig. 3).
   Especially, 9 mutation blocks out of total 14 ([126]Table 1) were found
   overlapping with enhancers from 197 tissue/cell types. Target genes of
   the majority of these mutation blocks related enhancers are VPS4B and
   KDSR. Both genes are nearby BCL2. This indicates that there is a
   potential impact of these mutation blocks on long ranged regulation of
   these two genes. VPS4B is differentially expressed in diffuse large
   B-cell lymphoma [127][68]. KDSR is significantly differentially
   expressed between FL patient and normal samples (e.g.,
   P-value < 0.0003; upregulated in FL compared to normal). Detailed
   enhancer target gene information of these mutation blocks is provided
   in the [128]supplementary data.

Fig. 3.

   [129]Fig. 3
   [130]Open in a new tab

   Mutation blocks, differentially methylated regions, and enhancers in a
   single TAD around BCL2 identified by the new integrative data analysis.
   This figure displays mutation blocks, DMRs and enhancers that are
   identified in a single TAD containing the BCL2 gene. First panel
   presenting brown color horizontal bars presents mean methylation levels
   of normal samples in the DMR that are predicted around BCL2 and
   overlaps with the mutation block and enhancers. Second panel containing
   green coloured horizontal bars presents mean methylation levels of the
   same DMR in FL samples. Third panel presents all predicted DMRs with
   respective DMR IDs in the region in form of red tiles. Fourth panel
   presents overlapping enhancers (with mutation blocks or DMRs) in the
   region from the 5 FL related cell lines (DOHH2, GC B cell, Namalwa,
   OCI-ly1 and OCI-Ly7) in light blue color. Fifth panel presents mutation
   blocks (with respective block IDs) predicted by BayesPI-BAR2 in dark
   blue tiles. Sixth panel presents the RefSeq genes (BCL2, KDSR) present
   in the region. Seventh panel presents the TAD boundaries around the
   region, linking the start of TAD with its end with a grey curve. A
   yellow vertical bar across the figure highlights the important
   overlapping mutation blocks, DMRs, enhancers discussed in result
   section. Coordinates for all these genomic features are mentioned in
   stable4. (For interpretation of the references of color in this figure