Graphical abstract graphic file with name ga1.jpg [29]Open in a new tab Keywords: Genome, Epigenome, 3D chromatin domain, Regulatory mutation, Cancer, Machine learning, Integrative data analysis Abbreviations: differentially methylated region, DMR; topologically associated domain, TAD; follicular lymphoma, FL; single nucleotide variation, SNV; differentially expressed gene, DEG; principal component analysis, PCA; T-distributed stochastic neighbor embedding, t-SNE; group mean difference, GMD Abstract A major challenge in human genetics is of the analysis of the interplay between genetic and epigenetic factors in a multifactorial disease like cancer. Here, a novel methodology is proposed to investigate genome-wide regulatory mechanisms in cancer, as studied with the example of follicular Lymphoma (FL). In a first phase, a new machine-learning method is designed to identify Differentially Methylated Regions (DMRs) by computing six attributes. In a second phase, an integrative data analysis method is developed to study regulatory mutations in FL, by considering differential methylation information together with DNA sequence variation, differential gene expression, 3D organization of genome (e.g., topologically associated domains), and enriched biological pathways. Resulting mutation block-gene pairs are further ranked to find out the significant ones. By this approach, BCL2 and BCL6 were identified as top-ranking FL-related genes with several mutation blocks and DMRs acting on their regulatory regions. Two additional genes, CDCA4 and CTSO, were also found in top rank with significant DNA sequence variation and differential methylation in neighboring areas, pointing towards their potential use as biomarkers for FL. This work combines both genomic and epigenomic information to investigate genome-wide gene regulatory mechanisms in cancer and contribute to devising novel treatment strategies. 1. Introduction Carcinogenesis involves epigenomic, genomic, transcriptomic and proteomic changes. The integration of these changes is important for gaining a better insight into cancer molecular biology and may lead to improved diagnosis and finding novel strategies for cure. Epigenetic alterations such as DNA methylation and chromatin modifications are interlinked [30][1]. DNA hypomethylation or loss of DNA methylation on CpG dinucleotides was the first epigenetic anomaly to be recognized in cancer cells [31][2]. DNA methylation represses gene expression. Hypomethylation of DNA by contrast, can induce expression of genes, including oncogenes. Further, it can result in activation of transposable elements and loss of genomic imprinting. Hypermethylation of DNA is also a common feature of cancer. DNA hypermethylation is often seen at promoter regions of tumor suppressor genes, inducing their epigenetic silencing. Key gatekeeper genes like cyclin-dependent kinase inhibitor 2A(CDKN2A) and BRCA1, for example, are silenced in this way [32][3], [33][4]. Gene expression profiles of tumor cells have been useful in classification, prognostication and prediction of multiple types of cancers such as breast, colorectal and lung cancer etc. [34][5], [35][6], [36][7]. It has been demonstrated in multiple studies that differential expression of certain gene sets is linked to cancer progression [37][8]. This has led to development of gene signatures to predict prognosis of cancer [38][9]. Gene expression profiling can not only predict clinical outcome but also be used to select optimal personalized therapy. Nonetheless, expression profiles of cancer can be highly variable limiting the use of expression profiling in clinical practice. [39][10]. Incorporation of DNA sequence variations like Single Nucleotide Variations (SNVs) is therefore warranted, as they are typically present in cancer and are stably detected [40][11]. Multiple cancer genetic studies have reported SNVs that can contribute to cell transformation as gain-of-function or loss-of-function, for example by activating an oncogene or reducing the expression of a tumor suppressor [41][12]. However, it remains difficult to associate any SNV functionally to a gene, and consequently to cancer because of the regulatory complexity of higher order genomes. Due to presence of long-distance regulatory elements, physical proximity of any SNV to an oncogene or tumor suppressor gene is not sufficient evidence for it to be labeled as cancer driver SNV [42][13]. Chromatin architecture also comes into play here. The majority of the long-range chromatin interactions happen within and are limited by Topologically Associated Domains (TAD) boundaries. Coupling SNVs with the genes present in a similar TAD can give us an indication about the functional relevance of a SNV, and thereby its contribution as a cancer driver mutation [43][14], [44][15]. Follicular lymphoma (FL) is a recurring lymphoma for which chromosomal translocation was identified [45][16]. However, this translocation alone is not sufficient to cause FL. Recurrent mutations have also been reported in FL in multiple developmental, signaling pathway and chromatin regulator genes [46][17]. In addition, epigenetic alterations have been observed in FL [47][18]. DNA methylation of tumor suppressor genes have been reported in FL [48][19]. DNA hypermethylation may also cause transcriptional repression of functionally important genes in FL [49][20].A gene-expression profiling study predicted the risk of progression of patients with follicular lymphoma using a 23-gene score [50][21]. Of note, the most previous research conducted in FL does not comprehensively address all genetic changes as explained earlier. For instance, an integrated data analysis pipeline [51][22] was developed previously to identify putative functional regulatory mutations in FL by considering both the gene expression profiles and the clustered distribution of SNVs. However, it was only able to predict regulatory mutations near the promoter region of genes. To have a comprehensive picture of oncogenesis of FL, we are motivated to design a new integrated data analysis method that takes into account epigenomic (e.g., DNA methylation), genomic (e.g., chromatin architecture- TAD), and transcriptomic (e.g., gene expression) information together with the distribution of genome wide SNVs in patients. In this way, our method makes it possible to predict functional regulatory mutations that affect gene regulation through a long-distance. This will be a great leap forward for the investigation of non-coding mutations in cancer or disease, by utilizing genome wide sequence technology in clinical studies. Publicly available biological data sets allow comprehensive analysis of data. However, efficient statistical and bioinformatic methods for such integrative analysis are lacking. A few tools like sTRAP and is-rSNP predict functional non-coding variants by hypothesizing that non-coding variants can affect gene expression by altering protein-DNA binding [52][23], [53][24]. Another method searched for SNP combinations for disease on the basis of the energy distribution difference considering an individual’s genotype data as a point with a unit of energy [54][25]. CADD and FunSeq2 can integrate some data types like predicted transcription factor binding sites, measured ChIP-Seq peaks of TFs, chromatin state marks, conservation scores and protein–protein interactions [55][26], [56][27]. Another study integrates DNA methylation, gene expression and somatic mutations to infer tissue-of-origin of a tumor [57][28]. In short, there are multiple integrative studies conducted on cancer, but none integrates all the genetic and epigentic changes (i.e. DNA methylation, gene expression, DNA sequence variation and Topologically associated domains) in cancer.[58][29], [59][30], [60][31], [61][32]. Moreover, there is less focus on using differential methylation for functional annotation of SNVs. Previous studies have reported a correlation between the regional methylation level and the rate of mutation at CpG sites in genomic regions [62][33]. Especially in the presence of allele-specific methylation, mutations in the driver genes can be inherently connected with the aberrant DNA methylation landscape in cancer [63][34], [64][35], [65][36]. This points towards the potential use of differential methylation to gauge the functional relevance of a SNV. Although many studies have highlighted the link of the DNA methylation and SNVs, there is no method to our knowledge which employs DNA methylation data to identify disease-related SNVs. While some methods for differential methylation detection are already available, it is important to have a score or rank explaining the magnitude of the differential methylation. In particular, if we want to profit from differentially methylated regions (DMRs) in an integrative study. To overcome this limitation, we have first devised a new method for significant differential methylation detection, which was used to analyze DMRs in FL through parallel computation of six attributes ([66]Fig. 1). Some of the computed attributes are used to identify high confidence DMRS (hcDMRs). HcDMRs then serve as a standard to rank the remaining methylated regions. The resulting model reports a set of DMRs with respective scores depicting the significance of a particular DMR. Fig. 1. [67]Fig. 1 [68]Open in a new tab Work flow for ranking differentially methylated regions between two groups by a new machine learning approach. This figure describes a new machine learning approach for predicting and ranking high quality differentially methylated regions (DMRs) with four steps: 1) search for methylated region (MR) in a genome-wide manner based on predefined criterias, 2) parallel computation of six attributes in each MR, 3) four of the attributes (e.g., the percentage of differentially methylated CpG methylation sites, the clustering accuracy of predicted sample group label based on 2-D t-SNE map, the percentage of the high and median methylation level changes, and the significance of Euclidean distance difference between the intra-group and the inter-group) are used to identify highly confidence DMRs (hcDMRs), 4) fits a logistic regression model for all available MRs by using hcDMRs as true targets. Probability value of each MR (from logistic regression model) is used to rank the DMRs per their significance. Acknowledging the role of differential methylation and chromatin architecture at the epigenetic level and DNA sequence variations like SNVs and gene expression profiles in cancer, we broaden the scope of our study by using a newly developed integrative data analysis method to investigate regulatory mutations in FL at a genome wide level. First, genomic blocks having a high SNV concentration were identified (we will address them as mutation blocks), and mapped to the DMRs and differentially expressed genes (DEG) that are present in the same TAD. Then, the frequency of occurrence of mutation blocks and its annotation to the genomic elements, differential expression level of the associated genes and the related DMR score are used as features to identify significant mutation block-gene pairs in FL. From the analysis, a final set of genes associated to mutation blocks is obtained which is further evaluated by a robustness analysis, based on an independent source (e.g., chromatin state segmentations) that was not used in the prediction. Mutation block-gene associations related to FL that passed robustness analysis with high statistical and biological support, are reported in this study. These set of mutation block-genes can be seen as cancer drivers. Similarly, the set of hcDMRs can have a strong potential of being used as biomarkers for diagnosis, prognosis, prediction and potential treatment of FL. Our study presents a robust method for DMR detection and integrative genomic analysis of regulatory mutations that can be applied to any malignancy. 2. Material and methods 2.1. Diverse high throughput sequencing data for follicular lymphoma patients Genome-wide sequencing data of 14 tumor-normal paired FL patients was obtained from a previous study [69][37], by getting access to controlled data kept on ICGC. Samples were downloaded from European Genome-phenome Archive [70][38] ([71]https://www.ebi.ac.uk/ega/) under accession numbers EGAD00001000645 and EGAD00001000355. RNA-Seq data of four control samples (Germinal center B-cell - GCB) from healthy people was downloaded from GEO database under accession number GSE4598265 [72][39]. DNA methylation data of whole-genome bisulfite sequencing (WGBS) for 8 FL patients and 4 GCB control samples was acquired from an earlier work [73][40]. For data sets not available specifically for FL, we chose data sets closest possible to FL. Human common Topologically Associating Domains (TAD) and boundaries information was downloaded from supplementary Tables 1 and 4 of [74][41]. The common boundaries are from five human cell lines that represent three distinct embryonic germ layers (GM12878 and HMEC, mesoderm; IMR90, endoderm; HUVEC and NHEK, ectoderm). Those TAD boundaries were reported in the original study as significantly (83–85% with pval 〈10^−7) conserved between normal and malignant cells and were thus used in this study. Human enhancer annotations of 197 tissue/cell types were download from EnhancerAtlas 2.0 [75][42]. Annotations from 5 out of 197 cell lines (DOHH2, GC B cell, Namalwa, OCI-ly1 and OCI-Ly7) were later grouped as FL-related cell lines [76][43]. Information of KEGG, BIOCARTA, and GO Biology Process pathways was retrieved from DAVID functional annotation tool [77][44]. Here, all sequencing data were aligned to hs37D5, a variant of GRCh37 human genome assembly used by the 1000 Genomes project [78][45]. Genome-wide mutations were called by using Strelka [79][46] and MuTect [80][47] with the default parameters. An intersection of mutation calls from both programs was used in the further data analysis for each patient [81][48]. From mutations called by these programs, we only considered SNVs. For identifying the transcripts of all protein-coding genes, we used gene annotation from the UCSC hg19 [82][49]. Annotation to the reference genome were performed on four defined genomic regions: the TSS/TES regions between −5kb and + 1kp to the TSS (transcription start site)/TES (transcription end site) of protein-coding genes, and the gene body region between TSS and TES, and the 5′distance regions were calculated from 1 Mb to 5 kb upstream of the TSS. These four defined regions are similar to the previous publications [83][50], [84][51] in differential methylation analysis and identification of promoter-distal loops. Gene expression levels are measured as reads per kilobase of transcript per million mapped reads (RPKM) of RNA-Seq experiments and were computed by applying the featureCounts [85][52] and our in-house Python code on aligned BAM files. In this study, the genetic (SNVs), epigenetic (DNA methylation), and transcriptomic (gene expression) data are from the same FL patient cohort [86][40]. A brief description of these FL samples and the procedures for obtaining them are provided in supplementary Stable 1. The tumor cell content in the cryopreserved sample material was at least 60% in all cases [87][40]. More information of these FL samples and the basic characterization including histopathological panel review and immunohistochemical and FISH analyses can be seen in the previous publication [88][37]. 2.2. Segmentation of human genome in functional regions based on chromatin features Chromatin modifications are important epigenic makers in genome, which can be used to characterize functional regions (e.g., enhancer and TSS et al). We obtained predicted segmentation of human genome based on chromatin features from an earlier publication [89][53]. This segmentation of human genome is based on the predictions from two machine-learning methods (ChromHMM [90][54] and Segway [91][55]), by using multiple chromatin marks (e.g., H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K27ac, H3K27me3, H3K36me3, and H3K20me) known to be involved in enhancer, repression, or promoter regions across six human cell-lines (e.g., GM12878, H1 hESC, HeLa-S3, HepG2, HUVEC, and K562). In addition to chromatin marks, other genomic marks such as Pol2, CTCF, and nucleosome density were also considered. The final combined segmentation from the two predictions uses only seven chromatin states to segment the human genome in functional regions (e.g., TSS – predicted promoter region including TSS; PF – predicted promoter flanking region; E – predicted enhancer; WE – predicted weak enhancer or open chromatin cis regulatory element; CTCF – CTCF enriched element; T – predicted transcribed region; R – predicted repressed or low activity region). 2.3. Identifying methylation regions There are a few popular methods available for differential methylation analysis, but with their own limitations. For example, MethylKit predicts differential methylation at a single base pair resolution, neglecting the confounding effects of any neighboring methylated site [92][56]. This limitation is addressed by HMST-Seq-Analyzer by predicting DMRs instead of differentially methylated sites [93][51]. However, HMST-Seq-Analyzer can only make one-to-one or one-to-many comparisons and cannot make group comparisons (many-to-many). Therefore, a new method for rigorous differential methylation detection is developed. It takes into account the confounding effect of neighboring methylated sites and can perform group comparisons with multiple samples in each group. As a foremost step, it tries to search for Methylated Regions (MR) in the genome. The selection criteria for MRs are that there should be a minimum number of CpG methylation sites in a MR (default of minimum 5), and any two neighboring CpGs must not be any further than a specified distance (by default 250 bp). Gathering MRs from genome-wide data can produce hundreds and thousands of MRs depending upon the type of sequencing method used for methylation detection. A brute force attempt for differential methylation analysis can be computationally exhaustive at the genome-wide scale. Therefore, once all MRs passing aforementioned filtering conditions are acquired from the sequencing data, a parallel computational algorithm is used to assess the significance of differential methylation between the two groups in each MR. The total number of observed MRs is equally divided on the available computer processers (e.g., 10 or 20 processes) and is ran in parallel, which significantly speeds up the calculation. Lastly, a rigorous set of DMRs is reported with their scores which can be used as input to any further integrative study. 2.4. Parallel computation of six attributes in each methylated region Six major attributes are computed and evaluated at the MR level in order to identify Differentially Methylated Region (DMR): 1) Interpolated and smoothed data curves of original methylation levels are computed for both tumor and control/normal groups, and a 95% confidence interval from the group mean is graphically illustrated. 2) Based on the smoothed methylation curves of all samples, the first three principal component elements of PCA (Principal Component Analysis - a linear dimensionality reduction method) are calculated and visualized in a 3D plot (e.g., samples are colored by their group label), from which the difference among samples due to methylation variation in a MR is revealed. 3) As a third attribute, a two sampled T-test is performed to evaluate the significance of differential methylation at each CpG site, respectively. The percentage of CpG methylation sites in a MR that reaches a predefined significant level (e.g., T-test P value < 0.05) is recorded. The higher the percentage the better the differentially methylated region. 4) As a fourth attribute, T-distributed Stochastic Neighbor Embedding (t-SNE, a non-linear dimensionality reduction method) is applied on the smoothed methylation profiles of each MR, and the corresponding 2-D plot of samples is generated (e.g., samples are colored by their group label). Subsequently, k-means clustering is applied on this 2-D t-SNE map, to estimate the clustering accuracy by comparing the predicted sample group label against the true sample group label. The higher the clustering accuracy the more significant the DMR. 5) The fifth attribute is categorizing the levels of differential methylation changes (or absolute group mean difference - GMD) at each CpG site of a MR. As a default, three levels are defined i.e., 0.07 < GMD<=0.1, 0.1 < GMD<=0.2, and GMD > 0.2 for low, median and high methylation changes, respectively, between tumor and control group. The percentages of CpG sites in a MR with low, median, and high methylation changes are recorded. The higher the percentage of high methylation changes the more significant the DMR will be. 6) Finally, Euclidean distances for samples within a group (e.g., intra-group in either tumor or normal group) and between groups (e.g., inter-group such as between tumor and normal group) are calculated based on the methylation profiles, respectively. A two sampled T-test is used to evaluate the significance of the difference between the intra-group distance and the inter-group distance (e.g., Euclidean distances of samples within a tumor group vs. Euclidean distance of samples between tumor and normal). The corresponding P-values are recorded for each MR. The underlying hypothesis for computing this attribute is that the intra-group distance is usually significantly different from inter-group distance for a DMR. Thus, the more significant the difference of distance (between the intra and the inter groups), the more significant the DMR. Here, the first two attributes (attribute 1 and 2) are computed for the purpose of visualization and giving an impression of the distribution of data at a broader level. The rest of the four attributes and the recorded summary statistics are directly used as features for identifying high confidence DMRs (hcDMRs). 2.5. Predicting high confidence differentially methylated regions by a two-levels approach After completing the parallel computation of six major attributes in all the identified MRs, a two-level filtering approach is adopted to determine a set of hcDMRs. The first filter is applied on the attribute 5 i.e., the percentage of high methylation changes (e.g., mean group difference > 0.2) between two groups should be greater than zero. Then a second level filter is applied, where a putative hcDMR should meet at least one of the two conditions: either the percentage of CpG sites in a MR that are significantly different between two groups (e.g., T-test P value < 0.05 in attribute 3) is greater than a threshold value (e.g., default > 0), or there should be a significant difference of Euclidean distances between the intra-group and the inter-group (T-test P value < 0.05 in attribute 6). After the two levels filtering, a set of putative hcDMRs are obtained. The strength of these DMRs is controlled by three key parameters such as 1) the percentage of high methylation changes (e.g., mean group changes > 0.2; default) between two group and 2) the percentage of methylation sites show differential methylation in a MR are greater than zero, and 3) a significant difference (P value < 0.05 in default) of Euclidean distances between the intra-group and the inter-group. 2.6. Ranking differentially methylated regions through logistic regression Although hcDMRs are a solid set of DMRs, to improve the sensitivity of the method and to sort DMRs based on their significance, a ranking approach through logistic regression method is further introduced. Here, hcDMRs are used as true target sites in logistic regression to fit all the available MRs with pre-computed attributes. The four attributes used as regressors are: 1) the percentage of methylation sites in a MR that have passed significant level of differential methylation (from attribute 3), 2) the clustering accuracy for predicting group labels in a MR based on a 2-D projected t-SNE methylation profiles (from attribute 4), 3) the percentage of group mean methylation changes in the high and median level changes (from attribute 5) and 4) the log10 transformed P values of the significance of difference in Euclidean distance between the intra-group and the inter-group (from attribute 6). A probability value is assigned to each MR after fitting the logistic regression model to all MRs by using the high confidence DMRs as true target. These probability values can be used to sort and select the most significant DMRs. For example, if a probability of logistic regression model is P>=0.7, then around 90% of the hcDMRs from the initial two-step filtering (e.g., the mean group changes > 0.2 and, either the percentage of differentially methylated CpG sites is > 0 or there is a significant difference in Euclidean distance between the intra-group and the inter-group) will be included, endorsing that the initial two-steps filtration is powerful for identifying sturdy DMRs. Unlike other popular methods for differential methylation analysis, the length of DMRs predicted by the current method can range from dozens of bp to hundred thousand bp which is similar to actual behavior of DMRs. Moreover, only the distribution of methylation sites is considered in defining a MR and the same trend of methylation level changes in a MR is not forced. Therefore, three types of DMRs (hyper, hypo, and mix) are reported in the prediction. HyperDMRs indicate increase in methylation levels as compared to control/normal samples. HypoDMRs have decreased methylation levels as compared to control/normal samples. It can be misleading to assume that methylated regions, showing differential methylation, will either show increase or decrease of the methylation level. Some DMRs can show both increasing and decreasing levels of methylation at different sites within the region. The current method captures such a possibility as well and reports them as mixed DMRs. All DMRs can also be manually explored through the plots exported by our method. More information about both the identification of hcDMRs and the ranking of DMRs are shown in [94]Fig. 1. 2.7. Integrating differential Methylation, differential gene expression and topologically associated domain information in regulatory mutation prediction Based on the aforementioned new method for analyzing and ranking DMRs between two groups of samples, it is possible to integrate the differential methylation with differentially expressed genes (DEG) data in predicting functional non-coding mutation in disease. First, SNVs from whole-genome-sequencing (WGS) data for the disease are identified by using Strelka and Mutect. An intersection of the SNVs predicted by the both programs was used for further analysis to strengthen the evidence, as performed in a previous publication [95][57]. A region harboring multiple SNVs can have more regulatory potential as compared to a single SNV. Hence, genome-wide identification of mutation blocks in patient samples is done by using BayesPI-BAR2 [96][58]. We first identify mutation clusters and then group them into mutation blocks. Mutation cluster will be a genomic region having a certain number of consecutive SNVs present in any of the patient sample. Here, any regions having at least one SNV in any of the patient samples were selected and grouped into mutation clusters to keep low stringency. In case of more than one SNVs the distance between adjacent SNVs should be<30 bp to be included in the same cluster. The mutation clusters were further grouped into mutation blocks. Mutation block will be a group of mutation clusters, where any consecutively located clusters are not>500 bp apart from each other. Then, these mutation blocks were annotated to multiple genomic regions based on annotated HG19 reference genome (i.e., Gene, TSS, TES, 5′distance and enhancers). In this study, only mutation blocks that were either overlapping with a DMR or associated with a DEG (e.g., a mutation block located in a the TSS, TES, 5′distance region or gene body of the DEG) in FL patients are considered. However, mutations can impact the expression profiles of target genes from long range interaction as well. To ensure the confidence in relevance of relationship between a mutation block and its long-distance target gene (e.g., DEG), the search for mutation block-DEG pair in 5′distance region is confined within the same topologically associated domains (TAD). The information of TAD that are common in five cell lines, is obtained from a recent paper [97][41]. Usually, mutations or SNVs are more influential if present in the regulatory region such as enhancers, as they may disrupt or create a binding site for a transcription factor. The identified mutation blocks are further mapped to enhancer regions provided by EnhancerAtlas2.0 [98][42]. Once all mutation blocks are linked to either DMR or DEG, the strength of genes associated (e.g., through gene, TSS, TES, or 5′distance) to relevant mutation blocks is inferred by a method similar to weighted voting systems [99][59], where four features are used in ranking: 1) the number of patients having affected mutation blocks in a gene, 2) the probability of logistic regression fitting for DMRs associated to a gene, 3) the absolute log10 transformed P-values of DEG, and 4) the annotated genomic region that is linked to a mutation block (e.g., a mutation block locates in TSS, Gene, Enhancer, TES, and 5′distance region will be assigned a weight 4, 4, 3, 2, and 1, respectively). An average of normalized four features’ scores (min–max normalization) is being used to rank the strength of associations between a mutation block and a gene (e.g., normalized score spans from 0 to 1). Finally, mutation block-gene pairs with an average of normalized feature scores>0.5 are extracted, and subjected to GO and pathway enrichment analysis (P-value < 0.05) by using DAVID functional annotation tool [100][60]. In [101]Fig. 2, a workflow or pipeline for such integrated analysis (DMR, DEG, TAD information, and DNA sequence variation) of mutation blocks in FL is presented. Fig. 2. [102]Fig. 2 [103]Open in a new tab Work flow for integrative analysis of regulatory mutations in follicular lymphoma by using whole genome sequencing, differential methylation, differential expression and TAD information. First, SNVs from the whole genome sequencing (WGS) data of 14 follicular lymphoma (FL) patients were used to identify genome-wide mutation blocks in FL by using BayesPI-BAR2. Then, the mutation blocks were annotated to four genomic regions (gene, TSS, TES, and 5′distance regions) as well as to enhancers collected from EnhancerAtlas 2.0. Subsequently, mutation blocks overlapping with either differentially methylated region (DMR) or differentailly expressed gene (DEG) are recorded. For a mutation block-DEG association that is linked by a 5′distance region of gene, the method requests that both the mutation block and the gene are located in the same topologically associated domain (TAD). The importance of mutation block-gene associations is ranked by four features: 1) the number of patients having affected mutation blocks, 2) the significance of a DMR, 3) the significance of the differentially expressed gene, and 4) a weighted score for annotated genomic region (e.g., gene, TSS, TES, 5′distance or enhancers) that a mutation block is mapped to. Finally, a set of top ranked mutation block-gene associations are extracted, where the selected genes are enriched in a common pathway or GO biology process. 2.8. A robustness analysis of mutation block-gene associations by evaluating seven chromatin states Here, a new robustness analysis is developed to evaluate the top ranked mutation block-gene associations obtained from an average of normalized feature scores (e.g., 327 genes in [104]Fig. 2). The new analysis is based on independent information i.e. segmentation of human genome to seven functional regions based on chromatin features [105][53] (e.g., active or repressive histone modifications and nucleosome density), which was not used in the prediction. The robustness analysis utilized a permutation test to assess the significance of associated mutation blocks in seven chromatin states (e.g., R, T, TSS, enhancer, CTCF, WE, PF), respectively. The segmentation of human genome in chromatin states was predicted by two machine learning methods ChromHMM and Segway. First, for mutation blocks associated to each predicted target gene, if there are N number of mutation blocks in a gene, then the percentage of blocks (the actual percentage) located in the seven chromatin states are calculated, respectively. Then, we randomly draw 10,000 times of N mutation blocks from all ∼ 66467 blocks of 14 FL patients that overlap with the seven chromatin states, where the N mutation blocks predicted in the first step are excluded. Subsequently, for each sampled N random mutation blocks, their percentage (the expected percentage) in the seven chromatin states is computed, respectively. In each of the chromatin states, if the expected percentage obtained from the randomly sampled mutation blocks is greater than the actual percentage in the first step, then the chromatin state is incremented by one. Finally, for each tested gene, an expected P-value of its associated mutation blocks located in one of the seven chromatin states is calculated (e.g., P-value = the total number of the expected percentages greater than the actual ones divided by the number of samplings such as 10000), respectively. A filtered list of top ranked mutation block-gene associations will be obtained by assuming that the functional regulatory mutation blocks are significantly enriched (e.g., expected P-value < 0.05) in either TSS or enhancer regions. Thus, a final top ranked (e.g., within the top 20) mutation block-gene associations will be more reliable, if they passed such robust analysis in multiple predictions based on the same data. 3. Results 3.1. Predicting DMRs with a new machine learning method To identify DMRs between FL patients and normal samples, a newly developed machine learning approach (default parameters) was applied on the WGBS data of 8 tumors and 4 control samples, where multiple samples from each group were analyzed simultaneously. A probability P > 0.7 of logistic regression model was used as a cutoff value for detecting DMRs between case (FL) and control(normal) samples. Total 275,949 DMRs were predicted. These DMRs were annotated to four defined genomic regions (TSS, TES, gene, and 5′distance), as well as to human enhancers obtained from EnhancerAtlas [106][42] by using intersection function of BEDTools [107][61]. Example of predicted DMRs is: mr37 in chromosome 1 (SFig. 1), having probability = 1 in the logistic regression model. A counter example of a region not considered DMR (mr60385 in chromosome three; probability = 5.582354e-16 in logistic regression model) is illustrated in SFig. 2. In SFig. 1, the predicted DMR is roughly 250 bp long with six attributes illustrated. A difference in the trend of smoothed methylation profiles between tumor and normal samples can be seen in the upper panel of SFig. 1. Upon calculating the PCA based on the smoothed methylation profiles, tumor samples clearly separate from the normal ones in a 3-D plot of the first three principal components. Nevertheless, it is crucial to see the shape of methylation level distribution in the lower panel of SFig. 1, where ∼ 53.8% of the methylated sites in the MR are differentially methylated (P-value < 0.05) between tumor and normal group. Clustering accuracy of K-means clustering for tumor and normal samples is high (equals ∼ 0.92), according to the two-dimensional t-SNE projections for all available methylation sites in the MR. Especially, the difference between intra-group and inter-group Euclidian distance is marginally significant for normal group (P-value < 0.06) but significant for tumor group (P-value < 0.0083), and the peak for mean group methylation changes is centered around −0.1. These six attributes in the SFig. 1 reiterate the significant differential methylation in mr37 on chromosome one. 3.2. Comparison of predicted DMRs between the new method and the HMST-seq-Analyzer Though HMST-Seq-Analyzer is a tool to predict DMRs from data obtained from multiple methylation detection methods including Whole Genome Bisulfite Sequencing (WGBS) [108][51], in the case of group comparisons, it performs one-to-many comparison only. In this study, mean methylation levels of 4 normal GCB WGBS samples were used to compare to 8 FL patient data [109][40], respectively, by using the default parameters of HMST-seq-Analyzer. Only those DMRs (∼60322) predicted in all 8 samples (∼78–90% of each prediction) were used for further comparison to the new results (∼259963 with probability > 0.8 in logistic regression model of new machine learning approach). Around 92% (239325) of the DMRs from the new prediction method were completely overlapping with that from the HMST-Seq-Analyzer. If the probability cutoff of logistic regression model is varied between 0.5 and 0.7, the overlap between the two results remains ∼ 91 to 92%. It is worthy to note that HMST-Seq-Analyzer generates much longer DMRs than the new method, which explains why more DMRs are predicted by the new method than the HMST-Seq-Analyzer while maintaining a high percentage of overlap between the both. Thus, DMRs detected by the new method are robust and their shorter length makes their annotation and integrated data analysis further easier. 3.3. Including differential methylation and differential gene expression in regulatory mutation analysis There are 118867 and 81812 single nucleotide variants (SNVs) in 14 FL patients, called by MuTect and Strelka, respectively. About 71235 SNVs (∼87% overlap) were detected by both methods. They were selected to identify genome-wide mutation blocks in FL by using BayesPI-BAR2 [110][58], [111][62]. The result spans to 66868 mutation blocks, which requests minimum one patient and one SNV in a cluster. The SNP cluster and block distance was kept as 30 bp and 500 bp, respectively. Here, a built-in mutation background model from BayesPI-BAR2 was not applied to select highly mutated blocks (i.e., SNVs from multiple patients are located in the same mutation block). Instead, a putative functional mutation block was selected based on a different criterion: either it overlaps with a DMR or triggers a nearby gene activity (e.g., DEG). This assumes that gene regulation or TF binding and DNA methylation often affect each other [112][63], [113][64]. For example, impact of SNV on TF binding may cause a variation of DNA methylation levels in neighboring regions or a dysregulation of gene expressions in the nearby location. Following this assumption, ∼13143 mutation blocks were found overlapping with the predicted DMRs. After considering genes that were associated to these DMRs through a gene body, TSS, TES, or 5′distance regions, and mutation blocks that are overlapping with either DMR (∼4603 mutation blocks) or their associated genes are differentially expressed (∼1831 DEG; P < 0.05), previously published mutation block-gene associations in FL are recovered in this initial analysis: for example, three known regulatory mutation blocks near the promoters of dysregulated BCL6 and BCL2 genes [114][57] (up and down regulated in FL compared to normal, respectively). This is a result that supports the hypothesis that functional regulatory mutation may affect DNA methylation level and/or gene expression activity in the nearby region. 3.4. Ranking mutation block-gene association by considering diverse information Though it is possible to narrow down the number of mutation block-gene association by considering both DMR and DEG information, it is a challenge to evaluate their significance. The problem is further complicated by the fact that a mutation block may be assigned to multiple 5′distance regions of different genes (e.g., from 1 Mb to 5 kb upstream of the TSS). Thus, an additional evaluation of mutation block-DEG associations through 5′distance regions is needed: for example, if a mutation block and a gene are not located in the same TAD, then their association through 5′distance region will be ignored. In this study, information of common TADs/boundaries in five cell lines (GM12878, HUVEC, IMR90, HMEC and NHEK) were obtained from a previous publication [115][41]. After thus filtering mutation block-gene pairs, ∼1453 differentially expressed genes (DEG in TSS, TES, gene, or 5′distance) are associated with ∼ 5756 mutation blocks (e.g., either overlapping with DMRs or not). Among these mutation blocks, ∼3684 of them are located in enhancer regions [116][42]. To rank the significance of these inferred mutation block-gene associations, a weighted vote approach was used to rank them by integrating normalized four feature scores (e.g., the number of patients affected by mutation blocks, DMR significance, P-value to DEG, and the weighted genomic feature for a mutation block; [117]Fig. 2). In this work, an average of normalized feature scores was used to export the top ranked mutation block-genes associations: for example, with a mean feature score>=0.5, ∼327 genes are selected. Among these the BCL2 gene is ranked at the top. The number of mutation blocks located in enhancers/TES/TSS/Gene remains stable (∼76%) when a mean feature score cutoff value is decreased (e.g., <0.4). These 327 top ranked genes were used in further pathway analysis because functional mutations often influence multiple genes in the same pathway or biological process [118][65], [119][66]. Thus, GO enrichment analysis is applied on these top ranked genes by using DAVID functional annotation tool [120][60]. The enriched GO biological process and KEGG/BIOCARTA pathways (e.g., P < 0.05) are extracted for defining a final list of mutation block-gene associations, which includes 159 genes involved in several important signaling pathways and biological processes related to FL. For example, intrinsic apoptotic signaling pathway in response to DNA damage, T cell receptor signaling pathway, B cell receptor signaling pathway, NF-kappa B signaling pathway, transcriptional misregulation in cancer, and immune response etc. The aforementioned enriched pathways are affected by mutation blocks from at least 13 FL patients (data in supplementary website). Notably, our previously predicted putative functional regulatory mutation blocks near BCL2 and BCL6 genes [121][57] are ranked in the top 10 (e.g., ranked 1 and 9 for BCL2 and BCL6, respectively; supplementary Stable 2) by this new analysis. Additionally, several novel mutation block-gene associations in FL are also identified (e.g., mutation block associated to CTSO and CDCA4 are ranked at top 2 and 5, respectively; supplementary Stable 2). The presence of numerous mutation blocks and DMRs in the vicinity of these genes urges also to investigate the role of long-range interaction in their regulation. Coming sections will discuss the genes and the respective distal elements in detail. 3.5. Hypomethylation of mutation blocks and enhancers in the BCL2 promoter region can contribute to BCL2 overexpression in follicular lymphoma. The BCL2 gene is located at chromosome 18q21 and codes for BCL-2 protein which inhibits apoptosis and is important for normal B-cell development and differentiation. Follicular lymphoma shows the t(14;18) chromosomal translocation. This translocation involves BCL2 and causes its overexpression, thus providing survival advantage to the malignant B-cells [122][67]. A previous study aimed at predicted two mutation blocks in TSS region of BCL2 [123][57]. The same two mutation blocks of 3113 bp and 726 bp (block_66303, block_66304 respectively) in the TSS region ([124]Fig. 3; stable 3) are detected by this new genome-wide analysis. However, the new method expands its scope beyond the promoter region, hence there are additional eleven mutation blocks predicted in gene body of BCL2 and one in 5′ distance of BCL2 ([125]Fig. 3). Especially, 9 mutation blocks out of total 14 ([126]Table 1) were found overlapping with enhancers from 197 tissue/cell types. Target genes of the majority of these mutation blocks related enhancers are VPS4B and KDSR. Both genes are nearby BCL2. This indicates that there is a potential impact of these mutation blocks on long ranged regulation of these two genes. VPS4B is differentially expressed in diffuse large B-cell lymphoma [127][68]. KDSR is significantly differentially expressed between FL patient and normal samples (e.g., P-value < 0.0003; upregulated in FL compared to normal). Detailed enhancer target gene information of these mutation blocks is provided in the [128]supplementary data. Fig. 3. [129]Fig. 3 [130]Open in a new tab Mutation blocks, differentially methylated regions, and enhancers in a single TAD around BCL2 identified by the new integrative data analysis. This figure displays mutation blocks, DMRs and enhancers that are identified in a single TAD containing the BCL2 gene. First panel presenting brown color horizontal bars presents mean methylation levels of normal samples in the DMR that are predicted around BCL2 and overlaps with the mutation block and enhancers. Second panel containing green coloured horizontal bars presents mean methylation levels of the same DMR in FL samples. Third panel presents all predicted DMRs with respective DMR IDs in the region in form of red tiles. Fourth panel presents overlapping enhancers (with mutation blocks or DMRs) in the region from the 5 FL related cell lines (DOHH2, GC B cell, Namalwa, OCI-ly1 and OCI-Ly7) in light blue color. Fifth panel presents mutation blocks (with respective block IDs) predicted by BayesPI-BAR2 in dark blue tiles. Sixth panel presents the RefSeq genes (BCL2, KDSR) present in the region. Seventh panel presents the TAD boundaries around the region, linking the start of TAD with its end with a grey curve. A yellow vertical bar across the figure highlights the important overlapping mutation blocks, DMRs, enhancers discussed in result section. Coordinates for all these genomic features are mentioned in stable4. (For interpretation of the references of color in this figure