Abstract

   DNA methylation plays a critical role in gene regulation, affecting
   cellular differentiation and disease progression, particularly in
   non‐coding regions. However, predicting the epigenetic consequences of
   non‐coding mutations at single‐cell resolution remains a challenge.
   Existing tools have limited prediction capacity and struggle to capture
   dynamic, cell‐type‐specific regulatory changes that are crucial for
   understanding disease mechanisms. Here, Methven, a deep learning
   framework designed is presented to predict the effects of non‐coding
   mutations on DNA methylation at single‐cell resolution. Methven
   integrates DNA sequence with single‐cell ATAC‐seq data and models
   SNP‐CpG interactions over 100 kbp genomic distances. By using a
   divide‐and‐conquer approach, Methven accurately predicts both short‐
   and long‐range regulatory interactions and leverages the pre‐trained
   DNA language model for enhanced precision in classification and
   regression tasks. Methven outperforms existing methods and demonstrates
   robust generalizability to monocyte datasets. Importantly, it
   identifies CpG sites associated with rheumatoid arthritis, revealing
   key pathways involved in immune regulation and disease progression.
   Methven's ability to detect progressive epigenetic changes provides
   crucial insights into gene regulation in complex diseases. These
   findings demonstrate Methven's potential as a powerful tool for basic
   research and clinical applications, advancing this understanding of
   non‐coding mutations and their role in disease, while offering new
   opportunities for personalized medicine.

   Keywords: deep learning, DNA methylation, non‐coding mutations,
   single‐cell resolution, SNP‐CpG interactions
     __________________________________________________________________

   Methven is a deep‐learning framework that predicts the impact of
   non‐coding SNPs on DNA methylation at single‐cell resolution. By
   integrating pre‐trained DNA language model with single‐cell ATAC‐seq
   data, Methven models SNP‐CpG interactions across genomic distances up
   to 100 kbp. It supports both classification and regression outputs,
   enabling dynamic predictions of the direction and magnitude of
   methylation changes, and uncovers cell‐type‐specific mechanisms linked
   to diseases such as rheumatoid arthritis.

   graphic file with name ADVS-12-2413571-g003.jpg

1. Introduction

   DNA methylation is a key epigenetic modification essential for
   regulating gene expression, with critical roles in cellular
   differentiation and disease pathogenesis.^[ [30]^1 , [31]^2 , [32]^3 ^]
   This modification is embedded within a complex regulatory network,
   influenced by non‐coding regions such as enhancers and silencers, which
   mediate chromatin structure, DNA accessibility, and protein
   interactions.^[ [33]^4 , [34]^5 , [35]^6 ^] Mutations in these
   non‐coding regions can disrupt chromatin loops or transcriptional
   machinery recruitment, leading to aberrant methylation patterns
   associated with diseases such as cancer and autoimmune disorders.^[
   [36]^7 , [37]^8 , [38]^9 ^] For instance, mutations in enhancer regions
   have been implicated in oncogene activation, while mutations linked to
   autoimmune diseases may alter the regulation of immune‐related genes.^[
   [39]^10 , [40]^11 , [41]^12 ^] Non‐coding mutations often act
   synergistically with other genetic or epigenetic modifications,
   amplifying disease progression.^[ [42]^13 ^] Thus, understanding how
   non‐coding mutations affect these mechanisms is crucial for building
   accurate models of gene regulation.

   Despite significant advances in identifying genetic variants,
   traditional approaches have primarily focused on direct genomic signals
   such as transcription factor binding and histone modifications.^[
   [43]^14 , [44]^15 , [45]^16 ^] While these approaches have provided
   valuable insights, they often overlook the regulatory effects mediated
   by DNA methylation, especially in non‐coding regions. This limitation
   hampers our understanding of the broader epigenetic landscape,
   particularly regarding the long‐range interactions that are critical
   for proper gene expression and cellular function.^[ [46]^17 ^]

   Moreover, DNA methylation, particularly at CpG sites, is known to be
   cell type‐specific, making it imperative to study this modification at
   the single‐cell level.^[ [47]^18 , [48]^19 , [49]^20 ^] Advances in
   single‐cell technologies have dramatically improved our ability to map
   methylation dynamics, uncovering regulatory mechanisms that were
   previously masked in bulk assays.^[ [50]^21 , [51]^22 ^] The ability to
   distinguish methylation differences at the single‐cell level is crucial
   for understanding disease pathology and for developing more precise
   therapeutic strategies.^[ [52]^23 ^] Aberrant methylation patterns in
   distance cell populations have been linked to a wide spectrum of
   diseases, including cancer, autoimmune disorders, and
   neurodevelopmental disorders.^[ [53]^23 , [54]^24 , [55]^25 ^]

   However, despite the growing recognition of non‐coding variants and
   their role in disease, translating this knowledge into methods that
   predict methylation changes at the single‐cell level remains a
   challenge.^[ [56]^26 ^] Current tools, such as CpGenie,^[ [57]^27 ^]
   were designed to predict the impact of non‐coding variants on DNA
   methylation but have limitations due to their narrow receptive field of
   500 base pairs (bp), hindering their utility to capture regulatory
   interactions over broader genomic regions. While effective for local
   SNP‐CpG interactions, CpGenie struggles with more long‐range
   interactions, such as those involving enhancers located tens of
   kilobases from their target genes. These long‐range interactions are
   critical for complex diseases like cancer, where distal enhancer
   elements play a pivotal role in gene regulation. Moreover, CpGenie is
   not designed for single‐cell resolution, limiting its ability to
   capture cell‐specific regulatory changes.^[ [58]^27 ^]

   Other models, such as DeepSea^[ [59]^28 ^] and Enformer,^[ [60]^29 ^]
   offer broader receptive fields and can annotate DNA sequences for
   functional impacts across larger genomic regions. However, these models
   generate static predictions, lacking the flexibility to account for the
   dynamic nature of epigenetic regulation. In diseases such as cancer,
   where regulatory regions like enhancers and promoters undergo temporal
   changes, these models fall short of capturing the evolving regulatory
   landscape. Similarly, in autoimmune diseases, where immune cells
   dynamically respond to environmental stimuli,^[ [61]^30 ^] these models
   struggle to predict context‐specific methylation changes. Furthermore,
   these models could not explain transcriptome variability between
   individuals, as they focus on general gene regulation without providing
   the cell‐type specificity needed to understand how non‐coding variants
   influence distinct cellular environments.^[ [62]^31 , [63]^32 ^] For
   instance, in neurodevelopmental disorders, where methylation patterns
   vary between neuronal subtypes, the lack of single‐cell data
   integration significantly reduces the accuracy of these models in
   predicting cell‐specific regulatory changes.^[ [64]^28 , [65]^29 ,
   [66]^33 ^]

   To address these limitations, we developed Methven, a deep learning
   framework designed to predict the effects of non‐coding mutations on
   DNA methylation at single‐cell resolution. Methven integrates DNA
   sequences with single‐cell ATAC‐seq data, employing a
   divide‐and‐conquer strategy to model SNP‐CpG interactions across
   genomic distances of up to 100 kbp. By supporting both classification
   and regression tasks, Methven aims to provide more accurate and
   comprehensive predictions of methylation dynamics, particularly for
   long‐range regulatory interactions. This framework addresses the gaps
   in existing models and offers potential applications in understanding
   the epigenetic underpinnings of complex diseases and advancing
   personalized medicine.

2. Results

2.1. Overview of Methven

   Methven was designed to predict the impact of non‐coding mutations on
   methylation sites within a 100 kbp range around single‐nucleotide
   polymorphism (SNP), specifically at single‐cell resolution. To achieve
   this, we collected 244,491 cis‐methylation quantitative trait loci
   (meQTL) data^[ [67]^34 ^] from the meQTL EPIC Database,^[ [68]^35 ^]
   specifically focusing on CD4+ T cells (Table [69]S1, Supporting
   Information). These data were selected for their high‐quality
   annotation of SNP‐CpG interactions, which renders them well‐suited for
   modeling methylation changes across a broad range of genomic distances.
   The CD4+ T cell type was specifically chosen due to its pivotal role in
   immune regulation and frequent involvement in autoimmune diseases,
   positioning it as a valuable model for investigating methylation
   dynamics in disease contexts.

   In addition to meQTL data, the corresponding single‐cell ATAC‐seq data
   from the EpiMap Repository^[ [70]^36 ^] were incorporated. The use of
   single‐cell ATAC‐seq data ensures that Methven can capture chromatin
   accessibility at the single‐cell level, which is essential for
   understanding the regulatory dynamics of gene expression and
   methylation. This single‐cell resolution is crucial for making
   predictions about cell‐type‐specific methylation patterns that are
   often masked in bulk assays, enabling a more precise understanding of
   epigenetic changes in various disease states. By incorporating both
   meQTL and ATAC‐seq data, Methven mitigates model bias from
   false‐negative SNPs (i.e., SNPs that affect CpG sites but are not
   statistically captured), while ensuring that the model retains its
   generalizability and adaptability across other cell types or tissues.

   Methven's architecture comprises two core components (Figure [71] 1 ):
   (1) preprocessing of labeled SNP‐CpG pairs to generate embeddings
   suitable for training, and (2) a deep learning module designed to
   perform both classification and regression tasks. During preprocessing,
   CpG sites within a 100 kbp range around each SNP were annotated, and a
   comprehensive dataset comprising 50190 SNP‐CpG pairs was constructed
   (Figure [72]1b). Recognizing that SNPs close to CpG sites may exert
   more direct and stronger effects, while those at greater distances
   might involve more complex long‐range regulatory mechanisms, we applied
   a divide‐and‐conquer strategy. The dataset was split into small pairs
   (distance <10 kbp, 19,874 pairs) and large pairs (distance between
   10kbp and 100kbp, 30,316 pairs). Independent models were trained for
   each subset, with the Methven‐small model targeting small pairs, and
   the Methven‐large model focusing on large pairs, allowing each model to
   capture the specific features relevant to their respective distances.

Figure 1.

   Figure 1
   [73]Open in a new tab

   Overview of Methven. a) Prediction of mutation impacts on methylation
   using DNA sequences and single‐cell ATAC‐seq data. In the data
   collection phase, meQTL data from CD4+ T cells were used to label the
   impact of non‐coding mutations on CpG sites. A divide‐and‐conquer
   strategy was employed, segmenting the dataset into small pairs and
   large pairs based on the distance between SNPs and CpG sites. DNA
   sequences after positional‐wise cutting were fed into a DNA language
   model. The generated DNA embeddings and the corresponding ATAC‐seq data
   were averaged and concatenated within each cut unit. The concatenated
   embeddings were finally fed into the deep learning model designed to
   perform both classification and regression tasks. b) Illustration of
   the preprocessing pipeline. For each SNP, CpG sites within 100 kbp
   upstream and downstream were identified based on methylation changes.
   SNP‐CpG pairs within 10 kbp formed the small dataset, while those
   between 10 kbp and 100 kbp formed the large dataset. DNA sequences and
   corresponding ATAC‐seq data were extracted with the CpG site centered.
   The sequences, both pre‐ and post‐mutation, were then positionally cut
   around the CpG site, and input into a DNA language model to obtain
   embeddings. Finally, the DNA embeddings from each cut and the ATAC‐seq
   data were average pooled and concatenated. c) Details of the deep
   learning module. The concatenated embeddings are fed into two stacked
   Bidirectional Gated Recurrent Unit (BiGRU) modules, followed by batch
   normalization layers and fully connected layers. The classification and
   regression tasks are handled separately: the classification task
   predicts the direction of the SNP's impact on CpG methylation levels
   (upregulation/downregulation), while the regression task estimates the
   magnitude of this impact (slope).

   To enhance the representation of DNA sequences, we selected DNABert2^[
   [74]^37 ^] as the language model for generating pre‐trained DNA
   embeddings. We selected DNABert2 due to its efficient Byte Pair
   Encoding (BPE) tokenization, which improves computational efficiency
   and captures complex genomic patterns.^[ [75]^37 ^] Due to the input
   sequence length limitations of DNABert2, we implemented a
   positional‐wise cutting strategy, ensuring that each segment remains
   central while maximizing key information retention. For both pre‐ and
   post‐mutation sequences, the DNA embeddings and ATAC‐seq data for each
   segment were averaged pooled, and concatenated.

   The concatenated embeddings were then fed into Methven's deep learning
   module. This module consists of two stacked Bidirectional Gated
   Recurrent Unit (BiGRU) layers,^[ [76]^38 ^] followed by batch
   normalization layers^[ [77]^39 ^] and fully connected layers.^[ [78]^40
   ^] Methven supports two independent tasks: classification to predict
   the direction of the SNP's impact on CpG methylation levels
   (up‐regulation/down‐regulation), and a regression to estimate the
   magnitude of this impact (slope). By separating these tasks, Methven
   minimizes task interference and achieves higher predictive accuracy,
   particularly in determining the direction of the methylation impact.

2.2. Methven Achieves High Accuracy in Internal Validation across Genomic
Distances

   Methven's prediction performance was first evaluated using an internal
   test set (partitioned from the same dataset as the training set, see
   “Methods”, Figure [79] 2a,b). Specifically, the dataset used for
   training and testing comprised 50,190 meQTL data from the meQTL EPIC
   Database, focusing on CD4+ T cells, alongside corresponding single‐cell
   ATAC‐seq data from the EpiMap Repository. This dataset was chosen for
   its comprehensive annotation of SNP‐CpG interactions and high relevance
   to immune regulation and autoimmune diseases.

Figure 2.

   Figure 2
   [80]Open in a new tab

   Benchmarking and robustness evaluation of Methven on intra‐dataset. a)
   Performance of Methven on the small dataset (test samples = 1,988). The
   classification task is evaluated using a confusion matrix, ROC curve,
   AUC, accuracy (ACC), recall, precision, and F1‐score. The regression
   task performance is quantified by RMSE, R^2, and PCC, with the
   significance of PCC assessed under the condition of P‐value < 0.001. b)
   Performance of Methven on the large dataset (test samples = 3,032).
   Evaluation metrics are applied similarly to those in (a). c) Ablation
   study on the small dataset. Four ablation experiments were conducted:
   removing the input ATAC‐seq data, removing the DNA embedding, replacing
   DNABert2 embedding with OneHot, and replacing the BiGRU layers with
   fully connected layers. Grey lines represent the performance of the
   full Methven model, while green lines depict the performance of the
   ablated models. d) Similar to (c), with the same ablation experiments
   conducted on the large dataset.

   The internal test set was generated by splitting the original dataset
   into training (80%), validation (10%), and testing (10%) sets. Care was
   taken to ensure that the test set contained SNP‐CpG pairs that
   represented both small and large SNP‐CpG distances (<10 kbp and 10
   kbp‐100 kbp, respectively). The test set included 1,988 small‐distance
   SNP‐CpG pairs and 3,032 large‐distance SNP‐CpG pairs, enabling
   Methven's performance to be evaluated across a wide range of genomic
   distances (Table [81]S1, Supporting Information).

   In the classification task, predicting the direction of the SNP's
   impact on CpG methylation without considering the magnitude helped
   distinguish directional effects and reduce interference from small
   absolute slope values. On the small distance SNP‐CpG pairs set, the
   Methven classification model achieved an ACC of 0.920 and an AUC of
   0.969. For the large distance SNP‐CpG pairs set, the ACC was 0.837, and
   the AUC was 0.918. These results demonstrated the robust prediction
   capability of the Methven classification model, with a receptive field
   extending up to 100 kbp.

   However, predicting only the direction of the SNP's impact on CpG
   methylation is insufficient. To address this, additional models were
   trained to regress the magnitude of the impact, specifically the meQTL
   slope. The Methven‐small model achieved an RMSE of 1.59 and a Pearson
   correlation coefficient (PCC) of 0.87 (Student t‐test p < 0.001), while
   the Methven‐large model recorded an RMSE of 2.10 and a PCC of 0.81
   (Student t‐test p < 0.001). The regression task provides finer‐grained
   predictions and complements the classification task. Together, these
   two tasks enable Methven to offer both high‐level and detailed
   insights, improving its overall predictive utility. Furthermore, when
   the Beta value (meQTL slope) predicted by the regression model has a
   small absolute value, this indicates that the model considers the
   mutation to have minimal impact on methylation levels. In such cases,
   it may be appropriate to classify the mutation as “non‐impactful” and
   exclude it from annotation as a meQTL.

   Ablation experiments revealed that each critical component of Methven
   independently contributes to its overall performance. These key
   components include the ATAC‐seq input, DNA embeddings, and the BiGRU
   layers within the model architecture. We conducted a series of
   experiments where the ATAC‐seq input was removed, DNA embeddings were
   excluded, DNABert2 embeddings were replaced with OneHot encoding, and
   BiGRU layers were substituted with fully connected layers. As shown in
   Figure [82]2c,d, the performance degradation following the removal of
   the essential inputs (both DNA embedding and ATAC‐seq) underscores the
   validity of the Methven design.

   It is worth noting that when using OneHot encoding instead of DNABert2
   for generating DNA embeddings, the inability to perform average pooling
   due to the limitations of OneHot encoding leads to higher computational
   costs, especially with longer DNA sequences (Tables [83]S3,S4,
   Supporting Information). Although both embeddings achieved similar
   performance, DNABert2 exponentially reduced the overall number of model
   parameters, which is highly beneficial for large‐scale DNA sequence
   predictions in improving efficiency and scalability.

   To further verify the representation learning ability of Methven, we
   utilized the t‐SNE algorithm^[ [84]^41 ^] to visualize the sample
   distribution based on the initial feature set (pre‐ and post‐mutation
   DNA embeddings) and the representation generated by Methven. All
   embeddings and representations were mapped into a two‐dimensional
   space. To visualize the sample distribution, SNP‐CpG pairs were colored
   according to the SNP's effect on methylation (Figure [85] 3 ). We
   observed that, in both the classification and regression tasks, SNP‐CpG
   pairs with different labels or slopes were completely intermixed in the
   two‐dimensional space of the initial characterization. However, in the
   Methven representation space, SNP‐CpG pairs were separated according to
   their classification labels (Figure [86]3a,b) and were distributed in
   an orderly manner according to slope values in the regression task
   (Figure [87]3c,d). These results indicate that Methven is able to
   efficiently generate high‐quality representation vectors for mutation
   effect prediction and maintain consistent performance in different
   tasks.

Figure 3.

   Figure 3
   [88]Open in a new tab

   Visualization of representation ability of Methven. a) Visualization of
   representational ability in the Methven small model for the
   classification task. t‐SNE was used to perform dimensionality reduction
   and visualization on the DNA embeddings both pre‐ and post‐mutation, as
   well as on the outputs from the Methven model after the deep learning
   module. Green points represent SNP‐CpG pairs where the CpG methylation
   level increases, while yellow points represent pairs where the CpG
   methylation level decreases. b) Similar to (a), with t‐SNE applied to
   the large dataset model. c) Visualization of representational ability
   in the Methven small model for the regression task. t‐SNE was used to
   perform dimensionality reduction and visualization on the DNA
   embeddings both pre‐ and post‐mutation, as well as on the outputs from
   the Methven model after the deep learning module. Points are colored
   with a gradient from blue to red, representing slope values from low to
   high. d) Similar to (c), with t‐SNE applied to the large dataset model.

2.3. Methven Outperforms Existing Methods in Predicting Non‐Coding Mutation
Effects

   While Methven demonstrated high accuracy in internal validation, its
   true robustness lies in how it compares to other state‐of‐the‐art
   methods for predicting methylation changes induced by non‐coding
   mutations.

   Methven is capable of predicting the impact of SNPs on all CpG sites
   within a 100 kbp range, both upstream and downstream, whereas the
   previous state‐of‐the‐art model, CpGenie,^[ [89]^27 ^] was limited to a
   500 bp range. To comprehensively demonstrate Methven's effectiveness
   and robustness in predicting the effects of non‐coding mutations on
   methylation, we compared it with existing external tools on the
   classification task. CpGenie, a widely used method specifically
   designed for predicting the impact of mutations on methylation, served
   as the primary baseline. To ensure a fair comparison, we applied
   CpGenie to data with SNP‐CpG distances up to 100 kbp, consistent with
   the range used in Methven.

   We also included Enformer^[ [90]^42 ^] in our comparisons. While
   Enformer was originally designed to predict the impact of non‐coding
   mutations on gene expression, it excels at generating functional
   annotations of DNA sequences. On the other hand, Methven's features are
   derived from large‐scale pretraining on DNA sequences to learn semantic
   representations that capture deeper relationships within the genome.
   Comparing Methven with Enformer allowed us to evaluate which
   approach—functional annotations or semantic information—offers stronger
   predictive capabilities in assessing the impact of non‐coding mutations
   on methylation.

   The comparison was conducted using ten‐fold cross‐validation on the
   classification task. To fairly assess the representational power of
   different methods, we extracted the embeddings from the penultimate
   layer of each model (the embeddings used for final classification,
   representing the highest‐level features learned by the model) and then
   trained a decision tree with default parameters on these embeddings,
   evaluating performance on an internal test set (Figure [91] 4a; Figure
   [92]S3, Supporting Information).

Figure 4.

   Figure 4
   [93]Open in a new tab

   External validation of Methven on existing methods, new cell type, and
   disease‐associated SNPs. a) Comparison of ten‐fold cross‐validation
   performance between Methven, Enformer, and CpGenie on the
   classification task (Table [94]S5, Supporting Information). The metrics
   used for comparison include ACC, Precision, Recall, F1‐score, and AUC.
   To ensure a fair comparison of these models' ability to learn the
   relationship between SNPs and CpG sites, embeddings from the layer
   preceding the output layer of each model were extracted, and a decision
   tree with identical parameters was trained on these embeddings. b)
   Methven's performance on monocyte single‐cell meQTL datasets. The
   experiments involved two approaches: end‐to‐end (e2e) training of
   Methven directly on the monocyte dataset, and fine‐tuning Methven
   pre‐trained on the CD4+ T cell dataset. c) Analysis of rheumatoid
   arthritis (RA)‐associated SNPs using the Methven regression model. The
   SNPs predicted were selected from genome‐wide association studies
   (GWAS) analysis as RA‐associated SNPs. Red bars indicate SNPs where the
   absolute difference in slope between case and control SNPs is greater
   than 0.5, with the control SNP having a positive slope, suggesting an
   up‐enhancement of methylation impact in RA cases. Blue bars represent
   SNPs where the absolute difference in slope is greater than 0.5, with
   the control SNP having a negative slope, indicating a down‐enhancement
   of methylation impact in RA cases. Grey bars indicate SNPs where the
   absolute difference in slope is less than 0.5, suggesting little
   association with RA in terms of methylation impact. Black bars denote
   SNPs where the absolute slope in RA cases is smaller than in controls,
   indicating a reduced impact on methylation in RA conditions. d) Pathway
   enrichment results for genes corresponding to Methven‐predicted
   affected CpGs. To compare the impact of varying signal‐to‐noise ratios,
   the SNPs within a 50 kbp range of the ATAC‐seq peak were analyzed
   separately and compared to the results from all SNPs.

   Methven outperformed all other methods across both small and large
   datasets on all evaluation metrics, consistently demonstrating balanced
   recognition of both positive and negative samples. Specifically, the
   Methven‐small model achieved a mean of ACC = 0.908 and a mean of AUC =
   0.908, while the Methven‐large model recorded a mean of ACC = 0.842 and
   a mean of AUC = 0.842 (Table [95]S5, Supporting Information). In
   contrast, while Enformer also demonstrated balanced recognition of
   positive and negative samples, its overall performance was lower than
   Methven's, likely because the functional annotation features learned
   during the pretraining of Enformer are not fully optimized for tasks
   focused on methylation tasks.

   CpGenie, which relies on OneHot encoding for DNA sequences and
   convolutional neural networks (CNN) for representation learning,
   struggled with longer sequence lengths (>500 bp reported), resulting in
   less stable and poorer performance CpGenie achieved a mean of ACC =
   0.782 and a mean AUC = 0.807 for small‐distance pairs, and a mean ACC =
   0.706 and a mean AUC = 0.739 for large‐distance pairs (Table [96]S5,
   Supporting Information). These findings demonstrate that Methven
   effectively predicts the impact of non‐coding mutations on DNA
   methylation and offers improved generalization across genomic distances
   compared to state‐of‐the‐art methods.

   It is worth highlighting that even if CpGenie and Enformer were
   extended to single‐cell resolution, their assessment of SNP effects
   within specific cell types would remain statical, as both rely
   exclusively on DNA sequence information for predictions. In contrast,
   Methven offers a unique advantage by incorporating personalized
   ATAC‐seq data as input, introducing greater flexibility. This enables
   Methven to identify distinct patterns for the same cell type across
   different individuals or to evaluate SNP effects across various stages
   of a disease—an adaptability not achievable with previous methods.

2.4. Methven Effectively Generalizes to Monocytes for Methylation Prediction

   In addition to its strong performance compared to existing methods, we
   also sought to explore Methven's generalizability across different cell
   types, starting with monocytes. Mutations can have different effects
   depending on the cellular environment. To assess Methven's ability to
   generalize to other cell types, we applied the Methven classification
   model to a dataset comprising monocyte single‐cell meQTLs. This dataset
   was downloaded from the EPIGEN MeQTL Database,^[ [97]^35 ^] with
   corresponding ATAC‐seq data sourced from the EpiMap Repository.^[
   [98]^36 ^] After preprocessing, the number of SNP‐CpG pairs used for
   training and testing was approximately one‐third of those in Methven's
   internal dataset (Tables [99]S1,S8, Supporting Information).

   We initially trained Methven directly on the monocyte external
   validation dataset using an end‐to‐end (e2e) approach. We observed
   solid classification performance on both the small SNP‐CpG pairs and
   large SNP‐CpG pairs (AUCs of 0.898 and 0.770, respectively,
   Figure [100]4b, Table [101]S6, Supporting Information), indicating its
   potential to generalize to other cell types or tissue types, provided
   that corresponding ATAC‐seq data is available.

   Next, we fine‐tuned Methven based on the pre‐trained model from CD4+ T
   cells. This fine‐tuned model showed a slight improvement in performance
   over the end‐to‐end training, with AUCs of 0.939 and 0.823,
   respectively (Figure [102]4b). This improvement is likely due to the
   larger training data size in Methven's internal dataset, which helped
   capture intrinsic relationships between meQTLs and ATAC‐seq. These
   results suggest that Methven could serve as a generalized pre‐trained
   model, especially as the amount and the cell/tissue type of training
   data increase over time.

2.5. Methven Reveals Methylation Changes Linked to Disease‐Associated SNPs in
Rheumatoid Arthritis

   After validating Methven's ability to generalize to different cell
   types, we applied it to uncover potential links between SNP‐induced
   methylation changes and specific diseases, focusing on rheumatoid
   arthritis (RA). In real‐world studies, investigating the effects of
   mutations often involves exploring disease mechanisms.^[ [103]^43 ,
   [104]^44 , [105]^45 ^] Methven's integration of ATAC‐seq inputs enables
   the analysis of mutations occurring in the same cell type under
   different disease processes. To evaluate Methven's potential in
   uncovering mutation‐disease connections, we selected SNPs highly
   associated with RA through genome‐wide association studies (GWAS)
   analysis.^[ [106]^46 ^] For training, we used ATAC‐seq data from CD4+ T
   cell lines, with cells stimulated for 24 hours with anti‐CD3/CD28
   serving as the case ATAC‐seq, and unstimulated cells as the control
   ATAC‐seq.^[ [107]^47 ^] To enhance the signal‐to‐noise ratio, we
   filtered SNPs located within 1 kbp upstream and downstream of the
   control ATAC‐seq peak regions since these regions are highly enriched
   for regulatory elements,^[ [108]^48 ^] resulting in 8 remaining SNPs.
   All CpG sites within a 100 kbp range upstream and downstream of these
   SNPs were annotated.

   We then mapped the predicted affected CpGs to the nearest transcription
   start site (TSS). To enhance the signal‐to‐noise ratio, we conducted
   pathway enrichment analysis for the affected CpG‐related genes observed
   by all SNPs and the SNPs located within a 50 kbp range of the ATAC‐seq
   peaks. We then identified six pathways, such as “clearance of foreign
   intracellular DNA” and “lymphocyte activation”, with significantly
   adjusted P‐values (Mann‐Whitney U test P‐value < 0.05) that were common
   to both groups (Figure [109]4d). The pathway of clearance of foreign
   intracellular DNA is reported closely linked to the pathogenesis of RA
   through its role in the immune response and inflammation,^[ [110]^49 ^]
   while the signaling lymphocytic activation molecule family (SLAMF) may
   influence RA pathogenesis by participating in inflammation mediated by
   infiltrating immune cells.^[ [111]^50 ^] The identification of
   RA‐associated CpGs influenced by SNPs and recognized by Methven, which
   align with pathways known to be involved in RA pathogenesis,
   demonstrates Methven's capability to enhance the understanding of the
   role of “mutations affect methylation levels” in disease mechanisms.

   Next, we applied Methven's regression model to predict the impact of
   these SNPs in both case and control samples with corresponding ATAC‐seq
   data. Based on the differences in slopes between the two conditions, we
   categorized the annotated CpG sites into four groups: those where
   mutation impact on methylation was enhanced by disease occurrence
   (Up‐enhanced), negatively enhanced (Down‐enhanced), reduced impact
   (Reduced), or unaffected by the disease (Unaffected). Methven was able
   to distinguish among these four categories of CpG sites
   (Figure [112]4c), addressing a gap left by other methods in this area.

   As an example, we examined rs968567, an SNP proven to be highly
   associated with RA,^[ [113]^51 ^] and analyzed its impact on case and
   control ATAC‐seq peaks, along with the distribution of CpG sites across
   the four categories relative to the SNP (Figure [114] 5a, Table
   [115]S7, Supporting Information). We observed that the CpG sites with
   high predicted scores of each category (top CpG sites) were clustered
   around the ATAC‐seq peak regions, which likely contain functional
   regulatory elements. This clustering may explain why these CpG sites
   were more affected by the SNP and exhibited greater changes due to
   disease occurrence.

Figure 5.

   Figure 5
   [116]Open in a new tab

   Methven analysis of rs968567 on disease process, a case of rs968567‐CpG
   associated pair, and pattern analysis across different cell types. a)
   Visualization of RA‐associated SNP analysis for the Methven. rs968567
   has been reported as a key SNP for RA. For the different categories of
   SNPs identified in Figure [117]4(c), top‐ranking CpG sites were
   selected as Top CpG sites, showing that most are located near ATAC‐seq
   peaks. b) Case study of Methven's prediction for the rs968567 to
   cg06781209 interaction. cg06781209 is a binding site for the
   transcription factor SREBF2. rs968567 in the promoter region of the
   FADS2 gene alters DNA methylation, disrupting the binding of SREBF2 and
   downregulating FADS2 expression, thereby reducing RA risk. In Methven's
   predictions, the absolute value of the unstimulated slope is greater
   than that of the case slope, indicating that in RA, the CpG site is
   less influenced by the SNP. This suggests a reduced ability of the SNP
   to modulate CpG methylation, thereby failing to suppress RA as
   effectively. c) Bar plot of the count of up‐regulated and
   down‐regulated CpG sites affected by rs968567 in CD4+ T cells and
   monocytes. CD4+ T cells have slightly more up‐regulated sites (154)
   compared to monocytes (149), while down‐regulated sites are nearly
   equal between the two cell types. This indicates rs968567's similar but
   distinct regulatory impacts on CD4+ T cells and monocytes. d) Line plot
   of the predicted regulatory slope of Methven for CD4+ T cells and
   monocytes across the CpG site around rs968567. The patterns for both
   cell types are largely consistent, though subtle differences in slope
   magnitude reflect slight variations in the regulatory effect of
   rs968567 between the two cell types.

   One CpG site, cg06781209, is a binding site for the transcription
   factor SREBF2. The SNP rs968567, located in the promoter region of the
   FADS2 gene, has been reported to alter the methylation level of
   cg06781209.^[ [118]^52 , [119]^53 ^] This alteration disrupts the
   binding of SREBF2, downregulating FADS2 gene expression and
   subsequently reducing RA risk.^[ [120]^52 , [121]^53 ^] In other words,
   changes in the methylation level of cg06781209 are associated with the
   suppression of the effects of rs968567, thereby influencing RA risk.
   Methven's predictions showed that the SNP's influence on CpG
   methylation was less pronounced in the 24‐hour stimulated condition
   than in the unstimulated condition (Figure [122]5b, 24‐hour stimulated
   predicted slope = 0.09, unstimulated predicted slope = ‐1.19),
   suggesting diminished transcription factor binding and consequently
   reduced RA risk. These findings align with previous studies showing
   that the methylation of cg06781209 can suppress the effects of
   rs968567, thus influencing RA risk.^[ [123]^52 , [124]^53 ^]

   These results demonstrate that Methven can assist in determining
   whether the pathogenicity of disease‐associated SNPs is mediated by
   “changes in methylation affected by SNPs” thus providing insights into
   the impact of individual mutations on disease risk and progression, and
   supporting the development of personalized treatment strategies.

2.6. Methven Captures Cell Type‐Specific Regulatory Patterns of SNP‐Induced
Methylation

   Given the importance of cell type‐specific methylation patterns in
   understanding disease risk, we next evaluated Methven's ability to
   capture such regulatory patterns across different cell types,
   specifically CD4+ T cells and monocytes. We applied Methven to assess
   the regulatory effects of SNP rs968567 on CpG sites in both cell types,
   focusing on up‐ and down‐regulation patterns (Figure [125]5c), as well
   as the predicted regulatory slopes (Figure [126]5d). The analysis
   revealed a balanced distribution of up‐ and down‐regulated CpG sites
   between the two cell types. CD4+ T cells showed slightly more
   up‐regulated sites (154) than monocytes (149), while the number of
   down‐regulated sites was comparable (109 in CD4+ T cells and 114 in
   monocytes). This balance demonstrates Methven's ability to capture cell
   type‐specific regulatory responses even when the underlying genetic
   perturbation, such as the SNP rs968567, is the same.

   When a disease‐associated SNP like rs968567 exerts specific methylation
   impacts on particular cell types, Methven can assess the consistency of
   methylation patterns across cell types. By comparing the predicted
   methylation changes with known patterns, clinicians may stratify
   individuals into high‐ or low‐risk categories based on their
   methylation response. This underscores Methven's potential in precision
   medicine, where cell type‐specific epigenetic changes can help predict
   individual disease susceptibility.^[ [127]^54 ^]

2.7. Leverages Functional DNA Regions for Enhanced Methylation Predictions

   To further illustrate Methven's prediction capabilities, we
   investigated its ability to leverage functional DNA regions, which play
   a crucial role in regulating gene expression and methylation.
   Non‐coding DNA often contains regulatory elements that influence both
   gene expression and methylation patterns. SNPs located near these
   functional regions are more likely to have an impact on these
   processes. To explore Methven's ability to understand and utilize
   sequence features, we aligned the hidden states of the BiGRU layers
   with the DNA sequences and analyzed them based on different functional
   regions, as well as whether the DNA bases were located within these
   functional regions. We obtained annotations for eight types of
   functional regions from the UCSC Genome Browser: active promoter,
   strong enhancer, transcriptional transition, transcriptional
   elongation, insulator, heterochrome, repressed region, and repetitive
   element/copy number variation.

   The hidden states of the BiGRU layer showed significantly different
   activation values (Mann‐Whitney U test, P‐vlaue < 0.005) between
   functional regulatory regions and non‐functional regions (Figure [128]
   6 ), suggesting that Methven can recognize and incorporate information
   from critical genomic regions. This ability to differentiate functional
   from non‐functional regions likely contributes to Methven's high
   classification performance in predicting the effects of non‐coding
   mutations on DNA methylation.

Figure 6.

   Figure 6
   [129]Open in a new tab

   Hidden state analysis of Methven. a) Differences in the hidden states
   of stacked BiGRU layers in the Methven small model between functional
   and non‐functional regions. Eight functional regions were annotated
   using the UCSC Genome Browser and mapped onto the DNA sequences of the
   input SNP‐CpG pairs (Txn Transition: transcriptional transition, Txn
   Elongation: transcriptional elongation, Repetitive/CNV: repetitive
   element/copy number variation, Repressed: repressed region,
   Heterochrom/lo: heterochrome). The differences between functional and
   non‐functional regions were assessed using the Mann‐Whitney U test. b)
   Similar to (a), with the analysis conducted using the Methven large
   model.

   Additionally, we observed differences in the distribution of hidden
   state activations between the Methven‐small and Methven‐large models.
   These differences suggest that Methven learns distinct regulatory
   patterns for short‐range and long‐range interactions. This observation
   reinforces the decision to train separate models for different genomic
   distances, as it highlights Methven's capacity to adapt to the unique
   regulatory mechanisms that operate at various scales within the genome.

3. Discussion

   Accurately predicting the epigenetic consequences of non‐coding
   mutations on DNA methylation, particularly at single‐cell resolution,
   remains a significant challenge in understanding gene regulation and
   its links to complex diseases. Despite advances in GWAS that have
   identified numerous genetic variants associated with diseases and
   traits, pinpointing causal variants and clarifying their pathogenic
   mechanisms remain difficult. Previous tools, such as CpGenie^[ [130]^27
   ^] have pioneered non‐coding variant effect prediction on DNA
   methylation, but their limited receptive fields and static prediction
   capabilities hinder their application in broader genomic and cellular
   contexts. Similarly, models like DeepSea^[ [131]^29 ^] and Enformer^[
   [132]^42 ^] provide valuable functional annotations from DNA sequences
   but struggle to account for the dynamic and cell‐specific regulatory
   changes that are critical for understanding disease progression.

   Methven addresses these limitations by integrating DNA sequences with
   single‐cell ATAC‐seq data, modeling SNP‐CpG interactions over genomic
   distances up to 100 kbp using a divide‐and‐conquer strategy. This
   approach allows Methven to capture both short‐ and long‐range
   regulatory interactions with greater accuracy than previous methods.
   Moreover, Methven's architecture supports predictions at single‐cell
   resolution, moving beyond static predictions to model the dynamic
   interactions between non‐coding mutations and the epigenome. By
   leveraging DNABert2 embeddings and single‐cell ATAC‐seq data, Methven
   predicts both the direction and magnitude of methylation changes,
   achieving a classification accuracy of 92.0% and an AUC of 0.969 for
   short‐range interactions. These improvements highlight Methven's
   flexibility in modeling complex genomic interactions, crucial for
   advancing our understanding of epigenetic regulation. By utilizing
   single‐cell ATAC‐seq data, Methven can dynamically learn the
   relationship between DNA sequences and the chromatin accessibility of
   the specific sample being predicted, enabling personalized predictions
   that would not be achievable with the prediction based only on DNA
   sequences.

   Methven introduces several advancements in the field of computational
   genomics. First, its divide‐and‐conquer strategy effectively captures
   both local and long‐range SNP‐CpG interactions, a feature that
   distinguishes Methven from earlier tools limited to short‐range
   predictions. Second, it utilizes DNABert2 to generate pre‐trained DNA
   embeddings, efficiently encoding complex regulatory relationships while
   maintaining a lightweight architecture for large‐scale predictions.
   Third, Methven is the first tool to predict non‐coding mutation effects
   on DNA methylation at single‐cell resolution, making it an important
   innovation for studying cell‐type‐specific regulation in diseases like
   cancer and autoimmune disorders. Additionally, the dual‐task
   architecture, supporting both classification and regression outputs,
   enables Methven to predict not only the direction but also the
   magnitude of methylation changes, contributing to a more comprehensive
   understanding of epigenetic regulation.

   The Results demonstrate that Methven's predictive power is validated
   through its strong performance on the internal CD4+ T cell dataset, as
   well as its ability to generalize to an external monocyte dataset.
   Methven showed robust cell‐type‐specific predictions, particularly in
   identifying methylation changes associated with disease‐related SNPs,
   such as those linked to rheumatoid arthritis. The fine‐tuning of
   Methven on monocytes, originally trained on CD4+ T cells, further
   highlights its adaptability across different biological contexts, an
   important advantage over prior tools that often struggle to generalize
   beyond their training datasets.

   Methven's application to RA‐associated SNPs uncovered significant
   mechanistic links between non‐coding mutations and disease
   pathogenesis. By integrating ATAC‐seq data, Methven successfully
   captured both the spatial and temporal dynamics of chromatin
   accessibility, a critical factor in understanding how regulatory
   elements evolve over time in diseases like RA. Notably, Methven
   identified CpG sites involved in pathways such as “clearance of foreign
   intracellular DNA” and “lymphocyte activation”, both of which are
   pivotal to RA pathogenesis. This capability of Methven to integrate
   epigenetic data and predict cell‐type‐specific methylation responses
   provides significant potential for precision medicine, where it could
   be used to stratify patients based on their epigenetic risk profiles.

   In addition to predicting which CpG sites are affected by
   disease‐associated mutations, Methven is also able to categorize these
   sites based on the extent of methylation changes (e.g., up‐regulated,
   down‐regulated, reduced, or unaffected). This granularity in prediction
   offers a detailed understanding of how non‐coding variants affect gene
   expression through methylation changes, which is crucial for diseases
   driven by immune dysregulation, such as RA. For instance, Methven
   accurately predicted the impact of rs968567, an SNP strongly associated
   with RA, on the methylation of cg06781209, a CpG site involved in
   regulating the expression of FADS2. The model's predictions aligned
   with prior findings showing that changes in methylation at this site
   contribute to disease risk, demonstrating the utility of Methven in
   identifying actionable epigenetic biomarkers.

   Despite its strengths, Methven has limitations. One key issue is the
   limited availability of high‐quality single‐cell meQTL datasets
   obtained through fine‐mapping, which has restricted the current version
   of Methven. To address this limitation, we plan to develop
   computational approaches capable of generating large‐scale single‐cell
   meQTL datasets across diverse cell types, thereby enhancing Methven's
   pre‐training process. As an exploratory analysis, we conducted a
   preliminary evaluation of Methven's generalizability to tissue‐level
   meQTLs using a small‐scale retina meQTL dataset (Figure [133]S4,
   Supporting Information). Importantly, Methven's primary contribution
   lies in providing a pattern for learning the regulatory relationships
   between DNA sequences and epigenetic information. Therefore, Methven
   has the potential to be extended to other downstream tasks related to
   mutation impact prediction (Figure [134]S5, Supporting Information).

   Another issue worth discussing is Methven's reliance on high‐quality
   cell‐specific ATAC‐seq data. These data capture cell‐type‐specific
   chromatin states, which were shown to be beneficial to Methven's
   performance through ablation experiments. While recent advances in
   sequencing technologies are gradually increasing the availability of
   such datasets, their high cost and technical requirements may still
   pose challenges for wider adoption. To overcome this, future work will
   explore the integration of complementary omics data, such as histone
   modification patterns and chromatin interaction maps, to expand
   Methven's applicability and further refine its predictions.

   Future research should also refine Methven's pretraining strategies. As
   Methven serves as a theoretical framework, embeddings from other
   language models can be used in place of OneHot encoding for sequence
   representation. With the ongoing development of new DNA pretraining
   models, we will continue to track advancements in this field and test
   additional models to enhance Methven's performance and applicability.
   Additionally, while Methven currently operates within a 100 kbp
   receptive field, future research will focus on exploring strategies to
   extend its range to even larger regulatory domains. We will also work
   on balancing the computational cost and ensuring model accuracy for
   these extended ranges. In summary, Methven's dynamic, cell‐specific
   approach offers insights into the epigenetic impact of non‐coding
   mutations and holds promise for both basic research and personalized
   medicine.

4. Experimental Section

Datasets

   The meQTL EPIC Database : The meQTL EPIC dataset^[ [135]^35 ^] was
   downloaded from the meQTL EPIC Database website
   ([136]https://epicmeqtl.kcl.ac.uk/), which reported the results of a
   meQTL analysis at 724,499 CpGs profiles in 2,358 blood samples from
   three UK cohorts. In this study, meQTL data from CD4+ T cells were
   obtained in the EPIC meQTL Database, which as the intra‐dataset was
   used. Additionally, monocyte meQTL data was utilized from the same
   database as one of the external validation datasets.

   EpiMap Repository : The corresponding ATAC‐seq data for matching CD4+ T
   cell meQTL and monocyte meQTL were downloaded from the EpiMap
   Repository^[ [137]^36 ^] ([138]https://compbio.mit.edu/epimap/), which
   includes aggregated and uniformly re‐processed functional genomics data
   from 3030 references across sources such as ENCODE and Roadmap.