Abstract Background With the rapid development of high-throughput sequencing technology, high-throughput sequencing data has grown on a massive scale, leading to the emergence of multiple public databases, such as EBI and GEO. Conducting secondary mining of high-throughput sequencing data in these databases can yield more valuable insights. Meta-analysis can quantitatively combine high-throughput sequencing data from the the same topic. It increases the sample size for data analysis, enhances statistical power, and results in more consistent and reliable conclusions. Results This study proposes a new between-study variance estimator [MATH: Em :MATH] . We prove that [MATH: Em :MATH] is non-negative and [MATH: Emτ^m2 :MATH] increases with the increase of [MATH: τ^m2 :MATH] , satisfying the general conditions of the between-study variance estimator. We get the DSLE2 (two-step estimation starting with the DSL estimate and the [MATH: Em :MATH] in the second step) random-effects meta-analysis model based on the between-study variance estimator Em. The accuracy and a series of evaluation metrics of the DSLE2 model are better than those of the other 6 meta-analysis models. DSLE2 model is applied to lung cancer and Parkinson’s methylation data. Significantly differentially methylated sites identified by DSLE2 model and the genes with significantly differentially methylated sites are closely related to two diseases, indicating the effectiveness of DSLE2 random-effects model. Conclusions This paper propose the DSLE2 random-effects meta-analysis model based on new between-study variance estimator Em. The DSLE2 model performs well for methylation data. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-025-11316-3. Keywords: Methylation sequencing data, Meta-analysis, Between-study variance, Random-effects model Background DNA methylation refers to the addition of a methyl group at the CpG site of the DNA chain [[30]1]. DNA methylation occurs at millions of CpG dinucleotide positions in the genome and changes with age and the external environment [[31]2]. It is an important epigenetic modification [[32]3]. In the correlation analysis between methylation and traits, HM450K (Infinium Human Methylation 450 K BeadChip) or other Illumina technologies use the methylated fluorescence signal intensity ( [MATH: M :MATH] ) and the unmethylated fluorescence signal intensity ( [MATH: U :MATH] ) of the CpG site to calculate DNA methylation levels [[33]4]. Two commonly used calculation methods for methylation levels are [MATH: β=M/(M+U) :MATH] and [MATH: M=log2(M/U) :MATH] [[34]5]. In the minfi R package, the getBeta function and getM function can be used directly to calculate the [MATH: β :MATH] value and [MATH: M :MATH] value of every probe [[35]6]. With the development of high-throughput sequencing technology, multiple public databases have been formed, such as the GEO (Gene Expression Omnibus) database and TCGA (The Cancer Genome Atlas database) database [[36]7]. People can obtain methylation sequencing data of the same disease from different databases [[37]8]. Due to differences in experimental conditions and sample processing procedures, the results of methylation-trait association analysis of the same disease in different studies are different [[38]9]. Meta-analysis can combine methylation data on the same topic to provide a more reliable list of significantly different methylation sites [[39]10]. In meta-analysis, datasets from independent studies with the same purpose often have heterogeneity [[40]11]. In the random-effects model, the comprehensive effect-size result depends on the quantification of heterogeneity between studies [[41]12]. Therefore, the estimation of heterogeneity is the most critical step in the random-effects meta-analysis model [[42]13, [43]14]. In addition to quantifying heterogeneity, investigating differences in experimental designs, participants, and interventions across studies can help to understand the causes of heterogeneity [[44]15, [45]16]. This paper aims to estimate the size of the between-study variance in a random-effects meta-analysis model. We propose the DSLE2 meta-analysis model based on new between-study variance estimator [MATH: Em :MATH] for methylation high-throughput sequencing data. Under three hypothesis testing conditions, this model is compared with DSLR2 (two-step estimation starting with the DSL estimate and the [MATH: R2 :MATH] in the second step), DL (DerSimonian and Laird estimate), EB (Empirical Bayes estimate), HO (Hedges and Olkin estimate), RML (restricted maximum likelihood estimate), and SJ (Sidik and Jonkman estimate) random-effects meta-analysis models using sensitivity, and other evaluation metrics. The results show that the DSLE2 methylation meta-analysis model performs well under the first hypothesis testing condition. We apply the DSLE2 random-effects model to lung cancer and Parkinson’s methylation data, further demonstrating the reliability of the DSLE2 meta-analysis model. The proposed between-study variance estimator is designed to accurately quantify the true variability in effect sizes across studies, distinguishing it from random error. This advancement enhances the precision and reliability of meta-analysis models, offering significant benefits for research methodology, personalized medicine, study design, and resource allocation. By implementing and validating this new estimator, we can achieve a clearer understanding of effect consistency across studies, ultimately leading to more robust and generalizable conclusions. The primary purpose of the proposed between-study variance estimator is to improve the modeling of heterogeneity in methylation high-throughput sequencing data. FEM meta-analysis models often struggle to accurately capture the complex variability inherent in such data due to differences in study design, populations, and technical procedures. Accurate effect size estimation is critical for identifying biologically significant methylation patterns and their potential clinical implications. To refine the estimation of true effect sizes by more accurately accounting for between-study variance is very important. Reliable identification of consistent methylation markers across studies is also crucial for developing personalized medical interventions. The improved model supports the discovery of biomarkers that are consistently reproducible, aiding in disease diagnosis, prognosis, and treatment customization. Methods The random-effects model is a generalization of the fixed-effects model. The fixed effects model assumes that all studies included in the same meta-analysis have the same true effect size, while the random-effects model assumes that the true effect sizes of different studies included in the same meta-analysis obey a normal distribution. The between-study variance is a statistic used by the random-effects model to measure the heterogeneity between studies from the same topic. If the between study variance is 0, the random-effects model degenerates into the fixed-effects model. The between-study variance estimator of DSL (DerSimonian and Laird estimate) random-effects model is simple and the most commonly used method for estimating the between-study heterogeneity. In addition to being easy to calculate, the between-study variance estimator is also suitable for effect sizes of different dimensions. However, in practice, due to the possible occurrence of negative values, DSL between-study variance estimators are often truncated. In this paper, we present a non-truncated estimator of between-study variance [MATH: Em :MATH] . DSLE2 random-effects meta-analysis model We propose a general heterogeneity variance estimator for methylation sequencing data that is applicable to effect sizes at any scale and is non-negative. Assume that [MATH: y1m,y2m, ⋯,ykm :MATH] are the effect sizes of k independent studies of methylation site m; [MATH: Qm :MATH] is a heterogeneity statistic for methylation site m; [MATH: ωim :MATH] is the weight of methylation site m for ith study in the fixed-effects meta-analysis model; [MATH: ωim∗ :MATH] is the weight of methylation site m for ith study in the random-effects meta-analysis model; [MATH: σim2 :MATH] is the within-study variability representing sampling errors of ith study; [MATH: τ^m,D SL2 :MATH] is the between-study variance estimator in DSL random-effects model; [MATH: μ^m,D SL :MATH] is the mean of effects in DSL random-effects model. And the between-study variance estimator [MATH: Em :MATH] is [MATH: Em=Qm-SMM ,m∑i=1 kωim∗ -∑i=1 k(ωim∗ )2/∑i=1 kωim∗ :MATH] 1 where [MATH: Qm=∑i=1 kωimyim-μ^m2< /mn> :MATH] [MATH: μ^m= ∑i=1 kωimyim∑i=1 kωim :MATH] [MATH: ωim=1σim2,i=1,2,⋯,k :MATH] [MATH: SMM,m=∑i=1 kωim∗ yim-μ^m,DSL2 :MATH] [MATH: μ^m,DSL=∑i=2 kωim∗ yim∑i=1 kωim∗ :MATH] [MATH: ωim∗ =1σim2+τ^m,D SL2,i=1,2,⋯,k :MATH] The algorithm of DLSE2 meta-analysis model is as follows: First, we use the fixed-effects model to calculate the weight and comprehensive effect value of each study [MATH: ωim=1σim2,i=1,2,⋯,k :MATH] and [MATH: μ^m= ∑i=1 kωimyim∑i=1 kωim :MATH] Then, we calculate the heterogeneity statistic [MATH: Qm :MATH] [MATH: Qm=∑i=1 kωim(yim-μ^m)2 :MATH] We use the DSL random-effects model to calculate the between-study variance estimator, the weight of each study, and the corresponding comprehensive effect size [MATH: τ^m,D SL2=max0,Qm-(k-1)S1< /mn>m-S2< mi>m/S1m :MATH] [MATH: Srm=∑i=1 kωimr,r< /mi>=1,2 :MATH] [MATH: ωim∗ =1σim2+τ^m,D SL2,i=1,2,⋯,k :MATH] [MATH: μ^m,DSL=∑i=1 kωim∗ yim∑i=1 kωim∗ :MATH] We further calculate [MATH: SMM,m :MATH] [MATH: SMM,m=∑i=1 kωim∗ yim-μ^m,DSL2 :MATH] then [MATH: Em=Qm-SMM ,m∑i=1 kωim∗ -∑i=1 kωim∗2/∑i=1 kωim∗ :MATH] The weight of each study based on the between-study variance [MATH: Em :MATH] is [MATH: ωim,DSLE2< /mtext>=1σim2+ Em,i=1,2,⋯,k :MATH] The comprehensive effect is [MATH: μ^m,DSLE2=∑i=1 kωim,< /mo>DSLE2yim∑i=1 kωim,< /mo>DSLE2 :MATH] The random-effect for each study is [MATH: ξ^ig=E< /mi>mσim2+ Emyim-μ^m,DSLE2(τ^m,D SLE22),i =1,2,⋯,k :MATH] The variance of comprehensive effect is [MATH: Var(μ^m,DSLE2)=1∑i=1 kωim,< /mo>DSLE2 :MATH] The [MATH: z :MATH] statistic of comprehensive effect is [MATH: zm(μ^m,DSLE2)=μ^m,DSLE2Var(μ^m,DSLE2)∼N(0,1) :MATH] The lower and upper bounds of the [MATH: 100(1-α)% :MATH] confidence interval are [MATH: LLμ^m,DSLE2=μ^m,DSLE2-z(1-α)/2(m)×Var< /mtext>μ^m,DSLE2 :MATH] and [MATH: ULμ^m,DSLE2=μ^m,DSLE2+z(1-α)/2(m)×Var< /mtext>μ^m,DSLE2 :MATH] the [MATH: p :MATH] -value of one-sided test is [MATH: p=1-Φzmμ^m,DSLE2 :MATH] the [MATH: p :MATH] -value of two-sided test is [MATH: p=21-Φ|zmμ^m,DSLE2| :MATH] where [MATH: Φ(·) :MATH] is the cumulative function of the standard normal distribution. Theorem 1 Assume that [MATH: y1m,y2m, ⋯,ynm :MATH] are the effect sizes of n independent studies of methylation site [MATH: m :MATH] , and the between-study variance estimator [MATH: Em :MATH] is [MATH: Em=Qm-SMM ,mA :MATH] 2 where [MATH: Qm=∑i=1 n1σim2yim-∑i=1 n1σim2yi∑i=1 n1σim2< /mfrac>2 :MATH] 3 [MATH: SMM,m=∑i=1 n1σ im2+τ^m2yim-∑i=1 n1σ im2+τ^m2yim∑i=1 n1σ im2+τ^m22 :MATH] 4 [MATH: A=∑i=1 n1σ im2+τ^m2-∑i=1 n1σim2+τ^m22< mo movablelimits="false">∑i=1 n1σ im2+τ^m2 :MATH] 5 Then, [MATH: Em(τ^m2) :MATH] is monotone and non-decreasing with respect to [MATH: τ^m2 :MATH] . Proof It can be obtained from ([46]3) [MATH: dQmdτ^m2=0 :MATH] 6 It can be obtained from ([47]4) [MATH: dSMM ,m(τ^m2)dτ^m2=-∑i=1 n1σim2+τ^m2yim-∑i=1 n1σ im2+τ^m2yim∑i=1 n1σ im2+τ^m2 2≤0 :MATH] 7 It can be obtained from ([48]5) [MATH: dA(τ^m2)dτm2^=-∑i=1 n1(σim2+τ^m2)2+2∑i=1 n1(σim2+τ^m2)3∑i=1 n1σ im2+τ^m2-∑i=1 n1(σim2+τ^m2)22∑i=1 n1σ im2+τ^m22 :MATH] let [MATH: xi=1σim2+τ^m2 :MATH] Then [MATH: xi>0 :MATH] [MATH: dA(τ^m2)dτ^m2=-∑i=1 nxi2< /mn>+2∑i=1 nxi3< /mn>∑i=1 nxi-(∑i=1 nxi2< /mn>)2∑i=1 nxi2 :MATH] [MATH: =-∑i=1 nxi2< /mn>∑i=1 nxi2+2∑i=1 nxi3< /mn>∑i=1 nxi-∑i=1 nxi2< /mn>2∑i=1 nxi2 :MATH] [MATH: ≤2∑i=1 nxi3< /mn>∑i=1 nxi-∑i=1 nxi2< /mn>2∑i=1 nxi2 :MATH] Let [MATH: f(y)=∑i=1 nxi (4-y)∑i=1 nxiy< /mi> :MATH] Then [MATH: dfdy=-∑i=1 nxi (4-y)lnx i∑i=1 nxiy< /mi>+∑i=1 nxi (4-y)∑i=1 nxiy< /mi>lnxi :MATH] So, when [MATH: y=2 :MATH] , [MATH: dfdy=0 :MATH] . We can get that [MATH: y=2 :MATH] is the stagnation point of [MATH: f(y) :MATH] . Then, we can get [MATH: ∑i=1 nxi3< /mn>∑i=1 nxi≤∑i=1 nxi2< /mn>2 :MATH] So [MATH: dA(τ^m2)dτ^m2≤2∑i=1 nxi3< /mn>∑i=1 nxi-∑i=1 nxi2< /mn>2∑i=1 nxi2≤0 :MATH] 8 It can be obtained from ([49]2) [MATH: d(Em(τ^m2))dτ^m2=dQm- SMM,m< mrow>(τ^m2)A(τ^m2)dτ^m2=-A(τ^m2)dS< /mi>MM,m< mo stretchy="false">(τ^m2)dτ^m2+Qm-SMM,m< /mfenced>dA(τ^m2)dτ^m2A(τ^m2)2 :MATH] 9 It can be obtained from ([50]3) and ([51]4) [MATH: Qm=S MM,m(0) :MATH] It is attainable via ([52]7) [MATH: Qm-S MM,m(τ^m2)≥0 :MATH] 10 It is available through ([53]6), ([54]7), ([55]8) and ([56]10) [MATH: d(Em(τ^m2))dτ^m2≥0 :MATH] So, [MATH: Em(τ^m2) :MATH] increases with the increase of [MATH: τ^m2 :MATH] . Theorem 2 Assume that [MATH: y1m,y2m, ⋯,ynm :MATH] are the effect sizes of n independent studies of methylation site m, and the between-study variance estimator [MATH: Em :MATH] is [MATH: Em=Qm-SMM ,mA :MATH] where [MATH: Qm=∑i=1 n1σim2yim-∑i=1 n1σim2yi∑i=1 n1σim2< /mfrac>2 :MATH] [MATH: SMM,m=∑i=1 n1σ im2+τ^m2yim-∑i=1 n1σ im2+τ^m2yim∑i=1 n1σ im2+τ^m22 :MATH] [MATH: A=∑i=1 n1σ im2+τ^m2-∑i=1 n1σim2+τ^m22< mo movablelimits="false">∑i=1 n1σ im2+τ^m2 :MATH] Then, [MATH: Em≥0 :MATH] . Proof Because [MATH: A(τ^m2)=∑i=1 n1σ im2+τ^m2-∑i=1 n1σim2+τ^m22< mo movablelimits="false">∑i=1 n1σ im2+τ^m2>0 :MATH] [MATH: Qm-S MM,m(τ^m2)≥0 :MATH] we have [MATH: Em(τ^m2)=∑i=1 n1σim2yim-∑i=1 n1σim2yi∑i=1 n1σim2< /mfrac>2-∑i=1 n1σ im2+τ^m2yim-∑i=1 n1σ im2+τ^m2yim∑i=1 n1σ im2+τ^m22∑i=1 n1σ im2+τ^m2-∑i=1 n1σim2+τ^m22< mo movablelimits="false">∑i=1 n1σ im2+τ^m2= Qm-SM< /mi>M,m(τ^m2)A(τ^m2)≥0 :MATH] Results Simulation study We simulated five sets of meta-analysis data. Each set of data has [MATH: K(K=4,6,8,10,12) :MATH] studies. Each study contains 20,000 methylation sites and 30 sample data. The first 15 samples are control data, and the last 15 sample data are experimental data. The methylation sites of each sample in each study are divided into 100 categories [MATH: (c=1,2,⋯,100) :MATH] , with 200 methylation sites in each category. The top 9000 methylation sites of each study are divided into K groups [MATH: (km=1 ,2,⋯,K) :MATH] . The first [MATH: 9000/K :MATH] methylation sites belong to the first group [MATH: (km=1 ) :MATH] , and the [MATH: 9000/K+1 :MATH] to [MATH: 9000∗2/K :MATH] methylation sites belong to the second group group [MATH: (km=2 ) :MATH] , and so on, the methylation site from [MATH: 9000(K-1)/K+1 :MATH] to 9000 methylation sites belong to the K-th group [MATH: (kg=K ) :MATH] , the 9001 to 20,000 methylation sites belong to the 0th group [MATH: (km=0 ) :MATH] . The algorithm for simulating data is summarized as follows: First, we randomly sampled the methylation sites of the [MATH: c(1≤c≤100< /mn>) :MATH] th class of the [MATH: k(1≤k≤K) :MATH] th study [MATH: k(1≤k≤K) :MATH] , where [MATH: k(1≤k≤K) :MATH] , [MATH: I200×200 :MATH] is identity matrix, [MATH: J200×200 :MATH] is a matrix with all elements [MATH: 1 :MATH] , [MATH: W-1 :MATH] is the inverse of the Wishart distribution, and [MATH: ∑ck′ :MATH] is normalized to [MATH: ∑ck :MATH] , such that all its diagonal elements are [MATH: 1 :MATH] . Then, sample the methylation level of the [MATH: c(1≤c≤100< /mn>) :MATH] th class site of the [MATH: n(1≤n≤30) :MATH] sample of the [MATH: k :MATH] th study [MATH: (Xmc1nk′,Xmc2nk′,⋯,Xmc200nk ′)T∼MVN(0,∑ck< mo stretchy="false">),1≤ k≤K :MATH] . We performed differential methylation settings on the first 9000 methylation sites, randomly sampling [MATH: δmk∈(0,1) :MATH] , so that [MATH: ∑k=1 Kδmk=km(km=1 ,2,⋯,K) :MATH] . When [MATH: δmk=1 :MATH] , site [MATH: m :MATH] of the [MATH: k :MATH] th study is a significantly differentially methylated site. When [MATH: δmk=0 :MATH] , site $m$ of the [MATH: k :MATH] th study is a non-significantly differential methylation site. This paper conducts random sampling [MATH: μmk∼U(0.5,3) :MATH] . The methylation level of the control group remains unchanged, and the methylation expression level of the experimental group is [MATH: Ymnk=Xm(n+N)k′+< msub>μmk∗ δmk :MATH] , where [MATH: 1≤m≤9000,1≤n≤15,1≤k≤K :MATH] . We compared the distribution of between study variance estimators of the DSLE2 random-effects model and six other meta-analysis models. As can be seen from Fig. [57]1, compared with the other six random-effects models, the between-study variance estimators of the DSLE2 meta-analysis model is relatively concentrated. The between-study variance estimators of DSLE2, DL, EB, HO, RML, and SJ random-effects models are mostly below 0.5. Compared with the other 6 random-effects models, the DSLE2 random-effects model has a relatively larger number of between-study variances distributed around 0. Fig. 1. [58]Fig. 1 [59]Open in a new tab Violin plot of between-study variance for seven random-effects meta-analysis models We also compared the accuracy, false negative rate, negative prediction rate, recall rate, Matthews correlation coefficient, PCMiss (Prediction-conditioned miss) value, SAR value, and F value of the DSLE2 meta-analysis model with DL, EB, HO, RML, and SJ meta-analysis models under three hypothesis. Accuracy We compared the accuracy of seven random-effects models under three hypothesis (see Fig. [60]2, Figure S1 and Figure S2). Under the first hypothesis, the DSLE2 meta-analysis model has the highest accuracy, the SJ random-effects model has the lowest accuracy, and the DSLR2, DL, EB, HO, and RML meta-analysis models have the similar accuracy. The accuracy of the DSLR2, DL, EB, HO, RML, and SJ meta-analysis models decreases as the number of studies increases. There is no obvious decreasing trend as the number increases for DSLE2 random-effects model. Under the second hypothesis, the DSLE2 meta-analysis model has the lowest accuracy, the SJ meta-analysis model has the highest accuracy.The accuracy of the DSLR2, DL, EB, HO, RML, and SJ meta-analysis models increases with the number of studies increases. Under the third hypothesis, the DSLE2 random-effects model has the lowest accuracy, and the DSLR2 meta-analysis model has the highest accuracy. Fig. 2. [61]Fig. 2 [62]Open in a new tab Plot of accuracy for 7 random-effects meta-analysis models under the first hypothesis False negative rate The false negative rate, also known as the second type error rate, refers to the proportion of the number of significantly differential methylation sites predicted by the model as non-significantly differential methylation sites to the number of all significantly differential methylation sites. We compared the false negative rate of seven meta-analysis models under three hypothesis (see Fig. [63]3, Figure S3 and Figure S4). Fig. 3. [64]Fig. 3 [65]Open in a new tab Plot of false negative rate curves for seven random-effects meta-analysis models under the first hypothesis Under the first hypothesis, the false negative rate of the DSLE2 model is the lowest among the seven random-effects models; the false negative rate of the SJ model is the highest among the seven random-effects models. The false negative rates of the DSLR2, DL, EB, HO, and RML meta-analysis models are relatively close, and they are all lower than the false-negative rates of the SJ meta-analysis model, and they are all higher than the false negative rate of the DSLE2 meta-analysis model. The false negative rate of the DSLR2, DL, EB, HO, RML, and SJ meta-analysis models increases with the increase of the number of studies. Under the second hypothesis, the false negative rate of the DSLE2 meta-analysis model is the lowest among the seven random-effects models, and the false negative rates of the DL, EB, HO, and RML random-effects models are relatively close, and the false negative rate of the SJ model is the highest among the seven meta-analysis models. Under the third hypothesis, the false negative rate of the DSLE2 meta-analysis model is the lowest among the seven meta-analysis models; the false negative rate of the SJ random-effects model is the highest among the seven random-effects models. The false negative rates of the DSLR2, DL, EB, HO, and RML meta-analysis models are relatively close, and they are all lower than the false negative rates of the SJ random-effects model, and they are all higher than the false negative rate of the DSLE2 meta-analysis model. The false negative rate of the DSLR2, DL, EB, HO, RML, and SJ random-effects models increases with the increase of the number of studies. Matthews correlation coefficient Matthews correlation coefficient is the correlation coefficient that describes the actual class and the predicted class. We compared the Matthews correlation coefficients of seven random-effects models under three hypothesis testing conditions (see Fig. [66]4, Figure S5 and Figure S6). Under the first hypothesis, when the number of studies is greater than 4, the Matthews correlation coefficient of the DSLE2 meta-analysis model is the highest among the Matthews correlation coefficient of the seven random-effects models. The Matthews correlation coefficient of SJ meta-analysis model is the lowest among the Matthews correlation coefficients of the seven meta-analysis models. The Matthews correlation coefficients of DSLR2, DL, EB, HO, and RML random-effects models are close, which are lower than the Matthews correlation coefficient of the DSLE2 meta-analysis model and higher than the Matthews correlation coefficient of the SJ random-effects model. The Matthews correlation coefficient of DSLR2, DL, EB, HO, RML, SJ meta-analysis models decreases as the number of studies increases. Under the second hypothesis, the Matthews correlation coefficient of the DSLE2 random-effects model is the lowest among the Matthews correlation coefficients of the seven meta-analysis models; the Matthews correlation coefficient of the SJ random-effects model is the highest among the Matthews correlation coefficients of the seven random-effects models; the Matthews correlation coefficients of DSLR2, DL, EB, HO, and RML meta-analysis models are relatively close. The Matthews correlation coefficients of HO, RML, and SJ random-effects models increase with the increase of the number of studies. Under the third hypothesis, the Matthews correlation coefficient of the DSLE2 random effects meta-analysis model is the lowest among the Matthews correlation coefficients of the seven meta-analysis models, and the Matthews correlation coefficient of the DSLR2 meta-analysis model is the highest among the Matthews correlation coefficients of the seven random-effects meta-analysis models. Fig. 4. [67]Fig. 4 [68]Open in a new tab Plot of Matthews correlation coefficient curves for 7 random-effects models under the first hypothesis Negative predictive value Negative predictive value refers to the proportion of the number of non-significantly differentially methylated sites correctly predicted by meta-analysis model to the number of all non-significantly differentially methylated sites predicted by meta-analysis model. We compared the negative predictive values of seven meta-analysis models (see Fig. [69]5, Figure S7 and Figure S8). Under the first hypothesis, the negative predictive values of the DSLE2 meta-analysis model are the highest among seven random-effects models; the negative predictive values of the SJ meta-analysis model are the lowest among seven random-effects models. The negative predictive values of the DSLR2, DL, EB, HO, and RML random-effects models are relatively close, and they are all lower than the negative predictive value of the DSLE2 random-effects model and higher than the negative predictive values of SJ random-effects model. Under the second hypothesis, the negative predictive values of the SJ meta-analysis model are the lowest among the seven random-effects models. The negative predictive values of the DL, EB, HO, and RML random-effects models are relatively close. Under the third hypothesis, the negative predictive values of the DSLE2 meta-analysis model are the highest among the seven random effects models; the negative predictive values of the SJ random-effects model are the lowest among the seven meta-analysis models. The negative predictive rates of DSLR2, DL, EB, HO, and RML meta-analysis models are relatively close. Fig. 5. [70]Fig. 5 [71]Open in a new tab Plot of negative predictive value curves for 7 random-effects models under the first hypothesis Prediction-conditioned miss PCMiss (Prediction-Conditioned Miss) refers to the proportion of the number of sites incorrectly predicted as non-significantly differentially methylated to the number of all non-significantly differentially methylated sites predicted by the meta-analysis model. We compared the PCMiss values of seven meta-analysis models under three hypothesis (see Fig. [72]6, Figure S9 and Figure S10). Under the first hypothesis, the PCMiss values of the DSLE2 random-effects model are the lowest among the PCMiss values of the seven meta-analysis models; the PCMiss values of the SJ random-effects model are the highest among the seven meta-analysis models. The PCMiss values of the DSLR2, DL, EB, HO, and RML random-effects meta-analysis models are relatively close; the PCMiss values of the seven models all increase with the increase of the number of studies. Under the second hypothesis, the PCMiss values of the SJ model are the highest among the PCMiss values of the seven models. The PCMiss values of the DL, EB, HO, and RML random-effects models are close. Under the third hypothesis testing, the PCMiss values of the DSLE2 model are the lowest among the PCMiss values of seven random effects meta-analysis models; the PCMiss values of the SJ meta-analysis model are the highest among the seven random-effects models. The PCMiss values of the DSLR2, DL, EB, HO, and RML meta-analysis models are relatively close. Fig. 6. [73]Fig. 6 [74]Open in a new tab Plot of PCMiss-value curves for 7 random-effects models under the first hypothesis Recall This paper measured the recall rates of seven random-effects models under three hypothesis (see Fig. [75]7, Figure S11 and Figure S12). Under the first hypothesis, the recall rates of the DSLE2 meta-analysis model are the highest among seven random effects meta-analysis models; the recall rates of the SJ random-effects model are the lowest among seven meta-analysis models. The recall rates of the DL, EB, HO, and RML meta-analysis models are relatively close. The recall rates of the DSLR2, DL, EB, HO, RML, and SJ random-effects models decreases as the number of studies increases. Under the second hypothesis, the recall rates of the DSLE2 model are the highest among the seven meta-analysis models, followed by the recall rates of the DSLR2 model, and the recall rates of the SJ are the lowest among seven models. The recall rates of the DL, EB, HO, and RML random-effects models are relatively close. Under the third hypothesis, the recall rates of the DSLE2 meta-analysis model are the highest among the seven random-effects models, and the recall rate of the SJ model is the lowest among the seven meta-analysis models. The recall rates of DSLR2, DL, EB, HO, and RML random-effects models are relatively close. Fig. 7. [76]Fig. 7 [77]Open in a new tab Plot of recall curves for 7 random-effects meta-analysis models under the first hypothesis SAR SAR combines accuracy, AUC (the area under the receiver operating characteristic curve) value, and root mean square error. The SAR value is more robust than a single indicator. We computed the SAR values of seven meta-analysis models under three hypothesis (see Fig. [78]8, Figure S13 and Figure S14). Under the first hypothesis, the SAR values of the DSLE2 random-effects model are the highest among the SAR values of seven meta-analysis models, and the SAR values of SJ model are the lowest among the seven random-effects models. The SAR values of DL, EB, HO, and RML meta-analysis models are relatively close. Under the second hypothesis, the SAR values of DSLR2 random-effects model are the highest among the SAR values of seven meta-analysis models, and the SAR values of the DSLE2 model are lowest among the seven random-effects models. The SAR values of the DL, EB, HO, RML, and SJ random-effects meta-analysis models are relatively close. The SAR values of the seven meta-analysis models all increase with the increase of the number of studies. Under the third hypothesis, the SAR values of the DSLR2 meta-analysis model are the highest among the SAR values of seven random-effect models. Moreover, the SAR values of the DL, EB, HO, and RML meta-analysis models are relatively close. Fig. 8. [79]Fig. 8 [80]Open in a new tab Plot of SAR-value curves for 7 random-effects meta-analysis models under the first hypothesis Precision-recall F measure Precision and recall indicators sometimes conflict with each other. In this case, precision and recall indicators need to be considered comprehensively. The most common method is to calculate the F value of precision and recall. We computed the [MATH: F :MATH] values of 7 random-effects models under three hypothesis (see Fig. [81]9, Figure S15 and Figure S16). Under the first hypothesis, the F values of the DSLE2 random-effects model are the highest among the F values of seven random-effects models, and the F values of the SJ meta-analysis model are the lowest among the seven meta-analysis models. The F values of DSLR2, DL, EB, HO, and RML meta-analysis models are relatively close. Under the second hypothesis, the F values of the SJ meta-analysis model are the highest among the F values of seven random-effects models, and the F values of the DSLE2 random-effects model are the lowest among the seven meta-analysis models. The F values of DSLR2, DL, EB, HO, and RML random-effects models are relatively close. Under the third hypothesis, the F values of the DSLE2 random-effects model is the lowest among the F values of seven meta-analysis models. Fig. 9. [82]Fig. 9 [83]Open in a new tab Plot of F-value curves for 7 random-effects meta-analysis models under the first hypothesis Application of DSLE2 random-effects meta-analysis model to lung cancer methylation data Lung cancer is the second most common cancer worldwide and the most common cancer among men. According to the World Cancer Statistics Center, more than 2.2 million cases of lung cancer occur every year [[84]17]. We collected three sets of lung cancer methylation data in the GEO database: [85]GSE63704, [86]GSE83842, and [87]GSE85845. The EPIC and 450 K methylation data of lung cancer in TCGA database were also collected. A total of 927 samples and 156,680 methylation sites were analyzed. The distribution of the number of significantly differential methylation sites determined by 7 random-effects models is shown in the Fig. [88]10. The number of significantly differential methylation sites determined by the DSLE2 random-effects model is more than the number of significantly differential methylation sites determined by DSLR2 and is less than the number of significantly differential methylation sites determined by DL, EB, HO, RML, and SJ meta-analysis models. Fig. 10. [89]Fig. 10 [90]Open in a new tab Histogram of the number distribution of significantly differential methylation sites identified by 7 random-effects meta-analysis models The DSLE2 random-effects model identified 59,146 significantly differential methylation sites, of which 2,230 significantly differential methylation sites distributed in 1stExon of 3,039 genes, and 2,003 significantly differential methylation sites were distributed in 3’UTR of 1,660 genes, and 4899 significantly differentially methylated sites distributed in the 5’UTR of 2754 genes, and 21,029 significantly differentially methylated sites distributed in the Body of 7056 genes, and 7500 significantly differentially methylated sites distributed in the TSS1500 of 5113 genes, and 6210 significantly differentially methylated sites distributed in TSS200 of 3963 genes. Research shows that lung cancer is closely related to A2BP1 [[91]8], AACS [[92]18], DNAH10 [[93]19], PINK1 [[94]20], and other genes with significantly differential methylation sites. The significantly differentially methylated sites identified by the DSLE2 meta-analysis model may affect the expression of the corresponding genes. Application of DSLE2 meta-analysis model to Parkinson’s methylation data Parkinson’s disease (PD) is a chronic neurodegenerative disease [[95]21]. According to the World Health Organization (WHO), there were 2.5 million and 6.1 million cases of PD worldwide in 1990 and 2016 [[96]22]. However, the number of PD patients increased significantly to 8.5 million, and it is estimated that by 2040, the number of PD patients worldwide may exceed 17 million [[97]23]. In 2019, PD caused 5.8 million disabilities, an 81% increase since 2000, and 329,000 deaths, an increase of more than 100 percent since 2000 [[98]23]. PD causes serious trouble to people’s life, but there is no clear cause and effective treatment. DSLE2 random-effects model was used to further study the causes of PD. We collected five sets of PD methylation data in the GEO database: [99]GSE72774, [100]GSE72776, [101]GSE111629, [102]GSE145361 and [103]GSE165081. A total of 3,080 samples and 161,261 methylation sites were analyzed. The DSLE2 model identified 26,244 significantly differentially methylation sites (Fig. [104]11). Most of significantly differentially methylation sites identified by DSLE2, SJ, HO, DSL, DSLR2 meta-analysis methods were same. DSLE2 model independently identified 2181 significantly differentially methylation sites. Fig. 11. Fig. 11 [105]Open in a new tab Venn of significantly differential methylation sites identified by 5 random-effects meta-analysis models We further analyzed the location of significantly differential methylation sites on genes (Figs. [106]12 and [107]13). Most of the significant differential methylation sites were located on the gene body, followed by TSS1500 and 5’UTR (Fig. [108]12). Relatively few Significantly differential methylation sites were located at 3’UTR and the first exon (Fig. [109]12). Most of the significantly differential methylation sites are uniquely located on a single gene and only a small number of genes contain multiple significantly differential methylation sites (Fig. [110]13). There are 51, 62 and 122 significantly differentially methylation sites for BCOR, HDAC4 and PRDM16, respectively. Research shows that PD is closely related to BCOR [[111]24], HDAC4 [[112]25], PRDM16 [[113]26], and other genes with significantly differential methylation sites. Fig. 12. [114]Fig. 12 [115]Open in a new tab Histogram of significantly differential methylation sites on genes. The horizontal axis refers features, and the vertical axis is the number of significantly differential methylation sites Fig. 13. [116]Fig. 13 [117]Open in a new tab Histogram of significantly differential methylation sites on genes. The horizontal axis is the number of significantly differential methylation sites, and the vertical axis is the number of genes Discussion and conclusion For high-throughput methylation sequencing data, we proposed the DSLE2 random effects meta-analysis model based on the between-study variance estimator [MATH: Em :MATH] . Under the alternative hypothesis that the effect sizes of all studies are not 0, the DSLE2 meta-analysis model performs better than the other 6 random-effects models in terms of accuracy, negative prediction rate, recall rate, Matthews correlation coefficient, PCMiss value, and SAR value, F value. We applied the DSLE2 meta-analysis model to the lung cancer methylation datasets. 59,146 significantly differential methylation sites were identified, with 2,230, 2,230 2003, 4899, 21,029, 7500, and 6210 significantly differentially methylated sites were located on 1stExon, 3'UTR, 5'UTR, Body, TSS1500, and TSS200 of 3039, 1660, 2754, 7056, 5113, and 3963 genes, respectively. Studies have shown that lung cancer is closely related to genes such as A2BP1, AACS, and DNAH10 with significantly differential methylation sites. We further applied the DSLE2 meta-analysis model to the PD methylation datasets. 26,244 significantly differentially methylation sites, with 952, 776, 2466, 7604, 3897 and 2052 significantly differentially methylated sites were located on 1stExon, 3'UTR, 5'UTR, Body, TSS1500, and TSS200 of 703, 700, 1240, 3059, 2495 and 1267 genes, respectively. Research shows that PD is closely related to BCOR, HDAC4, PRDM16, and other genes with significantly differential methylation sites. This work presents a new between-study variance estimator for meta-analysis model. First, the primary purpose of the between-study variance estimator is to measure the extent of variability in effect sizes across different studies that is attributable to true differences rather than within-study sampling error. Accurately estimating between-study variance allows us to distinguish between variability due to real differences in study effects and variability due to random error, providing a clearer understanding of the consistency of the effect across studies. Second, meta-analyses synthesize data from multiple studies to derive conclusions with greater statistical power. By accurately accounting for between-study variance, the proposed model enhances the validity of meta-analytic results, leading to more trustworthy conclusions. And it can improve the quality of evidence synthesis, promoting better decision-making in clinical practice, policy, and further research. Third, understanding heterogeneity is key to identifying which treatments work best for specific populations in medical research. A better estimation of between-study variance can help in understanding consistent treatment effects, aiding in the development of personalized therapeutic strategies. Moreover, efficiently allocating research resources requires an understanding of where variability lies. By identifying true sources of heterogeneity, researchers and funding bodies can focus efforts on areas with the most significant impact, optimizing the use of limited resources. In addition, after significant differential methylation sites identified by DSLE2 meta-analysis model, several software tools and platforms can be used for downstream analysis. These tools help in various aspects such as functional annotation, pathway enrichment analysis and so on. We can use Chip Analysis Methylation Pipeline (ChAMP) to annotate significantly differential methylated regions (DMRs), perform gene ontology (GO) analysis, and integrating with other epigenetic data. GREAT (Genomic Regions Enrichment of Annotations Tool) can be used to annotate and analyze the functional significance of sets of genomic regions, including DMRs, by associating them with nearby genes and functional terms. We can also use GSEA (Gene Set Enrichment Analysis) to do pathway enrichment analysis based on significant differential methylation sites identified by DSLE2. Supplementary Information [118]12864_2025_11316_MOESM1_ESM.zip^ (809.9KB, zip) Supplementary Material 1: Figure S1 to S16 are additional files which are plots of accuracy, false negative rates and other measures under the second and third hypothesis. Acknowledgements