Abstract
Background
   With the rapid development of high-throughput sequencing technology,
   high-throughput sequencing data has grown on a massive scale, leading
   to the emergence of multiple public databases, such as EBI and GEO.
   Conducting secondary mining of high-throughput sequencing data in these
   databases can yield more valuable insights. Meta-analysis can
   quantitatively combine high-throughput sequencing data from the the
   same topic. It increases the sample size for data analysis, enhances
   statistical power, and results in more consistent and reliable
   conclusions.
Results
   This study proposes a new between-study variance estimator
   [MATH: Em :MATH]
   . We prove that
   [MATH: Em :MATH]
   is non-negative and
   [MATH: Emτ^m2 :MATH]
   increases with the increase of
   [MATH: τ^m2 :MATH]
   , satisfying the general conditions of the between-study variance
   estimator. We get the DSLE2 (two-step estimation starting with the DSL
   estimate and the
   [MATH: Em :MATH]
   in the second step) random-effects meta-analysis model based on the
   between-study variance estimator Em. The accuracy and a series of
   evaluation metrics of the DSLE2 model are better than those of the
   other 6 meta-analysis models. DSLE2 model is applied to lung cancer and
   Parkinson’s methylation data. Significantly differentially methylated
   sites identified by DSLE2 model and the genes with significantly
   differentially methylated sites are closely related to two diseases,
   indicating the effectiveness of DSLE2 random-effects model.
Conclusions
   This paper propose the DSLE2 random-effects meta-analysis model based
   on new between-study variance estimator Em. The DSLE2 model performs
   well for methylation data.
Supplementary Information
   The online version contains supplementary material available at
   10.1186/s12864-025-11316-3.
   Keywords: Methylation sequencing data, Meta-analysis, Between-study
   variance, Random-effects model
Background
   DNA methylation refers to the addition of a methyl group at the CpG
   site of the DNA chain [[30]1]. DNA methylation occurs at millions of
   CpG dinucleotide positions in the genome and changes with age and the
   external environment [[31]2]. It is an important epigenetic
   modification [[32]3]. In the correlation analysis between methylation
   and traits, HM450K (Infinium Human Methylation 450 K BeadChip) or other
   Illumina technologies use the methylated fluorescence signal intensity
   (
   [MATH: M :MATH]
   ) and the unmethylated fluorescence signal intensity (
   [MATH: U :MATH]
   ) of the CpG site to calculate DNA methylation levels [[33]4]. Two
   commonly used calculation methods for methylation levels are
   [MATH: β=M/(M+U) :MATH]
   and
   [MATH:
   M=log2(M/U) :MATH]
   [[34]5]. In the minfi R package, the getBeta function and getM function
   can be used directly to calculate the
   [MATH: β :MATH]
   value and
   [MATH: M :MATH]
   value of every probe [[35]6].
   With the development of high-throughput sequencing technology, multiple
   public databases have been formed, such as the GEO (Gene Expression
   Omnibus) database and TCGA (The Cancer Genome Atlas database) database
   [[36]7]. People can obtain methylation sequencing data of the same
   disease from different databases [[37]8]. Due to differences in
   experimental conditions and sample processing procedures, the results
   of methylation-trait association analysis of the same disease in
   different studies are different [[38]9]. Meta-analysis can combine
   methylation data on the same topic to provide a more reliable list of
   significantly different methylation sites [[39]10]. In meta-analysis,
   datasets from independent studies with the same purpose often have
   heterogeneity [[40]11]. In the random-effects model, the comprehensive
   effect-size result depends on the quantification of heterogeneity
   between studies [[41]12]. Therefore, the estimation of heterogeneity is
   the most critical step in the random-effects meta-analysis model
   [[42]13, [43]14]. In addition to quantifying heterogeneity,
   investigating differences in experimental designs, participants, and
   interventions across studies can help to understand the causes of
   heterogeneity [[44]15, [45]16]. This paper aims to estimate the size of
   the between-study variance in a random-effects meta-analysis model.
   We propose the DSLE2 meta-analysis model based on new between-study
   variance estimator
   [MATH: Em :MATH]
   for methylation high-throughput sequencing data. Under three hypothesis
   testing conditions, this model is compared with DSLR2 (two-step
   estimation starting with the DSL estimate and the
   [MATH: R2 :MATH]
   in the second step), DL (DerSimonian and Laird estimate), EB (Empirical
   Bayes estimate), HO (Hedges and Olkin estimate), RML (restricted
   maximum likelihood estimate), and SJ (Sidik and Jonkman estimate)
   random-effects meta-analysis models using sensitivity, and other
   evaluation metrics. The results show that the DSLE2 methylation
   meta-analysis model performs well under the first hypothesis testing
   condition. We apply the DSLE2 random-effects model to lung cancer and
   Parkinson’s methylation data, further demonstrating the reliability of
   the DSLE2 meta-analysis model.
   The proposed between-study variance estimator is designed to accurately
   quantify the true variability in effect sizes across studies,
   distinguishing it from random error. This advancement enhances the
   precision and reliability of meta-analysis models, offering significant
   benefits for research methodology, personalized medicine, study design,
   and resource allocation. By implementing and validating this new
   estimator, we can achieve a clearer understanding of effect consistency
   across studies, ultimately leading to more robust and generalizable
   conclusions. The primary purpose of the proposed between-study variance
   estimator is to improve the modeling of heterogeneity in methylation
   high-throughput sequencing data. FEM meta-analysis models often
   struggle to accurately capture the complex variability inherent in such
   data due to differences in study design, populations, and technical
   procedures. Accurate effect size estimation is critical for identifying
   biologically significant methylation patterns and their potential
   clinical implications. To refine the estimation of true effect sizes by
   more accurately accounting for between-study variance is very
   important. Reliable identification of consistent methylation markers
   across studies is also crucial for developing personalized medical
   interventions. The improved model supports the discovery of biomarkers
   that are consistently reproducible, aiding in disease diagnosis,
   prognosis, and treatment customization.
Methods
   The random-effects model is a generalization of the fixed-effects
   model. The fixed effects model assumes that all studies included in the
   same meta-analysis have the same true effect size, while the
   random-effects model assumes that the true effect sizes of different
   studies included in the same meta-analysis obey a normal distribution.
   The between-study variance is a statistic used by the random-effects
   model to measure the heterogeneity between studies from the same topic.
   If the between study variance is 0, the random-effects model
   degenerates into the fixed-effects model. The between-study variance
   estimator of DSL (DerSimonian and Laird estimate) random-effects model
   is simple and the most commonly used method for estimating the
   between-study heterogeneity. In addition to being easy to calculate,
   the between-study variance estimator is also suitable for effect sizes
   of different dimensions. However, in practice, due to the possible
   occurrence of negative values, DSL between-study variance estimators
   are often truncated. In this paper, we present a non-truncated
   estimator of between-study variance
   [MATH: Em :MATH]
   .
DSLE2 random-effects meta-analysis model
   We propose a general heterogeneity variance estimator for methylation
   sequencing data that is applicable to effect sizes at any scale and is
   non-negative. Assume that
   [MATH:
   y1m,y2m,
   ⋯,ykm :MATH]
   are the effect sizes of k independent studies of methylation site m;
   [MATH: Qm :MATH]
   is a heterogeneity statistic for methylation site m;
   [MATH: ωim :MATH]
   is the weight of methylation site m for ith study in the fixed-effects
   meta-analysis model;
   [MATH: ωim∗
    :MATH]
   is the weight of methylation site m for ith study in the random-effects
   meta-analysis model;
   [MATH: σim2 :MATH]
   is the within-study variability representing sampling errors of ith
   study;
   [MATH: τ^m,D
   SL2 :MATH]
   is the between-study variance estimator in DSL random-effects model;
   [MATH: μ^m,D
   SL :MATH]
   is the mean of effects in DSL random-effects model. And the
   between-study variance estimator
   [MATH: Em :MATH]
   is
   [MATH:
   Em=Qm-SMM
   ,m∑i=1
   kωim∗
   -∑i=1
   k(ωim∗
   )2/∑i=1
   kωim∗
    :MATH]
   1
   where
   [MATH: Qm=∑i=1
   kωimyim-μ^m2<
   /mn> :MATH]
   [MATH: μ^m=
   ∑i=1
   kωimyim∑i=1
   kωim :MATH]
   [MATH: ωim=1σim2,i=1,2,⋯,k :MATH]
   [MATH:
   SMM,m=∑i=1
   kωim∗
   yim-μ^m,DSL2 :MATH]
   [MATH: μ^m,DSL=∑i=2
   kωim∗
   yim∑i=1
   kωim∗
    :MATH]
   [MATH: ωim∗
   =1σim2+τ^m,D
   SL2,i=1,2,⋯,k :MATH]
   The algorithm of DLSE2 meta-analysis model is as follows:
   First, we use the fixed-effects model to calculate the weight and
   comprehensive effect value of each study
   [MATH: ωim=1σim2,i=1,2,⋯,k :MATH]
   and
   [MATH: μ^m=
   ∑i=1
   kωimyim∑i=1
   kωim :MATH]
   Then, we calculate the heterogeneity statistic
   [MATH: Qm :MATH]
   [MATH: Qm=∑i=1
   kωim(yim-μ^m)2 :MATH]
   We use the DSL random-effects model to calculate the between-study
   variance estimator, the weight of each study, and the corresponding
   comprehensive effect size
   [MATH: τ^m,D
   SL2=max0,Qm
   mi>-(k-1)S1<
   /mn>m-S2<
   mi>m/S1m :MATH]
   [MATH: Srm=∑i=1
   kωimr,r<
   /mi>=1,2 :MATH]
   [MATH: ωim∗
   =1σim2+τ^m,D
   SL2,i=1,2,⋯,k :MATH]
   [MATH: μ^m,DSL=∑i=1
   kωim∗
   yim∑i=1
   kωim∗
    :MATH]
   We further calculate
   [MATH:
   SMM,m
   msub> :MATH]
   [MATH:
   SMM,m=∑i=1
   kωim∗
   yim-μ^m,DSL2 :MATH]
   then
   [MATH:
   Em=Qm-SMM
   ,m∑i=1
   kωim∗
   -∑i=1
   kωim∗2
   mn>/∑i=1
   kωim∗
    :MATH]
   The weight of each study based on the between-study variance
   [MATH: Em :MATH]
   is
   [MATH:
   ωim,DSLE2<
   /mtext>=1σim2+
   Em,i=1,2,⋯,k
    :MATH]
   The comprehensive effect is
   [MATH: μ^m,DSLE2
   mtext>=∑i=1
   kωim,<
   /mo>DSLE2yim∑i=1
   kωim,<
   /mo>DSLE2 :MATH]
   The random-effect for each study is
   [MATH: ξ^ig=E<
   /mi>mσim2+
   Emyim-μ^m,DSLE2
   mtext>(τ^m,D
   SLE22),i
   =1,2,⋯,k
    :MATH]
   The variance of comprehensive effect is
   [MATH: Var(μ^m,DSLE2
   mtext>)=1∑i=1
   kωim,<
   /mo>DSLE2 :MATH]
   The
   [MATH: z :MATH]
   statistic of comprehensive effect is
   [MATH: zm(μ^m,DSLE2
   mtext>)=μ^m,DSLE2
   mtext>Var(μ^m,DSLE2
   mtext>)∼N(0,1) :MATH]
   The lower and upper bounds of the
   [MATH: 100(1-α)% :MATH]
   confidence interval are
   [MATH: LLμ^m,DSLE2
   mtext>=μ^m,DSLE2
   mtext>-z(1-α)/2(m)×Var<
   /mtext>μ^m,DSLE2
   mtext> :MATH]
   and
   [MATH: ULμ^m,DSLE2
   mtext>=μ^m,DSLE2
   mtext>+z(1-α)/2(m)×Var<
   /mtext>μ^m,DSLE2
   mtext> :MATH]
   the
   [MATH: p :MATH]
   -value of one-sided test is
   [MATH: p=1-Φzmμ^m,DSLE2
   mtext> :MATH]
   the
   [MATH: p :MATH]
   -value of two-sided test is
   [MATH: p=21-Φ|zmμ^m,DSLE2
   mtext>|
    :MATH]
   where
   [MATH: Φ(·)
   :MATH]
   is the cumulative function of the standard normal distribution.
   Theorem 1 Assume that
   [MATH:
   y1m,y2m,
   ⋯,ynm :MATH]
   are the effect sizes of n independent studies of methylation site
   [MATH: m :MATH]
   , and the between-study variance estimator
   [MATH: Em :MATH]
   is
   [MATH:
   Em=Qm-SMM
   ,mA
   :MATH]
   2
   where
   [MATH: Qm=∑i=1
   n1σim2yim-∑i=1
   n1σim2yi∑i=1
   n1σim2<
   /mfrac>2 :MATH]
   3
   [MATH:
   SMM,m=∑i=1
   n1σ
   im2+τ^m2yim-∑i=1
   n1σ
   im2+τ^m2yim∑i=1
   n1σ
   im2+τ^m22 :MATH]
   4
   [MATH: A=∑i=1
   n1σ
   im2+τ^m2-∑i=1
   n1σim2+τ^m22<
   mo
   movablelimits="false">∑i=1
   n1σ
   im2+τ^m2 :MATH]
   5
   Then,
   [MATH: Em(τ^m2) :MATH]
   is monotone and non-decreasing with respect to
   [MATH: τ^m2 :MATH]
   .
   Proof It can be obtained from ([46]3)
   [MATH:
   dQmdτ^m2=0 :MATH]
   6
   It can be obtained from ([47]4)
   [MATH:
   dSMM
   ,m(τ^m2)dτ^m2=-∑i=1
   n1σim2+τ^m2yim-∑i=1
   n1σ
   im2+τ^m2yim∑i=1
   n1σ
   im2+τ^m2
   2≤0 :MATH]
   7
   It can be obtained from ([48]5)
   [MATH: dA(τ^m2)dτm2^=-∑i=1
   n1(σim2+τ^m2)2+2∑i=1
   n1(σim2+τ^m2)3∑i=1
   n1σ
   im2+τ^m2-∑i=1
   n1(σim2+τ^m2)22∑i=1
   n1σ
   im2+τ^m22
   :MATH]
   let
   [MATH:
   xi=1σim2+τ^m2 :MATH]
   Then
   [MATH:
   xi>0
   :MATH]
   [MATH: dA(τ^m2)dτ^m2=-∑i=1
   nxi2<
   /mn>+2∑i=1
   nxi3<
   /mn>∑i=1
   nxi-(∑i=1
   nxi2<
   /mn>)2∑i=1
   nxi2 :MATH]
   [MATH: =-∑i=1
   nxi2<
   /mn>∑i=1
   nxi2+2∑i=1
   nxi3<
   /mn>∑i=1
   nxi-∑i=1
   nxi2<
   /mn>2∑i=1
   nxi2 :MATH]
   [MATH: ≤2∑i=1
   nxi3<
   /mn>∑i=1
   nxi-∑i=1
   nxi2<
   /mn>2∑i=1
   nxi2 :MATH]
   Let
   [MATH: f(y)=∑i=1
   nxi
   (4-y)∑i=1
   nxiy<
   /mi> :MATH]
   Then
   [MATH:
   dfdy=-∑i=1
   nxi
   (4-y)lnx
   i∑i=1
   nxiy<
   /mi>+∑i=1
   nxi
   (4-y)∑i=1
   nxiy<
   /mi>lnxi
   :MATH]
   So, when
   [MATH: y=2 :MATH]
   ,
   [MATH:
   dfdy=0 :MATH]
   .
   We can get that
   [MATH: y=2 :MATH]
   is the stagnation point of
   [MATH: f(y) :MATH]
   .
   Then, we can get
   [MATH: ∑i=1
   nxi3<
   /mn>∑i=1
   nxi≤∑i=1
   nxi2<
   /mn>2 :MATH]
   So
   [MATH: dA(τ^m2)dτ^m2≤2∑i=1
   nxi3<
   /mn>∑i=1
   nxi-∑i=1
   nxi2<
   /mn>2∑i=1
   nxi2≤0 :MATH]
   8
   It can be obtained from ([49]2)
   [MATH: d(Em(τ^m2))dτ^m2=dQm-
   SMM,m<
   mrow>(τ^m2)A(τ^m2)d
   mtext>τ^m2=-A(τ^m2)dS<
   /mi>MM,m<
   mo stretchy="false">(τ^m2)dτ^m2+Qm-S
   mi>MM,m<
   /mfenced>dA(τ^m2)dτ^m2A(τ^m2)2 :MATH]
   9
   It can be obtained from ([50]3) and ([51]4)
   [MATH:
   Qm=S
   MM,m(0) :MATH]
   It is attainable via ([52]7)
   [MATH:
   Qm-S
   MM,m(τ^m2)≥0 :MATH]
   10
   It is available through ([53]6), ([54]7), ([55]8) and ([56]10)
   [MATH: d(Em(τ^m2))dτ^m2≥0 :MATH]
   So,
   [MATH: Em(τ^m2) :MATH]
   increases with the increase of
   [MATH: τ^m2 :MATH]
   .
   Theorem 2 Assume that
   [MATH:
   y1m,y2m,
   ⋯,ynm :MATH]
   are the effect sizes of n independent studies of methylation site m,
   and the between-study variance estimator
   [MATH: Em :MATH]
   is
   [MATH:
   Em=Qm-SMM
   ,mA
   :MATH]
   where
   [MATH: Qm=∑i=1
   n1σim2yim-∑i=1
   n1σim2yi∑i=1
   n1σim2<
   /mfrac>2 :MATH]
   [MATH:
   SMM,m=∑i=1
   n1σ
   im2+τ^m2yim-∑i=1
   n1σ
   im2+τ^m2yim∑i=1
   n1σ
   im2+τ^m22 :MATH]
   [MATH: A=∑i=1
   n1σ
   im2+τ^m2-∑i=1
   n1σim2+τ^m22<
   mo
   movablelimits="false">∑i=1
   n1σ
   im2+τ^m2 :MATH]
   Then,
   [MATH:
   Em≥0
   :MATH]
   .
   Proof Because
   [MATH: A(τ^m2)=∑i=1
   n1σ
   im2+τ^m2-∑i=1
   n1σim2+τ^m22<
   mo
   movablelimits="false">∑i=1
   n1σ
   im2+τ^m2>0 :MATH]
   [MATH:
   Qm-S
   MM,m(τ^m2)≥0 :MATH]
   we have
   [MATH: Em(τ^m2)=∑i=1
   n1σim2yim-∑i=1
   n1σim2yi∑i=1
   n1σim2<
   /mfrac>2-∑i=1
   n1σ
   im2+τ^m2yim-∑i=1
   n1σ
   im2+τ^m2yim∑i=1
   n1σ
   im2+τ^m22∑i=1
   n1σ
   im2+τ^m2-∑i=1
   n1σim2+τ^m22<
   mo
   movablelimits="false">∑i=1
   n1σ
   im2+τ^m2=
   Qm-SM<
   /mi>M,m(τ^m2)A(τ^m2)≥0
   :MATH]
Results
Simulation study
   We simulated five sets of meta-analysis data. Each set of data has
   [MATH: K(K=4,6,8,10,12) :MATH]
   studies. Each study contains 20,000 methylation sites and 30 sample
   data. The first 15 samples are control data, and the last 15 sample
   data are experimental data. The methylation sites of each sample in
   each study are divided into 100 categories
   [MATH: (c=1,2,⋯,100) :MATH]
   , with 200 methylation sites in each category. The top 9000 methylation
   sites of each study are divided into K groups
   [MATH: (km=1
   ,2,⋯,K) :MATH]
   . The first
   [MATH: 9000/K
   :MATH]
   methylation sites belong to the first group
   [MATH: (km=1
   ) :MATH]
   , and the
   [MATH: 9000/K+1 :MATH]
   to
   [MATH: 9000∗2/K :MATH]
   methylation sites belong to the second group group
   [MATH: (km=2
   ) :MATH]
   , and so on, the methylation site from
   [MATH: 9000(K-1)/K+1 :MATH]
   to 9000 methylation sites belong to the K-th group
   [MATH: (kg=K
   ) :MATH]
   , the 9001 to 20,000 methylation sites belong to the 0th group
   [MATH: (km=0
   ) :MATH]
   . The algorithm for simulating data is summarized as follows:
   First, we randomly sampled the methylation sites of the
   [MATH: c(1≤c≤100<
   /mn>) :MATH]
   th class of the
   [MATH: k(1≤k≤K) :MATH]
   th study
   [MATH: k(1≤k≤K) :MATH]
   , where
   [MATH: k(1≤k≤K) :MATH]
   ,
   [MATH:
   I200×200
   :MATH]
   is identity matrix,
   [MATH:
   J200×200
   :MATH]
   is a matrix with all elements
   [MATH: 1 :MATH]
   ,
   [MATH: W-1 :MATH]
   is the inverse of the Wishart distribution, and
   [MATH: ∑ck′ :MATH]
   is normalized to
   [MATH: ∑ck :MATH]
   , such that all its diagonal elements are
   [MATH: 1 :MATH]
   . Then, sample the methylation level of the
   [MATH: c(1≤c≤100<
   /mn>) :MATH]
   th class site of the
   [MATH: n(1≤n≤30
   mn>) :MATH]
   sample of the
   [MATH: k :MATH]
   th study
   [MATH: (Xmc1nk′,Xmc2nk′,⋯,Xmc200nk
   ′)T∼MVN(0,∑ck<
   mo
   stretchy="false">),1≤
   k≤K :MATH]
   .
   We performed differential methylation settings on the first 9000
   methylation sites, randomly sampling
   [MATH: δmk∈(0,1) :MATH]
   , so that
   [MATH: ∑k=1
   Kδmk=km(km=1
   ,2,⋯,K) :MATH]
   . When
   [MATH: δmk=1
   :MATH]
   , site
   [MATH: m :MATH]
   of the
   [MATH: k :MATH]
   th study is a significantly differentially methylated site. When
   [MATH: δmk=0
   :MATH]
   , site $m$ of the
   [MATH: k :MATH]
   th study is a non-significantly differential methylation site. This
   paper conducts random sampling
   [MATH: μmk∼U(0.5,3) :MATH]
   . The methylation level of the control group remains unchanged, and the
   methylation expression level of the experimental group is
   [MATH: Ymnk=Xm(n+N)k′+<
   msub>μmk∗
   δmk
   :MATH]
   , where
   [MATH:
   1≤m≤9000,1≤n≤15,1≤k≤K :MATH]
   .
   We compared the distribution of between study variance estimators of
   the DSLE2 random-effects model and six other meta-analysis models. As
   can be seen from Fig. [57]1, compared with the other six random-effects
   models, the between-study variance estimators of the DSLE2
   meta-analysis model is relatively concentrated. The between-study
   variance estimators of DSLE2, DL, EB, HO, RML, and SJ random-effects
   models are mostly below 0.5. Compared with the other 6 random-effects
   models, the DSLE2 random-effects model has a relatively larger number
   of between-study variances distributed around 0.
Fig. 1.
   [58]Fig. 1
   [59]Open in a new tab
   Violin plot of between-study variance for seven random-effects
   meta-analysis models
   We also compared the accuracy, false negative rate, negative prediction
   rate, recall rate, Matthews correlation coefficient, PCMiss
   (Prediction-conditioned miss) value, SAR value, and F value of the
   DSLE2 meta-analysis model with DL, EB, HO, RML, and SJ meta-analysis
   models under three hypothesis.
Accuracy
   We compared the accuracy of seven random-effects models under three
   hypothesis (see Fig. [60]2, Figure S1 and Figure S2). Under the first
   hypothesis, the DSLE2 meta-analysis model has the highest accuracy, the
   SJ random-effects model has the lowest accuracy, and the DSLR2, DL, EB,
   HO, and RML meta-analysis models have the similar accuracy. The
   accuracy of the DSLR2, DL, EB, HO, RML, and SJ meta-analysis models
   decreases as the number of studies increases. There is no obvious
   decreasing trend as the number increases for DSLE2 random-effects
   model. Under the second hypothesis, the DSLE2 meta-analysis model has
   the lowest accuracy, the SJ meta-analysis model has the highest
   accuracy.The accuracy of the DSLR2, DL, EB, HO, RML, and SJ
   meta-analysis models increases with the number of studies increases.
   Under the third hypothesis, the DSLE2 random-effects model has the
   lowest accuracy, and the DSLR2 meta-analysis model has the highest
   accuracy.
Fig. 2.
   [61]Fig. 2
   [62]Open in a new tab
   Plot of accuracy for 7 random-effects meta-analysis models under the
   first hypothesis
False negative rate
   The false negative rate, also known as the second type error rate,
   refers to the proportion of the number of significantly differential
   methylation sites predicted by the model as non-significantly
   differential methylation sites to the number of all significantly
   differential methylation sites. We compared the false negative rate of
   seven meta-analysis models under three hypothesis (see Fig. [63]3,
   Figure S3 and Figure S4).
Fig. 3.
   [64]Fig. 3
   [65]Open in a new tab
   Plot of false negative rate curves for seven random-effects
   meta-analysis models under the first hypothesis
   Under the first hypothesis, the false negative rate of the DSLE2 model
   is the lowest among the seven random-effects models; the false negative
   rate of the SJ model is the highest among the seven random-effects
   models. The false negative rates of the DSLR2, DL, EB, HO, and RML
   meta-analysis models are relatively close, and they are all lower than
   the false-negative rates of the SJ meta-analysis model, and they are
   all higher than the false negative rate of the DSLE2 meta-analysis
   model. The false negative rate of the DSLR2, DL, EB, HO, RML, and SJ
   meta-analysis models increases with the increase of the number of
   studies. Under the second hypothesis, the false negative rate of the
   DSLE2 meta-analysis model is the lowest among the seven random-effects
   models, and the false negative rates of the DL, EB, HO, and RML
   random-effects models are relatively close, and the false negative rate
   of the SJ model is the highest among the seven meta-analysis models.
   Under the third hypothesis, the false negative rate of the DSLE2
   meta-analysis model is the lowest among the seven meta-analysis models;
   the false negative rate of the SJ random-effects model is the highest
   among the seven random-effects models. The false negative rates of the
   DSLR2, DL, EB, HO, and RML meta-analysis models are relatively close,
   and they are all lower than the false negative rates of the SJ
   random-effects model, and they are all higher than the false negative
   rate of the DSLE2 meta-analysis model. The false negative rate of the
   DSLR2, DL, EB, HO, RML, and SJ random-effects models increases with the
   increase of the number of studies.
Matthews correlation coefficient
   Matthews correlation coefficient is the correlation coefficient that
   describes the actual class and the predicted class. We compared the
   Matthews correlation coefficients of seven random-effects models under
   three hypothesis testing conditions (see Fig. [66]4, Figure S5 and
   Figure S6). Under the first hypothesis, when the number of studies is
   greater than 4, the Matthews correlation coefficient of the DSLE2
   meta-analysis model is the highest among the Matthews correlation
   coefficient of the seven random-effects models. The Matthews
   correlation coefficient of SJ meta-analysis model is the lowest among
   the Matthews correlation coefficients of the seven meta-analysis
   models. The Matthews correlation coefficients of DSLR2, DL, EB, HO, and
   RML random-effects models are close, which are lower than the Matthews
   correlation coefficient of the DSLE2 meta-analysis model and higher
   than the Matthews correlation coefficient of the SJ random-effects
   model. The Matthews correlation coefficient of DSLR2, DL, EB, HO, RML,
   SJ meta-analysis models decreases as the number of studies increases.
   Under the second hypothesis, the Matthews correlation coefficient of
   the DSLE2 random-effects model is the lowest among the Matthews
   correlation coefficients of the seven meta-analysis models; the
   Matthews correlation coefficient of the SJ random-effects model is the
   highest among the Matthews correlation coefficients of the seven
   random-effects models; the Matthews correlation coefficients of DSLR2,
   DL, EB, HO, and RML meta-analysis models are relatively close. The
   Matthews correlation coefficients of HO, RML, and SJ random-effects
   models increase with the increase of the number of studies. Under the
   third hypothesis, the Matthews correlation coefficient of the DSLE2
   random effects meta-analysis model is the lowest among the Matthews
   correlation coefficients of the seven meta-analysis models, and the
   Matthews correlation coefficient of the DSLR2 meta-analysis model is
   the highest among the Matthews correlation coefficients of the seven
   random-effects meta-analysis models.
Fig. 4.
   [67]Fig. 4
   [68]Open in a new tab
   Plot of Matthews correlation coefficient curves for 7 random-effects
   models under the first hypothesis
Negative predictive value
   Negative predictive value refers to the proportion of the number of
   non-significantly differentially methylated sites correctly predicted
   by meta-analysis model to the number of all non-significantly
   differentially methylated sites predicted by meta-analysis model. We
   compared the negative predictive values of seven meta-analysis models
   (see Fig. [69]5, Figure S7 and Figure S8). Under the first hypothesis,
   the negative predictive values of the DSLE2 meta-analysis model are the
   highest among seven random-effects models; the negative predictive
   values of the SJ meta-analysis model are the lowest among seven
   random-effects models. The negative predictive values of the DSLR2, DL,
   EB, HO, and RML random-effects models are relatively close, and they
   are all lower than the negative predictive value of the DSLE2
   random-effects model and higher than the negative predictive values of
   SJ random-effects model. Under the second hypothesis, the negative
   predictive values of the SJ meta-analysis model are the lowest among
   the seven random-effects models. The negative predictive values of the
   DL, EB, HO, and RML random-effects models are relatively close. Under
   the third hypothesis, the negative predictive values of the DSLE2
   meta-analysis model are the highest among the seven random effects
   models; the negative predictive values of the SJ random-effects model
   are the lowest among the seven meta-analysis models. The negative
   predictive rates of DSLR2, DL, EB, HO, and RML meta-analysis models are
   relatively close.
Fig. 5.
   [70]Fig. 5
   [71]Open in a new tab
   Plot of negative predictive value curves for 7 random-effects models
   under the first hypothesis
Prediction-conditioned miss
   PCMiss (Prediction-Conditioned Miss) refers to the proportion of the
   number of sites incorrectly predicted as non-significantly
   differentially methylated to the number of all non-significantly
   differentially methylated sites predicted by the meta-analysis model.
   We compared the PCMiss values of seven meta-analysis models under three
   hypothesis (see Fig. [72]6, Figure S9 and Figure S10). Under the first
   hypothesis, the PCMiss values of the DSLE2 random-effects model are the
   lowest among the PCMiss values of the seven meta-analysis models; the
   PCMiss values of the SJ random-effects model are the highest among the
   seven meta-analysis models. The PCMiss values of the DSLR2, DL, EB, HO,
   and RML random-effects meta-analysis models are relatively close; the
   PCMiss values of the seven models all increase with the increase of the
   number of studies. Under the second hypothesis, the PCMiss values of
   the SJ model are the highest among the PCMiss values of the seven
   models. The PCMiss values of the DL, EB, HO, and RML random-effects
   models are close. Under the third hypothesis testing, the PCMiss values
   of the DSLE2 model are the lowest among the PCMiss values of seven
   random effects meta-analysis models; the PCMiss values of the SJ
   meta-analysis model are the highest among the seven random-effects
   models. The PCMiss values of the DSLR2, DL, EB, HO, and RML
   meta-analysis models are relatively close.
Fig. 6.
   [73]Fig. 6
   [74]Open in a new tab
   Plot of PCMiss-value curves for 7 random-effects models under the first
   hypothesis
Recall
   This paper measured the recall rates of seven random-effects models
   under three hypothesis (see Fig. [75]7, Figure S11 and Figure S12).
   Under the first hypothesis, the recall rates of the DSLE2 meta-analysis
   model are the highest among seven random effects meta-analysis models;
   the recall rates of the SJ random-effects model are the lowest among
   seven meta-analysis models. The recall rates of the DL, EB, HO, and RML
   meta-analysis models are relatively close. The recall rates of the
   DSLR2, DL, EB, HO, RML, and SJ random-effects models decreases as the
   number of studies increases. Under the second hypothesis, the recall
   rates of the DSLE2 model are the highest among the seven meta-analysis
   models, followed by the recall rates of the DSLR2 model, and the recall
   rates of the SJ are the lowest among seven models. The recall rates of
   the DL, EB, HO, and RML random-effects models are relatively close.
   Under the third hypothesis, the recall rates of the DSLE2 meta-analysis
   model are the highest among the seven random-effects models, and the
   recall rate of the SJ model is the lowest among the seven meta-analysis
   models. The recall rates of DSLR2, DL, EB, HO, and RML random-effects
   models are relatively close.
Fig. 7.
   [76]Fig. 7
   [77]Open in a new tab
   Plot of recall curves for 7 random-effects meta-analysis models under
   the first hypothesis
SAR
   SAR combines accuracy, AUC (the area under the receiver operating
   characteristic curve) value, and root mean square error. The SAR value
   is more robust than a single indicator. We computed the SAR values of
   seven meta-analysis models under three hypothesis (see Fig. [78]8,
   Figure S13 and Figure S14). Under the first hypothesis, the SAR values
   of the DSLE2 random-effects model are the highest among the SAR values
   of seven meta-analysis models, and the SAR values of SJ model are the
   lowest among the seven random-effects models. The SAR values of DL, EB,
   HO, and RML meta-analysis models are relatively close. Under the second
   hypothesis, the SAR values of DSLR2 random-effects model are the
   highest among the SAR values of seven meta-analysis models, and the SAR
   values of the DSLE2 model are lowest among the seven random-effects
   models. The SAR values of the DL, EB, HO, RML, and SJ random-effects
   meta-analysis models are relatively close. The SAR values of the seven
   meta-analysis models all increase with the increase of the number of
   studies. Under the third hypothesis, the SAR values of the DSLR2
   meta-analysis model are the highest among the SAR values of seven
   random-effect models. Moreover, the SAR values of the DL, EB, HO, and
   RML meta-analysis models are relatively close.
Fig. 8.
   [79]Fig. 8
   [80]Open in a new tab
   Plot of SAR-value curves for 7 random-effects meta-analysis models
   under the first hypothesis
Precision-recall F measure
   Precision and recall indicators sometimes conflict with each other. In
   this case, precision and recall indicators need to be considered
   comprehensively. The most common method is to calculate the F value of
   precision and recall. We computed the
   [MATH: F :MATH]
   values of 7 random-effects models under three hypothesis (see
   Fig. [81]9, Figure S15 and Figure S16). Under the first hypothesis, the
   F values of the DSLE2 random-effects model are the highest among the F
   values of seven random-effects models, and the F values of the SJ
   meta-analysis model are the lowest among the seven meta-analysis
   models. The F values of DSLR2, DL, EB, HO, and RML meta-analysis models
   are relatively close. Under the second hypothesis, the F values of the
   SJ meta-analysis model are the highest among the F values of seven
   random-effects models, and the F values of the DSLE2 random-effects
   model are the lowest among the seven meta-analysis models. The F values
   of DSLR2, DL, EB, HO, and RML random-effects models are relatively
   close. Under the third hypothesis, the F values of the DSLE2
   random-effects model is the lowest among the F values of seven
   meta-analysis models.
Fig. 9.
   [82]Fig. 9
   [83]Open in a new tab
   Plot of F-value curves for 7 random-effects meta-analysis models under
   the first hypothesis
Application of DSLE2 random-effects meta-analysis model to lung cancer
methylation data
   Lung cancer is the second most common cancer worldwide and the most
   common cancer among men. According to the World Cancer Statistics
   Center, more than 2.2 million cases of lung cancer occur every year
   [[84]17]. We collected three sets of lung cancer methylation data in
   the GEO database: [85]GSE63704, [86]GSE83842, and [87]GSE85845. The
   EPIC and 450 K methylation data of lung cancer in TCGA database were
   also collected. A total of 927 samples and 156,680 methylation sites
   were analyzed. The distribution of the number of significantly
   differential methylation sites determined by 7 random-effects models is
   shown in the Fig. [88]10. The number of significantly differential
   methylation sites determined by the DSLE2 random-effects model is more
   than the number of significantly differential methylation sites
   determined by DSLR2 and is less than the number of significantly
   differential methylation sites determined by DL, EB, HO, RML, and SJ
   meta-analysis models.
Fig. 10.
   [89]Fig. 10
   [90]Open in a new tab
   Histogram of the number distribution of significantly differential
   methylation sites identified by 7 random-effects meta-analysis models
   The DSLE2 random-effects model identified 59,146 significantly
   differential methylation sites, of which 2,230 significantly
   differential methylation sites distributed in 1stExon of 3,039 genes,
   and 2,003 significantly differential methylation sites were distributed
   in 3’UTR of 1,660 genes, and 4899 significantly differentially
   methylated sites distributed in the 5’UTR of 2754 genes, and 21,029
   significantly differentially methylated sites distributed in the Body
   of 7056 genes, and 7500 significantly differentially methylated sites
   distributed in the TSS1500 of 5113 genes, and 6210 significantly
   differentially methylated sites distributed in TSS200 of 3963 genes.
   Research shows that lung cancer is closely related to A2BP1 [[91]8],
   AACS [[92]18], DNAH10 [[93]19], PINK1 [[94]20], and other genes with
   significantly differential methylation sites. The significantly
   differentially methylated sites identified by the DSLE2 meta-analysis
   model may affect the expression of the corresponding genes.
Application of DSLE2 meta-analysis model to Parkinson’s methylation data
   Parkinson’s disease (PD) is a chronic neurodegenerative disease
   [[95]21]. According to the World Health Organization (WHO), there were
   2.5 million and 6.1 million cases of PD worldwide in 1990 and 2016
   [[96]22]. However, the number of PD patients increased significantly to
   8.5 million, and it is estimated that by 2040, the number of PD
   patients worldwide may exceed 17 million [[97]23]. In 2019, PD caused
   5.8 million disabilities, an 81% increase since 2000, and 329,000
   deaths, an increase of more than 100 percent since 2000 [[98]23]. PD
   causes serious trouble to people’s life, but there is no clear cause
   and effective treatment. DSLE2 random-effects model was used to further
   study the causes of PD.
   We collected five sets of PD methylation data in the GEO database:
   [99]GSE72774, [100]GSE72776, [101]GSE111629, [102]GSE145361 and
   [103]GSE165081. A total of 3,080 samples and 161,261 methylation sites
   were analyzed. The DSLE2 model identified 26,244 significantly
   differentially methylation sites (Fig. [104]11). Most of significantly
   differentially methylation sites identified by DSLE2, SJ, HO, DSL,
   DSLR2 meta-analysis methods were same. DSLE2 model independently
   identified 2181 significantly differentially methylation sites.
Fig. 11.
   Fig. 11
   [105]Open in a new tab
   Venn of significantly differential methylation sites identified by 5
   random-effects meta-analysis models
   We further analyzed the location of significantly differential
   methylation sites on genes (Figs. [106]12 and [107]13). Most of the
   significant differential methylation sites were located on the gene
   body, followed by TSS1500 and 5’UTR (Fig. [108]12). Relatively few
   Significantly differential methylation sites were located at 3’UTR and
   the first exon (Fig. [109]12). Most of the significantly differential
   methylation sites are uniquely located on a single gene and only a
   small number of genes contain multiple significantly differential
   methylation sites (Fig. [110]13). There are 51, 62 and 122
   significantly differentially methylation sites for BCOR, HDAC4 and
   PRDM16, respectively. Research shows that PD is closely related to BCOR
   [[111]24], HDAC4 [[112]25], PRDM16 [[113]26], and other genes with
   significantly differential methylation sites.
Fig. 12.
   [114]Fig. 12
   [115]Open in a new tab
   Histogram of significantly differential methylation sites on genes. The
   horizontal axis refers features, and the vertical axis is the number of
   significantly differential methylation sites
Fig. 13.
   [116]Fig. 13
   [117]Open in a new tab
   Histogram of significantly differential methylation sites on genes. The
   horizontal axis is the number of significantly differential methylation
   sites, and the vertical axis is the number of genes
Discussion and conclusion
   For high-throughput methylation sequencing data, we proposed the DSLE2
   random effects meta-analysis model based on the between-study variance
   estimator
   [MATH: Em :MATH]
   . Under the alternative hypothesis that the effect sizes of all studies
   are not 0, the DSLE2 meta-analysis model performs better than the other
   6 random-effects models in terms of accuracy, negative prediction rate,
   recall rate, Matthews correlation coefficient, PCMiss value, and SAR
   value, F value.
   We applied the DSLE2 meta-analysis model to the lung cancer methylation
   datasets. 59,146 significantly differential methylation sites were
   identified, with 2,230, 2,230 2003, 4899, 21,029, 7500, and 6210
   significantly differentially methylated sites were located on 1stExon,
   3'UTR, 5'UTR, Body, TSS1500, and TSS200 of 3039, 1660, 2754, 7056,
   5113, and 3963 genes, respectively. Studies have shown that lung cancer
   is closely related to genes such as A2BP1, AACS, and DNAH10 with
   significantly differential methylation sites.
   We further applied the DSLE2 meta-analysis model to the PD methylation
   datasets. 26,244 significantly differentially methylation sites, with
   952, 776, 2466, 7604, 3897 and 2052 significantly differentially
   methylated sites were located on 1stExon, 3'UTR, 5'UTR, Body, TSS1500,
   and TSS200 of 703, 700, 1240, 3059, 2495 and 1267 genes, respectively.
   Research shows that PD is closely related to BCOR, HDAC4, PRDM16, and
   other genes with significantly differential methylation sites.
   This work presents a new between-study variance estimator for
   meta-analysis model. First, the primary purpose of the between-study
   variance estimator is to measure the extent of variability in effect
   sizes across different studies that is attributable to true differences
   rather than within-study sampling error. Accurately estimating
   between-study variance allows us to distinguish between variability due
   to real differences in study effects and variability due to random
   error, providing a clearer understanding of the consistency of the
   effect across studies. Second, meta-analyses synthesize data from
   multiple studies to derive conclusions with greater statistical power.
   By accurately accounting for between-study variance, the proposed model
   enhances the validity of meta-analytic results, leading to more
   trustworthy conclusions. And it can improve the quality of evidence
   synthesis, promoting better decision-making in clinical practice,
   policy, and further research. Third, understanding heterogeneity is key
   to identifying which treatments work best for specific populations in
   medical research. A better estimation of between-study variance can
   help in understanding consistent treatment effects, aiding in the
   development of personalized therapeutic strategies. Moreover,
   efficiently allocating research resources requires an understanding of
   where variability lies. By identifying true sources of heterogeneity,
   researchers and funding bodies can focus efforts on areas with the most
   significant impact, optimizing the use of limited resources.
   In addition, after significant differential methylation sites
   identified by DSLE2 meta-analysis model, several software tools and
   platforms can be used for downstream analysis. These tools help in
   various aspects such as functional annotation, pathway enrichment
   analysis and so on. We can use Chip Analysis Methylation Pipeline
   (ChAMP) to annotate significantly differential methylated regions
   (DMRs), perform gene ontology (GO) analysis, and integrating with other
   epigenetic data. GREAT (Genomic Regions Enrichment of Annotations Tool)
   can be used to annotate and analyze the functional significance of sets
   of genomic regions, including DMRs, by associating them with nearby
   genes and functional terms. We can also use GSEA (Gene Set Enrichment
   Analysis) to do pathway enrichment analysis based on significant
   differential methylation sites identified by DSLE2.
Supplementary Information
   [118]12864_2025_11316_MOESM1_ESM.zip^ (809.9KB, zip)
   Supplementary Material 1: Figure S1 to S16 are additional files which
   are plots of accuracy, false negative rates and other measures under
   the second and third hypothesis.
Acknowledgements