Abstract
Background
With the rapid development of high-throughput sequencing technology,
high-throughput sequencing data has grown on a massive scale, leading
to the emergence of multiple public databases, such as EBI and GEO.
Conducting secondary mining of high-throughput sequencing data in these
databases can yield more valuable insights. Meta-analysis can
quantitatively combine high-throughput sequencing data from the the
same topic. It increases the sample size for data analysis, enhances
statistical power, and results in more consistent and reliable
conclusions.
Results
This study proposes a new between-study variance estimator
[MATH: Em :MATH]
. We prove that
[MATH: Em :MATH]
is non-negative and
[MATH: Emτ^m2 :MATH]
increases with the increase of
[MATH: τ^m2 :MATH]
, satisfying the general conditions of the between-study variance
estimator. We get the DSLE2 (two-step estimation starting with the DSL
estimate and the
[MATH: Em :MATH]
in the second step) random-effects meta-analysis model based on the
between-study variance estimator Em. The accuracy and a series of
evaluation metrics of the DSLE2 model are better than those of the
other 6 meta-analysis models. DSLE2 model is applied to lung cancer and
Parkinson’s methylation data. Significantly differentially methylated
sites identified by DSLE2 model and the genes with significantly
differentially methylated sites are closely related to two diseases,
indicating the effectiveness of DSLE2 random-effects model.
Conclusions
This paper propose the DSLE2 random-effects meta-analysis model based
on new between-study variance estimator Em. The DSLE2 model performs
well for methylation data.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12864-025-11316-3.
Keywords: Methylation sequencing data, Meta-analysis, Between-study
variance, Random-effects model
Background
DNA methylation refers to the addition of a methyl group at the CpG
site of the DNA chain [[30]1]. DNA methylation occurs at millions of
CpG dinucleotide positions in the genome and changes with age and the
external environment [[31]2]. It is an important epigenetic
modification [[32]3]. In the correlation analysis between methylation
and traits, HM450K (Infinium Human Methylation 450 K BeadChip) or other
Illumina technologies use the methylated fluorescence signal intensity
(
[MATH: M :MATH]
) and the unmethylated fluorescence signal intensity (
[MATH: U :MATH]
) of the CpG site to calculate DNA methylation levels [[33]4]. Two
commonly used calculation methods for methylation levels are
[MATH: β=M/(M+U) :MATH]
and
[MATH:
M=log2(M/U) :MATH]
[[34]5]. In the minfi R package, the getBeta function and getM function
can be used directly to calculate the
[MATH: β :MATH]
value and
[MATH: M :MATH]
value of every probe [[35]6].
With the development of high-throughput sequencing technology, multiple
public databases have been formed, such as the GEO (Gene Expression
Omnibus) database and TCGA (The Cancer Genome Atlas database) database
[[36]7]. People can obtain methylation sequencing data of the same
disease from different databases [[37]8]. Due to differences in
experimental conditions and sample processing procedures, the results
of methylation-trait association analysis of the same disease in
different studies are different [[38]9]. Meta-analysis can combine
methylation data on the same topic to provide a more reliable list of
significantly different methylation sites [[39]10]. In meta-analysis,
datasets from independent studies with the same purpose often have
heterogeneity [[40]11]. In the random-effects model, the comprehensive
effect-size result depends on the quantification of heterogeneity
between studies [[41]12]. Therefore, the estimation of heterogeneity is
the most critical step in the random-effects meta-analysis model
[[42]13, [43]14]. In addition to quantifying heterogeneity,
investigating differences in experimental designs, participants, and
interventions across studies can help to understand the causes of
heterogeneity [[44]15, [45]16]. This paper aims to estimate the size of
the between-study variance in a random-effects meta-analysis model.
We propose the DSLE2 meta-analysis model based on new between-study
variance estimator
[MATH: Em :MATH]
for methylation high-throughput sequencing data. Under three hypothesis
testing conditions, this model is compared with DSLR2 (two-step
estimation starting with the DSL estimate and the
[MATH: R2 :MATH]
in the second step), DL (DerSimonian and Laird estimate), EB (Empirical
Bayes estimate), HO (Hedges and Olkin estimate), RML (restricted
maximum likelihood estimate), and SJ (Sidik and Jonkman estimate)
random-effects meta-analysis models using sensitivity, and other
evaluation metrics. The results show that the DSLE2 methylation
meta-analysis model performs well under the first hypothesis testing
condition. We apply the DSLE2 random-effects model to lung cancer and
Parkinson’s methylation data, further demonstrating the reliability of
the DSLE2 meta-analysis model.
The proposed between-study variance estimator is designed to accurately
quantify the true variability in effect sizes across studies,
distinguishing it from random error. This advancement enhances the
precision and reliability of meta-analysis models, offering significant
benefits for research methodology, personalized medicine, study design,
and resource allocation. By implementing and validating this new
estimator, we can achieve a clearer understanding of effect consistency
across studies, ultimately leading to more robust and generalizable
conclusions. The primary purpose of the proposed between-study variance
estimator is to improve the modeling of heterogeneity in methylation
high-throughput sequencing data. FEM meta-analysis models often
struggle to accurately capture the complex variability inherent in such
data due to differences in study design, populations, and technical
procedures. Accurate effect size estimation is critical for identifying
biologically significant methylation patterns and their potential
clinical implications. To refine the estimation of true effect sizes by
more accurately accounting for between-study variance is very
important. Reliable identification of consistent methylation markers
across studies is also crucial for developing personalized medical
interventions. The improved model supports the discovery of biomarkers
that are consistently reproducible, aiding in disease diagnosis,
prognosis, and treatment customization.
Methods
The random-effects model is a generalization of the fixed-effects
model. The fixed effects model assumes that all studies included in the
same meta-analysis have the same true effect size, while the
random-effects model assumes that the true effect sizes of different
studies included in the same meta-analysis obey a normal distribution.
The between-study variance is a statistic used by the random-effects
model to measure the heterogeneity between studies from the same topic.
If the between study variance is 0, the random-effects model
degenerates into the fixed-effects model. The between-study variance
estimator of DSL (DerSimonian and Laird estimate) random-effects model
is simple and the most commonly used method for estimating the
between-study heterogeneity. In addition to being easy to calculate,
the between-study variance estimator is also suitable for effect sizes
of different dimensions. However, in practice, due to the possible
occurrence of negative values, DSL between-study variance estimators
are often truncated. In this paper, we present a non-truncated
estimator of between-study variance
[MATH: Em :MATH]
.
DSLE2 random-effects meta-analysis model
We propose a general heterogeneity variance estimator for methylation
sequencing data that is applicable to effect sizes at any scale and is
non-negative. Assume that
[MATH:
y1m,y2m,
⋯,ykm :MATH]
are the effect sizes of k independent studies of methylation site m;
[MATH: Qm :MATH]
is a heterogeneity statistic for methylation site m;
[MATH: ωim :MATH]
is the weight of methylation site m for ith study in the fixed-effects
meta-analysis model;
[MATH: ωim∗
:MATH]
is the weight of methylation site m for ith study in the random-effects
meta-analysis model;
[MATH: σim2 :MATH]
is the within-study variability representing sampling errors of ith
study;
[MATH: τ^m,D
SL2 :MATH]
is the between-study variance estimator in DSL random-effects model;
[MATH: μ^m,D
SL :MATH]
is the mean of effects in DSL random-effects model. And the
between-study variance estimator
[MATH: Em :MATH]
is
[MATH:
Em=Qm-SMM
,m∑i=1
kωim∗
-∑i=1
k(ωim∗
)2/∑i=1
kωim∗
:MATH]
1
where
[MATH: Qm=∑i=1
kωimyim-μ^m2<
/mn> :MATH]
[MATH: μ^m=
∑i=1
kωimyim∑i=1
kωim :MATH]
[MATH: ωim=1σim2,i=1,2,⋯,k :MATH]
[MATH:
SMM,m=∑i=1
kωim∗
yim-μ^m,DSL2 :MATH]
[MATH: μ^m,DSL=∑i=2
kωim∗
yim∑i=1
kωim∗
:MATH]
[MATH: ωim∗
=1σim2+τ^m,D
SL2,i=1,2,⋯,k :MATH]
The algorithm of DLSE2 meta-analysis model is as follows:
First, we use the fixed-effects model to calculate the weight and
comprehensive effect value of each study
[MATH: ωim=1σim2,i=1,2,⋯,k :MATH]
and
[MATH: μ^m=
∑i=1
kωimyim∑i=1
kωim :MATH]
Then, we calculate the heterogeneity statistic
[MATH: Qm :MATH]
[MATH: Qm=∑i=1
kωim(yim-μ^m)2 :MATH]
We use the DSL random-effects model to calculate the between-study
variance estimator, the weight of each study, and the corresponding
comprehensive effect size
[MATH: τ^m,D
SL2=max0,Qm
mi>-(k-1)S1<
/mn>m-S2<
mi>m/S1m :MATH]
[MATH: Srm=∑i=1
kωimr,r<
/mi>=1,2 :MATH]
[MATH: ωim∗
=1σim2+τ^m,D
SL2,i=1,2,⋯,k :MATH]
[MATH: μ^m,DSL=∑i=1
kωim∗
yim∑i=1
kωim∗
:MATH]
We further calculate
[MATH:
SMM,m
msub> :MATH]
[MATH:
SMM,m=∑i=1
kωim∗
yim-μ^m,DSL2 :MATH]
then
[MATH:
Em=Qm-SMM
,m∑i=1
kωim∗
-∑i=1
kωim∗2
mn>/∑i=1
kωim∗
:MATH]
The weight of each study based on the between-study variance
[MATH: Em :MATH]
is
[MATH:
ωim,DSLE2<
/mtext>=1σim2+
Em,i=1,2,⋯,k
:MATH]
The comprehensive effect is
[MATH: μ^m,DSLE2
mtext>=∑i=1
kωim,<
/mo>DSLE2yim∑i=1
kωim,<
/mo>DSLE2 :MATH]
The random-effect for each study is
[MATH: ξ^ig=E<
/mi>mσim2+
Emyim-μ^m,DSLE2
mtext>(τ^m,D
SLE22),i
=1,2,⋯,k
:MATH]
The variance of comprehensive effect is
[MATH: Var(μ^m,DSLE2
mtext>)=1∑i=1
kωim,<
/mo>DSLE2 :MATH]
The
[MATH: z :MATH]
statistic of comprehensive effect is
[MATH: zm(μ^m,DSLE2
mtext>)=μ^m,DSLE2
mtext>Var(μ^m,DSLE2
mtext>)∼N(0,1) :MATH]
The lower and upper bounds of the
[MATH: 100(1-α)% :MATH]
confidence interval are
[MATH: LLμ^m,DSLE2
mtext>=μ^m,DSLE2
mtext>-z(1-α)/2(m)×Var<
/mtext>μ^m,DSLE2
mtext> :MATH]
and
[MATH: ULμ^m,DSLE2
mtext>=μ^m,DSLE2
mtext>+z(1-α)/2(m)×Var<
/mtext>μ^m,DSLE2
mtext> :MATH]
the
[MATH: p :MATH]
-value of one-sided test is
[MATH: p=1-Φzmμ^m,DSLE2
mtext> :MATH]
the
[MATH: p :MATH]
-value of two-sided test is
[MATH: p=21-Φ|zmμ^m,DSLE2
mtext>|
:MATH]
where
[MATH: Φ(·)
:MATH]
is the cumulative function of the standard normal distribution.
Theorem 1 Assume that
[MATH:
y1m,y2m,
⋯,ynm :MATH]
are the effect sizes of n independent studies of methylation site
[MATH: m :MATH]
, and the between-study variance estimator
[MATH: Em :MATH]
is
[MATH:
Em=Qm-SMM
,mA
:MATH]
2
where
[MATH: Qm=∑i=1
n1σim2yim-∑i=1
n1σim2yi∑i=1
n1σim2<
/mfrac>2 :MATH]
3
[MATH:
SMM,m=∑i=1
n1σ
im2+τ^m2yim-∑i=1
n1σ
im2+τ^m2yim∑i=1
n1σ
im2+τ^m22 :MATH]
4
[MATH: A=∑i=1
n1σ
im2+τ^m2-∑i=1
n1σim2+τ^m22<
mo
movablelimits="false">∑i=1
n1σ
im2+τ^m2 :MATH]
5
Then,
[MATH: Em(τ^m2) :MATH]
is monotone and non-decreasing with respect to
[MATH: τ^m2 :MATH]
.
Proof It can be obtained from ([46]3)
[MATH:
dQmdτ^m2=0 :MATH]
6
It can be obtained from ([47]4)
[MATH:
dSMM
,m(τ^m2)dτ^m2=-∑i=1
n1σim2+τ^m2yim-∑i=1
n1σ
im2+τ^m2yim∑i=1
n1σ
im2+τ^m2
2≤0 :MATH]
7
It can be obtained from ([48]5)
[MATH: dA(τ^m2)dτm2^=-∑i=1
n1(σim2+τ^m2)2+2∑i=1
n1(σim2+τ^m2)3∑i=1
n1σ
im2+τ^m2-∑i=1
n1(σim2+τ^m2)22∑i=1
n1σ
im2+τ^m22
:MATH]
let
[MATH:
xi=1σim2+τ^m2 :MATH]
Then
[MATH:
xi>0
:MATH]
[MATH: dA(τ^m2)dτ^m2=-∑i=1
nxi2<
/mn>+2∑i=1
nxi3<
/mn>∑i=1
nxi-(∑i=1
nxi2<
/mn>)2∑i=1
nxi2 :MATH]
[MATH: =-∑i=1
nxi2<
/mn>∑i=1
nxi2+2∑i=1
nxi3<
/mn>∑i=1
nxi-∑i=1
nxi2<
/mn>2∑i=1
nxi2 :MATH]
[MATH: ≤2∑i=1
nxi3<
/mn>∑i=1
nxi-∑i=1
nxi2<
/mn>2∑i=1
nxi2 :MATH]
Let
[MATH: f(y)=∑i=1
nxi
(4-y)∑i=1
nxiy<
/mi> :MATH]
Then
[MATH:
dfdy=-∑i=1
nxi
(4-y)lnx
i∑i=1
nxiy<
/mi>+∑i=1
nxi
(4-y)∑i=1
nxiy<
/mi>lnxi
:MATH]
So, when
[MATH: y=2 :MATH]
,
[MATH:
dfdy=0 :MATH]
.
We can get that
[MATH: y=2 :MATH]
is the stagnation point of
[MATH: f(y) :MATH]
.
Then, we can get
[MATH: ∑i=1
nxi3<
/mn>∑i=1
nxi≤∑i=1
nxi2<
/mn>2 :MATH]
So
[MATH: dA(τ^m2)dτ^m2≤2∑i=1
nxi3<
/mn>∑i=1
nxi-∑i=1
nxi2<
/mn>2∑i=1
nxi2≤0 :MATH]
8
It can be obtained from ([49]2)
[MATH: d(Em(τ^m2))dτ^m2=dQm-
SMM,m<
mrow>(τ^m2)A(τ^m2)d
mtext>τ^m2=-A(τ^m2)dS<
/mi>MM,m<
mo stretchy="false">(τ^m2)dτ^m2+Qm-S
mi>MM,m<
/mfenced>dA(τ^m2)dτ^m2A(τ^m2)2 :MATH]
9
It can be obtained from ([50]3) and ([51]4)
[MATH:
Qm=S
MM,m(0) :MATH]
It is attainable via ([52]7)
[MATH:
Qm-S
MM,m(τ^m2)≥0 :MATH]
10
It is available through ([53]6), ([54]7), ([55]8) and ([56]10)
[MATH: d(Em(τ^m2))dτ^m2≥0 :MATH]
So,
[MATH: Em(τ^m2) :MATH]
increases with the increase of
[MATH: τ^m2 :MATH]
.
Theorem 2 Assume that
[MATH:
y1m,y2m,
⋯,ynm :MATH]
are the effect sizes of n independent studies of methylation site m,
and the between-study variance estimator
[MATH: Em :MATH]
is
[MATH:
Em=Qm-SMM
,mA
:MATH]
where
[MATH: Qm=∑i=1
n1σim2yim-∑i=1
n1σim2yi∑i=1
n1σim2<
/mfrac>2 :MATH]
[MATH:
SMM,m=∑i=1
n1σ
im2+τ^m2yim-∑i=1
n1σ
im2+τ^m2yim∑i=1
n1σ
im2+τ^m22 :MATH]
[MATH: A=∑i=1
n1σ
im2+τ^m2-∑i=1
n1σim2+τ^m22<
mo
movablelimits="false">∑i=1
n1σ
im2+τ^m2 :MATH]
Then,
[MATH:
Em≥0
:MATH]
.
Proof Because
[MATH: A(τ^m2)=∑i=1
n1σ
im2+τ^m2-∑i=1
n1σim2+τ^m22<
mo
movablelimits="false">∑i=1
n1σ
im2+τ^m2>0 :MATH]
[MATH:
Qm-S
MM,m(τ^m2)≥0 :MATH]
we have
[MATH: Em(τ^m2)=∑i=1
n1σim2yim-∑i=1
n1σim2yi∑i=1
n1σim2<
/mfrac>2-∑i=1
n1σ
im2+τ^m2yim-∑i=1
n1σ
im2+τ^m2yim∑i=1
n1σ
im2+τ^m22∑i=1
n1σ
im2+τ^m2-∑i=1
n1σim2+τ^m22<
mo
movablelimits="false">∑i=1
n1σ
im2+τ^m2=
Qm-SM<
/mi>M,m(τ^m2)A(τ^m2)≥0
:MATH]
Results
Simulation study
We simulated five sets of meta-analysis data. Each set of data has
[MATH: K(K=4,6,8,10,12) :MATH]
studies. Each study contains 20,000 methylation sites and 30 sample
data. The first 15 samples are control data, and the last 15 sample
data are experimental data. The methylation sites of each sample in
each study are divided into 100 categories
[MATH: (c=1,2,⋯,100) :MATH]
, with 200 methylation sites in each category. The top 9000 methylation
sites of each study are divided into K groups
[MATH: (km=1
,2,⋯,K) :MATH]
. The first
[MATH: 9000/K
:MATH]
methylation sites belong to the first group
[MATH: (km=1
) :MATH]
, and the
[MATH: 9000/K+1 :MATH]
to
[MATH: 9000∗2/K :MATH]
methylation sites belong to the second group group
[MATH: (km=2
) :MATH]
, and so on, the methylation site from
[MATH: 9000(K-1)/K+1 :MATH]
to 9000 methylation sites belong to the K-th group
[MATH: (kg=K
) :MATH]
, the 9001 to 20,000 methylation sites belong to the 0th group
[MATH: (km=0
) :MATH]
. The algorithm for simulating data is summarized as follows:
First, we randomly sampled the methylation sites of the
[MATH: c(1≤c≤100<
/mn>) :MATH]
th class of the
[MATH: k(1≤k≤K) :MATH]
th study
[MATH: k(1≤k≤K) :MATH]
, where
[MATH: k(1≤k≤K) :MATH]
,
[MATH:
I200×200
:MATH]
is identity matrix,
[MATH:
J200×200
:MATH]
is a matrix with all elements
[MATH: 1 :MATH]
,
[MATH: W-1 :MATH]
is the inverse of the Wishart distribution, and
[MATH: ∑ck′ :MATH]
is normalized to
[MATH: ∑ck :MATH]
, such that all its diagonal elements are
[MATH: 1 :MATH]
. Then, sample the methylation level of the
[MATH: c(1≤c≤100<
/mn>) :MATH]
th class site of the
[MATH: n(1≤n≤30
mn>) :MATH]
sample of the
[MATH: k :MATH]
th study
[MATH: (Xmc1nk′,Xmc2nk′,⋯,Xmc200nk
′)T∼MVN(0,∑ck<
mo
stretchy="false">),1≤
k≤K :MATH]
.
We performed differential methylation settings on the first 9000
methylation sites, randomly sampling
[MATH: δmk∈(0,1) :MATH]
, so that
[MATH: ∑k=1
Kδmk=km(km=1
,2,⋯,K) :MATH]
. When
[MATH: δmk=1
:MATH]
, site
[MATH: m :MATH]
of the
[MATH: k :MATH]
th study is a significantly differentially methylated site. When
[MATH: δmk=0
:MATH]
, site $m$ of the
[MATH: k :MATH]
th study is a non-significantly differential methylation site. This
paper conducts random sampling
[MATH: μmk∼U(0.5,3) :MATH]
. The methylation level of the control group remains unchanged, and the
methylation expression level of the experimental group is
[MATH: Ymnk=Xm(n+N)k′+<
msub>μmk∗
δmk
:MATH]
, where
[MATH:
1≤m≤9000,1≤n≤15,1≤k≤K :MATH]
.
We compared the distribution of between study variance estimators of
the DSLE2 random-effects model and six other meta-analysis models. As
can be seen from Fig. [57]1, compared with the other six random-effects
models, the between-study variance estimators of the DSLE2
meta-analysis model is relatively concentrated. The between-study
variance estimators of DSLE2, DL, EB, HO, RML, and SJ random-effects
models are mostly below 0.5. Compared with the other 6 random-effects
models, the DSLE2 random-effects model has a relatively larger number
of between-study variances distributed around 0.
Fig. 1.
[58]Fig. 1
[59]Open in a new tab
Violin plot of between-study variance for seven random-effects
meta-analysis models
We also compared the accuracy, false negative rate, negative prediction
rate, recall rate, Matthews correlation coefficient, PCMiss
(Prediction-conditioned miss) value, SAR value, and F value of the
DSLE2 meta-analysis model with DL, EB, HO, RML, and SJ meta-analysis
models under three hypothesis.
Accuracy
We compared the accuracy of seven random-effects models under three
hypothesis (see Fig. [60]2, Figure S1 and Figure S2). Under the first
hypothesis, the DSLE2 meta-analysis model has the highest accuracy, the
SJ random-effects model has the lowest accuracy, and the DSLR2, DL, EB,
HO, and RML meta-analysis models have the similar accuracy. The
accuracy of the DSLR2, DL, EB, HO, RML, and SJ meta-analysis models
decreases as the number of studies increases. There is no obvious
decreasing trend as the number increases for DSLE2 random-effects
model. Under the second hypothesis, the DSLE2 meta-analysis model has
the lowest accuracy, the SJ meta-analysis model has the highest
accuracy.The accuracy of the DSLR2, DL, EB, HO, RML, and SJ
meta-analysis models increases with the number of studies increases.
Under the third hypothesis, the DSLE2 random-effects model has the
lowest accuracy, and the DSLR2 meta-analysis model has the highest
accuracy.
Fig. 2.
[61]Fig. 2
[62]Open in a new tab
Plot of accuracy for 7 random-effects meta-analysis models under the
first hypothesis
False negative rate
The false negative rate, also known as the second type error rate,
refers to the proportion of the number of significantly differential
methylation sites predicted by the model as non-significantly
differential methylation sites to the number of all significantly
differential methylation sites. We compared the false negative rate of
seven meta-analysis models under three hypothesis (see Fig. [63]3,
Figure S3 and Figure S4).
Fig. 3.
[64]Fig. 3
[65]Open in a new tab
Plot of false negative rate curves for seven random-effects
meta-analysis models under the first hypothesis
Under the first hypothesis, the false negative rate of the DSLE2 model
is the lowest among the seven random-effects models; the false negative
rate of the SJ model is the highest among the seven random-effects
models. The false negative rates of the DSLR2, DL, EB, HO, and RML
meta-analysis models are relatively close, and they are all lower than
the false-negative rates of the SJ meta-analysis model, and they are
all higher than the false negative rate of the DSLE2 meta-analysis
model. The false negative rate of the DSLR2, DL, EB, HO, RML, and SJ
meta-analysis models increases with the increase of the number of
studies. Under the second hypothesis, the false negative rate of the
DSLE2 meta-analysis model is the lowest among the seven random-effects
models, and the false negative rates of the DL, EB, HO, and RML
random-effects models are relatively close, and the false negative rate
of the SJ model is the highest among the seven meta-analysis models.
Under the third hypothesis, the false negative rate of the DSLE2
meta-analysis model is the lowest among the seven meta-analysis models;
the false negative rate of the SJ random-effects model is the highest
among the seven random-effects models. The false negative rates of the
DSLR2, DL, EB, HO, and RML meta-analysis models are relatively close,
and they are all lower than the false negative rates of the SJ
random-effects model, and they are all higher than the false negative
rate of the DSLE2 meta-analysis model. The false negative rate of the
DSLR2, DL, EB, HO, RML, and SJ random-effects models increases with the
increase of the number of studies.
Matthews correlation coefficient
Matthews correlation coefficient is the correlation coefficient that
describes the actual class and the predicted class. We compared the
Matthews correlation coefficients of seven random-effects models under
three hypothesis testing conditions (see Fig. [66]4, Figure S5 and
Figure S6). Under the first hypothesis, when the number of studies is
greater than 4, the Matthews correlation coefficient of the DSLE2
meta-analysis model is the highest among the Matthews correlation
coefficient of the seven random-effects models. The Matthews
correlation coefficient of SJ meta-analysis model is the lowest among
the Matthews correlation coefficients of the seven meta-analysis
models. The Matthews correlation coefficients of DSLR2, DL, EB, HO, and
RML random-effects models are close, which are lower than the Matthews
correlation coefficient of the DSLE2 meta-analysis model and higher
than the Matthews correlation coefficient of the SJ random-effects
model. The Matthews correlation coefficient of DSLR2, DL, EB, HO, RML,
SJ meta-analysis models decreases as the number of studies increases.
Under the second hypothesis, the Matthews correlation coefficient of
the DSLE2 random-effects model is the lowest among the Matthews
correlation coefficients of the seven meta-analysis models; the
Matthews correlation coefficient of the SJ random-effects model is the
highest among the Matthews correlation coefficients of the seven
random-effects models; the Matthews correlation coefficients of DSLR2,
DL, EB, HO, and RML meta-analysis models are relatively close. The
Matthews correlation coefficients of HO, RML, and SJ random-effects
models increase with the increase of the number of studies. Under the
third hypothesis, the Matthews correlation coefficient of the DSLE2
random effects meta-analysis model is the lowest among the Matthews
correlation coefficients of the seven meta-analysis models, and the
Matthews correlation coefficient of the DSLR2 meta-analysis model is
the highest among the Matthews correlation coefficients of the seven
random-effects meta-analysis models.
Fig. 4.
[67]Fig. 4
[68]Open in a new tab
Plot of Matthews correlation coefficient curves for 7 random-effects
models under the first hypothesis
Negative predictive value
Negative predictive value refers to the proportion of the number of
non-significantly differentially methylated sites correctly predicted
by meta-analysis model to the number of all non-significantly
differentially methylated sites predicted by meta-analysis model. We
compared the negative predictive values of seven meta-analysis models
(see Fig. [69]5, Figure S7 and Figure S8). Under the first hypothesis,
the negative predictive values of the DSLE2 meta-analysis model are the
highest among seven random-effects models; the negative predictive
values of the SJ meta-analysis model are the lowest among seven
random-effects models. The negative predictive values of the DSLR2, DL,
EB, HO, and RML random-effects models are relatively close, and they
are all lower than the negative predictive value of the DSLE2
random-effects model and higher than the negative predictive values of
SJ random-effects model. Under the second hypothesis, the negative
predictive values of the SJ meta-analysis model are the lowest among
the seven random-effects models. The negative predictive values of the
DL, EB, HO, and RML random-effects models are relatively close. Under
the third hypothesis, the negative predictive values of the DSLE2
meta-analysis model are the highest among the seven random effects
models; the negative predictive values of the SJ random-effects model
are the lowest among the seven meta-analysis models. The negative
predictive rates of DSLR2, DL, EB, HO, and RML meta-analysis models are
relatively close.
Fig. 5.
[70]Fig. 5
[71]Open in a new tab
Plot of negative predictive value curves for 7 random-effects models
under the first hypothesis
Prediction-conditioned miss
PCMiss (Prediction-Conditioned Miss) refers to the proportion of the
number of sites incorrectly predicted as non-significantly
differentially methylated to the number of all non-significantly
differentially methylated sites predicted by the meta-analysis model.
We compared the PCMiss values of seven meta-analysis models under three
hypothesis (see Fig. [72]6, Figure S9 and Figure S10). Under the first
hypothesis, the PCMiss values of the DSLE2 random-effects model are the
lowest among the PCMiss values of the seven meta-analysis models; the
PCMiss values of the SJ random-effects model are the highest among the
seven meta-analysis models. The PCMiss values of the DSLR2, DL, EB, HO,
and RML random-effects meta-analysis models are relatively close; the
PCMiss values of the seven models all increase with the increase of the
number of studies. Under the second hypothesis, the PCMiss values of
the SJ model are the highest among the PCMiss values of the seven
models. The PCMiss values of the DL, EB, HO, and RML random-effects
models are close. Under the third hypothesis testing, the PCMiss values
of the DSLE2 model are the lowest among the PCMiss values of seven
random effects meta-analysis models; the PCMiss values of the SJ
meta-analysis model are the highest among the seven random-effects
models. The PCMiss values of the DSLR2, DL, EB, HO, and RML
meta-analysis models are relatively close.
Fig. 6.
[73]Fig. 6
[74]Open in a new tab
Plot of PCMiss-value curves for 7 random-effects models under the first
hypothesis
Recall
This paper measured the recall rates of seven random-effects models
under three hypothesis (see Fig. [75]7, Figure S11 and Figure S12).
Under the first hypothesis, the recall rates of the DSLE2 meta-analysis
model are the highest among seven random effects meta-analysis models;
the recall rates of the SJ random-effects model are the lowest among
seven meta-analysis models. The recall rates of the DL, EB, HO, and RML
meta-analysis models are relatively close. The recall rates of the
DSLR2, DL, EB, HO, RML, and SJ random-effects models decreases as the
number of studies increases. Under the second hypothesis, the recall
rates of the DSLE2 model are the highest among the seven meta-analysis
models, followed by the recall rates of the DSLR2 model, and the recall
rates of the SJ are the lowest among seven models. The recall rates of
the DL, EB, HO, and RML random-effects models are relatively close.
Under the third hypothesis, the recall rates of the DSLE2 meta-analysis
model are the highest among the seven random-effects models, and the
recall rate of the SJ model is the lowest among the seven meta-analysis
models. The recall rates of DSLR2, DL, EB, HO, and RML random-effects
models are relatively close.
Fig. 7.
[76]Fig. 7
[77]Open in a new tab
Plot of recall curves for 7 random-effects meta-analysis models under
the first hypothesis
SAR
SAR combines accuracy, AUC (the area under the receiver operating
characteristic curve) value, and root mean square error. The SAR value
is more robust than a single indicator. We computed the SAR values of
seven meta-analysis models under three hypothesis (see Fig. [78]8,
Figure S13 and Figure S14). Under the first hypothesis, the SAR values
of the DSLE2 random-effects model are the highest among the SAR values
of seven meta-analysis models, and the SAR values of SJ model are the
lowest among the seven random-effects models. The SAR values of DL, EB,
HO, and RML meta-analysis models are relatively close. Under the second
hypothesis, the SAR values of DSLR2 random-effects model are the
highest among the SAR values of seven meta-analysis models, and the SAR
values of the DSLE2 model are lowest among the seven random-effects
models. The SAR values of the DL, EB, HO, RML, and SJ random-effects
meta-analysis models are relatively close. The SAR values of the seven
meta-analysis models all increase with the increase of the number of
studies. Under the third hypothesis, the SAR values of the DSLR2
meta-analysis model are the highest among the SAR values of seven
random-effect models. Moreover, the SAR values of the DL, EB, HO, and
RML meta-analysis models are relatively close.
Fig. 8.
[79]Fig. 8
[80]Open in a new tab
Plot of SAR-value curves for 7 random-effects meta-analysis models
under the first hypothesis
Precision-recall F measure
Precision and recall indicators sometimes conflict with each other. In
this case, precision and recall indicators need to be considered
comprehensively. The most common method is to calculate the F value of
precision and recall. We computed the
[MATH: F :MATH]
values of 7 random-effects models under three hypothesis (see
Fig. [81]9, Figure S15 and Figure S16). Under the first hypothesis, the
F values of the DSLE2 random-effects model are the highest among the F
values of seven random-effects models, and the F values of the SJ
meta-analysis model are the lowest among the seven meta-analysis
models. The F values of DSLR2, DL, EB, HO, and RML meta-analysis models
are relatively close. Under the second hypothesis, the F values of the
SJ meta-analysis model are the highest among the F values of seven
random-effects models, and the F values of the DSLE2 random-effects
model are the lowest among the seven meta-analysis models. The F values
of DSLR2, DL, EB, HO, and RML random-effects models are relatively
close. Under the third hypothesis, the F values of the DSLE2
random-effects model is the lowest among the F values of seven
meta-analysis models.
Fig. 9.
[82]Fig. 9
[83]Open in a new tab
Plot of F-value curves for 7 random-effects meta-analysis models under
the first hypothesis
Application of DSLE2 random-effects meta-analysis model to lung cancer
methylation data
Lung cancer is the second most common cancer worldwide and the most
common cancer among men. According to the World Cancer Statistics
Center, more than 2.2 million cases of lung cancer occur every year
[[84]17]. We collected three sets of lung cancer methylation data in
the GEO database: [85]GSE63704, [86]GSE83842, and [87]GSE85845. The
EPIC and 450 K methylation data of lung cancer in TCGA database were
also collected. A total of 927 samples and 156,680 methylation sites
were analyzed. The distribution of the number of significantly
differential methylation sites determined by 7 random-effects models is
shown in the Fig. [88]10. The number of significantly differential
methylation sites determined by the DSLE2 random-effects model is more
than the number of significantly differential methylation sites
determined by DSLR2 and is less than the number of significantly
differential methylation sites determined by DL, EB, HO, RML, and SJ
meta-analysis models.
Fig. 10.
[89]Fig. 10
[90]Open in a new tab
Histogram of the number distribution of significantly differential
methylation sites identified by 7 random-effects meta-analysis models
The DSLE2 random-effects model identified 59,146 significantly
differential methylation sites, of which 2,230 significantly
differential methylation sites distributed in 1stExon of 3,039 genes,
and 2,003 significantly differential methylation sites were distributed
in 3’UTR of 1,660 genes, and 4899 significantly differentially
methylated sites distributed in the 5’UTR of 2754 genes, and 21,029
significantly differentially methylated sites distributed in the Body
of 7056 genes, and 7500 significantly differentially methylated sites
distributed in the TSS1500 of 5113 genes, and 6210 significantly
differentially methylated sites distributed in TSS200 of 3963 genes.
Research shows that lung cancer is closely related to A2BP1 [[91]8],
AACS [[92]18], DNAH10 [[93]19], PINK1 [[94]20], and other genes with
significantly differential methylation sites. The significantly
differentially methylated sites identified by the DSLE2 meta-analysis
model may affect the expression of the corresponding genes.
Application of DSLE2 meta-analysis model to Parkinson’s methylation data
Parkinson’s disease (PD) is a chronic neurodegenerative disease
[[95]21]. According to the World Health Organization (WHO), there were
2.5 million and 6.1 million cases of PD worldwide in 1990 and 2016
[[96]22]. However, the number of PD patients increased significantly to
8.5 million, and it is estimated that by 2040, the number of PD
patients worldwide may exceed 17 million [[97]23]. In 2019, PD caused
5.8 million disabilities, an 81% increase since 2000, and 329,000
deaths, an increase of more than 100 percent since 2000 [[98]23]. PD
causes serious trouble to people’s life, but there is no clear cause
and effective treatment. DSLE2 random-effects model was used to further
study the causes of PD.
We collected five sets of PD methylation data in the GEO database:
[99]GSE72774, [100]GSE72776, [101]GSE111629, [102]GSE145361 and
[103]GSE165081. A total of 3,080 samples and 161,261 methylation sites
were analyzed. The DSLE2 model identified 26,244 significantly
differentially methylation sites (Fig. [104]11). Most of significantly
differentially methylation sites identified by DSLE2, SJ, HO, DSL,
DSLR2 meta-analysis methods were same. DSLE2 model independently
identified 2181 significantly differentially methylation sites.
Fig. 11.
Fig. 11
[105]Open in a new tab
Venn of significantly differential methylation sites identified by 5
random-effects meta-analysis models
We further analyzed the location of significantly differential
methylation sites on genes (Figs. [106]12 and [107]13). Most of the
significant differential methylation sites were located on the gene
body, followed by TSS1500 and 5’UTR (Fig. [108]12). Relatively few
Significantly differential methylation sites were located at 3’UTR and
the first exon (Fig. [109]12). Most of the significantly differential
methylation sites are uniquely located on a single gene and only a
small number of genes contain multiple significantly differential
methylation sites (Fig. [110]13). There are 51, 62 and 122
significantly differentially methylation sites for BCOR, HDAC4 and
PRDM16, respectively. Research shows that PD is closely related to BCOR
[[111]24], HDAC4 [[112]25], PRDM16 [[113]26], and other genes with
significantly differential methylation sites.
Fig. 12.
[114]Fig. 12
[115]Open in a new tab
Histogram of significantly differential methylation sites on genes. The
horizontal axis refers features, and the vertical axis is the number of
significantly differential methylation sites
Fig. 13.
[116]Fig. 13
[117]Open in a new tab
Histogram of significantly differential methylation sites on genes. The
horizontal axis is the number of significantly differential methylation
sites, and the vertical axis is the number of genes
Discussion and conclusion
For high-throughput methylation sequencing data, we proposed the DSLE2
random effects meta-analysis model based on the between-study variance
estimator
[MATH: Em :MATH]
. Under the alternative hypothesis that the effect sizes of all studies
are not 0, the DSLE2 meta-analysis model performs better than the other
6 random-effects models in terms of accuracy, negative prediction rate,
recall rate, Matthews correlation coefficient, PCMiss value, and SAR
value, F value.
We applied the DSLE2 meta-analysis model to the lung cancer methylation
datasets. 59,146 significantly differential methylation sites were
identified, with 2,230, 2,230 2003, 4899, 21,029, 7500, and 6210
significantly differentially methylated sites were located on 1stExon,
3'UTR, 5'UTR, Body, TSS1500, and TSS200 of 3039, 1660, 2754, 7056,
5113, and 3963 genes, respectively. Studies have shown that lung cancer
is closely related to genes such as A2BP1, AACS, and DNAH10 with
significantly differential methylation sites.
We further applied the DSLE2 meta-analysis model to the PD methylation
datasets. 26,244 significantly differentially methylation sites, with
952, 776, 2466, 7604, 3897 and 2052 significantly differentially
methylated sites were located on 1stExon, 3'UTR, 5'UTR, Body, TSS1500,
and TSS200 of 703, 700, 1240, 3059, 2495 and 1267 genes, respectively.
Research shows that PD is closely related to BCOR, HDAC4, PRDM16, and
other genes with significantly differential methylation sites.
This work presents a new between-study variance estimator for
meta-analysis model. First, the primary purpose of the between-study
variance estimator is to measure the extent of variability in effect
sizes across different studies that is attributable to true differences
rather than within-study sampling error. Accurately estimating
between-study variance allows us to distinguish between variability due
to real differences in study effects and variability due to random
error, providing a clearer understanding of the consistency of the
effect across studies. Second, meta-analyses synthesize data from
multiple studies to derive conclusions with greater statistical power.
By accurately accounting for between-study variance, the proposed model
enhances the validity of meta-analytic results, leading to more
trustworthy conclusions. And it can improve the quality of evidence
synthesis, promoting better decision-making in clinical practice,
policy, and further research. Third, understanding heterogeneity is key
to identifying which treatments work best for specific populations in
medical research. A better estimation of between-study variance can
help in understanding consistent treatment effects, aiding in the
development of personalized therapeutic strategies. Moreover,
efficiently allocating research resources requires an understanding of
where variability lies. By identifying true sources of heterogeneity,
researchers and funding bodies can focus efforts on areas with the most
significant impact, optimizing the use of limited resources.
In addition, after significant differential methylation sites
identified by DSLE2 meta-analysis model, several software tools and
platforms can be used for downstream analysis. These tools help in
various aspects such as functional annotation, pathway enrichment
analysis and so on. We can use Chip Analysis Methylation Pipeline
(ChAMP) to annotate significantly differential methylated regions
(DMRs), perform gene ontology (GO) analysis, and integrating with other
epigenetic data. GREAT (Genomic Regions Enrichment of Annotations Tool)
can be used to annotate and analyze the functional significance of sets
of genomic regions, including DMRs, by associating them with nearby
genes and functional terms. We can also use GSEA (Gene Set Enrichment
Analysis) to do pathway enrichment analysis based on significant
differential methylation sites identified by DSLE2.
Supplementary Information
[118]12864_2025_11316_MOESM1_ESM.zip^ (809.9KB, zip)
Supplementary Material 1: Figure S1 to S16 are additional files which
are plots of accuracy, false negative rates and other measures under
the second and third hypothesis.
Acknowledgements