Abstract
The causes of many complex human diseases are still largely unknown.
Genetics plays an important role in uncovering the molecular mechanisms
of complex human diseases. A key step to characterize the genetics of a
complex human disease is to unbiasedly identify disease-associated gene
transcripts on a whole-genome scale. Confounding factors could cause
false positives. Paired design, such as measuring gene expression
before and after treatment for the same subject, can reduce the effect
of known confounding factors. However, not all known confounding
factors can be controlled in a paired/match design. Model-based
clustering, such as mixtures of hierarchical models, has been proposed
to detect gene transcripts differentially expressed between paired
samples. To the best of our knowledge, no model-based gene clustering
methods have the capacity to adjust for the effects of covariates yet.
In this article, we proposed a novel mixture of hierarchical models
with covariate adjustment in identifying differentially expressed
transcripts using high-throughput whole-genome data from paired design.
Both simulation study and real data analysis show the good performance
of the proposed method.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12859-023-05556-x.
Keywords: Curse of dimensionality, Confounding, EM algorithm, RNAseq
Introduction
Genome-wide differential gene expression analysis is widely used for
the elucidation of the molecular mechanisms of complex human diseases.
One popular and powerful approach to detect differentially expressed
genes is the probe-wise linear regression analysis combined with the
control of multiple testing, such as limma [[29]1]. That is, we first
perform linear regression for each probe and then adjust p-values for
controlling multiple testing. One advantage of this approach is its
capacity to adjust for potential confounding factors.
Another approach for detecting differentially expressed genes is the
model-based clustering via mixture of Bayesian hierarchical models
(MBHM) [[30]2–[31]7], which can borrow information across genes to
cluster genes. Probe clustering based on MBHMs treats gene transcripts
as “samples” and samples as “variables”. Therefore, transcript
clustering based on MBHMs has large number of “samples” and relatively
small number of “variables”, hence does not have the curse-of
dimensionality problem. In addition, unlike transcript-specific tests
that have several parameters per transcript, transcript clustering
based on MBHMs has only a few hyperparameters per cluster to be
estimated and could borrow information across transcripts to estimate
model hyperparameters. These approaches generally assume that samples
under two groups are obtained independently. [[32]8] proposed a
constrained MBHM to identify genetic outcomes measured from
paired/matched designs.
Paired design is commonly used in study design for its homogeneous
external environment for comparing measurements under different
conditions. However, not all known confounding factors can be
controlled in a paired/match design. Hence, we might still need to
adjust the effects of confounding factors for data from a
paired/matched design.
Mixture of regressions or mixture of experts model [[33]9–[34]11] have
been proposed in literature to do clustering with capacity to adjust
for covariates. To best of our knowledge, this approach does not have
constraints on positive, negative, and constant means and has not been
applied to detect differentially expressed genes.
In this article, we proposed a novel mixture of hierarchical models
with covariate adjustment in identifying differentially expressed
transcripts using high-throughput whole genome data from paired design.
Method
We assumed that gene transcripts can be roughly classified into 3
clusters based on their expression levels in subjects after treatment
(denoted as condition 1) relative to those before treatment (denoted as
condition 2):
1. Transcripts after treatment have higher expression levels than
those before treatment, i.e., over-expressed (OE) in condition 1;
2. Transcripts after treatment have lower expression levels than those
before treatment, i.e., under-expressed (UE) in condition 1;
3. Transcripts after treatment have same expression levels than those
before treatment, i.e., non-differentially expressed (NE) between
condition 1 and matched condition 2.
We followed [[35]8] to directly model the marginal distributions of
gene transcripts in the 3 clusters. In [[36]8], they proposed a mixture
of three-component hierarchical distributions to characterize the
within-pair difference of gene expression. We extended their model by
incorporating potential confounding factors (such as Age and Sex) in
the mixture of hierarchical models, which might affect the response of
gene expression to drug treatment.
Note that this extension is non-trivial, just like multiple linear
regression is not just a simple extension to simple linear regression.
We assumed that data have been processed so that the distributions of
mRNA expression levels are close to normal distributions. For RNAseq
data, we can apply VOOM transformation [[37]12] or countTransformers
[[38]13] before applying eLNNpairedCov.
A mixture of hierarchical models
For the
[MATH: gth :MATH]
gene transcript, let
[MATH: xgl :MATH]
and
[MATH: ygl :MATH]
denote the expression levels of the
[MATH: lth :MATH]
subject under two different conditions, e.g., before and after
treatment,
[MATH:
g=1,…,G
:MATH]
,
[MATH:
l=1,…,n
:MATH]
, where G is the number of transcripts and n is the number of subjects
(i.e., the number of pairs). Let
[MATH: dgl=log<
mn>2(ygl)-log2(xgl) :MATH]
be the log2 difference for the
[MATH: gth :MATH]
gene transcript of
[MATH: lth :MATH]
subject. Denote
[MATH: dg=dg1,<
/mo>…,dgnT
mrow> :MATH]
. We assumed that
[MATH: dg :MATH]
is conditionally normally distributed given mean vector and covariance
matrix. Let
[MATH: WT :MATH]
be the
[MATH: n×(p+1) :MATH]
design matrix, where p is the number of covariates. The first column of
[MATH: WT :MATH]
is the vector of ones, indicating intercept. Let
[MATH: η :MATH]
be the
[MATH: (p+1)×1 :MATH]
vector of coefficients for the intercept and covariate effects. We
assume following mixture of three-component hierarchical models:
For gene transcripts over-expressed (OE) in post-treatment samples, we
expect that the mean log2 differences are positive. Hence, we assume
[MATH: dg|μg,τg∼Nμg,τg-1Inμg|τg∼Nexp[WTη1],k1τg-1Inτg∼Γα1,β1 :MATH]
where
[MATH:
k1>0,α1>0 :MATH]
and
[MATH:
β1>0
:MATH]
.
[MATH: Γα1,β1 :MATH]
denotes the Gamma distribution with shape parameter
[MATH: α1 :MATH]
and rate parameter
[MATH: β1 :MATH]
. That is, we assume that (1) the mean vectors
[MATH: μg :MATH]
,
[MATH: g=1 :MATH]
,
[MATH: … :MATH]
, G, given the variance
[MATH:
τg-1 :MATH]
follow a multivariate normal distribution with mean vector
[MATH: expWTη1 :MATH]
and covariance matrix
[MATH:
k1τg-1In :MATH]
; and (2) the variances
[MATH:
τg-1 :MATH]
,
[MATH: g=1 :MATH]
,
[MATH: … :MATH]
, G, follow a Gamma distribution with shape parameter
[MATH: α1 :MATH]
and rate parameter
[MATH: β1 :MATH]
.
Note that the exponential of the intercept
[MATH: exp(η10) :MATH]
indicates the mean of log2 difference is positive.
For gene transcripts under-expressed (UE) in post-treatment samples, we
expect that the mean log2 differences are negative. Hence, we assume
[MATH: dg|μg,τg∼Nμg,τg-1Inμg|τg∼N-exp[WTη2],k2τg-1Inτg∼Γα2,β2 :MATH]
where
[MATH:
k2>0,α2>0 :MATH]
,
[MATH:
β2>0
:MATH]
, and
[MATH: WT :MATH]
is the design matrix.
Note that the negative exponential of the intercept
[MATH: -exp(η20) :MATH]
indicates the mean of log2 difference is negative.
For gene transcripts non-differentially expressed (NE) between pre- and
post-treatment samples, we expect the mean log2 differences are zero.
Hence, we assume
[MATH: dg|τg∼NUTθg,τg-1Inθg|τg∼N(η3,k3τg-1Ip)τg∼Γα3,β3 :MATH]
where
[MATH:
k3>0,α3>0 :MATH]
and
[MATH:
β3>0
:MATH]
.
[MATH: UT :MATH]
is the design matrix without intercept column. That is, the intercepts
are zero. Note that the intercepts indicate mean log2 differences.
Hence,
[MATH: η3 :MATH]
is a
[MATH: p×1 :MATH]
vector of coefficients for the covariates.
Note that
[MATH: θg :MATH]
measure effects of confounding factors for NE genes. The true effect of
NE genes are zero (i.e., the intercept of
[MATH: UTθg
:MATH]
is zero in the above model).
The hyperparameters
[MATH: αc :MATH]
and
[MATH: βc :MATH]
are shape and rate parameters for the Gamma distribution, respectively,
[MATH:
c=1,2,3
:MATH]
. As for
[MATH:
k1,k2<
/mn> :MATH]
and
[MATH: k3 :MATH]
, the variation of the mean vector
[MATH: μg :MATH]
should be smaller than that of the observations
[MATH: dg :MATH]
. So we expect
[MATH:
0<kc<1,c=1,2,3 :MATH]
.
Note that the marginal distribution for each component of the mixture
is a multivariate t distribution [[39]14, Section 3.7.6]. However, to
model differentially expressed genes, the multivariate t distributions
derived from our models have special structure of mean vector and
covariance matrix.
For continuous covariates, we require that they are standardized so
that they have mean zero and variance one. Standardizing continuous
covariates would make
[MATH: expWTη1 :MATH]
and
[MATH: expWTη2 :MATH]
be numerically finite.
Ideally, we should require
[MATH: μg>0
:MATH]
[MATH: (μg<0
) :MATH]
for all transcripts in cluster 1 (cluster 2). To do so, we can assume a
log normal prior distribution for
[MATH: μg :MATH]
in cluster 1, for instance. However, a log normal distribution could
not be a conjugate prior for the mean of a normal distribution. It
would increase the computational burden if non-conjugate priors were
used. Other alternative models can also be used, such as assuming
[MATH: μg|η10=
mo>exp(η10)+WTη1
:MATH]
and
[MATH: η10 :MATH]
follows a normal distribution. However, these models do not have
closed-form marginal densities. Hence, they would substantially
increase computational burden. Besides, the empirical distribution of
the mean log2 difference
[MATH: dg :MATH]
of the differentially expressed gene probes has shown a right-skewed
pattern, while that of non-differentially expressed genes demonstrates
an approximate bell shape (see in Additional file [40]1: Figures
A2-A4). Hence, we require the mean
[MATH: E(μg)>0 :MATH]
(
[MATH: E(μg)<0 :MATH]
) for cluster 1 (cluster 2) by assuming
[MATH: E(μg) :MATH]
for cluster 1 (cluster 2) to be
[MATH: exp[WTη1] :MATH]
(
[MATH: -exp[WTη2] :MATH]
).
The proposed mixture models have meaningful biological interpretations
for mean structures. In particular, for the OE cluster, the intercept
[MATH: exp(η10) :MATH]
can be interpreted as the expected average log2 difference of gene
transcripts when the value of all the p covariates are zero; the
coefficient
[MATH: η1i :MATH]
of covariate i can be interpreted as there exists
[MATH: exp(η1i) :MATH]
fold-change associated with the one unit increase in covariate i while
the values of the remaining
[MATH: (p-1) :MATH]
covariates are fixed; for the UE cluster, the intercept
[MATH: -exp(η20) :MATH]
can be interpreted as the expected average log2 difference of gene
transcripts when the value of all the p covariates are zero; the
coefficient
[MATH: η2i :MATH]
of covariate i can be interpreted as there exists
[MATH: exp(η2i) :MATH]
fold-change associated with the one unit increase in covariate i while
the values of the remaining
[MATH: (p-1) :MATH]
covariates are fixed; while for the NE cluster, the coefficient
[MATH: η3i :MATH]
of covariate i can be interpreted as
[MATH: η3i :MATH]
unit increase of expected log2 difference of gene transcripts
associated with the one unit increase in covariate i while the values
of the remaining
[MATH: (p-1) :MATH]
covariates are fixed. They also are convenient to get closed-form
marginal densities so that we can use Expectation-Maximization (EM)
algorithm to estimate hyperparameters, instead of using
computational-intensive algorithms, such as Markov chain Monte Carlo
(MCMC).
Marginal density functions
Let
[MATH: f1(dg|ψ) :MATH]
,
[MATH: f2(dg|ψ) :MATH]
,
[MATH: f3(dg|ψ) :MATH]
be the marginal densities of the 3 hierarchical models, and
[MATH: π :MATH]
[MATH: = :MATH]
[MATH: (π1,π2,π3
) :MATH]
be the vector of cluster mixture proportions, where
[MATH: ψ=α1,β1,k1,η1T,α2,β
2,k2,η2T,α3,β
3,k3,η3T
mfenced>T :MATH]
. Then the marginal density of
[MATH: dg :MATH]
is:
[MATH: f(dg|ψ)=π1f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ). :MATH]
Determining transcript cluster membership
The transcript-cluster membership is determined based on the posterior
probabilities,
[MATH: ζgc :MATH]
[MATH: = :MATH]
[MATH: Pr(gth :MATH]
gene transcript in cluster c
[MATH: |dg) :MATH]
. We can get
[MATH: ζgc=πcfc(dg|ψ)π1
msub>f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ),c
mi>=1,2,3.
:MATH]
1
We determine a transcript’s cluster membership as follows: If the
maximum value among
[MATH: ζgi,i=1,2,3 :MATH]
is
[MATH: ζgc :MATH]
, then the transcript g belongs to cluster c.
The true values of
[MATH: π1 :MATH]
,
[MATH: π2 :MATH]
,
[MATH: π3 :MATH]
, and
[MATH: ψ :MATH]
are unknown. We use estimated values to determine transcripts’ cluster
membership.
Parameter estimation via EM algorithm
We used expectation-maximization (EM) algorithm [[41]15] to estimate
the model parameters
[MATH: π=π1,π2,π3
T :MATH]
and
[MATH: ψ :MATH]
.
Let
[MATH: zg=(zg1,zg2,zg3) :MATH]
to be the indicator vector indicating if gene transcript g belongs to a
cluster or not. To stablize the estimate of
[MATH: π :MATH]
when
[MATH: πc :MATH]
is very small, we assume that the cluster mixture proportions
[MATH: π :MATH]
follows a symmetric Dirichlet
[MATH: D(b) :MATH]
distribution, i.e.,
[MATH: f(π)=Γ(∑c=13bc)∏c
=13Γ(bc)∏<
mi>c=13π<
/mi>cbc-1 :MATH]
. Therefore, the likelihood function for the complete data
[MATH: (d,z,π) :MATH]
is
[MATH: L(ψ|d,z,π)=f(d,z,π|ψ)=f(d,z|π,ψ)f(π|ψ)=f(d,z|π,ψ)f(π)=∏g=1Gf(dg,zg|ψ,π)Dir<
mrow>(b)=∏g=1G(π1f1(dg|ψ))zg1(π2f2(dg|ψ))zg2(π3f3(dg|ψ))zg3×Γ(∑c=13bc)∏c
=13Γ(bc)∏c=13
πcbc-1.
:MATH]
Then the log complete-data likelihood function is:
[MATH: l(ψ|d,z,π)=∑g=1G((zg1logf1(dg|ψ)+zg2logf2<
mrow>(dg|ψ))+zg3logf3<
mrow>(dg|ψ))+∑g=1G(zg1logπ1+zg2logπ
2+zg3
mn>logπ3)+logΓ(∑c=13bc)∏c
=13Γ(bc)+∑c=13
munderover>(bc-1
)logπc. :MATH]
The EM algorithm is used to estimate parameters
[MATH: π :MATH]
and
[MATH: ψ :MATH]
. Since
[MATH: z :MATH]
is unknown random vector, we integrate it out from the log
complete-data likelhood function. Here,
[MATH: zg=(zg1,zg2,zg3) :MATH]
.
[MATH: ζg1
msub>=E(zg1|dg,π,ψ)=Pr(zg1=1|dg,π,ψ)=π1f1(dg|ψ)π1
msub>f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ)ζg
2=π2f2(dg|ψ)π1
msub>f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ)ζg
3=π3f3(dg|ψ)π1
msub>f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ) :MATH]
2
E-step. Denote
[MATH: Q(t)π,ψ|d,z(t),π(t) :MATH]
as the expected log complete-data likelihood function at t-th iteration
of the EM algorithm, we have
[MATH: Q(t)=Ezlψd,z,π|d,z(t),π(t)
=∑g=1G((ζg1(t)logf1(dg|ψ)+ζg<
/mi>2(t)logf2(dg|ψ))+ζg<
/mi>3(t)logf3(dg|ψ))+∑g=1G(ζg1(t)logπ1+ζg2
mrow>(t)logπ2+ζg3
mrow>(t)logπ3)+logΓ(∑c=13bc)∏c
=13Γ(bc)+∑c=13
munderover>(bc-1
)logπc, :MATH]
where
[MATH: ζgc(t)=Ezgc|dg,π(t),ψ
mrow>(t)
=πc(t)fc(dg|ψ(t))π1(t)f1(dg|ψ(t))+π2(t)f2(dg|ψ(t))+π3(t)f3(dg|ψ(t)),c=1,2
,3.
:MATH]
3
M-step. Maximize
[MATH: Q(t)π,ψ|d,z(t),π(t) :MATH]
to find the optimal values of
[MATH: π :MATH]
and
[MATH: ψ :MATH]
, and use these optimal values as estimates for the parameters
[MATH: π :MATH]
and
[MATH: ψ :MATH]
.
To maximize
[MATH: Q(t)π,ψ|d,z(t),π(t) :MATH]
, we use the “L-BFGS-B” method developed by Byrd et al. (1995)
[[42]16], which utilizes the first partial derivatives of
[MATH: Q(t)π,ψ|d,z(t),π(t) :MATH]
and allows box constraints, that is each variable can be given a lower
and/or upper bound.
Simulated annealing modification
EM algorithm may be trapped in a local maximum since it is strictly
ascending. As introduced by Celeux and Govaert (1992) [[43]17],
simulated annealing (SA) is widely used to help EM algorithm escape
from local maximum by adding randomness with a stochastic step.
Specifically, the conditional expectation in ([44]2) is modified in a
SA algorithm as follows
[MATH: ζ~gc(t)=πc(t)fc(dg|ψ(t))1/m(t)∑
c=13πc(t)fc(dg|ψ(t))1/m(t),c=1,2
,3.
:MATH]
4
where m is the temperature used to control the randomness. Usually, the
temperature m starts with a relatively high value since larger m leads
to larger randomness. At iteration t, the temperature is updated by
[MATH: m(t+1)=r×m(t) :MATH]
with the cooling rate r controls the speed of reduction. As suggested
in [[45]18, [46]19], we use
[MATH: m(0)=2 :MATH]
and
[MATH: r=0.9 :MATH]
.
We denoted eLNNpairedCov as the proposed method using the traditional
EM algorithm to obtain parameter estimates and denoted
eLNNpairedCov.SEM as the proposed method using the EM with
SA-modification to obtain parameter estimates.
We stop the expectation-maximization iterations based on a proportional
change, i.e. if the maximum of the absolute value of the differences of
model parameter estimates between current iteration and previous
iteration over the absolute value of the previous iteration estimates
is smaller than a small constant (e.g.
[MATH:
1.0×10-3
:MATH]
).
More details about the EM algorithm are shown in Supplementary Document
[see Additional file [47]1].
A real data study
We used the dataset [48]GSE24742 [[49]20], which can be downloaded from
the Gene Expression Omnibus
[[50]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24742], to
evaluate the performance of the proposed model-based clustering methods
(denoted as eLNNpairedCov and eLNNpairedCov.SEM ).
The dataset is from a study that investigated the gene expression
before and after administrating rituximab, a drug for treating anti-TNF
resistant rheumatoid arthritis (RA). There are 12 subjects, each having
2 samples (one sample is before treatment and the other is after
treatment). Age and sex are also available. Expression levels of 54,675
gene probes were measured for each of the 24 samples by using
Affymetrix HUman Genome U133 Plus 2.0 array. The dataset has been
preprocessed by the dataset contributor. We further kept only 43,505
gene probes in the autosomal chromosomes (i.e., chromosomes 1 to 22).
We then performed log2 transformation for gene expression levels. We
next obtained the within-subject difference of the log2 transformed
expression levels (log2 expression after-treatment minus log2
expression before-treatment). By examining the histogram (Figure A1)
[see Additional file [51]1] of the estimated standard deviations of
log2 differences of within-subject gene expression for the 43,505 gene
probes, we found a bimodal distribution. Based on Figure A1 [see
Additional file [52]1], where the histogram of estimated standard
deviations exhibits two modes, we choose to exclude gene probes with
standard deviation
[MATH: <1 :MATH]
corresponding to the first mode. It is a common practice to remove
genes with low variation [[53]21–[54]23]. Finally, 23,948 gene probes
kept in the down-stream analysis.
A simulation study
We performed a simulation study to compare the performance of the
proposed methods eLNNpairedCov, eLNNpairedCov.SEM with transcript-wise
test limma and Li et al.’s [[55]8] method (denoted as eLNNpaired).
eLNNpairedCov, eLNNpairedCov.SEM and limma adjust covariate effects,
while eLNNpaired does not. For eLNNpaired, we first regress out
covariates effect for each gene to make a fair comparison between
eLNNpaired and other methods.
The limma approach first performs an empirical-Bayes-based linear
regression for each transcript. In this linear regression, the
within-subject log2 difference of transcript expression is the outcome
and intercept indicating if the transcript is over-expressed
(intercept>0), under-expressed (intercept<0), or non-differentially
expressed (intercept = 0), adjusting for potential confounding factors.
A transcript is claimed as OE if its intercept estimate is positive and
corresponding FDR-adjusted p-value
[MATH: <0.05 :MATH]
, where FDR stands for false discovery rate. A transcript is claimed as
UE if its intercept estimate is negative and corresponding FDR-adjusted
p-value
[MATH: <0.05 :MATH]
. Other transcripts are claimed as NE.
The parameter values (
[MATH: π :MATH]
,
[MATH: ψ :MATH]
, and proportion of women) in the simulation study are based on the
estimates via eLNNpairedCov.SEM from the analysis of the pre-processed
real dataset [56]GSE24742 described in Subsection “A real data study”.
In this simulation study, we considered two sets with different
covariate coefficients for differentially expressed genes clusters. In
the first set (Set 1), parameter values are the estimates of parameters
based on the eLNNpairedCov.SEM method from real dataset. That is,
[MATH:
π1=0.00246 :MATH]
,
[MATH:
π2=0.01470 :MATH]
,
[MATH:
π3=0.98284 :MATH]
,
[MATH:
α1=3.53
:MATH]
,
[MATH:
β1=3.45
:MATH]
,
[MATH:
k1=0.26
:MATH]
,
[MATH:
η10=0.18
:MATH]
,
[MATH:
η11=0.00
:MATH]
,
[MATH:
η12=-1.05 :MATH]
,
[MATH:
α2=3.53
:MATH]
,
[MATH:
β2=3.45
:MATH]
,
[MATH:
k2=0.26
:MATH]
,
[MATH:
η20=0.18
:MATH]
,
[MATH:
η21=0.00
:MATH]
,
[MATH:
η22=-1.05 :MATH]
,
[MATH:
α3=2.86
:MATH]
,
[MATH:
β3=2.20
:MATH]
,
[MATH:
k3=0.72
:MATH]
,
[MATH:
η31=-0.01 :MATH]
,
[MATH:
η32=0.00
:MATH]
. In the second set (Set 2), we set
[MATH: η10 :MATH]
=
[MATH: η20 :MATH]
=0.08 instead of 0.18. For each set, we considered two scenarios. In
the first scenario (Scenario1), the number of subjects is equal to 30.
In the second scenario (Scenario2), the number of subjects is equal to
100.
For each scenario, we generated 100 datasets. Each simulated dataset
contains
[MATH:
G=20,000
:MATH]
gene transcripts. There are two covariates: standardized age (denoted
as Age.s) and Sex. Age.s follows normal distribution with mean 0 and
standard deviation 1. Seventy five percent (75%) of subjects are women.
Evaluation criteria
Two agreement indices and two error rates are used to compare the
predicted cluster membership and true cluster membership of all genes.
The two agreement indices are accuracy (i.e., proportion of predicted
cluster membership equal to the true cluster membership) and Jaccard
index [[57]24]. For perfect agreement, these indices have a value of
one. If an index takes a value close to zero, then the agreement
between the true transcript cluster membership and the estimated
transcript cluster membership is likely due to chance. The two error
rates are false positive rate (FPR) and false negative rate (FNR). FPR
is the percentage of detected DE transcripts among truly NE
transcripts. FNR is the percentage of detected NE transcripts among
truly DE transcripts. We also examined the user time and number of EM
iterations for running each simulated dataset.
Results
Results of the real data analysis
For the real dataset, we adjusted standardized age and sex for
eLNNpairedCov, eLNNpairedCov.SEM, and limma. We standardized age so
that it has mean zero and variance one. For each transcript, we also
scaled its expression across subjects so that its variance is equal
one. For eLNNpaired, we first regressed out the effect of standardized
age and sex for each transcript.
The estimates of parameters in our model are listed in Table [58]1.
Note that the proposed eLNNpairedCov and eLNNpairedCov.SEM have the
same estimates for the parameters in these three clusters, except for
the proportions of three clusters. The proportions of OE and UE
estimated by eLNNpairedCov method are 0.0376% and 0.346%, respectively.
The proportions of OE and UE estimated by eLNNpairedCov.SEM method are
0.246% and 1.47%, respectively.
Table 1.
Parameter estimates of OE, UE and NE clusters from eLNNpairedCov and
eLNNpairedCov.SEM
OE UE NE
[MATH: β1 :MATH]
3.445543
[MATH: β2 :MATH]
3.445543
[MATH: β3 :MATH]
3.445543
[MATH: k1 :MATH]
0.264565
[MATH: k2 :MATH]
0.264565
[MATH: k3 :MATH]
0.264565
[MATH: η10 :MATH]
0.176007
[MATH: η20 :MATH]
0.176007
[MATH: η11 :MATH]
−0.000609
[MATH: η21 :MATH]
−0.000609
[MATH: η31 :MATH]
−0.013796
[MATH: η12 :MATH]
−1.051257
[MATH: η22 :MATH]
−1.051257
[MATH: η32 :MATH]
−0.000017
[59]Open in a new tab
For the OE cluster,
[MATH: exp(η10)=exp(0.176007)=1.192 :MATH]
can be interpreted as the expected log2 difference for a male subject (
[MATH: sex=0
:MATH]
) whose age is equal to mean age (
[MATH: age=0
:MATH]
is the mean-centered age);
[MATH:
η11=-0.00060
9 :MATH]
indicates that one-unit increase in age leads to
[MATH: exp(-0.000609)=0.999 :MATH]
fold-changes in expected log2 difference, while
[MATH:
η12=-1.05125
7 :MATH]
indicates that there is
[MATH: exp(-1.051257)=0.349 :MATH]
fold-changes between male subjects and female subjects in expected log2
difference if they are at the same age. For the UE cluster,
[MATH:
η20=0.176007 :MATH]
can be interpreted as the expected log2 difference for a male subject (
[MATH: sex=0
:MATH]
) whose age is equal to mean age (
[MATH: age=0
:MATH]
is the mean-centered age) is
[MATH: -exp(0.176007)=-1.192 :MATH]
;
[MATH:
η21=-0.00060
9 :MATH]
indicates that one-unit increase in age leads to
[MATH: exp(-0.000609)=0.999 :MATH]
fold-changes in expected log2 difference, while
[MATH:
η22=-1.05125
7 :MATH]
indicates that there is
[MATH: exp(-1.051257)=0.349 :MATH]
fold-changes between male subjects and female subjects in expected log2
difference if they are at the same age. For the NE cluster,
[MATH:
η31=-0.01379
6 :MATH]
indicates that one-unit increase in age leads to 0.01379 decreases in
expected log2 difference, and
[MATH:
η32=-0.00001
7 :MATH]
indicates that there is 0.000017 decrease from female subjects to male
subjects in the expected log2 difference if they are at the same age.
The number of differentially expressed genes detected by each method is
listed in Table [60]2.
Table 2.
Number of Differentially expressed genes detected by limma, eLNNpaired,
eLNNpairedCov and eLNNpairedCov.SEM in [61]GSE24742
limma eLNNpaired eLNNpairedCov eLNNpairedCov.SEM
OE 0 0 55 59
UE 6 0 355 352
[62]Open in a new tab
The limma method detected 6 under-expressed gene transcripts
(Figure [63]1 and Table S1), while eLNNpaired did not find any positive
signals (i.e.,
[MATH: π^3=1 :MATH]
). The proposed methods eLNNpairedCov and eLNNpairedCov.SEM detected 55
OE transcripts (Table S2) and 59 OE transcripts (Table S3),
respectively (Upper two panels of Fig. [64]2) and 355 UE transcripts
(Table S4) and 352 UE transcripts (Table S5), respectively (Lower two
panels of Figure [65]2). The 6 UE transcripts detected by limma is also
selected as UE transcripts by eLNNpairedCov and eLNNpairedCov.SEM. Note
that the 55 OE genes detected by eLNNpairedCov are also detected by
eLNNpariedCov.SEM. The 352 UE genes detected by eLNNpairedCov.SEM are
also detected by eLNNpariedCov.
Fig. 1.
Fig. 1
[66]Open in a new tab
Parallel boxplots of log2 within-subject difference of gene expression
for 6 UE transcripts detected by limma for pre-processed [67]GSE24742
dataset. Red horizontal line indicates log2 difference equal to zero
Fig. 2.
[68]Fig. 2
[69]Open in a new tab
Parallel boxplots of log2 within-subject difference of gene expression
for differentially expressed transcripts detected by eLNNpairedCov and
eLNNpairedCov.SEM for pre-processed [70]GSE24742 dataset. Upper two
panels: 55 OE transcripts and 59 OE transcripts, respectively; Lower
two panels: 355 UE transcripts and 352 UE transcripts, respectively.
Red horizontal lines indicate log2 difference equal to zero
It is assuring that several genes corresponding to the DE transcripts
identified by eLNNpairedCov and eLNNpairedCov.SEM have been associated
to rheumatoid arthritis (RA) in literature. For example, Humby et al.
(2019) [[71]25] reported that genes ZNF365 (OE), IL36RN (OE), MRVI1-AS1
(OE), WFDC6 (UE), UBE2H (UE), are associated with RA.
We performed pathway enrichment analysis through the use of IPA (QIAGEN
Inc.,
[72]https://www.qiagenbioinformatics.com/products/ingenuitypathway-anal
ysis) for 352 UE and 55 OE genes identified by eLNNpairedCov.SEM. The
top enriched canonical pathways are shown in Tables [73]3 and [74]4.
Evidence in literature shows that these pathways are relevant to RA.
S100 protein family plays an important role in rheumatoid arthritis (
[[75]26]). Literature shows consistent crucial role of the PD-1/PD-L
pathway in the pathogenesis of rheumatic diseases ( [[76]27, [77]28]).
It has been shown that RA can lead to lung tissue damage, resulting in
pulmonary fibrosis ( [[78]29]). Macrophage is a key player in the
pathogenesis of autoimmune diseases, such as RA ( [[79]30]). RA and
osteoarthritis (OA) are two common arthritis with different
pathogenesis ( [[80]31]). It is interesting to see Osteoarthritis
pathway is a significantly enriched pathway for UE genes. It is
consistent with literature that similar focal and systemic alterations
exist in RA and OA [[81]32].
Table 3.
Top canonical pathways for 352 UE genes by eLNNpairedCov.SEM
Name p-value
S100 Family Signaling Pathway
[MATH: 2.97E-06 :MATH]
PD-1, PD-L1 cancer immunotherapy pathway
[MATH: 7.54E-05 :MATH]
Pulmonary Fibrosis Idiopathic Signaling pathway
[MATH: 3.45E-04 :MATH]
Phagosome Formation
[MATH: 7.56E-04 :MATH]
Osteoarthritis Pathway
[MATH: 1.04E-03 :MATH]
[82]Open in a new tab
Table 4.
Top canonical pathways for 55 OE genes by eLNNpairedCov.SEM
Name p-value
Ribonucleotide Reductase Signaling Pathway
[MATH: 5.34E-03 :MATH]
Leukocyte Extravasation Signaling
[MATH: 7.57E-03 :MATH]
Cell Cycle: G1/S Checkpoint Regulation
[MATH: 8.85E-03 :MATH]
Tetrahydrofolate Salvage from 5,10- methenyltetrahydrofolate
[MATH: 1.04E-02 :MATH]
Role of Osteoblasts, Osteoclasts and Chondrocytes in Rheumatoid
Arthritis
[MATH: 1.19E-02 :MATH]
[83]Open in a new tab
Ribonucleotide Reductase (RNR) is the enzyme providing the precursors
needed for both synthesis and repair of DNA, which could be a potential
drug for RA ( [[84]33, [85]34]). Leukocyte extravasation through the
endothelial barrier is important in the pathogenesis of RA ( [[86]35]).
It has been shown that the limb bud and heart development (LBH) gene is
a key dysregulated gene in RA and other autoimmune diseases and there
are some evidence showing LBH could modulate the cell cycle [[87]36].
Osteoblasts, osteoclasts and chondrocytes play importan roles in
Rheumatoid Arthritis ( [[88]37–[89]39]). We did not find literature
linking Tetrahydrofolate Salvage from 5,10- methenyltetrahydrofolate to
RA yet, indicating this enrichment might be novel.
Results of the simulation study
For Scenario 1 (
[MATH: n=30 :MATH]
), the jittered scatter plots of the performance indices versus methods
are shown in Fig. [90]3 (Set 1) and Fig. [91]5 (Set 2) and the jittered
scatter plots of the difference of the performance indices versus
methods are shown in Fig. [92]4 (Set 1) and Figure [93]6 (Set 2).
Fig. 3.
[94]Fig. 3
[95]Open in a new tab
Jittered scatter plots of performance indices versus method for Set 1,
Scenario 1 (number of pairs
[MATH: =30 :MATH]
). Red solid horizontal lines indicate the median performance indices
of eLNNpairedCov.SEM
Fig. 5.
[96]Fig. 5
[97]Open in a new tab
Jittered scatter plots of performance indices versus method for Set 2,
Scenario 1 (number of pairs
[MATH: =30 :MATH]
). Red solid horizontal lines indicate the median performance indices
of eLNNpairedCov.SEM
Fig. 4.
[98]Fig. 4
[99]Open in a new tab
Jittered scatter plots of difference of performance indices versus
method for Set 1, Scenario 1 (number of pairs
[MATH: =30 :MATH]
). Red solid horizontal lines indicate y-axis equal to zero
Fig. 6.
[100]Fig. 6
[101]Open in a new tab
Jittered scatter plots of difference of performance indices versus
method for Set 2, Scenario 1 (number of pairs
[MATH: =30 :MATH]
). Red solid horizontal lines indicate y-axis equal to zero
The differences of performance indices are between eLNNpairedCov.SEM
and the other three methods (limma, eLNNpaired and eLNNpairedCov). A
positive difference indicates that the performance indices of the other
method is larger than that of eLNNpairedCov.SEM. A negative difference
indicates that the performance indices of the other method is smaller
than that of eLNNpairedCov.SEM.
The upper panel of Figs. [102]3, [103]4, [104]5 and [105]6 show that
both the eLNNpairedCov and eLNNpairedCov.SEM have higher agreement
indices (Jaccard and accuracy) than limma, which in turn have higher
agreement indices than eLNNpaired.
The middle panel of Figures 3-6 show that the proposed eLNNpairedCov
and eLNNpairedCov.SEM methods have similar performance, They have lower
FPR than limma, while eLNNpaired has an exceedingly low FPR (close to
0). The middle panel also show that eLNNpairedCov, eLNNpairedCov.SEM
have smaller FNR than limma, while eLNNpaired has an exceedingly high
FNR (close to 1). The extreme values in FPR and FNR of eLNNpaired can
be attributed to the fact that it did not detect any differentially
expressed genes in this case.
Additionally, Figs. [106]3, [107]4, [108]5 and [109]6 also show that
compared with the performances of these methods in Set 1 (
[MATH:
η101
:MATH]
[MATH:
=η20=
:MATH]
0.18), those in Set 2 (
[MATH:
η101
:MATH]
[MATH:
=η20=
:MATH]
0.08) have lower agreement indices and higher error rates except for
eLNNpaired, which fails to detect any differentially expressed genes in
both Set 1 and Set 2.
The bottom panel of Figs. [110]3 and [111]5 show that limma runs very
fast, while eLNNpaired, eLNNpairedCov and eLNNpairedCov.SEM run in
reasonable time (i.e., less than 30 s per dataset that has
[MATH:
G=20,000
:MATH]
genes and
[MATH: n=30 :MATH]
subjects). On average eLNNpairedCov and eLNNpairedCov.SEM spend a
little more time than eLNNpaired. The bottom panel of Fig. [112]3 and
[113]5 also show that eLNNpaired uses less than 5 EM iterations, while
eLNNpairedCov and eLNNpairedCov.SEM tend to use more EM iterations. In
particular, eLNNpairedCov.SEM uses 10 EM iterations, which is the
maximum number of iterations we set to save computing time. Note that
the EM iteration number for limma is set to be one, which does not use
EM algorithm to obtain parameter estimates.
The simulation results for Scenario 2 (
[MATH: n=100 :MATH]
) are shown in Figures A5-A8 [see Additional file [114]1], which have
similar patterns to those for Scenario 1 (
[MATH: n=30 :MATH]
), except that both eLNNpairedCov and eLNNpairedCov.SEM have smaller
FPR which are close to 0. Note that eLNNpairedCov,eLNNpairedCov.SEM and
limma have small FNR (close to 0), while eLNNpaired still has huge FNR
(close to 1).
Discussion and conclusion
In this article, we proposed a novel model-based clustering approach to
detect differential expressed transcripts between samples before
treatment and samples after treatment, with the capacity to adjust for
potential confounding factors. This is novel in that to the best of our
knowledge, all existing model-based gene clustering methods do not yet
have the capacity to adjust for covariates.
The proposed approach is different from transcript-wise test followed
by multiplicity adjustment in that it does not involve hypothesis
testing. Hence, no multiplicity adjustment is needed. The simulation
study showed that if the difference of gene expression between samples
before treatment and samples after treatment follows the mixture of
hierarchical models in Subsection “A mixture of hierarchical models”,
then the proposed method can outperform limma, which is a fast and
powerful transcript-wise test method. The real data analysis also
showed the proposed method eLNNpairedCov can detect more differentially
expressed gene transcripts, which include the transcripts detected by
limma.
Although we classify genes to three distinct clusters, the transitions
between these clusters could be smooth. This would be reflected by a
gene’s posterior probability that might be large in two of three
clusters, e.g., 0.49 for cluster 1, 0.01 for cluster 2, and 0.5 for
cluster 3. On the other hand, expression changes could be split up into
more than 3 clusters, e.g., groups behaving differently. In this
article, we are only interested in identifying three clusters of genes:
over-expressed in condition 1, under-expressed in condition 1, and
non-differentially expressed.
There are other model-based clustering methods in literature, such as
[[115]40]. However, they were not designed to detect differentially
expressed genes. For example, we can set the number K of clusters as 3
for their model. However, there is no constraints that the intercepts
for the three clusters have to be positive, negative, and zero. That
is, the three clusters identified might not correspond to
over-expressed, under-expressed, and non-differentially expressed
genes.
It is well-known in literature that EM algorithm might stuck at local
optimal solution. In this article, we used EM with SA-modification to
help escape from local optimal solutions. In future, we plan to try the
hybrid algorithm of the DPSO (Discrete Particle Swarm Optimization) and
the EM approach to improve the global search performance [[116]41].
In our models, the three gene groups allow to have different
coefficients of covariates. In future, we could test if these
coefficients are same or not. If no significant difference, we could
use a model assuming equal coefficients.
RNAseq and single-cell RNAseq data are cutting-edge tools to
investigate molecular mechanisms of complex human diseases. However, it
is quite challenging to analyze these count data with inflated zero
counts. In future, we will evaluate if eLNNpairedCov can be used to
analyze single-cell RNAseq data by first transforming counts to
continuous scale (e.g., via VOOM [[117]12] or countTransformers
[[118]13]) and then to apply eLNNpairedCov to the transformed data.
We implemented the proposed methods to an R package eLNNpairedCov,
which will be freely available to researchers.
Supplementary Information
[119]12859_2023_5556_MOESM1_ESM.pdf^ (468.4KB, pdf)
Additional file 1. Supplementary Document.
[120]12859_2023_5556_MOESM2_ESM.xlsx^ (10.8KB, xlsx)
Additional file 2: Table S1.Gene list of 6 UE transcripts detected by
limma.
[121]12859_2023_5556_MOESM3_ESM.xlsx^ (12.5KB, xlsx)
Additional file 3: Table S2.Gene list of 55 OE transcripts detected by
eLNNpairedCov.
[122]12859_2023_5556_MOESM4_ESM.xlsx^ (12.7KB, xlsx)
Additional file 4: Table S3.Gene list of 59 OE transcripts detected by
eLNNpairedCov.SEM.
[123]12859_2023_5556_MOESM5_ESM.xlsx^ (23.9KB, xlsx)
Additional file 5: Table S4.Gene list of 355 UE transcripts detected by
eLNNpairedCov.
[124]12859_2023_5556_MOESM6_ESM.xlsx^ (23.8KB, xlsx)
Additional file 6: Table S5.Gene list of 352 UE transcripts detected by
eLNNpairedCov.SEM.
Acknowledgements