Abstract The causes of many complex human diseases are still largely unknown. Genetics plays an important role in uncovering the molecular mechanisms of complex human diseases. A key step to characterize the genetics of a complex human disease is to unbiasedly identify disease-associated gene transcripts on a whole-genome scale. Confounding factors could cause false positives. Paired design, such as measuring gene expression before and after treatment for the same subject, can reduce the effect of known confounding factors. However, not all known confounding factors can be controlled in a paired/match design. Model-based clustering, such as mixtures of hierarchical models, has been proposed to detect gene transcripts differentially expressed between paired samples. To the best of our knowledge, no model-based gene clustering methods have the capacity to adjust for the effects of covariates yet. In this article, we proposed a novel mixture of hierarchical models with covariate adjustment in identifying differentially expressed transcripts using high-throughput whole-genome data from paired design. Both simulation study and real data analysis show the good performance of the proposed method. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05556-x. Keywords: Curse of dimensionality, Confounding, EM algorithm, RNAseq Introduction Genome-wide differential gene expression analysis is widely used for the elucidation of the molecular mechanisms of complex human diseases. One popular and powerful approach to detect differentially expressed genes is the probe-wise linear regression analysis combined with the control of multiple testing, such as limma [[29]1]. That is, we first perform linear regression for each probe and then adjust p-values for controlling multiple testing. One advantage of this approach is its capacity to adjust for potential confounding factors. Another approach for detecting differentially expressed genes is the model-based clustering via mixture of Bayesian hierarchical models (MBHM) [[30]2–[31]7], which can borrow information across genes to cluster genes. Probe clustering based on MBHMs treats gene transcripts as “samples” and samples as “variables”. Therefore, transcript clustering based on MBHMs has large number of “samples” and relatively small number of “variables”, hence does not have the curse-of dimensionality problem. In addition, unlike transcript-specific tests that have several parameters per transcript, transcript clustering based on MBHMs has only a few hyperparameters per cluster to be estimated and could borrow information across transcripts to estimate model hyperparameters. These approaches generally assume that samples under two groups are obtained independently. [[32]8] proposed a constrained MBHM to identify genetic outcomes measured from paired/matched designs. Paired design is commonly used in study design for its homogeneous external environment for comparing measurements under different conditions. However, not all known confounding factors can be controlled in a paired/match design. Hence, we might still need to adjust the effects of confounding factors for data from a paired/matched design. Mixture of regressions or mixture of experts model [[33]9–[34]11] have been proposed in literature to do clustering with capacity to adjust for covariates. To best of our knowledge, this approach does not have constraints on positive, negative, and constant means and has not been applied to detect differentially expressed genes. In this article, we proposed a novel mixture of hierarchical models with covariate adjustment in identifying differentially expressed transcripts using high-throughput whole genome data from paired design. Method We assumed that gene transcripts can be roughly classified into 3 clusters based on their expression levels in subjects after treatment (denoted as condition 1) relative to those before treatment (denoted as condition 2): 1. Transcripts after treatment have higher expression levels than those before treatment, i.e., over-expressed (OE) in condition 1; 2. Transcripts after treatment have lower expression levels than those before treatment, i.e., under-expressed (UE) in condition 1; 3. Transcripts after treatment have same expression levels than those before treatment, i.e., non-differentially expressed (NE) between condition 1 and matched condition 2. We followed [[35]8] to directly model the marginal distributions of gene transcripts in the 3 clusters. In [[36]8], they proposed a mixture of three-component hierarchical distributions to characterize the within-pair difference of gene expression. We extended their model by incorporating potential confounding factors (such as Age and Sex) in the mixture of hierarchical models, which might affect the response of gene expression to drug treatment. Note that this extension is non-trivial, just like multiple linear regression is not just a simple extension to simple linear regression. We assumed that data have been processed so that the distributions of mRNA expression levels are close to normal distributions. For RNAseq data, we can apply VOOM transformation [[37]12] or countTransformers [[38]13] before applying eLNNpairedCov. A mixture of hierarchical models For the [MATH: gth :MATH] gene transcript, let [MATH: xgl :MATH] and [MATH: ygl :MATH] denote the expression levels of the [MATH: lth :MATH] subject under two different conditions, e.g., before and after treatment, [MATH: g=1,,G :MATH] , [MATH: l=1,,n :MATH] , where G is the number of transcripts and n is the number of subjects (i.e., the number of pairs). Let [MATH: dgl=log< mn>2(ygl)-log2(xgl) :MATH] be the log2 difference for the [MATH: gth :MATH] gene transcript of [MATH: lth :MATH] subject. Denote [MATH: dg=dg1,< /mo>,dgnT :MATH] . We assumed that [MATH: dg :MATH] is conditionally normally distributed given mean vector and covariance matrix. Let [MATH: WT :MATH] be the [MATH: n×(p+1) :MATH] design matrix, where p is the number of covariates. The first column of [MATH: WT :MATH] is the vector of ones, indicating intercept. Let [MATH: η :MATH] be the [MATH: (p+1)×1 :MATH] vector of coefficients for the intercept and covariate effects. We assume following mixture of three-component hierarchical models: For gene transcripts over-expressed (OE) in post-treatment samples, we expect that the mean log2 differences are positive. Hence, we assume [MATH: dg|μg,τgNμg,τg-1Inμg|τgNexp[WTη1],k1τg-1InτgΓα1,β1 :MATH] where [MATH: k1>0,α1>0 :MATH] and [MATH: β1>0 :MATH] . [MATH: Γα1,β1 :MATH] denotes the Gamma distribution with shape parameter [MATH: α1 :MATH] and rate parameter [MATH: β1 :MATH] . That is, we assume that (1) the mean vectors [MATH: μg :MATH] , [MATH: g=1 :MATH] , [MATH: :MATH] , G, given the variance [MATH: τg-1 :MATH] follow a multivariate normal distribution with mean vector [MATH: expWTη1 :MATH] and covariance matrix [MATH: k1τg-1In :MATH] ; and (2) the variances [MATH: τg-1 :MATH] , [MATH: g=1 :MATH] , [MATH: :MATH] , G, follow a Gamma distribution with shape parameter [MATH: α1 :MATH] and rate parameter [MATH: β1 :MATH] . Note that the exponential of the intercept [MATH: exp(η10) :MATH] indicates the mean of log2 difference is positive. For gene transcripts under-expressed (UE) in post-treatment samples, we expect that the mean log2 differences are negative. Hence, we assume [MATH: dg|μg,τgNμg,τg-1Inμg|τgN-exp[WTη2],k2τg-1InτgΓα2,β2 :MATH] where [MATH: k2>0,α2>0 :MATH] , [MATH: β2>0 :MATH] , and [MATH: WT :MATH] is the design matrix. Note that the negative exponential of the intercept [MATH: -exp(η20) :MATH] indicates the mean of log2 difference is negative. For gene transcripts non-differentially expressed (NE) between pre- and post-treatment samples, we expect the mean log2 differences are zero. Hence, we assume [MATH: dg|τgNUTθg,τg-1Inθg|τgN(η3,k3τg-1Ip)τgΓα3,β3 :MATH] where [MATH: k3>0,α3>0 :MATH] and [MATH: β3>0 :MATH] . [MATH: UT :MATH] is the design matrix without intercept column. That is, the intercepts are zero. Note that the intercepts indicate mean log2 differences. Hence, [MATH: η3 :MATH] is a [MATH: p×1 :MATH] vector of coefficients for the covariates. Note that [MATH: θg :MATH] measure effects of confounding factors for NE genes. The true effect of NE genes are zero (i.e., the intercept of [MATH: UTθg :MATH] is zero in the above model). The hyperparameters [MATH: αc :MATH] and [MATH: βc :MATH] are shape and rate parameters for the Gamma distribution, respectively, [MATH: c=1,2,3 :MATH] . As for [MATH: k1,k2< /mn> :MATH] and [MATH: k3 :MATH] , the variation of the mean vector [MATH: μg :MATH] should be smaller than that of the observations [MATH: dg :MATH] . So we expect [MATH: 0<kc<1,c=1,2,3 :MATH] . Note that the marginal distribution for each component of the mixture is a multivariate t distribution [[39]14, Section 3.7.6]. However, to model differentially expressed genes, the multivariate t distributions derived from our models have special structure of mean vector and covariance matrix. For continuous covariates, we require that they are standardized so that they have mean zero and variance one. Standardizing continuous covariates would make [MATH: expWTη1 :MATH] and [MATH: expWTη2 :MATH] be numerically finite. Ideally, we should require [MATH: μg>0 :MATH] [MATH: (μg<0 ) :MATH] for all transcripts in cluster 1 (cluster 2). To do so, we can assume a log normal prior distribution for [MATH: μg :MATH] in cluster 1, for instance. However, a log normal distribution could not be a conjugate prior for the mean of a normal distribution. It would increase the computational burden if non-conjugate priors were used. Other alternative models can also be used, such as assuming [MATH: μg|η10=exp(η10)+WTη1 :MATH] and [MATH: η10 :MATH] follows a normal distribution. However, these models do not have closed-form marginal densities. Hence, they would substantially increase computational burden. Besides, the empirical distribution of the mean log2 difference [MATH: dg :MATH] of the differentially expressed gene probes has shown a right-skewed pattern, while that of non-differentially expressed genes demonstrates an approximate bell shape (see in Additional file [40]1: Figures A2-A4). Hence, we require the mean [MATH: E(μg)>0 :MATH] ( [MATH: E(μg)<0 :MATH] ) for cluster 1 (cluster 2) by assuming [MATH: E(μg) :MATH] for cluster 1 (cluster 2) to be [MATH: exp[WTη1] :MATH] ( [MATH: -exp[WTη2] :MATH] ). The proposed mixture models have meaningful biological interpretations for mean structures. In particular, for the OE cluster, the intercept [MATH: exp(η10) :MATH] can be interpreted as the expected average log2 difference of gene transcripts when the value of all the p covariates are zero; the coefficient [MATH: η1i :MATH] of covariate i can be interpreted as there exists [MATH: exp(η1i) :MATH] fold-change associated with the one unit increase in covariate i while the values of the remaining [MATH: (p-1) :MATH] covariates are fixed; for the UE cluster, the intercept [MATH: -exp(η20) :MATH] can be interpreted as the expected average log2 difference of gene transcripts when the value of all the p covariates are zero; the coefficient [MATH: η2i :MATH] of covariate i can be interpreted as there exists [MATH: exp(η2i) :MATH] fold-change associated with the one unit increase in covariate i while the values of the remaining [MATH: (p-1) :MATH] covariates are fixed; while for the NE cluster, the coefficient [MATH: η3i :MATH] of covariate i can be interpreted as [MATH: η3i :MATH] unit increase of expected log2 difference of gene transcripts associated with the one unit increase in covariate i while the values of the remaining [MATH: (p-1) :MATH] covariates are fixed. They also are convenient to get closed-form marginal densities so that we can use Expectation-Maximization (EM) algorithm to estimate hyperparameters, instead of using computational-intensive algorithms, such as Markov chain Monte Carlo (MCMC). Marginal density functions Let [MATH: f1(dg|ψ) :MATH] , [MATH: f2(dg|ψ) :MATH] , [MATH: f3(dg|ψ) :MATH] be the marginal densities of the 3 hierarchical models, and [MATH: π :MATH] [MATH: = :MATH] [MATH: (π1,π2,π3 ) :MATH] be the vector of cluster mixture proportions, where [MATH: ψ=α1,β1,k1,η1T,α2,β 2,k2,η2T,α3,β 3,k3,η3TT :MATH] . Then the marginal density of [MATH: dg :MATH] is: [MATH: f(dg|ψ)=π1f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ). :MATH] Determining transcript cluster membership The transcript-cluster membership is determined based on the posterior probabilities, [MATH: ζgc :MATH] [MATH: = :MATH] [MATH: Pr(gth :MATH] gene transcript in cluster c [MATH: |dg) :MATH] . We can get [MATH: ζgc=πcfc(dg|ψ)π1f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ),c=1,2,3. :MATH] 1 We determine a transcript’s cluster membership as follows: If the maximum value among [MATH: ζgi,i=1,2,3 :MATH] is [MATH: ζgc :MATH] , then the transcript g belongs to cluster c. The true values of [MATH: π1 :MATH] , [MATH: π2 :MATH] , [MATH: π3 :MATH] , and [MATH: ψ :MATH] are unknown. We use estimated values to determine transcripts’ cluster membership. Parameter estimation via EM algorithm We used expectation-maximization (EM) algorithm [[41]15] to estimate the model parameters [MATH: π=π1,π2,π3 T :MATH] and [MATH: ψ :MATH] . Let [MATH: zg=(zg1,zg2,zg3) :MATH] to be the indicator vector indicating if gene transcript g belongs to a cluster or not. To stablize the estimate of [MATH: π :MATH] when [MATH: πc :MATH] is very small, we assume that the cluster mixture proportions [MATH: π :MATH] follows a symmetric Dirichlet [MATH: D(b) :MATH] distribution, i.e., [MATH: f(π)=Γ(c=13bc)c =13Γ(bc)< mi>c=13π< /mi>cbc-1 :MATH] . Therefore, the likelihood function for the complete data [MATH: (d,z,π) :MATH] is [MATH: L(ψ|d,z,π)=f(d,z,π|ψ)=f(d,z|π,ψ)f(π|ψ)=f(d,z|π,ψ)f(π)=g=1Gf(dg,zg|ψ,π)Dir< mrow>(b)=g=1G(π1f1(dg|ψ))zg1(π2f2(dg|ψ))zg2(π3f3(dg|ψ))zg3×Γ(c=13bc)c =13Γ(bc)c=13 πcbc-1. :MATH] Then the log complete-data likelihood function is: [MATH: l(ψ|d,z,π)=g=1G((zg1logf1(dg|ψ)+zg2logf2< mrow>(dg|ψ))+zg3logf3< mrow>(dg|ψ))+g=1G(zg1logπ1+zg2logπ 2+zg3logπ3)+logΓ(c=13bc)c =13Γ(bc)+c=13(bc-1 )logπc. :MATH] The EM algorithm is used to estimate parameters [MATH: π :MATH] and [MATH: ψ :MATH] . Since [MATH: z :MATH] is unknown random vector, we integrate it out from the log complete-data likelhood function. Here, [MATH: zg=(zg1,zg2,zg3) :MATH] . [MATH: ζg1=E(zg1|dg,π,ψ)=Pr(zg1=1|dg,π,ψ)=π1f1(dg|ψ)π1f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ)ζg 2=π2f2(dg|ψ)π1f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ)ζg 3=π3f3(dg|ψ)π1f1(dg|ψ)+π2f2(dg|ψ)+π3f3(dg|ψ) :MATH] 2 E-step. Denote [MATH: Q(t)π,ψ|d,z(t),π(t) :MATH] as the expected log complete-data likelihood function at t-th iteration of the EM algorithm, we have [MATH: Q(t)=Ezlψd,z,π|d,z(t),π(t) =g=1G((ζg1(t)logf1(dg|ψ)+ζg< /mi>2(t)logf2(dg|ψ))+ζg< /mi>3(t)logf3(dg|ψ))+g=1G(ζg1(t)logπ1+ζg2(t)logπ2+ζg3(t)logπ3)+logΓ(c=13bc)c =13Γ(bc)+c=13(bc-1 )logπc, :MATH] where [MATH: ζgc(t)=Ezgc|dg,π(t),ψ(t) =πc(t)fc(dg|ψ(t))π1(t)f1(dg|ψ(t))+π2(t)f2(dg|ψ(t))+π3(t)f3(dg|ψ(t)),c=1,2 ,3. :MATH] 3 M-step. Maximize [MATH: Q(t)π,ψ|d,z(t),π(t) :MATH] to find the optimal values of [MATH: π :MATH] and [MATH: ψ :MATH] , and use these optimal values as estimates for the parameters [MATH: π :MATH] and [MATH: ψ :MATH] . To maximize [MATH: Q(t)π,ψ|d,z(t),π(t) :MATH] , we use the “L-BFGS-B” method developed by Byrd et al. (1995) [[42]16], which utilizes the first partial derivatives of [MATH: Q(t)π,ψ|d,z(t),π(t) :MATH] and allows box constraints, that is each variable can be given a lower and/or upper bound. Simulated annealing modification EM algorithm may be trapped in a local maximum since it is strictly ascending. As introduced by Celeux and Govaert (1992) [[43]17], simulated annealing (SA) is widely used to help EM algorithm escape from local maximum by adding randomness with a stochastic step. Specifically, the conditional expectation in ([44]2) is modified in a SA algorithm as follows [MATH: ζ~gc(t)=πc(t)fc(dg|ψ(t))1/m(t)c=13πc(t)fc(dg|ψ(t))1/m(t),c=1,2 ,3. :MATH] 4 where m is the temperature used to control the randomness. Usually, the temperature m starts with a relatively high value since larger m leads to larger randomness. At iteration t, the temperature is updated by [MATH: m(t+1)=r×m(t) :MATH] with the cooling rate r controls the speed of reduction. As suggested in [[45]18, [46]19], we use [MATH: m(0)=2 :MATH] and [MATH: r=0.9 :MATH] . We denoted eLNNpairedCov as the proposed method using the traditional EM algorithm to obtain parameter estimates and denoted eLNNpairedCov.SEM as the proposed method using the EM with SA-modification to obtain parameter estimates. We stop the expectation-maximization iterations based on a proportional change, i.e. if the maximum of the absolute value of the differences of model parameter estimates between current iteration and previous iteration over the absolute value of the previous iteration estimates is smaller than a small constant (e.g. [MATH: 1.0×10-3 :MATH] ). More details about the EM algorithm are shown in Supplementary Document [see Additional file [47]1]. A real data study We used the dataset [48]GSE24742 [[49]20], which can be downloaded from the Gene Expression Omnibus [[50]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24742], to evaluate the performance of the proposed model-based clustering methods (denoted as eLNNpairedCov and eLNNpairedCov.SEM ). The dataset is from a study that investigated the gene expression before and after administrating rituximab, a drug for treating anti-TNF resistant rheumatoid arthritis (RA). There are 12 subjects, each having 2 samples (one sample is before treatment and the other is after treatment). Age and sex are also available. Expression levels of 54,675 gene probes were measured for each of the 24 samples by using Affymetrix HUman Genome U133 Plus 2.0 array. The dataset has been preprocessed by the dataset contributor. We further kept only 43,505 gene probes in the autosomal chromosomes (i.e., chromosomes 1 to 22). We then performed log2 transformation for gene expression levels. We next obtained the within-subject difference of the log2 transformed expression levels (log2 expression after-treatment minus log2 expression before-treatment). By examining the histogram (Figure A1) [see Additional file [51]1] of the estimated standard deviations of log2 differences of within-subject gene expression for the 43,505 gene probes, we found a bimodal distribution. Based on Figure A1 [see Additional file [52]1], where the histogram of estimated standard deviations exhibits two modes, we choose to exclude gene probes with standard deviation [MATH: <1 :MATH] corresponding to the first mode. It is a common practice to remove genes with low variation [[53]21–[54]23]. Finally, 23,948 gene probes kept in the down-stream analysis. A simulation study We performed a simulation study to compare the performance of the proposed methods eLNNpairedCov, eLNNpairedCov.SEM with transcript-wise test limma and Li et al.’s [[55]8] method (denoted as eLNNpaired). eLNNpairedCov, eLNNpairedCov.SEM and limma adjust covariate effects, while eLNNpaired does not. For eLNNpaired, we first regress out covariates effect for each gene to make a fair comparison between eLNNpaired and other methods. The limma approach first performs an empirical-Bayes-based linear regression for each transcript. In this linear regression, the within-subject log2 difference of transcript expression is the outcome and intercept indicating if the transcript is over-expressed (intercept>0), under-expressed (intercept<0), or non-differentially expressed (intercept = 0), adjusting for potential confounding factors. A transcript is claimed as OE if its intercept estimate is positive and corresponding FDR-adjusted p-value [MATH: <0.05 :MATH] , where FDR stands for false discovery rate. A transcript is claimed as UE if its intercept estimate is negative and corresponding FDR-adjusted p-value [MATH: <0.05 :MATH] . Other transcripts are claimed as NE. The parameter values ( [MATH: π :MATH] , [MATH: ψ :MATH] , and proportion of women) in the simulation study are based on the estimates via eLNNpairedCov.SEM from the analysis of the pre-processed real dataset [56]GSE24742 described in Subsection “A real data study”. In this simulation study, we considered two sets with different covariate coefficients for differentially expressed genes clusters. In the first set (Set 1), parameter values are the estimates of parameters based on the eLNNpairedCov.SEM method from real dataset. That is, [MATH: π1=0.00246 :MATH] , [MATH: π2=0.01470 :MATH] , [MATH: π3=0.98284 :MATH] , [MATH: α1=3.53 :MATH] , [MATH: β1=3.45 :MATH] , [MATH: k1=0.26 :MATH] , [MATH: η10=0.18 :MATH] , [MATH: η11=0.00 :MATH] , [MATH: η12=-1.05 :MATH] , [MATH: α2=3.53 :MATH] , [MATH: β2=3.45 :MATH] , [MATH: k2=0.26 :MATH] , [MATH: η20=0.18 :MATH] , [MATH: η21=0.00 :MATH] , [MATH: η22=-1.05 :MATH] , [MATH: α3=2.86 :MATH] , [MATH: β3=2.20 :MATH] , [MATH: k3=0.72 :MATH] , [MATH: η31=-0.01 :MATH] , [MATH: η32=0.00 :MATH] . In the second set (Set 2), we set [MATH: η10 :MATH] = [MATH: η20 :MATH] =0.08 instead of 0.18. For each set, we considered two scenarios. In the first scenario (Scenario1), the number of subjects is equal to 30. In the second scenario (Scenario2), the number of subjects is equal to 100. For each scenario, we generated 100 datasets. Each simulated dataset contains [MATH: G=20,000 :MATH] gene transcripts. There are two covariates: standardized age (denoted as Age.s) and Sex. Age.s follows normal distribution with mean 0 and standard deviation 1. Seventy five percent (75%) of subjects are women. Evaluation criteria Two agreement indices and two error rates are used to compare the predicted cluster membership and true cluster membership of all genes. The two agreement indices are accuracy (i.e., proportion of predicted cluster membership equal to the true cluster membership) and Jaccard index [[57]24]. For perfect agreement, these indices have a value of one. If an index takes a value close to zero, then the agreement between the true transcript cluster membership and the estimated transcript cluster membership is likely due to chance. The two error rates are false positive rate (FPR) and false negative rate (FNR). FPR is the percentage of detected DE transcripts among truly NE transcripts. FNR is the percentage of detected NE transcripts among truly DE transcripts. We also examined the user time and number of EM iterations for running each simulated dataset. Results Results of the real data analysis For the real dataset, we adjusted standardized age and sex for eLNNpairedCov, eLNNpairedCov.SEM, and limma. We standardized age so that it has mean zero and variance one. For each transcript, we also scaled its expression across subjects so that its variance is equal one. For eLNNpaired, we first regressed out the effect of standardized age and sex for each transcript. The estimates of parameters in our model are listed in Table [58]1. Note that the proposed eLNNpairedCov and eLNNpairedCov.SEM have the same estimates for the parameters in these three clusters, except for the proportions of three clusters. The proportions of OE and UE estimated by eLNNpairedCov method are 0.0376% and 0.346%, respectively. The proportions of OE and UE estimated by eLNNpairedCov.SEM method are 0.246% and 1.47%, respectively. Table 1. Parameter estimates of OE, UE and NE clusters from eLNNpairedCov and eLNNpairedCov.SEM OE UE NE [MATH: β1 :MATH] 3.445543 [MATH: β2 :MATH] 3.445543 [MATH: β3 :MATH] 3.445543 [MATH: k1 :MATH] 0.264565 [MATH: k2 :MATH] 0.264565 [MATH: k3 :MATH] 0.264565 [MATH: η10 :MATH] 0.176007 [MATH: η20 :MATH] 0.176007 [MATH: η11 :MATH] −0.000609 [MATH: η21 :MATH] −0.000609 [MATH: η31 :MATH] −0.013796 [MATH: η12 :MATH] −1.051257 [MATH: η22 :MATH] −1.051257 [MATH: η32 :MATH] −0.000017 [59]Open in a new tab For the OE cluster, [MATH: exp(η10)=exp(0.176007)=1.192 :MATH] can be interpreted as the expected log2 difference for a male subject ( [MATH: sex=0 :MATH] ) whose age is equal to mean age ( [MATH: age=0 :MATH] is the mean-centered age); [MATH: η11=-0.00060 9 :MATH] indicates that one-unit increase in age leads to [MATH: exp(-0.000609)=0.999 :MATH] fold-changes in expected log2 difference, while [MATH: η12=-1.05125 7 :MATH] indicates that there is [MATH: exp(-1.051257)=0.349 :MATH] fold-changes between male subjects and female subjects in expected log2 difference if they are at the same age. For the UE cluster, [MATH: η20=0.176007 :MATH] can be interpreted as the expected log2 difference for a male subject ( [MATH: sex=0 :MATH] ) whose age is equal to mean age ( [MATH: age=0 :MATH] is the mean-centered age) is [MATH: -exp(0.176007)=-1.192 :MATH] ; [MATH: η21=-0.00060 9 :MATH] indicates that one-unit increase in age leads to [MATH: exp(-0.000609)=0.999 :MATH] fold-changes in expected log2 difference, while [MATH: η22=-1.05125 7 :MATH] indicates that there is [MATH: exp(-1.051257)=0.349 :MATH] fold-changes between male subjects and female subjects in expected log2 difference if they are at the same age. For the NE cluster, [MATH: η31=-0.01379 6 :MATH] indicates that one-unit increase in age leads to 0.01379 decreases in expected log2 difference, and [MATH: η32=-0.00001 7 :MATH] indicates that there is 0.000017 decrease from female subjects to male subjects in the expected log2 difference if they are at the same age. The number of differentially expressed genes detected by each method is listed in Table [60]2. Table 2. Number of Differentially expressed genes detected by limma, eLNNpaired, eLNNpairedCov and eLNNpairedCov.SEM in [61]GSE24742 limma eLNNpaired eLNNpairedCov eLNNpairedCov.SEM OE 0 0 55 59 UE 6 0 355 352 [62]Open in a new tab The limma method detected 6 under-expressed gene transcripts (Figure [63]1 and Table S1), while eLNNpaired did not find any positive signals (i.e., [MATH: π^3=1 :MATH] ). The proposed methods eLNNpairedCov and eLNNpairedCov.SEM detected 55 OE transcripts (Table S2) and 59 OE transcripts (Table S3), respectively (Upper two panels of Fig. [64]2) and 355 UE transcripts (Table S4) and 352 UE transcripts (Table S5), respectively (Lower two panels of Figure [65]2). The 6 UE transcripts detected by limma is also selected as UE transcripts by eLNNpairedCov and eLNNpairedCov.SEM. Note that the 55 OE genes detected by eLNNpairedCov are also detected by eLNNpariedCov.SEM. The 352 UE genes detected by eLNNpairedCov.SEM are also detected by eLNNpariedCov. Fig. 1. Fig. 1 [66]Open in a new tab Parallel boxplots of log2 within-subject difference of gene expression for 6 UE transcripts detected by limma for pre-processed [67]GSE24742 dataset. Red horizontal line indicates log2 difference equal to zero Fig. 2. [68]Fig. 2 [69]Open in a new tab Parallel boxplots of log2 within-subject difference of gene expression for differentially expressed transcripts detected by eLNNpairedCov and eLNNpairedCov.SEM for pre-processed [70]GSE24742 dataset. Upper two panels: 55 OE transcripts and 59 OE transcripts, respectively; Lower two panels: 355 UE transcripts and 352 UE transcripts, respectively. Red horizontal lines indicate log2 difference equal to zero It is assuring that several genes corresponding to the DE transcripts identified by eLNNpairedCov and eLNNpairedCov.SEM have been associated to rheumatoid arthritis (RA) in literature. For example, Humby et al. (2019) [[71]25] reported that genes ZNF365 (OE), IL36RN (OE), MRVI1-AS1 (OE), WFDC6 (UE), UBE2H (UE), are associated with RA. We performed pathway enrichment analysis through the use of IPA (QIAGEN Inc., [72]https://www.qiagenbioinformatics.com/products/ingenuitypathway-anal ysis) for 352 UE and 55 OE genes identified by eLNNpairedCov.SEM. The top enriched canonical pathways are shown in Tables [73]3 and [74]4. Evidence in literature shows that these pathways are relevant to RA. S100 protein family plays an important role in rheumatoid arthritis ( [[75]26]). Literature shows consistent crucial role of the PD-1/PD-L pathway in the pathogenesis of rheumatic diseases ( [[76]27, [77]28]). It has been shown that RA can lead to lung tissue damage, resulting in pulmonary fibrosis ( [[78]29]). Macrophage is a key player in the pathogenesis of autoimmune diseases, such as RA ( [[79]30]). RA and osteoarthritis (OA) are two common arthritis with different pathogenesis ( [[80]31]). It is interesting to see Osteoarthritis pathway is a significantly enriched pathway for UE genes. It is consistent with literature that similar focal and systemic alterations exist in RA and OA [[81]32]. Table 3. Top canonical pathways for 352 UE genes by eLNNpairedCov.SEM Name p-value S100 Family Signaling Pathway [MATH: 2.97E-06 :MATH] PD-1, PD-L1 cancer immunotherapy pathway [MATH: 7.54E-05 :MATH] Pulmonary Fibrosis Idiopathic Signaling pathway [MATH: 3.45E-04 :MATH] Phagosome Formation [MATH: 7.56E-04 :MATH] Osteoarthritis Pathway [MATH: 1.04E-03 :MATH] [82]Open in a new tab Table 4. Top canonical pathways for 55 OE genes by eLNNpairedCov.SEM Name p-value Ribonucleotide Reductase Signaling Pathway [MATH: 5.34E-03 :MATH] Leukocyte Extravasation Signaling [MATH: 7.57E-03 :MATH] Cell Cycle: G1/S Checkpoint Regulation [MATH: 8.85E-03 :MATH] Tetrahydrofolate Salvage from 5,10- methenyltetrahydrofolate [MATH: 1.04E-02 :MATH] Role of Osteoblasts, Osteoclasts and Chondrocytes in Rheumatoid Arthritis [MATH: 1.19E-02 :MATH] [83]Open in a new tab Ribonucleotide Reductase (RNR) is the enzyme providing the precursors needed for both synthesis and repair of DNA, which could be a potential drug for RA ( [[84]33, [85]34]). Leukocyte extravasation through the endothelial barrier is important in the pathogenesis of RA ( [[86]35]). It has been shown that the limb bud and heart development (LBH) gene is a key dysregulated gene in RA and other autoimmune diseases and there are some evidence showing LBH could modulate the cell cycle [[87]36]. Osteoblasts, osteoclasts and chondrocytes play importan roles in Rheumatoid Arthritis ( [[88]37–[89]39]). We did not find literature linking Tetrahydrofolate Salvage from 5,10- methenyltetrahydrofolate to RA yet, indicating this enrichment might be novel. Results of the simulation study For Scenario 1 ( [MATH: n=30 :MATH] ), the jittered scatter plots of the performance indices versus methods are shown in Fig. [90]3 (Set 1) and Fig. [91]5 (Set 2) and the jittered scatter plots of the difference of the performance indices versus methods are shown in Fig. [92]4 (Set 1) and Figure [93]6 (Set 2). Fig. 3. [94]Fig. 3 [95]Open in a new tab Jittered scatter plots of performance indices versus method for Set 1, Scenario 1 (number of pairs [MATH: =30 :MATH] ). Red solid horizontal lines indicate the median performance indices of eLNNpairedCov.SEM Fig. 5. [96]Fig. 5 [97]Open in a new tab Jittered scatter plots of performance indices versus method for Set 2, Scenario 1 (number of pairs [MATH: =30 :MATH] ). Red solid horizontal lines indicate the median performance indices of eLNNpairedCov.SEM Fig. 4. [98]Fig. 4 [99]Open in a new tab Jittered scatter plots of difference of performance indices versus method for Set 1, Scenario 1 (number of pairs [MATH: =30 :MATH] ). Red solid horizontal lines indicate y-axis equal to zero Fig. 6. [100]Fig. 6 [101]Open in a new tab Jittered scatter plots of difference of performance indices versus method for Set 2, Scenario 1 (number of pairs [MATH: =30 :MATH] ). Red solid horizontal lines indicate y-axis equal to zero The differences of performance indices are between eLNNpairedCov.SEM and the other three methods (limma, eLNNpaired and eLNNpairedCov). A positive difference indicates that the performance indices of the other method is larger than that of eLNNpairedCov.SEM. A negative difference indicates that the performance indices of the other method is smaller than that of eLNNpairedCov.SEM. The upper panel of Figs. [102]3, [103]4, [104]5 and [105]6 show that both the eLNNpairedCov and eLNNpairedCov.SEM have higher agreement indices (Jaccard and accuracy) than limma, which in turn have higher agreement indices than eLNNpaired. The middle panel of Figures 3-6 show that the proposed eLNNpairedCov and eLNNpairedCov.SEM methods have similar performance, They have lower FPR than limma, while eLNNpaired has an exceedingly low FPR (close to 0). The middle panel also show that eLNNpairedCov, eLNNpairedCov.SEM have smaller FNR than limma, while eLNNpaired has an exceedingly high FNR (close to 1). The extreme values in FPR and FNR of eLNNpaired can be attributed to the fact that it did not detect any differentially expressed genes in this case. Additionally, Figs. [106]3, [107]4, [108]5 and [109]6 also show that compared with the performances of these methods in Set 1 ( [MATH: η101 :MATH] [MATH: =η20= :MATH] 0.18), those in Set 2 ( [MATH: η101 :MATH] [MATH: =η20= :MATH] 0.08) have lower agreement indices and higher error rates except for eLNNpaired, which fails to detect any differentially expressed genes in both Set 1 and Set 2. The bottom panel of Figs. [110]3 and [111]5 show that limma runs very fast, while eLNNpaired, eLNNpairedCov and eLNNpairedCov.SEM run in reasonable time (i.e., less than 30 s per dataset that has [MATH: G=20,000 :MATH] genes and [MATH: n=30 :MATH] subjects). On average eLNNpairedCov and eLNNpairedCov.SEM spend a little more time than eLNNpaired. The bottom panel of Fig. [112]3 and [113]5 also show that eLNNpaired uses less than 5 EM iterations, while eLNNpairedCov and eLNNpairedCov.SEM tend to use more EM iterations. In particular, eLNNpairedCov.SEM uses 10 EM iterations, which is the maximum number of iterations we set to save computing time. Note that the EM iteration number for limma is set to be one, which does not use EM algorithm to obtain parameter estimates. The simulation results for Scenario 2 ( [MATH: n=100 :MATH] ) are shown in Figures A5-A8 [see Additional file [114]1], which have similar patterns to those for Scenario 1 ( [MATH: n=30 :MATH] ), except that both eLNNpairedCov and eLNNpairedCov.SEM have smaller FPR which are close to 0. Note that eLNNpairedCov,eLNNpairedCov.SEM and limma have small FNR (close to 0), while eLNNpaired still has huge FNR (close to 1). Discussion and conclusion In this article, we proposed a novel model-based clustering approach to detect differential expressed transcripts between samples before treatment and samples after treatment, with the capacity to adjust for potential confounding factors. This is novel in that to the best of our knowledge, all existing model-based gene clustering methods do not yet have the capacity to adjust for covariates. The proposed approach is different from transcript-wise test followed by multiplicity adjustment in that it does not involve hypothesis testing. Hence, no multiplicity adjustment is needed. The simulation study showed that if the difference of gene expression between samples before treatment and samples after treatment follows the mixture of hierarchical models in Subsection “A mixture of hierarchical models”, then the proposed method can outperform limma, which is a fast and powerful transcript-wise test method. The real data analysis also showed the proposed method eLNNpairedCov can detect more differentially expressed gene transcripts, which include the transcripts detected by limma. Although we classify genes to three distinct clusters, the transitions between these clusters could be smooth. This would be reflected by a gene’s posterior probability that might be large in two of three clusters, e.g., 0.49 for cluster 1, 0.01 for cluster 2, and 0.5 for cluster 3. On the other hand, expression changes could be split up into more than 3 clusters, e.g., groups behaving differently. In this article, we are only interested in identifying three clusters of genes: over-expressed in condition 1, under-expressed in condition 1, and non-differentially expressed. There are other model-based clustering methods in literature, such as [[115]40]. However, they were not designed to detect differentially expressed genes. For example, we can set the number K of clusters as 3 for their model. However, there is no constraints that the intercepts for the three clusters have to be positive, negative, and zero. That is, the three clusters identified might not correspond to over-expressed, under-expressed, and non-differentially expressed genes. It is well-known in literature that EM algorithm might stuck at local optimal solution. In this article, we used EM with SA-modification to help escape from local optimal solutions. In future, we plan to try the hybrid algorithm of the DPSO (Discrete Particle Swarm Optimization) and the EM approach to improve the global search performance [[116]41]. In our models, the three gene groups allow to have different coefficients of covariates. In future, we could test if these coefficients are same or not. If no significant difference, we could use a model assuming equal coefficients. RNAseq and single-cell RNAseq data are cutting-edge tools to investigate molecular mechanisms of complex human diseases. However, it is quite challenging to analyze these count data with inflated zero counts. In future, we will evaluate if eLNNpairedCov can be used to analyze single-cell RNAseq data by first transforming counts to continuous scale (e.g., via VOOM [[117]12] or countTransformers [[118]13]) and then to apply eLNNpairedCov to the transformed data. We implemented the proposed methods to an R package eLNNpairedCov, which will be freely available to researchers. Supplementary Information [119]12859_2023_5556_MOESM1_ESM.pdf^ (468.4KB, pdf) Additional file 1. Supplementary Document. [120]12859_2023_5556_MOESM2_ESM.xlsx^ (10.8KB, xlsx) Additional file 2: Table S1.Gene list of 6 UE transcripts detected by limma. [121]12859_2023_5556_MOESM3_ESM.xlsx^ (12.5KB, xlsx) Additional file 3: Table S2.Gene list of 55 OE transcripts detected by eLNNpairedCov. [122]12859_2023_5556_MOESM4_ESM.xlsx^ (12.7KB, xlsx) Additional file 4: Table S3.Gene list of 59 OE transcripts detected by eLNNpairedCov.SEM. [123]12859_2023_5556_MOESM5_ESM.xlsx^ (23.9KB, xlsx) Additional file 5: Table S4.Gene list of 355 UE transcripts detected by eLNNpairedCov. [124]12859_2023_5556_MOESM6_ESM.xlsx^ (23.8KB, xlsx) Additional file 6: Table S5.Gene list of 352 UE transcripts detected by eLNNpairedCov.SEM. Acknowledgements