Abstract

   RNA sequencing (RNA-seq) has become a widely used technology for
   analyzing global gene-expression changes during certain biological
   processes. It is generally acknowledged that RNA-seq data displays
   equidispersion and overdispersion characteristics; therefore, most
   RNA-seq analysis methods were developed based on a negative binomial
   model capable of capturing both equidispersed and overdispersed data.
   In this study, we reported that in addition to equidispersion and
   overdispersion, RNA-seq data also displays underdispersion
   characteristics that cannot be adequately captured by general RNA-seq
   analysis methods. Based on a double Poisson model capable of capturing
   all data characteristics, we developed a new RNA-seq analysis method
   (DREAMSeq). Comparison of DREAMSeq with five other frequently used
   RNA-seq analysis methods using simulated datasets showed that its
   performance was comparable to or exceeded that of other methods in
   terms of type I error rate, statistical power, receiver operating
   characteristics (ROC) curve, area under the ROC curve, precision-recall
   curve, and the ability to detect the number of differentially expressed
   genes, especially in situations involving underdispersion. These
   results were validated by quantitative real-time polymerase chain
   reaction using a real Foxtail dataset. Our findings demonstrated
   DREAMSeq as a reliable, robust, and powerful new method for RNA-seq
   data mining. The DREAMSeq R package is available at
   [27]http://tanglab.hebtu.edu.cn/tanglab/Home/DREAMSeq.

   Keywords: RNA-seq, DREAMSeq, equidispersion, overdispersion,
   underdispersion, double Poisson model, negative binomial model

Introduction

   With the development of next-generation sequencing technology, RNA
   sequencing (RNA-seq) has become a routine and powerful method for
   evaluating global dynamic changes in gene expression during certain
   biological processes. Compared with microarray technologies, RNA-seq
   technologies have several advantages, including a wider measurable
   range of expression levels, higher throughput, less noise, more
   information for detecting allele-specific expression, and a higher
   capability to detect novel promoters and alternative gene-splicing
   isoforms (Marioni et al., [28]2008; Mortazavi et al., [29]2008; Sultan
   et al., [30]2008; Wang et al., [31]2009, [32]2010b; Oshlack et al.,
   [33]2010). Therefore, developing powerful, reliable, and unbiased
   RNA-seq data-mining methods would facilitate the use of RNA-seq to
   explore basic biological questions in this era of big data.

   Typically, RNA-seq experimental procedures can be divided into six
   steps: (1) sequencing the RNA samples to obtain raw reads, (2)
   filtering out low-quality reads, (3) mapping the high-quality reads to
   a reference genome or transcriptome, (4) summarizing the read counts
   for each gene, (5) detecting differentially expressed genes (DEGs), and
   (6) performing systems biology analysis [e.g., cluster analysis,
   principal components analysis (PCA), gene ontology (GO) analysis, and
   pathway enrichment analysis] (Oshlack et al., [34]2010). Of these
   steps, identifying DEGs across treatments/conditions is the key task
   and often the primary goal of RNA-seq data analysis. There are numerous
   statistical methods focusing directly on read-count data for DEG
   identification, with these classified into two categories: (1)
   parametric methods that rely on assumptions about discrete probability
   models and include methods based on a Poisson model, such as DEGseq
   (Wang et al., [35]2010a) and TSPM (Auer and Doerge, [36]2011), methods
   based on a negative binomial (NB) model, such as edgeR (Robinson et
   al., [37]2010), DESeq (Anders and Huber, [38]2010), baySeq (Hardcastle
   and Kelly, [39]2010), NBPSeq (Di et al., [40]2011), EBSeq (Leng et al.,
   [41]2013), ShrinkSeq (Van De Wiel et al., [42]2013), and DESeq2 (Love
   et al., [43]2014), methods based on a beta-binomial model, such as
   BBSeq (Zhou et al., [44]2011), methods based on a multivariate Poisson
   log-normal (LN) model, such as PLNseq (Zhang et al., [45]2015), and
   methods based on a generalized Poisson (GP) model, such as GPseq
   (Srivastava and Chen, [46]2010) and deGPS (Chu et al., [47]2015); and
   (2) non-parametric methods, such as NOISeq (Tarazona et al., [48]2011)
   and SAMseq (Li and Tibshirani, [49]2013), that do not assume any
   particular model.

   Among count-based RNA-seq data-analysis methods, non-parametric methods
   were developed based on large-sample asymptotic theory and exhibit
   statistical power sufficient to detect DEGs only when the number of
   replicates per treatment condition is ≥5 (Tarazona et al., [50]2011;
   Seyednasrollah et al., [51]2013; Soneson and Delorenzi, [52]2013).
   However, due to the high cost of RNA-seq, the general sample size in a
   typical RNA-seq experiment is < 5 replicates, which limits the
   application of non-parametric methods in RNA-seq data mining.
   Therefore, the most popular RNA-seq data-analysis methods are
   parametric methods based on Poisson and NB models. In early RNA-seq
   studies where only technical replicates were used, the traditional
   Poisson model was highly capable of fitting read-count data
   characterized by equidispersion (i.e., the variance is equal to the
   mean) (Marioni et al., [53]2008; Bullard et al., [54]2010). However,
   when biological replicates are available, read-count data often
   exhibits more variability than the Poisson model expects, which limits
   the use of a Poisson model for analyzing RNA-seq data (Anders and
   Huber, [55]2010). Fortunately, the NB model, as a Gamma-Poisson
   mixture, can address the overdispersion issue (i.e., when the variance
   is larger than the mean), as well as capture equidispersion (Anders and
   Huber, [56]2010). Additionally, recent studies reported that some
   RNA-seq data demonstrates characteristics of underdispersion (i.e., the
   variance is smaller than the mean), which might be caused by RNA-seq
   coverage, as well as zero-inflation, cluster, or low expression level
   of the count data, and could lead to underestimation of DEGs (Famoye,
   [57]1993; Srivastava and Chen, [58]2010; Rau et al., [59]2011; Mi et
   al., [60]2015; Choo-Wosoba et al., [61]2016; Low et al., [62]2017).
   However, neither a traditional Poisson model nor the NB model works
   well at mining underdispersed data.

   The GP model is a generalization of the Poisson model with an
   additional parameter. This method can process data characterized by
   underdispersion and non-underdispersion (equidispersion and
   overdispersion) (LuValle, [63]1990), but can only capture certain
   levels of dispersion, because the model is truncated under certain
   conditions regarding its bounded dispersion parameter (Famoye,
   [64]1993). For example, the program deGPS employs the GP model to fit
   read-count data characterized by non-underdispersion (Chu et al.,
   [65]2015), whereas GPseq uses this model to consider potential
   positional bias during DEG analysis and handle position-level counts
   instead of gene-level counts, which is different from other methods
   (Srivastava and Chen, [66]2010). Therefore, these methods derived from
   different discrete models can potentially perform poorly at fitting
   underdispersed count data due to the restrictions associated with the
   inherent properties in the models.

   In this study, we described a mixed Poisson model called double Poisson
   (DP), which offers the advantage of flexibility in fitting a wide range
   of data exhibiting underdispersion and non-underdispersion using only
   two parameters (Efron, [67]1986). Based on this model, we developed a
   novel differential relative expression-analysis method for RNA-seq data
   mining (DREAMSeq). Because the results of differential gene-expression
   analysis are dependent upon the discrete model used to fit the RNA-seq
   data (Consortium, [68]2010), we also added NB-model functionality to
   the DREAMSeq pipeline in order to optimize the performance of our
   method. Therefore, depending on the model used in the pipeline, our
   method can be divided into three approaches: DREAMSeq.DP (based on the
   DP model), DREAMSeq.NB (based on the NB model), and DREAMSeq.Mix (based
   on the mixture of the DP and NB models, with the lower p-value between
   two p-values generated based on the DP and NB models chosen as the
   final p-value) in order to fit variable RNA-seq data. In order to
   evaluate the performance of DREAMSeq, we generated three simulated
   datasets using three real RNA-seq datasets. Because the DEGs can only
   be effectively identified when the sample size is ≥3 (Conesa et al.,
   [69]2016; Lin et al., [70]2016), to assess DREAMSeq using the most
   common RNA-seq scenario, we focused on detecting DEGs under small
   sample sizes (three replicates per condition) and between two groups.
   Our results indicated that the performance of DREAMSeq at effectively
   detecting DEGs was comparable to other popular RNA-seq data-analysis
   methods, including edgeR, DESeq, DESeq2, NBPSeq, and TSPM, in
   non-underdispersion situations, but outperformed most of the other
   methods in underdispersion situations. This conclusion was validated by
   quantitative real-time polymerase chain reaction (qRT-PCR) using a real
   Foxtail dataset generated in our laboratory. Our findings demonstrated
   DREAMSeq as a reliable and robust DEG-detection method that provides an
   additional option in the RNA-seq data-analysis toolbox, especially for
   underdispersed-data mining.

Materials and methods

Models and normalization

   In this study, let Y represent the observed count and X the
   corresponding underlying gene expression (unknown) in an RNA-seq
   experiment. Let Y[ijk] and X[ijk] denote the read count and the true
   gene expression of gene i from sample j in treatment group k, where i =
   1, …, I (the number of genes), j = 1, …, J (the number of replicates;
   here, J = 3), and k = 1, …, K (the number of groups; here, K = 2),
   respectively.

NB model

   We assume that Y follows an NB model with two parameters: the mean, μ,
   and the dispersion, ϕ. The probability mass function (PMF) of the NB
   model is given as:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mi>P</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>Y</mi><mo>=</mo><mi>y</mi><mo>|</mo><m
   i>μ</mi><mo>,</mo><mo>ϕ</mo></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mfrac><mrow><mo>Γ</mo><mrow><m
   o
   stretchy="false">(</mo><mrow><mi>y</mi><mo>+</mo><msup><mrow><mi>ϕ</mi>
   </mrow><mrow><mo>-</mo><mn>1</mn></mrow></msup></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mrow><mi>y</mi><mo>!</mo><mo>Γ</m
   o><mrow><mo
   stretchy="false">(</mo><mrow><msup><mrow><mi>ϕ</mi></mrow><mrow><mo>-</
   mo><mn>1</mn></mrow></msup></mrow><mo
   stretchy="false">)</mo></mrow></mrow></mfrac><msup><mrow><mrow><mo
   stretchy="true">(</mo><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mn>1</
   mn><mo>+</mo><mi>μ</mi><mi>ϕ</mi></mrow></mfrac></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><msup><mrow><mi>ϕ</mi></mrow>
   <mrow><mo>-</mo><mn>1</mn></mrow></msup></mrow></msup><msup><mrow><mrow
   ><mo
   stretchy="false">(</mo><mrow><mfrac><mrow><mi>μ</mi><mi>ϕ</mi></mrow><m
   row><mn>1</mn><mo>+</mo><mi>μ</mi><mi>ϕ</mi></mrow></mfrac></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mrow><mi>y</mi></mrow></msup><mo>
   .</mo></mtd></mtr></mtable> :MATH]
   (1)

   The expected value is estimated as:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mi>E</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>Y</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mi>μ</mi><mo>.</mo></mtd></mtr
   ></mtable> :MATH]
   (2)

   We parameterize the variance of the NB model according to a previous
   study (Robinson and Smyth, [71]2007):
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msup><mrow><mi>V</mi><mi>a</mi><mi>r</mi>
   <mrow><mo stretchy="false">(</mo><mrow><mi>Y</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mi>σ</mi></mrow><mrow><mn>2</m
   n></mrow></msup><mo>=</mo><mi>μ</mi><mo>+</mo><msup><mrow><mi>μ</mi></m
   row><mrow><mn>2</mn></mrow></msup><mi>ϕ</mi><mo>,</mo></mtd></mtr></mta
   ble> :MATH]
   (3)

   where ϕ ≥ 0 and determines the extra variability as compared with the
   Poisson model. When ϕ > 0, σ^2 > μ; and when ϕ = 0, σ^2 = μ; the NB
   model collapses to the Poisson model, which can be viewed as a special
   NB model with zero dispersion (Robinson and Smyth, [72]2007).
   Therefore, the NB model allows for both overdispersion and
   equidispersion.

DP model

   We assume that Y follows a DP model with two parameters: the mean, μ,
   and the dispersion, θ. The approximate PMF of the DP model is given as:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mi>P</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>Y</mi><mo>=</mo><mi>y</mi><mo>|</mo><m
   i>μ</mi><mo>,</mo><mi>θ</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msub><mrow><mi>f</mi></mrow><m
   row><mi>μ</mi><mo>,</mo><mi>θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi>y</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mrow><mo
   stretchy="false">(</mo><mrow><msup><mrow><mi>θ</mi></mrow><mrow><mfrac>
   <mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac></mrow></msup><ms
   up><mrow><mi>e</mi></mrow><mrow><mo>-</mo><mi>θ</mi><mi>μ</mi></mrow></
   msup></mrow><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">(</mo><mrow><mfrac><mrow><msup><mrow><mi>e</mi></mrow>
   <mrow><mo>-</mo><mi>y</mi></mrow></msup><msup><mrow><mi>y</mi></mrow><m
   row><mi>y</mi></mrow></msup></mrow><mrow><mi>y</mi><mo>!</mo></mrow></m
   frac></mrow><mo stretchy="false">)</mo></mrow><msup><mrow><mrow><mo
   stretchy="false">(</mo><mrow><mfrac><mrow><mi>e</mi><mi>μ</mi></mrow><m
   row><mi>y</mi></mrow></mfrac></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mrow><mi>θ</mi><mi>y</mi></mrow><
   /msup><mo>.</mo></mtd></mtr></mtable> :MATH]
   (4)

   The exact DP density is:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mi>P</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>Y</mi><mo>=</mo><mi>y</mi><mo>|</mo><m
   i>μ</mi><mo>,</mo><mi>θ</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msub><mrow><mover
   accent="true"><mrow><mi>f</mi></mrow><mo>~</mo></mover></mrow><mrow><mi
   >μ</mi><mo>,</mo><mi>θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi>y</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mi>c</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>μ</mi><mo>,</mo><mi>θ</mi></mrow><mo
   stretchy="false">)</mo></mrow><msub><mrow><mi>f</mi></mrow><mrow><mi>μ<
   /mi><mo>,</mo><mi>θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi>y</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>,</mo></mtd></mtr></mtable> :MATH]
   (5)

   where the factor c(μ,θ) can be calculated as:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mfrac><mrow><mn>1</mn></mrow><mrow><mi>c<
   /mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>μ</mi><mo>,</mo><mi>θ</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow></mfrac><mo>=</mo><msubsup><mrow><
   mo>∑</mo></mrow><mrow><mi>y</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>∞<
   /mi></mrow></msubsup><msub><mrow><mi>f</mi></mrow><mrow><mi>μ</mi><mo>,
   </mo><mi>θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi>y</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>≈</mo><mn>1</mn><mo>+</mo><mfrac><mro
   w><mn>1</mn><mo>-</mo><mi>θ</mi></mrow><mrow><mn>12</mn><mi>μ</mi><mi>θ
   </mi></mrow></mfrac><mrow><mo
   stretchy="false">(</mo><mrow><mn>1</mn><mo>+</mo><mfrac><mrow><mn>1</mn
   ></mrow><mrow><mi>μ</mi><mi>θ</mi></mrow></mfrac></mrow><mo
   stretchy="false">)</mo></mrow></mtd></mtr></mtable> :MATH]
   (6)

   with c(μ, θ) being the normalizing constant nearly equal to 1. The
   constant c(μ, θ) ensures that the density integrates to unity. The
   expected value and the variance of the DP model in reference to the
   exact density
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>f</mi></mrow><mo>~</mo></mover></mrow><mrow><mi
   >μ</mi><mo>,</mo><mi>θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi>y</mi></mrow><mo
   stretchy="false">)</mo></mrow> :MATH]
   are estimated as follows:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mi>E</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>Y</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>≈</mo><mi>μ</mi></mtd></mtr></mtable>
   :MATH]
   (7)

   and
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msup><mrow><mi>V</mi><mi>a</mi><mi>r</mi>
   <mrow><mo stretchy="false">(</mo><mrow><mi>Y</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mi>σ</mi></mrow><mrow><mn>2</m
   n></mrow></msup><mo>=</mo><mfrac><mrow><mi>μ</mi></mrow><mrow><mi>θ</mi
   ></mrow></mfrac><mo>,</mo></mtd></mtr></mtable> :MATH]
   (8)

   respectively, where θ > 0 under RNA-seq data circumstances. The Poisson
   model is nested in the DP model for θ = 1, indicating that the DP model
   can fit equidispersed read-count data when θ = 1. Additionally, the DP
   model allows for both overdispersion (0 < θ < 1) and underdispersion (θ
   > 1) (Efron, [73]1986).

Normalization

   Here, we assume that the expectation of Y[ijk], μ[ijk], is the product
   of X[ijk] and s[jk]:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</
   mi><mi>j</mi><mi>k</mi></mrow></msub><mo>=</mo><msub><mrow><mi>X</mi></
   mrow><mrow><mi>i</mi><mi>j</mi><mi>k</mi></mrow></msub><msub><mrow><mi>
   s</mi></mrow><mrow><mi>j</mi><mi>k</mi></mrow></msub><mo>,</mo></mtd></
   mtr></mtable> :MATH]
   (9)

   where s[jk] is the size factor corresponding to sample j in treatment
   group k, which can be estimated using various existing normalization
   methods, such as total counts, upper quartile (Bullard et al.,
   [74]2010), median (Dillies et al., [75]2012), quantile (Bolstad et al.,
   [76]2003; Irizarry et al., [77]2003), trimmed mean of M-values (TMM)
   (Robinson and Oshlack, [78]2010), DESeq normalization (DESeq) (Anders
   and Huber, [79]2010), reads per kilobase per million (RPKM) (Mortazavi
   et al., [80]2008), to remove unwanted variation (Risso et al.,
   [81]2014). Normalization is a process that makes unit-less data
   comparable among measurements by adjusting for sequencing depth and
   potentially other technical effects of different samples. Dillies et
   al. ([82]2012) and Lin et al. ([83]2016) found that TMM and DESeq
   normalization methods performed much better than the other methods
   described here. Therefore, the most widely used TMM method was chosen
   as the default data-normalization method in DREAMSeq and similar to
   previous studies (Robinson et al., [84]2010; Kadota et al., [85]2012;
   Soneson and Delorenzi, [86]2013; Sun et al., [87]2013).

Dispersion estimations

   Estimating the dispersion parameter is a crucial step in DEG detection.
   Various dispersion-parameter estimation methods, including
   pseudo-likelihood (Smyth, [88]2003), quasi-likelihood (Nelder,
   [89]2000; Lund et al., [90]2012), conditional maximum likelihood (CML)
   (Smyth and Verbyla, [91]1996), quantile-adjusted CML (Robinson and
   Smyth, [92]2008), and shrinkage-estimation methods (Anders and Huber,
   [93]2010; Robinson et al., [94]2010), have been discussed previously.
   In particular, many Bayesian-based shrinkage-estimation methods,
   including baySeq, ShrinkSeq, DSS (Wu et al., [95]2013), and DESeq2,
   have been developed and are capable of obtaining accurate and robust
   estimates by sharing information across all genes when the sample size
   is small (Ji and Liu, [96]2010). Therefore, we also utilized an
   empirical Bayesian framework to shrink the dispersion parameter. Our
   strategy to estimate the dispersion parameter was divided into five
   steps described as follows.

Initial dispersion estimators

   We first applied the method-of-moments (MoMs) described by Love et al.
   ([97]2014) to estimate the initial value of dispersion for each gene.
   According to previous studies (Anders and Huber, [98]2010; Robinson et
   al., [99]2010), we first use the normalized sample mean,
   [MATH: <msub><mrow><mover accent="false"
   class="mml-overline"><mrow><mi>X</mi></mrow><mo
   accent="true">¯</mo></mover></mrow><mrow><mi>i</mi><mi>k</mi></mrow></m
   sub> :MATH]
   , to estimate the expectation for the i^th gene in group k:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</
   mi><mi>k</mi></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mro
   w><mi>J</mi></mrow></mfrac><msub><mrow><mover accent="false"
   class="mml-overline"><mrow><mi>X</mi></mrow><mo
   accent="true">¯</mo></mover></mrow><mrow><mi>i</mi><mi>k</mi></mrow></m
   sub><msub><mrow><mo>∑</mo></mrow><mrow><mi>j</mi></mrow></msub><msub><m
   row><mi>s</mi></mrow><mrow><mi>j</mi><mi>k</mi></mrow></msub><mo>.</mo>
   </mtd></mtr></mtable> :MATH]
   (10)

   We assume that the dispersions between two groups are the same under
   small sample sizes. Therefore, we denote n = KJ and substitute equation
   (10) with the following equation:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</
   mi></mrow></msub><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</m
   i></mrow></mfrac><msub><mrow><mover accent="false"
   class="mml-overline"><mrow><mi>X</mi></mrow><mo
   accent="true">¯</mo></mover></mrow><mrow><mi>i</mi></mrow></msub><msub>
   <mrow><mo>∑</mo></mrow><mrow><mi>n</mi></mrow></msub><msub><mrow><mi>s<
   /mi></mrow><mrow><mi>j</mi><mi>k</mi></mrow></msub><mo>,</mo></mtd></mt
   r></mtable> :MATH]
   (11)

   where μ[i] and
   [MATH: <msub><mrow><mover accent="false"
   class="mml-overline"><mrow><mi>X</mi></mrow><mo
   accent="true">¯</mo></mover></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   are the expectation and sample mean, respectively, of the i^th gene. We
   then estimate the variance of the i^th gene,
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup> :MATH]
   , by pooling count data from different groups using approaches
   previously described by Anders and Huber ([100]2010) and Wu et al.
   ([101]2013). For the NB model, the initial dispersion for the i^th gene
   can be estimated by:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mi>i</mi><mi>n</mi><mi>i</mi><mi>t</mi></mrow></msu
   bsup><mo>=</mo><mfrac><mrow><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i
   </mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo>-</mo><msub><mrow><mi>
   μ</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mrow><msubsup><mrow>
   <mi>μ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsu
   p></mrow></mfrac><mo>.</mo></mtd></mtr></mtable> :MATH]
   (12)

   Note that
   [MATH:
   <msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>i</mi>
   <mi>n</mi><mi>i</mi><mi>t</mi></mrow></msubsup> :MATH]
   is often artificially assigned with an extremely low positive value
   (e.g., 1 × 10^−8 in DESeq) when
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mo><</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi
   ></mrow></msub> :MATH]
   , because the NB model cannot fit underdispersed read-count data. A
   similar conservative strategy was also utilized for underdispersion in
   a previous study (Schissler et al., [102]2015). Under this scenario,
   the initial dispersion can be overestimated, which results in a
   conservative DEG test (Robinson and Smyth, [103]2008). By contrast,
   instead of the NB model, the DP model is capable of handling this kind
   of data. For the DP model, the initial dispersion for the i^th gene can
   be estimated by:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mi>i</mi><mi>n</mi><mi>i</mi><mi>t</mi></mrow></msu
   bsup><mo>=</mo><mfrac><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow><mrow><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow></mfrac><mo>.</mo>
   </mtd></mtr></mtable> :MATH]
   (13)

Gene-wise dispersion estimators

   In RNA-seq experiments, there are typically tens of thousands of genes,
   but only a few replicates per treatment group, which describes the
   “large p and small n” phenomenon. It is quite difficult to estimate a
   reliable gene-specific dispersion with the MoMs described in such a
   scenario. To address this problem, we used maximum likelihood estimate
   (MLE) methods based on the initial dispersion estimator,
   [MATH:
   <msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>i</mi>
   <mi>n</mi><mi>i</mi><mi>t</mi></mrow></msubsup> :MATH]
   (or
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>i</mi>
   <mi>n</mi><mi>i</mi><mi>t</mi></mrow></msubsup> :MATH]
   ), to estimate a gene-wise dispersion,
   [MATH:
   <msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>g</mi>
   <mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e</mi><
   /mrow></msubsup> :MATH]
   (or
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>g</mi>
   <mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e</mi><
   /mrow></msubsup> :MATH]
   ), for gene, i. The MLE of the dispersion parameters in the NB and DP
   models can be obtained by maximizing the log-likelihood summed over all
   reads between conditions for the i^th gene:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><m
   i>i</mi><mi>s</mi><mi>e</mi></mrow></msubsup><mo>=</mo><mi>a</mi><mi>r<
   /mi><mi>g</mi><mi>m</mi><mi>a</mi><msub><mrow><mi>x</mi></mrow><mrow><m
   i>ϕ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mo>∑</mo></mrow><mrow><mi>n</
   mi></mrow></msub><mo class="qopname">log</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>f</mi></mrow><mrow><mi>N</
   mi><mi>B</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>i</
   mi><mi>j</mi><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mi>μ</mi></
   mrow><mrow><mi>i</mi><mi>k</mi></mrow></msub><mo>,</mo><mi>ϕ</mi></mrow
   ><mo stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow></mtd></mtr></mtable> :MATH]
   (14)

   and
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><m
   i>i</mi><mi>s</mi><mi>e</mi></mrow></msubsup><mo>=</mo><mi>a</mi><mi>r<
   /mi><mi>g</mi><mi>m</mi><mi>a</mi><msub><mrow><mi>x</mi></mrow><mrow><m
   i>θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mo>∑</mo></mrow><mrow><mi>n</
   mi></mrow></msub><mo class="qopname">log</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>f</mi></mrow><mrow><mi>D</
   mi><mi>P</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>i</
   mi><mi>j</mi><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mi>μ</mi></
   mrow><mrow><mi>i</mi><mi>k</mi></mrow></msub><mo>,</mo><mi>θ</mi></mrow
   ><mo stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow><mo>,</mo></mtd></mtr></mtable> :MATH]
   (15)

   respectively, where
   [MATH:
   <mi>ϕ</mi><mo>=</mo><msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></m
   row><mrow><mi>i</mi><mi>n</mi><mi>i</mi><mi>t</mi></mrow></msubsup>
   :MATH]
   ,
   [MATH:
   <mi>θ</mi><mo>=</mo><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></m
   row><mrow><mi>i</mi><mi>n</mi><mi>i</mi><mi>t</mi></mrow></msubsup>
   :MATH]
   , and f[NB](·) and f[DP](·) are the PMF of the NB and DP models,
   respectively.

Common dispersion estimators

   It is essential for reliable dispersion estimation that information is
   shared between genes, especially when few replicates are available
   (Robinson and Smyth, [104]2008). The simplest method of sharing
   information is to assume that the dispersion parameters are common for
   all genes and then to use the entire dataset to directly calculate a
   precise common dispersion. However, it is generally not true that each
   gene has the same dispersion in practice (Robinson and Smyth,
   [105]2007). Consequently, we should seek a more general common
   dispersion-estimation approach that compromises between entirely
   individual gene-wise dispersions and an entirely shared common
   dispersion. Here, we assumed that the dispersions are common across all
   genes having similar expression strengths, suggesting that if the means
   for some genes are similar, the dispersions (or variances) for these
   genes are also similar. We adopted a similar locally weighted
   regression as that for voom (Law et al., [106]2014) in order to obtain
   the common dispersion estimators (
   [MATH:
   <msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>c</mi>
   <mi>o</mi><mi>m</mi><mi>m</mi><mi>o</mi><mi>n</mi></mrow></msubsup>
   :MATH]
   for the NB model or
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>c</mi>
   <mi>o</mi><mi>m</mi><mi>m</mi><mi>o</mi><mi>n</mi></mrow></msubsup>
   :MATH]
   for the DP model) for the i^th gene by regressing the gene-wise
   dispersion estimators,
   [MATH:
   <msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>g</mi>
   <mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e</mi><
   /mrow></msubsup> :MATH]
   (or
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>g</mi>
   <mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e</mi><
   /mrow></msubsup> :MATH]
   ), onto the means, μ[i], of the normalized read counts. This is similar
   to the data-driven parameter estimation used by DESeq through the
   smooth function by modeling the observed mean-variance (or
   mean-dispersion) relationship for the genes in the read-count data
   (Anders and Huber, [107]2010).

Shrinkage-dispersion estimators

   Shrinkage estimation can effectively improve statistical tests for
   differential gene expression in the case of a small number of samples
   (Cui et al., [108]2005). As mentioned previously, in order to obtain a
   more accurate and robust estimate, an empirical Bayes (EB) approach has
   been used to shrink gene-wise dispersions toward common dispersions,
   which could effectively allow the borrowing of information between
   genes (Robinson and Smyth, [109]2007; Robinson et al., [110]2010). The
   DSS and DESeq2 methods use an EB approach incorporating shrinkage with
   an NB model to squeeze the gene-wise dispersion estimates toward an LN
   prior, where the strength of shrinkage is dependent upon how reliably
   the individual gene-wise dispersions can be estimated (Wu et al.,
   [111]2013; Love et al., [112]2014). Here, we assumed that the gene-wise
   dispersions, α, followed an LN prior with two parameters: the mean,
   m[0], and the standard deviation (SD), τ. The PMF of the LN model is
   given as:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mi>P</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>α</mi><mo>|</mo><msub><mrow><mi>m</mi>
   </mrow><mrow><mn>0</mn></mrow></msub><mo>,</mo><mi>τ</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><
   mrow><mi>α</mi><msqrt><mrow><mn>2</mn><mi>π</mi><msup><mrow><mi>τ</mi><
   /mrow><mrow><mn>2</mn></mrow></msup></mrow></msqrt></mrow></mfrac><msup
   ><mrow><mi>e</mi></mrow><mrow><mo>-</mo><mfrac><mrow><msup><mrow><mrow>
   <mo stretchy="false">(</mo><mrow><mo class="qopname">log</mo><mrow><mo
   stretchy="false">(</mo><mrow><mi>α</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mrow><mi>m</mi></mrow><m
   row><mn>0</mn></mrow></msub></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup></mr
   ow><mrow><mn>2</mn><msup><mrow><mi>τ</mi></mrow><mrow><mn>2</mn></mrow>
   </msup></mrow></mfrac></mrow></msup><mo>,</mo></mtd></mtr></mtable>
   :MATH]
   (16)

   where α represents
   [MATH:
   <msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>g</mi>
   <mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e</mi><
   /mrow></msubsup> :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>g</mi>
   <mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e</mi><
   /mrow></msubsup> :MATH]
   for the NB and DP models, respectively. The two parameters of the LN
   model are estimated as follows:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msub><mrow><mi>m</mi></mrow><mrow><mn>0</
   mn></mrow></msub><mo>=</mo><mi>m</mi><mi>e</mi><mi>d</mi><mi>i</mi><mi>
   a</mi><mi>n</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>l</mi><mi>o</mi><mi>g</mi><mrow><mo
   stretchy="false">(</mo><mrow><mi>β</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow></mtd></mtr></mtable> :MATH]
   (17)

   and
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mi>τ</mi><mo>=</mo><mi>m</mi><mi>a</mi><m
   i>d</mi><mrow><mo stretchy="false">(</mo><mrow><mo
   class="qopname">log</mo><mrow><mo
   stretchy="false">(</mo><mrow><mi>α</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>-</mo><mi>l</mi><mi>o</mi><mi>g</mi><
   mrow><mo stretchy="false">(</mo><mrow><mi>β</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow><mo>,</mo></mtd></mtr></mtable> :MATH]
   (18)

   respectively, where mad represents the median absolute deviation, and β
   represents
   [MATH:
   <msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>c</mi>
   <mi>o</mi><mi>m</mi><mi>m</mi><mi>o</mi><mi>n</mi></mrow></msubsup>
   :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>c</mi>
   <mi>o</mi><mi>m</mi><mi>m</mi><mi>o</mi><mi>n</mi></mrow></msubsup>
   :MATH]
   for the NB and DP models, respectively.

   We adopted the same strategy as the DSS and DESeq2 methods to estimate
   the shrinkage dispersions for the i^th gene in the NB and DP models:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mi>s</mi><mi>h</mi><mi>r</mi><mi>i</mi><mi>n</mi><m
   i>k</mi><mi>a</mi><mi>g</mi><mi>e</mi></mrow></msubsup><mo>=</mo><mi>a<
   /mi><mi>r</mi><mi>g</mi><mi>m</mi><mi>a</mi><msub><mrow><mi>x</mi></mro
   w><mrow><mi>ϕ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mo>∑</mo></mrow><mrow><mi>n</
   mi></mrow></msub><mo class="qopname">log</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>f</mi></mrow><mrow><mi>N</
   mi><mi>B</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>i</
   mi><mi>j</mi><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mi>μ</mi></
   mrow><mrow><mi>i</mi><mi>k</mi></mrow></msub><mo>,</mo><mi>ϕ</mi></mrow
   ><mo stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow><mo>+</mo><msub><mrow><mi>f</mi></mrow><m
   row><mi>L</mi><mi>N</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi>ϕ</mi><mo>,</mo><msub><mrow><mi>m</mi>
   </mrow><mrow><mn>0</mn></mrow></msub><mo>,</mo><mi>τ</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow></mtd></mtr></mtable> :MATH]
   (19)

   and
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>
   i</mi></mrow><mrow><mi>s</mi><mi>h</mi><mi>r</mi><mi>i</mi><mi>n</mi><m
   i>k</mi><mi>a</mi><mi>g</mi><mi>e</mi></mrow></msubsup><mo>=</mo><mi>a<
   /mi><mi>r</mi><mi>g</mi><mi>m</mi><mi>a</mi><msub><mrow><mi>x</mi></mro
   w><mrow><mi>θ</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mo>∑</mo></mrow><mrow><mi>n</
   mi></mrow></msub><mo class="qopname">log</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>f</mi></mrow><mrow><mi>D</
   mi><mi>P</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mrow><mi>Y</mi></mrow><mrow><mi>i</
   mi><mi>j</mi><mi>k</mi></mrow></msub><mo>,</mo><msub><mrow><mi>μ</mi></
   mrow><mrow><mi>i</mi><mi>k</mi></mrow></msub><mo>,</mo><mi>θ</mi></mrow
   ><mo stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow><mo>+</mo><msub><mrow><mi>f</mi></mrow><m
   row><mi>L</mi><mi>N</mi></mrow></msub><mrow><mo
   stretchy="false">(</mo><mrow><mi>θ</mi><mo>,</mo><msub><mrow><mi>m</mi>
   </mrow><mrow><mn>0</mn></mrow></msub><mo>,</mo><mi>τ</mi></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo
   stretchy="false">)</mo></mrow></mtd></mtr></mtable> :MATH]
   (20)

   respectively, where
   [MATH:
   <mi>ϕ</mi><mo>=</mo><msubsup><mrow><mi>ϕ</mi></mrow><mrow><mi>i</mi></m
   row><mrow><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>i</mi><
   mi>s</mi><mi>e</mi></mrow></msubsup> :MATH]
   ,
   [MATH:
   <mi>θ</mi><mo>=</mo><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></m
   row><mrow><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>i</mi><
   mi>s</mi><mi>e</mi></mrow></msubsup> :MATH]
   , and f[NB](·), f[DP](·), and f[LN](·) are the PMF of the NB, DP, and
   LN models, respectively.

Final dispersion estimators

   Bias in dispersion estimation has serious effects on the expected
   false-positive rates (FPRs) in small-sample situations (Robinson and
   Smyth, [113]2008). To avoid bias, DESeq by default chooses the maximum
   value from the two dispersion estimators: the individual dispersion and
   the fitted dispersion as a final dispersion for the gene (Anders and
   Huber, [114]2010). However, DESeq is often overly conservative due to
   overestimation of the dispersion and results in conservation tests
   (Robinson and Smyth, [115]2008; Soneson and Delorenzi, [116]2013). For
   this reason, we proposed a compromise approach called “window scan” to
   obtain the final dispersion estimators in five steps: (1) rank the
   genes from smallest to largest according to the means of samples across
   all conditions; (2) open a default 1-count window, where the mean is
   smallest; (3) based on the relationship between the
   shrinkage-dispersion estimator and the common-dispersion estimator, all
   genes in this window are divided into I-type genes (its
   shrinkage-dispersion estimator ≥ its common dispersion estimator) and
   II-type gene (its shrinkage dispersion estimator < its common
   dispersion estimator); (4) estimate the final dispersion of each I-type
   gene (or II-type gene) by choosing the larger value between its
   shrinkage-dispersion estimator and the median of the
   shrinkage-dispersion estimators of all I-type genes (or II-type genes)
   for the NB model (or choosing the smaller value for the DP model); and
   (5) shift the window to the larger mean and repeat steps (3,4) until
   all of the genes are scanned.

Test statistic and method evaluation

Test statistic

   For DEGs detected between two treatment groups, we tested the
   hypotheses of the form H[0]: μ[i, 1] = μ[i, 2] for the gene i, where
   μ[i, 1] and μ[i, 2] are the expectations for the i^th gene in groups 1
   and 2, respectively. The Wald test has been widely applied in many
   previous studies because of its simplicity and flexibility (Ng and
   Tang, [117]2005; Chen et al., [118]2011; Yu et al., [119]2017). Similar
   to DSS and DESeq2, we constructed the Wald test statistic as:
   [MATH: <mtable class="eqnarray"
   columnalign="left"><mtr><mtd><mi>W</mi><mo>=</mo><mfrac><mrow><mo>|</mo
   ><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mo>,</mo><mn>1</mn></mro
   w></msub><mo>-</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi><mo>,</
   mo><mn>2</mn></mrow></msub><mo>|</mo></mrow><mrow><msqrt><mrow><msubsup
   ><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><mo>,</mo><mn>1</mn></mrow><mro
   w><mn>2</mn></mrow></msubsup><mo>+</mo><msubsup><mrow><mi>σ</mi></mrow>
   <mrow><mi>i</mi><mo>,</mo><mn>2</mn></mrow><mrow><mn>2</mn></mrow></msu
   bsup></mrow></msqrt></mrow></mfrac><mo>,</mo></mtd></mtr></mtable>
   :MATH]
   (21)

   where
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><mo>,</mo><mn>1</mn></m
   row><mrow><mn>2</mn></mrow></msubsup> :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><mo>,</mo><mn>2</mn></m
   row><mrow><mn>2</mn></mrow></msubsup> :MATH]
   are the variances for the i^th gene in groups 1 and 2, respectively,
   and can be estimated using the final dispersion according to equation
   (3) in the NB model and equation (8) in the DP model.

Method evaluation

   All methods analyzed will return nominal p-values. In order to obtain a
   more reliable list of DEGs, the p-values were adjusted by the
   Benjamini-Hochberg (BH) procedure (Benjamini and Hochberg, [120]1995).
   We evaluated the type I error rates (i.e., FPRs) and statistical powers
   (i.e., true-positive rates; TPRs) of different methods with a
   significance level of 0.05. Additionally, we used a receiver operating
   characteristic (ROC) curve, the area under the ROC curve (AUC), and a
   precision-recall curve (PRC) to compare the performances of eight
   methods in the simulated datasets. It is common for biologists to be
   interested in detecting genes with fold changes (FCs) estimated
   according to the ratios of the mean normalized counts between two
   treatment groups. Therefore, some methods use FC as an indicator of DE,
   such as DEGseq and AMAP.Seq (Si and Liu, [121]2013). Here, we defined
   the genes satisfying either FC < 0.67 or FC > 1.5, and an adjusted p <
   0.05 as DEGs according to previous studies (Peart et al., [122]2005; Si
   and Liu, [123]2013). This quantitative filter combines the significance
   level with the FC threshold and might be considered more practical by
   biologists. Therefore, we also identified DEGs using this filter.

   The performances of different methods were further validated by qRT-PCR
   analysis.

Datasets

Real datasets

   We chose three real datasets to represent different characteristics of
   RNA-seq data. The Pickrell dataset and the Hammer dataset were
   downloaded from the ReCount database
   ([124]http://bowtie-bio.sourceforge.net/recount) (Frazee et al.,
   [125]2011). The Pickrell dataset was obtained from lymphoblastoid cell
   lines derived from 69 unrelated Nigerian individuals as part of the
   International HapMap project (Pickrell et al., [126]2010) and contains
   69 biological replicates. The Hammer dataset contains four biological
   replicates in each of two treatment groups: rat L4
   dorsal-root-ganglion-treated groups in the presence or absence of
   induced chronic neuropathic pain (Hammer et al., [127]2010). The third
   real dataset was the Arab dataset provided as “arab” in the NBPSeq R
   package and that includes three biological replicates, where
   Arabidopsis leaves were inoculated with either a defense-eliciting
   ΔhrcC mutant of Pseudomonas syringae pv. tomato DC3000 or 10 mM MgCl[2]
   as a mock-treatment control (Di et al., [128]2011).

Simulated datasets

   Simulation studies represent necessary processes for investigating the
   properties associated with certain statistical methods, given that the
   “true” DEGs are known in simulated data. An ideal simulation would
   generate data with similar characteristics to those produced in real
   RNA-seq experiments. Therefore, similar to Landau and Liu ([129]2013),
   we generated three independent simulated datasets using a DP model
   based on three real datasets, respectively. The simulation processes
   were repeated 30 times to ensure reasonable precision in parameter
   estimation. Each simulated dataset contains 10,000 genes, including
   2,000 DEGs and 8,000 non-DEGs, two treatment groups, and three
   replicates for each treatment group.

Foxtail dataset

   Foxtail millet (Setaria italica) is an important cereal crop in
   northern China, and the whole-genome sequence of Foxtail millet (Yugu-1
   cultivar) was published in 2012 (Bennetzen et al., [130]2012; Zhang et
   al., [131]2012). In this study, we used a Foxtail RNA-seq dataset
   obtained by our own laboratory to compare the performance of DREAMSeq
   with other methods. This Foxtail dataset includes three biological
   replicates, in which roots from 1-week-old Foxtail millet seedlings
   (Yugu-1 cultivar) were treated with or without 1 μM epi-Brassinolide
   (eBL) for 2 h, followed by total RNA extraction using Trizol reagent
   (Invitrogen, Carlsbad, CA, Unites States). Extracted total RNA (2 μg
   per sample) was sequenced on an Illumina HiSeq X-ten platform, and the
   remaining RNA was used for qRT-PCR validation. The paired-end reads
   were aligned to the Foxtail millet reference genome (JGIv2.0.34)
   (Bennetzen et al., [132]2012; Goodstein et al., [133]2012) using TopHat
   (version 2.0.12) (Trapnell et al., [134]2009; Kim et al., [135]2013),
   and gene read counts were obtained using the program htseq-count from
   the python package HTSeq (version 0.61) (Anders et al., [136]2015).

qRT-PCR

   First-strand cDNA was synthesized from 1 μg total RNA using Reverse
   Transcriptase M-MLV (Takara Bio, Otsu, Japan) according to manufacturer
   instructions. qRT-PCR was performed according to the standard protocol
   using a Bio-Rad CFX Connect real-time PCR system (Bio-Rad Laboratories,
   Hercules, CA, Untied States). Primers used are listed in Table [137]S1.
   The expression of target genes was normalized to Foxtail Actin, and the
   relative expression between treatment and control groups was averaged
   from three independent experiments, with the p-value calculated using a
   one-sample t-test. We defined genes satisfying relative expression >1.5
   or < 0.67 and p < 0.05 as “true” DEGs.

Results

The mean–variance relationship in real datasets

   When analyzing the Hammer, Arab, and Foxtail datasets, we found strong
   relationships between the variances and the means on the log-log scale
   for the read counts from different real datasets (Figure [138]S1). For
   convenience of notation and calculation, we used the unit line to
   represent a Poisson assumption-exhibited equidispersion. The data
   points on and above that line exhibit non-underdispersion, whereas the
   data points below that line exhibit underdispersion. Figure [139]S1
   shows that 2,606 of 18,635 genes (14.0%) in the Hammer dataset, 2,015
   of 26,222 genes (7.7%) in the Arab dataset, and 4,412 of 35,158 genes
   (12.5%) in the Foxtail dataset were estimated as underdispersed genes.
   Therefore, there are a considerable proportion of underdispersed genes
   in the RNA-seq data. Furthermore, we noted that the underdispersed data
   points mostly distributed at low read-count regions (Figure [140]S1).
   These results suggested that in addition to non-underdispersion,
   underdispersion also exists in RNA-seq data and should be properly
   handled during the RNA-seq data-mining process.

   Most RNA-seq analysis methods were developed based on an NB model,
   which is able to capture both equidispersed and overdispersed data but
   not underdispersed data. In comparison, a DP model can capture all
   RNA-seq data (Efron, [141]1986). Using real Hammer, Arab, and Foxtail
   datasets, we found that both DP and NB models were able to fit
   read-count data very well (Figure [142]S2). This suggested that the DP
   model can be used to mine RNA-seq data.

Generation of simulated datasets

   Wu et al. ([143]2013) reported that using real data-driven simulations
   provided a better estimate for gene-wise dispersions and improved DEG
   detection, because the true DE status of each gene is known by
   controlling the settings (Wu et al., [144]2013). Therefore, we
   generated three simulated datasets with mean and dispersion parameters
   estimated from three real datasets based on a commonly used DP model
   and denoted these as simPickrell, simHammer, and simArab, respectively.
   The average number of underdispersed genes in simPickrell, simHammer,
   and simArab was 1299 (13%), 1935 (19%), and 1432 (14%), respectively.
   As shown in Figure [145]S3, all simulated datasets were very similar to
   the corresponding real datasets in terms of distributions of the means
   and dispersions and relationships between means and dispersions. This
   indicated that our simulated data closely mimicked the real data.

Type I error rate

   Using the three simulated datasets, we first evaluated the type I error
   rates (i.e., FPRs) of the three DREAMSeq methods (DREAMSeq.DP,
   DREAMSeq.NB, and DREAMSeq.Mix) and five other widely used RNA-seq
   data-analysis methods (edgeR, DESeq, DESeq2, NBPSeq, and TSPM) under
   the null hypothesis. We found that except for TSPM, all other methods
   were able to control type I error rates well in both
   non-underdispersion and underdispersion situations (Figure [146]1). In
   comparison, DESeq was very conservative in term of type I error rate,
   whereas the abilities of FPR control by both DREAMSeq.NB and NBPSeq
   clearly varied between non-underdispersion and underdispersion
   situations. In contrast, the median FPRs of DREAMSeq.DP, DREAMSeq.Mix,
   edgeR, and DESeq2 were relatively stable and consistently lower than or
   very close to the nominal type I error rate of 0.05 under all
   situations.

Figure 1.

   [147]Figure 1
   [148]Open in a new tab

   Comparison of type I error rates between different methods. Boxplots
   show the type I error rates (i.e., FPRs) of different methods, which
   were calculated over 30 simulations for the simPickrell, simHammer, and
   simArab datasets under the null hypothesis. The horizontal dotted lines
   indicate the nominal type I error rate of 0.05 in non-underdispersion
   and underdispersion scenarios. nud, non-underdispersion; ud,
   underdispersion.

Statistical power, ROC, AUC, PRC, and number of DEGs

   We then evaluated the statistical powers (i.e., TPRs) of different
   methods using the simulated datasets under the alternative hypothesis
   (Figure [149]2). The results showed that in underdispersion situations,
   the TPR of DREAMSeq.Mix was slightly higher than that of DREAMSeq.DP,
   although that of both methods was higher than those of DREAMSeq.NB,
   edgeR, DESeq, DESeq2, and NBPSeq (Figure [150]2). In
   non-underdispersion situations, the TPRs of DREAMSeq.Mix and
   DREAMSeq.DP were comparable with the other methods. Interestingly, TSPM
   consistently showed higher TPRs. Given that TSPM also showed higher
   FPRs in similar situations, it is likely that the TSPM method increased
   statistical power at the cost of poor FPR control.

Figure 2.

   [151]Figure 2
   [152]Open in a new tab

   Statistical power comparison between different methods. Boxplots show
   the statistical powers (i.e., TPRs) of different methods and calculated
   over 30 simulations for the simPickrell, simHammer, and simArab
   datasets under the alternative hypothesis in non-underdispersion and
   underdispersion scenarios. nud, non-underdispersion; ud,
   underdispersion.

   The ROC curve was constructed using the TPR to FPR ratio for each
   method used for DE analysis. Theoretically, the method with the
   stronger statistical power at identifying DEGs should exhibit a ROC
   curve with a higher TPR relative to other methods at the same FPR
   level. Figure [153]S4 shows that NBPSeq and TSPM had lower TPRs when
   the FPR threshold was ~0.05 in each scenario, whereas the ROC curves of
   the other methods were very similar. Additionally, we found that ROC
   curves associated with the simHammer dataset were steeper than those
   for the simPickrell and simArab datasets, suggesting that the
   performance of DEG identification by different methods was strongly
   dependent upon innate data characteristics, such as heterogeneity.

   AUC is a relative measure of the quality of a DEG test, where a higher
   AUC indicates relatively better performance. To quantify the
   performances of different methods in detecting DEGs, AUCs of different
   methods were calculated. The result showed that the AUCs of DREAMSeq.DP
   and DREAMSeq.Mix were higher than those of DREAMSeq.NB, edgeR, DESeq,
   DESeq2, and NBPSeq in most of the situations, except slightly lower
   than DESeq2 when analyzing simHammer and simArab underdispersed data
   (Figure [154]3). Together with the above FPR, TPR, and ROC results,
   these findings clearly demonstrated that both DREAMSeq.DP and
   DREAMSeq.Mix were able to control type I error rates well while
   maintaining a relatively higher statistical power in detecting DEGs.

Figure 3.

   [155]Figure 3
   [156]Open in a new tab

   Comparison of AUCs between different methods. Boxplots show the AUCs of
   different methods and calculated over 30 simulations for the
   simPickrell, simHammer, and simArab datasets in non-underdispersion and
   underdispersion scenarios. nud, non-underdispersion; ud,
   underdispersion.

   PRC curve shows the precision for corresponding recall (TPR). Similar
   to the ROC curve, the PRC curve is also an important performance
   indicator used to evaluate different methods at identifying DEGs.
   Figure [157]S5 shows that all methods, except TSPM, had higher
   precision over the entire range of recall rates, regardless of dataset
   or dispersion. Additionally, we found that all methods exhibited their
   best predictive performance using the simHammer dataset, but did not
   predict very accurately using the simPickrell dataset in an
   underdispersion situation, which might also be related to the dataset
   itself.

   We also compared the identified DEG numbers of different methods, with
   the results showing that both DREAMSeq.DP and DREAMSeq.Mix generally
   detected a larger number of DEGs (except in the case of simHammer
   non-underdispersed data) than the other methods (except for TSPM, which
   displayed poor FDR control) when analyzing non-underdispersed or
   underdispersed data from three simulated datasets, respectively,
   (Figure [158]4).

Figure 4.

   [159]Figure 4
   [160]Open in a new tab

   Comparison of the number of DEGs identified by different methods.
   Boxplots show the number of DEGs identified by different methods and
   calculated over 30 simulations for the simPickrell, simHammer, and
   simArab datasets in non-underdispersion and underdispersion scenarios.
   nud, non-underdispersion; ud, underdispersion.

Analysis of the foxtail dataset

   Our comprehensive evaluations showed that edgeR, DESeq, DESeq2, and
   DREAMSeq.Mix generally performed better as analyzing different
   simulated RNA-seq datasets; therefore, these methods were chosen to
   test their abilities to detect DEGs, especially underdispersed DEGs,
   using a real Foxtail dataset. A total of 128 non-underdispersed and 17
   underdispersed DEGs were identified by at least one of the four methods
   (Figure [161]5 and Tables [162]S2–[163]S5). Overall, the number of DEGs
   identified by DREAMSeq.Mix was much higher than that by DESeq but lower
   than that by edgeR and DESeq2 (Figure [164]5A). However, DREAMSeq.Mix
   identified 15 underdispersed DEGs, whereas edgeR identified 12, and
   DESeq2 identified 9 underdispersed DEGs. We defined DEGs detected only
   by one method as unique DEGs. Notably, DREAMSeq.Mix detected the
   highest number of unique DEGs in underdispersion scenarios, whereas
   DESeq did not identify any unique DEGs in either non-underdispersion or
   underdispersion scenarios (Figures [165]5B,C). Consistent with previous
   reports (Seyednasrollah et al., [166]2013; Tang et al., [167]2015), all
   of the DEGs found by DESeq were also found by edgeR (Figures
   [168]5B,C), possibly because these two methods use the same statistical
   model (i.e., the NB model) and hypothesis testing procedure (i.e., the
   Robinson and Smyth exact test) (Robinson and Smyth, [169]2008; Anders
   and Huber, [170]2010; Robinson et al., [171]2010). The presence of
   various unique DEGs also suggested the advantage of using more than one
   method to analyze the same RNA-seq data in order to allow maximum
   discovery of DEGs.

Figure 5.

   [172]Figure 5
   [173]Open in a new tab

   eBL-regulated Foxtail millet-root DEGs identified by different methods.
   (A) Bar plot showing the number of eBL-regulated DEGs identified by
   DREAMSeq.Mix, edgeR, DESeq, and DESeq2. (B,C) Venn diagrams showing the
   overlap among the collections of eBL-regulated DEGs identified by
   DREAMSeq.Mix, edgeR, DESeq, and DESeq2 in non-underdispersion (B) and
   underdispersion (C) scenarios. nud, non-underdispersion; ud,
   underdispersion.

   We then used qRT-PCR to validate whether the DEGs identified from the
   Foxtail dataset were “true” DEGs. Because DEGs identified by DESeq were
   also identified by edgeR, the unique DEGs identified by either edgeR,
   DESeq2, or DREAMSeq.Mix and the common DEGs identified simultaneously
   by any two methods were chosen for qRT-PCR analysis (Figure [174]6).
   The results showed that most of the DEGs chosen for validation
   exhibited similar upregulation or downregulation patterns as those
   shown from RNA-seq data analysis. For non-underdispersed DEGs, qRT-PCR
   results verified that 9 of 19 DEGs (47.4%) identified by DREAMSeq.Mix,
   19 of 42 DEGs (45.2%) identified by edgeR, and 23 of 51 DEGs (45.1%)
   identified by DESeq2 were significantly upregulated or downregulated by
   eBL treatment by at least 1.5-fold. Notably, for underdispersed DEGs, 5
   of 8 (62.5%) DEGs identified by DREAMSeq.Mix were validated as “true”
   DEGs. By contrast, only 2 of 5 (40.0%) DEGs identified by edgeR and no
   DEGs identified by DESeq2 were validated as “true” DEGs. These qRT-PCR
   results demonstrated that for non-underdispersed data, the number of
   DEGs identified by DREAMSeq.Mix was lower than those by edgeR and
   DESeq2, but the accuracy was slightly higher; however, for
   underdispersed data, DREAMSeq.Mix exhibited both a higher number of
   identified DEGs and better accuracy than the other two methods,
   demonstrating DREAMSeq.Mix as a powerful RNA-seq data-analysis method,
   especially for situations involving underdispersed data.

Figure 6.

   [175]Figure 6
   [176]Open in a new tab

   qRT-PCR validation of the expression of eBL-regulated Foxtail DEGs
   detected by different methods. Bar plots show the relative expression
   of DEGs detected only by DREAMSeq.Mix (A), edgeR (B), and DESeq2 (C) or
   identified by DREAMSeq.Mix and edgeR (D), DREAMSeq.Mix and DESeq2 (E),
   or edgeR and DESeq2 (F), respectively, in eBL-treated Foxtail millet
   roots. The relative expression levels were normalized to the Foxtail
   millet Actin gene. Data represent the mean ± SE of three independent
   experiments. P-values were calculated using a one-sample t-test. ^*P <
   0.05; ^**P < 0.01. The horizontal dotted lines indicate relative
   expression of 1.5 or 0.67. nud, non-underdispersion; ud,
   underdispersion.

Discussion

   RNA-seq is an increasingly popular method used to analyze global
   changes in gene expression during certain biological processes.
   Identifying DEGs is a key step in mining RNA-seq data and important for
   downstream biological analyses, such as cluster analysis, PCA analysis,
   GO analysis, and Kyoto Encyclopedia of Genes and Genomes enrichment
   analysis. When analyzing RNA-seq data, most current methods focus on
   non-underdispersed data, with less attention given to underdispersed
   data. In this study, we observed that RNA-seq data also includes
   underdispersion characteristics. Additionally, Low et al. ([177]2017)
   found that as the RNA-seq coverage increases, underdispersion becomes
   increasingly obvious. With the development of sequencing technology,
   the read length and RNA-seq coverage have increased significantly.
   Therefore, to take full advantage of RNA-seq data, it is important to
   explore both non-underdispersed and underdispersed data. However, most
   widely used DE-analysis methods, such as DESeq and edgeR, are based on
   the NB model. Due to the limitations of this model, underdispersed data
   are often overestimated, leading to conservative results in the
   determination of DEGs. In comparison, the DP model is capable of
   capturing not only non-underdispersion but also underdispersion.
   Considering the potential advantages of these two models, we developed
   a novel RNA-seq data-mining method (DREAMSeq.Mix) that combines the DP
   and NB models.

   Using simulated datasets generated from three real RNA-seq experiments,
   we compared the performance of DREAMSeq.Mix at detecting DEGs with five
   other commonly used RNA-seq data-analysis methods. To provide a more
   comprehensive conclusion, we also added DREAMSeq.DP and DREAMSeq.NB
   methods, which were developed using only a DP model or an NB model,
   respectively, into the comparison. We found that DESeq, NBPSeq, and
   DREAMSeq.NB were often conservative, whereas TSPM, edgeR, and DESeq2
   were more liberal in detecting DEGs. The poor performance of TSPM in
   our study might be due to the limited number of replicates in the
   RNA-seq datasets used (Auer and Doerge, [178]2011; Kvam et al.,
   [179]2012; Soneson and Delorenzi, [180]2013). In comparison,
   DREAMSeq.DP and DREAMSeq.Mix often outperformed the other methods in
   terms of TPR, AUC, and the number of DEGs detected (Figures
   [181]2–[182]4). The following reasons suggest that DREAMSeq.Mix
   provided unique and important outcomes more advantageous than current
   RNA-seq data-mining methods.

   First, DREAMSeq incorporates a more flexible DP model to fit highly
   complex and variable RNA-seq data. The dispersion parameter of the DP
   model is not subject to the same restrictions as the NB model when it
   is estimated in underdispersion situations. As a result, logarithmic
   dispersion estimated using the DP model (Figure [183]S3) showed a
   better normality than that acquired using the NB model (Figure 1 in
   Landau and Liu, [184]2013). This demonstrated that the DP model was
   able to accurately fit a widely range of read-count data without
   artificial intervention in RNA-seq data analysis. Therefore,
   DREAMSeq.DP and DREAMSeq.Mix often outperformed the other methods,
   especially in underdispersion situations, in simulation studies.
   Moreover, in terms of identifying the “true” underdispersed DEGs,
   DREAMSeq.Mix outperformed edgeR, DESeq, and DESeq2 according to qRT-PCR
   validation.

   Second, DREAMSeq incorporates strategies, such as MoMs, MLE, and EB,
   which are used in the edgeR, DESeq, DSS, and DESeq2 methods, to obtain
   reliable dispersion estimation. Importantly, to avoid bias, DREAMSeq
   used a “window scan” approach to estimate dispersion and enhance
   DREAMSeq's robustness in analyzing a wider range of RNA-seq data. This
   enabled all DREAMSeq approaches maintain a higher AUC across different
   simulated datasets in either non-underdispersion or underdispersion
   scenarios.

   Third, in multiple scenarios, DREAMSeq.Mix performed slightly better
   than DREAMSeq.DP, although the difference was small. This indicated
   that the efficiency and robustness of DREAMSeq.Mix was improved by
   taking full potential of the advantages of the DP and NB models to fit
   RNA-seq data.

   Recently, single-cell RNA-seq (scRNA-seq) has rapidly become a powerful
   tool for analyzing gene-expression heterogeneity at the individual cell
   level and been widely applied to diverse fields of biological research,
   including stem cell differentiation, embryogenesis, and whole-tissue
   analysis (Saliba et al., [185]2014). However, scRNA-seq data displays
   typical features of bimodality (the NB model cannot capture bimodality)
   (Vu et al., [186]2016), making such data less efficient for mining
   using common RNA-seq data-analysis methods. Additionally, Choo-Wosoba
   et al. ([187]2016) reported that genomic next-generation sequencing
   data also involves underdispersion. The increased accuracy and
   robustness displayed in finding “true” DEGs with higher confidence and
   its better performance at exploring underdispersed data make DREAMSeq a
   potentially valuable tool for mining sequencing data generated from
   many other high-throughput platforms, such as scRNA-seq and genomic
   sequencing.

   During our analysis, we found that none of the eight tested methods
   consistently outperformed other methods under all situations, because
   different methods are capable of identifying specific groups of DEGs.
   Although some DEGs can be identified by all methods, the existence of
   unique DEGs suggested that different methods exhibited specific
   preferences during DEG detection. Additionally, our study showed that