Graphical abstract

   graphic file with name fx1.jpg
   [32]Open in a new tab

Highlights

     * •
       QRscore detects both variance and mean shifts in gene expression
     * •
       It controls FDR and achieves high power under noise and zero
       inflation
     * •
       Tissue-specific, age-related differentially dispersed genes are
       identified from GTEx data
     * •
       Cell-type-specific variance shifts associated with aging are
       revealed from scRNA-seq data

Motivation

   The increasing availability of large-scale RNA-seq datasets from both
   bulk and single-cell sources increases our power to detect biologically
   meaningful shifts in mean gene expression and expression variability.
   However, existing methods often prioritize mean shifts or rely on
   strong parametric assumptions that may not hold across heterogeneous
   populations. This has created a gap in methods that adapt to the
   alternative hypothesis of interest while also controlling type I and
   type II errors without requiring parametric assumptions to formulate
   the null hypothesis. We thus develop QRscore, an R/Bioconductor
   software that leverages non-parametric statistics and asymptotic theory
   to obtain powerful and false discovery rate (FDR)-controlled flexible
   tests of differential expression.
     __________________________________________________________________

   Zhou et al. present QRscore, a robust non-parametric framework for
   statistical hypothesis testing that detects shifts in both the mean and
   variance of gene expression. QRscore maintains high power and strict
   FDR control, even under noise and zero inflation. It reveals
   age-related variance changes in both bulk RNA-seq and pseudo-bulked
   single-cell RNA-seq data.

Introduction

   In both bulk and single-cell genomics, differential expression analysis
   is a routine procedure for identifying genes whose expressions differ
   under varying conditions or treatments. To detect changes in average
   expression, widely used methods in bulk RNA sequencing (RNA-seq)
   differential analysis treat the dispersion of gene expression as a
   nuisance parameter, implementing techniques to shrink and stabilize the
   estimation of variance. The widespread use of these approaches can
   result in missed signals of gene expression variability that capture
   important biological mechanisms. In humans and C. elegans, variability
   indices are associated with gene regulation[33]^1^,[34]^2 and cell
   development,[35]^3 potentially explaining many biological phenomena,
   including adaptation to changing environments, disease susceptibility,
   and incomplete penetrance.[36]^4^,[37]^5 Recent studies also
   demonstrate that variability indices provide insights into
   aging,[38]^6^,[39]^7^,[40]^8 therefore underscoring the crucial need to
   incorporate changes in variability into tests of differential gene
   expression.

   To detect expression variability, previous studies on microarray
   datasets[41]^9^,[42]^10 applied traditional hypothesis testing methods,
   such as the F-test and Bartlett’s test. These tests are based on normal
   distribution assumptions of two samples and are designed to identify
   differences in standard deviation. To improve robustness, one approach
   is to analyze different measures for variability other than the
   standard deviation. Such measures include the median absolute
   deviation[43]^10 and the coefficient of variation.[44]^2 Another
   approach to ensure the robustness of statistical tests is to employ
   methods that are more robust to deviations from normality for two
   samples, such as Levene’s test and permutation tests.[45]^10 The method
   diffVar[46]^11 implements a modified version of Levene’s test that
   utilizes an empirical Bayes estimator instead of variance estimation
   for t statistics to obtain a more powerful test. However, this method
   maintains its parametric nature, making it susceptible to reduced
   performance due to deviations from the underlying assumptions.
   Additionally, non-parametric methods like the Kolmogorov-Smirnov (KS)
   test are not commonly used in the analysis of RNA-seq data because of
   their limited statistical power, especially when working with small
   sample sizes.

   Because bulk RNA-seq data often show overdispersion, particularly in
   gene expression, where the variance for most genes exceeds the mean,
   they are commonly modeled using the negative binomial (NB) or
   zero-inflated NB (ZINB) distribution. One method for detecting
   differences in dispersion applies generalized additive models for
   location, scale, and shape (GAMLSS),[47]^12 allowing the addition of
   regressions for dispersion parameters in the NB distribution into the
   main regression model and targeting changes in dispersion parameters
   through likelihood ratio tests.[48]^4 Another method, MDSeq,[49]^13
   utilizes a reparameterization of the ZINB model to test the mean and
   dispersion separately from generalized linear models (GLMs).
   DiffDist[50]^14 uses a hierarchical Bayesian model based on the NB
   distribution, which can provide tests for differential dispersion and
   distribution for RNA-seq data.

   Although these current methods are based on NB distributions and are
   widely applied in scenarios of small sample sizes, recent research has
   shown that parametric methods are prone to inflated false discovery
   rates (FDRs) in both studies with small sample sizes[51]^15 and
   population-level RNA-seq studies with large sample sizes.[52]^16
   Additionally, they are highly sensitive to preprocessing
   procedures[53]^17 and are not robust to model misspecification.[54]^18

   Here, we introduce QRscore (quantile rank score), a flexible rank-based
   method for conducting robust and powerful two-sample tests. QRscore
   generalizes the Mann-Whitney test and capitalizes on the adaptability
   of the chosen test statistic to yield substantial statistical power
   across various scenarios, including the detection of shifts in mean and
   variance. QRscore offers various approaches for designing test
   statistics that are effective in detecting shifts in mean and
   dispersion. These include a versatile naive approach applicable to
   various cases beyond count data, as well as methods designed
   specifically for RNA-seq count data with either a low occurrence of
   zeros or a higher proportion of inflated zeros. Also, we extend QRscore
   to K-sample tests, where
   [MATH: <mrow><mi>K</mi><mo linebreak="goodbreak"
   linebreakstyle="after">≥</mo><mn>3</mn></mrow> :MATH]
   . Through simulations that extensively compare QRscore with existing
   differential variability tests and differential centrality tests, we
   find that QRscore controls FDRs across diverse model specifications
   while simultaneously obtaining high statistical power. We use QRscore
   to detect both differentially dispersed genes (DDGs) with variance
   shifts and differentially expressed genes (DEGs) with mean shifts
   across multiple tissues in the Genotype-Tissue Expression (GTEx)
   project,[55]^19 showing that differential variability characterizes
   functionally distinct sets of genes from differential centrality. We
   also apply QRscore to pseudo-bulked single-cell RNA-seq (scRNA-seq)
   data, highlighting its potential to uncover cell-type-specific variance
   shifts that are obscured in bulk RNA-seq analyses. Our method
   contributes to the field of bioinformatics tools for genomics and can
   be adapted to other applications that require flexible non-parametric
   tests.

Results

Overview of QRscore

   QRscore is a rank-based hypothesis testing method that achieves robust
   false positive control through its non-parametric foundation while
   gaining power from parametric model-informed weighting ([56]Figure 1).

Figure 1.

   [57]Figure 1
   [58]Open in a new tab

   Overview of QRscore

   (1) Prefiltering and normalization: this initial data processing stage
   ensures quality and comparability of data. (2) Pooling and sorting: for
   gene g, expression levels
   [MATH: <mrow><msub><mi>X</mi><mi>g</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mrow><mo
   stretchy="true">{</mo><mrow><msub><mi>X</mi><mrow><mn>1</mn><mi>g</mi><
   /mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>X</mi><mrow><mi>k<
   /mi><mi>g</mi></mrow></msub></mrow><mo
   stretchy="true">}</mo></mrow></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>Y</mi><mi>g</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mrow><mo
   stretchy="true">{</mo><mrow><msub><mi>Y</mi><mrow><mn>1</mn><mi>g</mi><
   /mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>Y</mi><mrow><mi>n<
   /mi><mi>g</mi></mrow></msub></mrow><mo
   stretchy="true">}</mo></mrow></mrow> :MATH]
   are combined into a single ordered list. The rank of
   [MATH:
   <mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub></mrow>
   :MATH]
   , denoted as
   [MATH:
   <mrow><msub><mi>R</mi><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow>
   </msub></msub></mrow> :MATH]
   , is determined after pooling and sorting the expression values of gene
   g. (3) Constructing the test statistic: a test statistic t is
   calculated using a weighted function based on modified score test
   functions, which enhances test power through parametric estimation and
   incorporates a generalized K-sample test. (4) Hypothesis and inference:
   this stage involves evaluating the null hypothesis for gene
   [MATH:
   <mrow><mi>g</mi><mo>,</mo><msub><mi>H</mi><mrow><mn>0</mn><mi>g</mi></m
   row></msub></mrow> :MATH]
   ,
   [MATH: <mrow><msub><mi>P</mi><mi>g</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><msub><mi>Q</mi><mi>g</mi></msub></mrow>
   :MATH]
   , where
   [MATH: <mrow><msub><mi>P</mi><mi>g</mi></msub></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>Q</mi><mi>g</mi></msub></mrow> :MATH]
   represent the distributions of gene expression samples
   [MATH: <mrow><msub><mi>X</mi><mi>g</mi></msub></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>Y</mi><mi>g</mi></msub></mrow> :MATH]
   , respectively. p values are computed and adjusted using the
   Benjamini-Hochberg procedure to control the false discovery rate. (5)
   Identification of DDGs and DEGs: genes with significant
   expression-level differences between the conditions are identified, as
   represented in the QRscore-Var and QRscore-Mean diagrams. Details of
   QRscore-Var and QRscore-Mean are provided in the [59]STAR Methods and
   [60]Methods S1.

   Our null hypothesis posits that the two samples are drawn from the same
   distribution. Specifically, given a gene g, let
   [MATH:
   <mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub></mrow>
   :MATH]
   and
   [MATH:
   <mrow><msub><mi>Y</mi><mrow><mi>j</mi><mi>g</mi></mrow></msub></mrow>
   :MATH]
   denote the normalized gene expression for individuals i and j in the
   first and second conditions, respectively, where
   [MATH: <mrow><mi>i</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><m
   i>k</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>j</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><m
   i>n</mi></mrow> :MATH]
   . With samples
   [MATH: <mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow>
   </msub><mo
   stretchy="true">}</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow
   ><mi>k</mi></msubsup><mover accent="true"><mo>∼</mo><mrow><mspace
   width="0.25em"></mspace><mi mathvariant="normal">i</mi><mo>.</mo><mi
   mathvariant="normal">i</mi><mo>.</mo><mi
   mathvariant="normal">d</mi><mo>.</mo><mspace
   width="0.25em"></mspace></mrow></mover><msub><mi>P</mi><mi>g</mi></msub
   ></mrow> :MATH]
   and
   [MATH: <mrow><msubsup><mrow><mo
   stretchy="true">{</mo><msub><mi>Y</mi><mrow><mi>j</mi><mi>g</mi></mrow>
   </msub><mo
   stretchy="true">}</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow
   ><mi>n</mi></msubsup><mover accent="true"><mo>∼</mo><mrow><mspace
   width="0.25em"></mspace><mi mathvariant="normal">i</mi><mo>.</mo><mi
   mathvariant="normal">i</mi><mo>.</mo><mi
   mathvariant="normal">d</mi><mo>.</mo><mspace
   width="0.25em"></mspace></mrow></mover><msub><mi>Q</mi><mi>g</mi></msub
   ></mrow> :MATH]
   , the null and alternative hypotheses for gene g are
   [MATH:
   <msub><mi>H</mi><mrow><mn>0</mn><mi>g</mi></mrow></msub><mo>:</mo><msub
   ><mi>P</mi><mi>g</mi></msub><mo>=</mo><msub><mi>Q</mi><mi>g</mi></msub>
   <mspace width="1em"></mspace><mtext>and</mtext><mspace
   width="1em"></mspace><msub><mi>H</mi><mrow><mn>1</mn><mi>g</mi></mrow><
   /msub><mo>:</mo><msub><mi>P</mi><mi>g</mi></msub><mo>≠</mo><msub><mi>Q<
   /mi><mi>g</mi></msub><mtext>.</mtext> :MATH]

   Our rank-based test statistic,
   [MATH: <mrow><msubsup><mi>T</mi><mi>g</mi><mi>w</mi></msubsup></mrow>
   :MATH]
   , is computed as the normalized sum of weighted ranks of the
   [MATH:
   <mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub></mrow>
   :MATH]
   observations using the formula
   [MATH: <mrow><msubsup><mi>T</mi><mi>g</mi><mi>w</mi></msubsup><mo
   linebreak="badbreak">=</mo><mfrac><mn>1</mn><msqrt><mrow><mi>n</mi><msp
   ace width="0.25em"></mspace><mo linebreak="badbreak">+</mo><mspace
   width="0.25em"></mspace><mi>k</mi></mrow></msqrt></mfrac><munderover><m
   o>∑</mo><mrow><mi>i</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>k</mi></munderover><msu
   b><mi>w</mi><mi>g</mi></msub><mrow><mo
   stretchy="true">(</mo><mfrac><msub><mi>R</mi><msub><mi>X</mi><mrow><mi>
   i</mi><mi>g</mi></mrow></msub></msub><mrow><mi>n</mi><mspace
   width="0.25em"></mspace><mo linebreak="badbreak">+</mo><mspace
   width="0.25em"></mspace><mi>k</mi></mrow></mfrac><mo
   stretchy="true">)</mo></mrow><mtext>,</mtext></mrow> :MATH]
   (Equation 1)

   where
   [MATH:
   <mrow><msub><mi>R</mi><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow>
   </msub></msub></mrow> :MATH]
   denotes the rank of
   [MATH:
   <mrow><msub><mi>X</mi><mrow><mi>i</mi><mi>g</mi></mrow></msub></mrow>
   :MATH]
   after pooling and sorting observations from both groups. Here,
   [MATH: <mrow><msub><mi>w</mi><mi>g</mi></msub></mrow> :MATH]
   is a weight function chosen to maximize the test power against
   user-specific alternatives. Its dependence on g indicates that it can
   be chosen differently across different genes.

   While the space of all possible weight functions
   [MATH: <mrow><msub><mi>w</mi><mi>g</mi></msub></mrow> :MATH]
   induces a family of test statistics of the form shown in [61]Equation
   1, particular choices of
   [MATH: <mrow><msub><mi>w</mi><mi>g</mi></msub></mrow> :MATH]
   lead to well-known tests. For instance,
   [MATH:
   <mrow><msub><mi>w</mi><mi>g</mi></msub><mrow><mo>(</mo><mi>x</mi><mo>)<
   /mo></mrow><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mi>x</mi></mrow> :MATH]
   gives the Mann-Whitney test, while
   [MATH:
   <mrow><msub><mi>w</mi><mi>g</mi></msub><mrow><mo>(</mo><mi>x</mi><mo>)<
   /mo></mrow><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><msup><mrow><mo>(</mo><mrow><mi>x</mi><mo>
   −</mo><mn>0.5</mn></mrow><mo>)</mo></mrow><mn>2</mn></msup></mrow>
   :MATH]
   corresponds to Mood’s test.[62]^20

   Focusing on differential expression analysis applications, we design
   weight functions
   [MATH: <mrow><msub><mi>w</mi><mi>g</mi></msub></mrow> :MATH]
   by composing the score function (the gradient of the log likelihood
   with respect to the parameters) with the inverse cumulative
   distribution function (see [63]STAR Methods). We derive these score
   functions selectively from either an NB or a ZINB distribution, with
   QRscore-Var targeting DDGs and QRscore-Mean targeting DEGs. The
   rationale behind this design is to overcome the lower power of
   traditional rank-based models; our approach significantly boosts power
   under correct model specification and achieves efficiency comparable to
   the likelihood ratio test.

   Despite the tailored weight design, our method remains
   distribution-free under the null hypothesis
   [MATH: <msub><mi>H</mi><mrow><mn>0</mn><mi>g</mi></mrow></msub> :MATH]
   , which assumes identical distributions (
   [MATH: <mrow><msub><mi>P</mi><mi>g</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><msub><mi>Q</mi><mi>g</mi></msub></mrow>
   :MATH]
   ). This is achieved because the rank-based inputs are uniformly
   distributed under the null, regardless of the underlying data
   distribution. This distribution-free property provides a significant
   advantage: it inherently reduces the impact of model misspecification
   and helps maintain strict control over false discoveries.

   Moreover, our method also extends to support K-sample tests (see
   [64]STAR Methods), broadening its applicability to multiple sample
   scenarios. Overall, QRscore is a versatile and powerful approach for
   detecting differential mean or dispersion in RNA-seq data, combining
   the robustness of non-parametric tests with the enhanced detection
   power of a special weight design. The details of the test statistic
   construction and its statistical inference are described in [65]STAR
   Methods and [66]Methods S1.

Simulation results

   In this section, we evaluate existing methods for detecting genes with
   changes in dispersion or mean across groups using extensive
   simulations. To benchmark two-sample tests, we use mean and dispersion
   parameters estimated from GTEx whole-blood RNA-seq data as the
   baseline. We simulate fold changes in dispersion or mean across
   randomly selected DDGs, varying the magnitude across datasets. To test
   robustness, we introduce outliers and zero inflation in different
   proportions. We also vary sample sizes to reflect realistic
   experimental designs. Lastly, we compare QRscore’s performance in both
   two-sample and three-sample tests with varied fold changes and sample
   sizes. Simulation details are provided in the [67]STAR Methods.

   Key simulation results are qualitatively summarized in [68]Table 1. The
   hypothesis tests can be categorized into two groups: those that rely on
   specific parametric distributions and those that are non-parametric.
   For DDG analysis, parametric methods include GAMLSS, MDSeq, diffVar,
   and Levene’s test, while Mood and QRscore-Var are distribution free.
   Similarly, for DEG analysis, methods like DESeq2, edgeR, and limma-voom
   rely on parametric models, whereas Mann-Whitney and QRscore-Mean are
   distribution free. To analyze the sensitivity and robustness of these
   methods, metrics such as the area under the precision-recall curve
   (AUPRC), power, and FDR under different simulation settings are
   computed. Non-parametric methods generally showed greater robustness to
   outliers and zeros. Notably, QRscore-Var and QRscore-Mean achieved the
   highest detection power for identifying DDGs and DEGs across various
   scenarios. In contrast, parametric methods were more susceptible to the
   negative effects of outliers and inflated zeros, often leading to an
   inflated FDR. Detailed results are provided in the following
   subsections.

Table 1.

   Summary table for simulation results
   Method Assumption Targeted direction Efficient in NB model Robust to
   outliers Robust to zeros
   QRscore non-parametric mean/variance yes yes yes
   GAMLSS NB variance yes no no
   MDSeq ZINB mean/variance no no yes
   Mood non-parametric variance no yes yes
   diffVar Gaussian variance yes yes no
   Levene Gaussian variance yes yes no
   DESeq2 NB mean yes no yes
   edgeR NB mean yes no yes
   limma-voom Gaussian mean yes yes no
   Mann-Whitney non-parametric mean yes yes yes
   [69]Open in a new tab

QRscore is a solid contender in differential dispersion and differential
expression analysis under correctly specified NB models

   In simulations based on NB distributions, QRscore-Var exhibits strong
   performance, slightly behind GAMLSS (setting 1 in [70]Table 2) and
   achieving comparably high detection power with FDR controlled
   ([71]Figure 2, “no zero inflation or outliers”). This close competition
   highlights QRscore-Var’s effectiveness in scenarios with correctly
   specified NB models. QRscore-Var also demonstrates higher power than
   naive methods like Mood’s test and Levene’s test ([72]Figure S1A). In
   smaller sample size settings (settings 2–3 in [73]Table 2), QRscore-Var
   consistently ranks as the second-best method after GAMLSS, reinforcing
   its reliability and showing notable improvements as sample sizes
   increase.

Table 2.

   Average area under the AUPRC across all fold changes in all seeds for
   each simulation setting
   Setting
   [MATH: <mrow><msub><mi>n</mi><mn>1</mn></msub></mrow> :MATH]
   [MATH: <mrow><msub><mi>n</mi><mn>2</mn></msub></mrow> :MATH]
   α (outliers) β (zeros) DiffVar GAMLSS Levene MDSeq Mood QRscore-Var
   1 113 249 0 0 0.774 0.798 0.775 0.701 0.749 0.788
   2 50 100 0 0 0.615 0.642 0.610 0.558 0.568 0.634
   3 50 50 0 0 0.518 0.575 0.525 0.516 0.479 0.527
   4 113 249 0.05 0 0.762 0.753 0.763 0.627 0.745 0.783
   5 113 249 0.1 0 0.754 0.714 0.756 0.573 0.745 0.780
   6 113 249 0.2 0 0.728 0.640 0.731 0.481 0.737 0.772
   7 113 249 0 0.05 0.741 0.736 0.753 0.699 0.742 0.783
   8 113 249 0 0.1 0.716 0.689 0.736 0.699 0.741 0.780
   9 113 249 0 0.2 0.650 0.594 0.692 0.698 0.732 0.772
   10 113 249 0.1 0.1 0.694 0.629 0.714 0.569 0.734 0.772
   11 50 100 0.1 0.1 0.529 0.429 0.555 0.399 0.550 0.613
   12 50 50 0.1 0.1 0.427 0.363 0.472 0.367 0.458 0.503
   [74]Open in a new tab

   The simulation results in this table are from scenarios where the
   dispersions of group 1 are greater than those of group 2 for true DDGs.
   For each setting, the greatest value is shown in bold. Sample size for
   group 1:
   [MATH: <mrow><msub><mi>n</mi><mn>1</mn></msub></mrow> :MATH]
   ; sample size for group 2:
   [MATH: <mrow><msub><mi>n</mi><mn>2</mn></msub></mrow> :MATH]
   ; proportion of genes with random outliers: α; and proportion of genes
   with random zeros: β. Sample sizes
   [MATH: <mrow><msub><mi>n</mi><mn>1</mn></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>113</mn></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>n</mi><mn>2</mn></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>249</mn></mrow> :MATH]
   are chosen from GTEx whole-blood tissue data, corresponding to the
   40–49 and 60–69 age groups, respectively. When perturbations are
   introduced, QRscore outperforms the others, and it ranks second when
   model assumptions are perfectly met. For a table that shows cases where
   the dispersions of group 2 are greater than those of group 1 for true
   DDGs, see [75]Table S1. AUPRC, area under the precision-recall curve.

Figure 2.

   [76]Figure 2
   [77]Open in a new tab

   Simulation results for QRscore-Var benchmarked against other methods
   for detecting DDGs

   The true DDGs have greater dispersion in group 1 than in group 2.

   (A) Line plots of power when true false discovery proportions are less
   than 0.05 across different dispersion fold changes.

   (B) Line plots of FDR when adjusted p values are less than 0.05 across
   different dispersion fold changes.

   Both plots use
   [MATH: <mrow><msub><mi>n</mi><mn>1</mn></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>113</mn></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>n</mi><mn>2</mn></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>249</mn></mrow> :MATH]
   (GTEx derived). We present four different settings in the plot:
   baseline NB model, 10% genes with outliers, 10% genes with zero
   inflation, and a combination of both. For additional methods benchmark,
   see [78]Figure S1.

   Likewise, as shown in [79]Table S3, QRscore-Mean shows performance
   comparable to DESeq2 and edgeR in differential mean analysis, with a
   marginal difference in the AUPRC of about 1%. This slight difference in
   power and FDR is further illustrated in setting 1 of [80]Figure S2.

QRscore-Var leads in robustness and effectiveness in gene expression
variability analysis with outliers and excess zeros

   Next, we simulate data with random outliers (settings 4–6 in [81]Table
   2), where QRscore-Var emerges as the best-performing method based on
   the average AUPRC. In contrast, methods like GAMLSS and MDSeq show much
   lower AUPRC scores, even with a small proportion of outliers, due to
   their reliance on the NB assumption, which makes them vulnerable to
   model misspecification. For instance, in setting 4 with 5% random
   outliers, the AUPRC for GAMLSS drops by 5.6% (from 0.798 to 0.753),
   while QRscore-Var experiences only a 0.6% drop (from 0.788 to 0.783).
   As the outlier proportion increases to 20%, GAMLSS’s score declines by
   19.8% (to 0.640), whereas QRscore-Var remains stable, with only a 2.0%
   drop (to 0.772). MDSeq also shows considerable performance degradation,
   and both GAMLSS and MDSeq exhibit inflated FDRs in the presence of 10%
   outliers, as shown in [82]Figure 2B (“outliers").

   In simulations with random zeros (settings 7–9 in [83]Table 2),
   QRscore-Var again demonstrates robust performance. GAMLSS and diffVar,
   by contrast, show inflated FDRs and reduced power under zero-inflated
   conditions ([84]Figure 2, “zero inflation"). Despite MDSeq being
   designed to accommodate technical zeros, QRscore-Var consistently
   outperforms it across all fold changes. Among non-parametric methods,
   QRscore-Var also achieves the highest power in zero-inflated settings
   ([85]Figure S1, “zero inflation”). Furthermore, QRscore maintains
   reliable power and FDR control under both up- and down-regulation of
   variance, underscoring its robustness across different shift directions
   (see [86]Tables 2 and [87]S1).

   To assess QRscore-Var’s robustness in settings involving small samples,
   we perform additional simulations with group sizes ranging from 5 to
   15, incorporating noise through randomly inserted outliers and excess
   zeros. We observe that parametric methods such as MDSeq and GAMLSS
   exhibit severe FDR inflation as sample sizes decrease and fail to
   provide reliable discoveries with controlled FDRs. In contrast,
   QRscore-Var consistently maintains strict FDR control across all
   scenarios and achieves the highest power when group sizes are 8 or
   higher ([88]Figure S3A). We find that all methods have low power below
   group sizes of 8. Therefore, we recommend using at least 8 samples per
   group to ensure true discoveries while maintaining FDR control.

   Finally, to assess the robustness of QRscore-Var under additional model
   misspecification, we conduct simulations where gene expression follows
   bimodal distributions—an important scenario given that real RNA-seq
   data can exhibit multimodal patterns due to heterogeneous cell
   populations or mixed regulatory states. As shown in [89]Table S2,
   QRscore-Var consistently achieves the highest average AUPRC across all
   scenarios with outliers and excess zeros. These results demonstrate
   that QRscore-Var maintains strong FDR control and detection power even
   when the data distribution departs from standard unimodal assumptions.

QRscore-Mean leads in robustness and effectiveness in differential gene
expression analysis with outliers and excess zeros

   QRscore-Mean, similar to QRscore-Var, also shows the strongest
   performance in differential gene expression analysis under high
   proportions of random outliers (setting 6 in [90]Table S3). While edgeR
   and DESeq2 show marginally higher AUPRC scores with fewer outliers,
   they suffer from inflated FDRs, making their detection results less
   reliable in practice ([91]Figure S2B, “outliers"). Among the methods
   that control the FDR, QRscore-Mean performs the best in terms of the
   average AUPRC and detection power.

   QRscore-Mean outperforms other methods in differential gene expression
   analysis when there are excess zeros (settings 7–9 in [92]Table S3).
   This improved performance is due to its robustness in handling data
   distribution irregularities, particularly those characterized by excess
   zeros. While random zeros weaken the signal without inflating FDR, this
   loss of power impacts methods like edgeR and DESeq2, which lack
   explicit handling of zero inflation.

   We also conduct additional “small sample" simulations with group sizes
   ranging from 5 to 15, incorporating realistic noise through randomly
   inserted outliers and excess zeros. QRscore-Mean and the Mann-Whitney
   test preserve FDR control across all scenarios. Although DESeq2 and
   edgeR show slightly higher AUPRC values at very small sample sizes,
   their inflated FDR limits practical utility, as shown in [93]Figure
   S3B. In contrast, QRscore-Mean consistently offers more accurate
   results. Similar to variance shift testing, we recommend a minimum
   sample size of 8 per group to ensure sufficient detection power.

Three-sample tests show superiority compared to pairwise two-sample tests in
diverse simulations for QRscore-Var

   We investigated the performance of the three-sample QRscore test (see
   [94]STAR Methods) compared to the commonly used approach of sequential
   pairwise testing followed by p value aggregation. In practice, the user
   may prefer to test the heterogeneity of gene expression across multiple
   groups or conditions and obtain a single, well-calibrated p value for
   each gene. Results from [95]Figure S4 demonstrate that the three-sample
   QRscore test outperforms the pairwise two-sample tests in terms of the
   AUPRC, particularly when the sample sizes for different groups are not
   equal. Additional simulations explore scenarios where the three samples
   share the same sample size and cases with varying degrees of dispersion
   among all three groups. These simulations, which also include the
   presence of random outliers and zeros, show that the three-sample tests
   consistently provide superior AUPRC results, demonstrating better
   adaptability and precision even under model perturbations. Based on the
   simulation results, we will apply three-sample tests to GTEx data in
   [96]application to GTEx.

Application to GTEx

   Having demonstrated QRscore’s error control capabilities and enhanced
   power, we now apply it to the GTEx dataset. The GTEx project provides
   RNA-seq data from bulk tissue samples, gathered from nearly 1,000
   deceased donors across various age groups and both sexes. This dataset
   covers a wide array of tissue types, enabling tissue-specific gene
   expression analyses.

   The flexibility of QRscore allows for the detection of tissue-specific
   DDGs, as well as distinct gene sets that differ in mean expression
   across tissues. We also show that these gene sets exhibit unique
   functional effects.

QRscore-Var detects tissue-specific genes with changes in dispersion

   We apply QRscore-Var across 33 tissues to identify DDGs across three
   age groups: 20–39 (younger), 40–59 (middle), and 60–79 (older). We
   detected distinct trends in variance shifts from younger to older age
   groups ([97]Figure 3A). In tibial artery and skeletal muscle tissues,
   most DDGs show an increase in variance from the middle to the older age
   group. In contrast, in whole-blood tissue, we observe more DDGs showing
   an increase in variance from the younger to the middle age group,
   followed by a decrease from the middle to the older group. Our latter
   finding agrees with a recent study on gene expression in blood samples
   from twins, which reported a general decline in expression variability
   with age.[98]^6 In total, 10,362 unique tissue-specific DDGs were
   detected.

Figure 3.

   [99]Figure 3
   [100]Open in a new tab

   QRscore-Var results on GTEx data

   (A) A stacked bar plot illustrates the quantity of genes identified by
   QRscore-Var, categorized by variance trends. For instance, the
   “decrease-decrease" label indicates a progressive reduction in gene
   expression variance from the youngest age group (20–39) through the
   middle age group (40–59) and into the oldest age group (60–79).

   (B) Histogram of the number of tissues in which each DDG detected only
   by QRscore-Var is found. The red bars indicate DDGs found from more
   than 3 tissues.

   (C) UpSet plot showing overlaps of DDGs from the three-sample test and
   corresponding pairwise comparisons.

   (D) Bar plots display the log counts of DDGs identified in various
   tissues from female and subsampled male individuals, respectively.

   To better interpret these patterns, we perform pairwise two-sample
   tests between age groups ([101]Figure 3C). In the tibial artery,
   variance shifts are primarily driven by differences between the
   youngest group and the other two. In the transverse colon, the oldest
   group contributes most to the detected differences. These patterns
   reflect how age range and tissue context shape dispersion trends.

   The majority of DDGs are tissue specific, where each gene is detected
   in only one tissue ([102]Figure 3B). However, a few genes show
   differential dispersion across multiple tissues ([103]Table S5). For
   example, EDA2R, a gene previously known to be associated with aging in
   the lung[104]^21 and skeletal muscle,[105]^22 is detected as a DDG in
   16 tissues. Additionally, PTCHD4, previously associated with aging in
   the thyroid gland,[106]^23 is found to have differential dispersion in
   16 tissues.

   Sex can significantly influence the detection of variance shifts, and
   the GTEx dataset includes more male than female samples. To address
   this potential confounding factor, we stratify the data by sex and
   subsample male samples to match the number of females, aiming for a
   fair comparison ([107]Figure 3D). In tissues such as the tibial artery,
   transverse colon, and whole blood, the numbers of DDGs identified in
   male and female samples are similar, and there are sex-specific DDGs
   discovered in each tissue. Interestingly, within adipose visceral
   tissue, even after subsampling to equalize sample sizes, DDGs are
   identified in male samples but not in female samples. These findings
   provide insights into sex-specific DDGs during aging.

QRscore-Var identifies different sets of genes from QRscore-Mean

   In parallel with QRscore-Mean, designed to identify genes with mean
   expression changes using tailored weights, we used QRscore-Var for its
   capacity to detect variance shifts. Although QRscore-Var detects fewer
   genes overall compared to QRscore-Mean, it identifies 2,652 unique DDGs
   across at least one tissue that QRscore-Mean does not capture (that is,
   DDGs that are not also DEGs; see [108]Figure 4A), indicating its
   ability to uncover different aspects of gene expression changes.

Figure 4.

   [109]Figure 4
   [110]Open in a new tab

   Analysis of age-related differential gene expression across human
   tissues using QRscore-Var and QRscore-Mean

   (A) Pie charts visualizes the proportion of gene biotypes across
   various tissues, analyzed through the QRscore-Var and QRscore-Mean
   methods. The slices represent protein-coding genes, long noncoding RNAs
   (lncRNAs), and small noncoding RNAs (sncRNAs), with immunoglobulin (Ig)
   genes separated into their own category due to their substantial
   proportion in some tissues. Pseudogenes are excluded. A white circle at
   the center indicates the total gene count. Tissues are ranked by the
   number of DDGs unique to QRscore-Var.

   (B) Bar plots for selected tissues, highlighting the top 20 significant
   genes only detected by QRscore-Var. These plots juxtapose the
   significance levels (
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after">−</mo><msub><mi>log</mi><mn>10</mn></msub><mrow>
   <mo>(</mo><mrow><mi>p</mi><mspace
   width="0.25em"></mspace><mtext>value</mtext></mrow><mo>)</mo></mrow></m
   row> :MATH]
   ) of identified genes between the QRscore-Var and QRscore-Mean
   analyses, with a focus on protein-coding genes.

   A notable number of DDGs are identified only by QRscore-Var, but not
   QRscore-Mean, in several tissues, in line with studies emphasizing
   variance shifts in the transverse colon and thyroid.[111]^24 While most
   DDGs are protein-coding genes, immunoglobulin (Ig) genes are also
   prominently detected, especially in tissues like esophagus mucosa,
   transverse colon, and lung ([112]Figure 4A). Their contribution
   suggests a link between age-related variability and immune-response
   alterations. These results indicate that variance shifts can reveal
   biologically meaningful changes that complement mean-based analyses in
   understanding aging.

   We summarize the top 20 protein-coding DDGs with age-associated
   variance shifts that are not classified as DEGs across six tissues (see
   [113]Figure 4B). Variance shifts in genes like C-C motif chemokine
   ligand 2 (CCL2) in whole blood and SAA1 in the transverse colon
   underscore their roles in shaping the inflammatory landscape during
   aging. CCL2 is pivotal in chronic immune responses during
   aging,[114]^25 while SAA1 contributes to an inflammatory
   microenvironment that varies across individuals, leading to diverse
   impacts on epithelial cell function via MMP9 up-regulation in aging
   tissues.[115]^26 In the tibial artery, ADRA2B regulates vascular tone,
   with its variance shifts suggesting its importance in maintaining
   vascular health with age. BNIP3 in skeletal muscle indicates diverse
   mitochondrial balance and autophagy responses, signaling varied
   susceptibility to age-related muscle degradation.[116]^27 Meanwhile,
   SERPINB1 in the lung safeguards against protease-induced tissue damage,
   with its variance shifts revealing intricate age-related changes that
   impact lung maintenance and contribute to chronic lung
   conditions.[117]^28 These gene expression shifts reveal the intricate,
   tissue-specific regulatory changes that contribute to aging and its
   associated dysfunctions.

   We compile lists of genes related to human aging from three databases,
   GenAge,[118]^29 LongevityMap,[119]^30 and CellAge[120]^31 to annotate
   age-related genes. We rank the genes based on p values for each tissue
   and age group, using QRscore-Var or QRscore-Mean tests, and apply
   one-sided Wilcoxon rank-sum tests to evaluate the methods’
   effectiveness at prioritizing aging-related genes. In [121]Table S4,
   QRscore-Var shows statistical significance (
   [MATH: <mrow><mi>p</mi><mo linebreak="goodbreak"
   linebreakstyle="after"><</mo><mn>0.05</mn></mrow> :MATH]
   ) in ranking aging-related genes in twelve out of fourteen tissues.
   However, the Wilcoxon test p values are less significant than
   QRscore-Mean in most tissues and are not significant in lung and lower
   leg skin (sun exposed). This suggests that, as the existing databases
   are mostly annotated by mean expression shifts, they may miss
   age-related DDGs that QRscore-Var can identify, and QRscore-Var
   provides a complementary perspective on the genetic basis of aging.

QRscore-Var and QRscore-Mean identify genes that are functionally distinct

   To investigate the biological functions of genes identified by QRscore,
   we performed Gene Ontology (GO) enrichment analysis for the top 500
   DDGs and DEGs ranked by QRscore-Var and QRscore-Mean, respectively. In
   whole blood, while QRscore-Mean highlights fundamental processes such
   as mRNA processing and RNA splicing, QRscore-Var reveals enrichment in
   pathways related to chemotaxis, angiogenesis, cell motility, and
   regulation of the ERK1/2 cascade—reflecting broader regulatory changes
   beyond mean expression shifts ([122]Figures 5A and 5B). Specifically,
   genes with large variance shifts, such as HRG, ANGPTL3, and SERPINE1,
   are linked to angiogenesis, while FGB and FGG are involved in immune
   signaling and blood coagulation ([123]Figure 5C). Additionally, the
   analysis reveals notable variance shift among age groups in the CCL
   gene family, such as CCL14, CCL18, and CCL20, which are enriched in the
   chemokine-mediated signaling pathway. This further highlights the
   intricate gene expression dynamics involved in immune responses and
   inflammation.

Figure 5.

   [124]Figure 5
   [125]Open in a new tab

   Comparative analysis of GO enrichment in whole-blood tissue based on
   significant genes identified by QRscore-Var and QRscore-Mean

   (A) A dot plot illustrates the enriched GO terms within the biological
   process ontology for the top 500 genes significantly detected by
   QRscore-Var.

   (B) Similar enrichment analysis results for genes identified by
   QRscore-Mean.

   (C) A heat plot, generated using ClusterProfiler, shows the fold change
   in expression variance between the 20–39 and 40–59 age groups for
   significant genes associated with the top 5 enriched GO terms
   identified by the QRscore-Var analysis.

   (D and E) A dot plot (D) and a density plot (E) further investigate the
   expression patterns of two highly coexpressed genes identified by
   QRscore-Var as being involved in angiogenesis.

   We further identify the top QRscore-Var genes associated with
   significantly enriched GO terms across 20 tissues and compare their
   QRscore-Mean rankings (see [126]Table S6). In whole blood, CCL2 is
   uniquely identified as significant by QRscore-Var and exhibits
   non-monotonic changes in dispersion with age. Among the top 25 DDGs
   contributing to enriched GO terms in this tissue, 23 are ranked within
   the top 200 by QRscore-Var but fall outside the top 1,000 under
   QRscore-Mean, underscoring the distinct insights gained from analyzing
   expression variability. We also identify coexpressed DDG pairs linked
   to key biological processes—for example, SERPINE1 and ANGPTL4 are
   strongly coexpressed (Spearman’s
   [MATH: <mrow><mi>ρ</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mn>0.808</mn></mrow> :MATH]
   ) and are implicated in angiogenesis and blood coagulation
   ([127]Figures 5D and 5E).

   Across other tissues—including the tibial artery, transverse colon, and
   skeletal muscle—QRscore-Var reveals biological processes and genes with
   significant expression variability during aging. In the tibial artery,
   variance shifts in DDGs highlight tissue-specific processes such as
   sprouting angiogenesis and the positive regulation of cell migration.
   Beyond these processes, SEMA6A, identified as a DDG but not a DEG, is
   associated with the semaphorin-plexin signaling pathway and the
   negative regulation of axon extension involved in axon guidance,
   pointing to additional molecular mechanisms at play in the tibial
   artery. In the transverse colon, DDGs are enriched for processes such
   as digestion, steroid and hormone metabolism, and xenobiotic response,
   highlighting their roles in maintaining gastrointestinal function,
   regulating metabolic activity, and responding to environmental stimuli.
   In skeletal muscle, coexpressed DDGs like COMP and FMOD underscore
   their importance in collagen fibril organization and maintaining the
   structural integrity of muscle tissue. The DDGs, enriched pathways, and
   coexpression patterns described here are shown in [128]Tables S6 and
   [129]S7.

Application to pseudo-bulked scRNA-seq data

   Here, we apply QRscore to data from the Asian Immune Diversity Atlas
   (AIDA),[130]^32 a large-scale scRNA-seq resource profiling peripheral
   blood mononuclear cells (PBMCs) from healthy donors across Asia. Our
   analysis focuses on individuals of East Asian identity to minimize
   confounding due to population heterogeneity (see [131]STAR Methods).
   The single-cell resolution of AIDA enables detailed exploration of
   age-related regulatory changes in specific immune cell types, which are
   often obscured in bulk tissue analyses. To increase detection power,
   stabilize variance estimates, and enable robust population-level
   statistical testing, we aggregate single-cell profiles into pseudo-bulk
   samples by donor, age group, and cell type. This approach preserves
   cell-type resolution while capturing inter-individual variability,
   aligning with our goal of characterizing gene expression variance
   across the population. Applying QRscore-Var across 25 immune cell types
   and three age groups (20–34, 35–49, and 50+), we detect both
   cell-type-specific and shared patterns of variance shifts, highlighting
   distinct regulatory programs across the immune system.

   We identify widespread shifts in gene expression variability with
   aging. Among the analyzed cell types, naive thymus-derived
   CD4-positive, alpha-beta T cells; central memory CD4-positive,
   alpha-beta T cells; and CD16-positive, CD56-dim natural killer cells
   exhibit the highest DDG counts, suggesting greater transcriptional
   reprogramming in these cell types during aging ([132]Figure S5A). Most
   DDGs exhibit monotonic variance trajectories across age groups, with
   the majority showing either progressive increases (increase-increase)
   or decreases (decrease-decrease) in dispersion from younger to older
   individuals. A smaller subset of DDGs display non-monotonic patterns,
   with predominantly increase-decrease trajectories and a minority
   showing decrease-increase variance shifts. Although the majority of
   DDGs are specific to individual cell types ([133]Figure S5B), several
   genes, such as RBM38, RPL27A, and BRD4, are recurrently identified
   across multiple immune cell-type populations. Notably, RBM38, a
   p53-regulated RNA-binding protein involved in mRNA stability and
   cellular senescence, is linked to aging and tumor suppression.[134]^33
   Furthermore, pairwise comparisons reveal that variance shifts are often
   strongest between the youngest and oldest groups ([135]Figure S5C). In
   the top three cell types with the highest DDG counts, two-sample tests
   show that more DDGs exhibit significant variance shifts in the 20–34
   vs. 50+ and 35–49 vs. 50+ comparisons, indicating that the oldest age
   group displays the most pronounced and distinct variance patterns.

   Next, we identify genes with significant variance shifts detected in
   both GTEx whole-blood and pseudo-bulked scRNA-seq analyses, with
   overlaps of 11%–32% across immune cell types ([136]Figure S6A). Among
   the shared genes, key regulators such as RBM38, RGS1, and the NR4A
   family (NR4A1/2/3) emerge as recurrent across multiple cell types,
   suggesting their involvement in a core aging-related regulatory
   program. These genes are well-established mediators of immune
   activation, stress responses, and inflammatory regulation—processes
   tightly linked to immunosenescence.

   To explore the functional relevance of these shared DDGs, we perform
   ontology enrichment analysis on the overlap between pseudo-bulk and
   bulk results for CD14-positive monocytes. The analysis reveals
   enrichment for inflammatory cytokine responses, including pathways
   related to interleukin-1, tumor necrosis factor (TNF), and chemokine
   signaling—key processes in age-associated inflammation ([137]Figure
   S6B). Genes such as ZC3H12A, CXCL8, CCL4, and CCL20 exemplify how
   transcriptional variability in monocytes is linked to enhanced
   responsiveness to pro-inflammatory cytokines and immune aging
   ([138]Figure S6C).

   While some genes are shared between the GTEx whole-blood and
   pseudo-bulked scRNA-seq analyses, many DDGs are uniquely identified in
   the pseudo-bulk analysis, highlighting the added resolution provided by
   cell-type-specific variance analysis. [139]Figure S5D shows bar plots
   of the top 20 DDGs for the three cell types with the highest DDG
   counts, including cell-type-specific DDGs shown in black. In naive
   thymus-derived CD4-positive, alpha-beta T cells, FURIN and EXT1 suggest
   modulation of cytokine signaling and extracellular immune interactions.
   In CD16-positive, CD56-dim natural killer cells, variance shifts in
   NFIL3 and IL1B reflect dynamic regulation of development and
   inflammatory responses. In central memory CD4-positive, alpha-beta T
   cells, CSNK1E and KLHL11 exhibit variability changes linked to immune
   rhythm and tolerance. These genes highlight the unique contributions of
   cell-type-specific analysis in uncovering regulatory mechanisms with
   age-related variability that were missed in our bulk analysis.

Discussion

   We introduced QRscore, a suite of rank-based non-parametric tests for
   identifying DDGs and DEGs from bulk RNA-seq data. We showed, through
   extensive simulations, that QRscore is robust at detecting both mean
   and dispersion shifts, particularly under challenging and realistic
   conditions where datasets have inflated zeros and outliers. Notably,
   QRscore controls false discovery well and achieves high power even in
   such settings. Applying QRscore to GTEx data, we find different DDGs
   and DEGs in each tissue that are also functionally distinct,
   implicating novel candidates for further biological investigation.

   Model misspecification is a critical issue in RNA-seq data analysis
   that potentially explains the inflation of false discoveries observed
   in practice. For instance, a study[140]^34 showed that methods relying
   on specific distributional assumptions, like NB or ZINB, identify
   different sets of DEGs across mouse strains and human tissues, with
   elevated FDRs in the presence of outliers. As a rank-based
   non-parametric method, QRscore is less sensitive to data distribution
   and effectively handles technical zeros and outliers. By being less
   prone to model misspecification, QRscore yields more reliable and
   replicable results, improving the likelihood of downstream experimental
   validation.

   As pointed out in the [141]introduction, gene expression variability
   plays an important role in aging, adaptation to changing environments,
   and disease susceptibility. Focusing on aging in our present work,
   senescence-related changes in gene expression variability are often
   driven by shifting regulatory efficiency and the accumulation of
   epigenetic modifications,[142]^35 which may go unnoticed when focusing
   solely on mean expression levels. QRscore fills this gap by detecting
   variance shifts, revealing biologically relevant DDGs that highlight
   aging-related processes like angiogenesis in whole blood. These
   findings reinforce the value of analyzing gene expression variability
   to uncover insights beyond mean-level changes.

   We further demonstrated that QRscore can be applied to pseudo-bulked
   scRNA-seq data, enabling analysis of cell-type-specific regulatory
   variability. This is important because bulk tissue data, like those of
   GTEx, aggregate signals from mixed cell populations, potentially
   obscuring variability specific to a particular lineage. In our
   analysis, many DDGs with cell-type-specific roles were uniquely
   identified in pseudo-bulk data. Additionally, QRscore identified shared
   signals across GTEx and pseudo-bulked scRNA-seq analyses, despite
   differences in sequencing technology, cell population composition, cell
   numbers per sample, and population ancestry. Together, these results
   highlight QRscore’s ability to uncover meaningful, cell-type-specific
   insights and robust signals of gene expression variability across
   diverse contexts.

   The future of differential expression analysis lies in detecting
   diverse expression changes while accounting for confounders and
   technical noise. Flexible methods like QRscore support this goal by
   capturing distributional shifts. As tools improve, biologists will gain
   deeper insight into how gene expression varies across myriad contexts.

Limitations of the study

   One limitation of QRscore is that it tends to be conservative when the
   sample size is small. This is an inherent constraint of rank-based
   tests, which generally require larger sample sizes to ensure reliable
   decisions. As large-scale datasets like GTEx become more common, future
   work will extend QRscore to even larger and more diverse populations.

Resource availability

Lead contact

   Requests for further information and resources should be directed to
   and will be fulfilled by the lead contact, Yun S. Song
   (yss@berkeley.edu).

Materials availability

   This study did not generate new unique reagents.

Data and code availability

     * •
       This paper analyzes existing, publicly available data. The
       accession numbers for the datasets are listed in the [143]key
       resources table. The data used to reproduce the figures in this
       paper are available at
       [144]https://doi.org/10.5281/zenodo.16416084.
     * •
       The R package for QRscore is available at
       [145]https://github.com/songlab-cal/QRscore and Bioconductor. The
       code for reproducing the figures in this paper is available at
       [146]https://doi.org/10.5281/zenodo.16416084.
     * •
       Any additional information required to reanalyze the data reported
       in this work paper is available from the lead contact upon request.

Acknowledgments