Graphical abstract
graphic file with name fx1.jpg
[32]Open in a new tab
Highlights
* •
QRscore detects both variance and mean shifts in gene expression
* •
It controls FDR and achieves high power under noise and zero
inflation
* •
Tissue-specific, age-related differentially dispersed genes are
identified from GTEx data
* •
Cell-type-specific variance shifts associated with aging are
revealed from scRNA-seq data
Motivation
The increasing availability of large-scale RNA-seq datasets from both
bulk and single-cell sources increases our power to detect biologically
meaningful shifts in mean gene expression and expression variability.
However, existing methods often prioritize mean shifts or rely on
strong parametric assumptions that may not hold across heterogeneous
populations. This has created a gap in methods that adapt to the
alternative hypothesis of interest while also controlling type I and
type II errors without requiring parametric assumptions to formulate
the null hypothesis. We thus develop QRscore, an R/Bioconductor
software that leverages non-parametric statistics and asymptotic theory
to obtain powerful and false discovery rate (FDR)-controlled flexible
tests of differential expression.
__________________________________________________________________
Zhou et al. present QRscore, a robust non-parametric framework for
statistical hypothesis testing that detects shifts in both the mean and
variance of gene expression. QRscore maintains high power and strict
FDR control, even under noise and zero inflation. It reveals
age-related variance changes in both bulk RNA-seq and pseudo-bulked
single-cell RNA-seq data.
Introduction
In both bulk and single-cell genomics, differential expression analysis
is a routine procedure for identifying genes whose expressions differ
under varying conditions or treatments. To detect changes in average
expression, widely used methods in bulk RNA sequencing (RNA-seq)
differential analysis treat the dispersion of gene expression as a
nuisance parameter, implementing techniques to shrink and stabilize the
estimation of variance. The widespread use of these approaches can
result in missed signals of gene expression variability that capture
important biological mechanisms. In humans and C. elegans, variability
indices are associated with gene regulation[33]^1^,[34]^2 and cell
development,[35]^3 potentially explaining many biological phenomena,
including adaptation to changing environments, disease susceptibility,
and incomplete penetrance.[36]^4^,[37]^5 Recent studies also
demonstrate that variability indices provide insights into
aging,[38]^6^,[39]^7^,[40]^8 therefore underscoring the crucial need to
incorporate changes in variability into tests of differential gene
expression.
To detect expression variability, previous studies on microarray
datasets[41]^9^,[42]^10 applied traditional hypothesis testing methods,
such as the F-test and Bartlett’s test. These tests are based on normal
distribution assumptions of two samples and are designed to identify
differences in standard deviation. To improve robustness, one approach
is to analyze different measures for variability other than the
standard deviation. Such measures include the median absolute
deviation[43]^10 and the coefficient of variation.[44]^2 Another
approach to ensure the robustness of statistical tests is to employ
methods that are more robust to deviations from normality for two
samples, such as Levene’s test and permutation tests.[45]^10 The method
diffVar[46]^11 implements a modified version of Levene’s test that
utilizes an empirical Bayes estimator instead of variance estimation
for t statistics to obtain a more powerful test. However, this method
maintains its parametric nature, making it susceptible to reduced
performance due to deviations from the underlying assumptions.
Additionally, non-parametric methods like the Kolmogorov-Smirnov (KS)
test are not commonly used in the analysis of RNA-seq data because of
their limited statistical power, especially when working with small
sample sizes.
Because bulk RNA-seq data often show overdispersion, particularly in
gene expression, where the variance for most genes exceeds the mean,
they are commonly modeled using the negative binomial (NB) or
zero-inflated NB (ZINB) distribution. One method for detecting
differences in dispersion applies generalized additive models for
location, scale, and shape (GAMLSS),[47]^12 allowing the addition of
regressions for dispersion parameters in the NB distribution into the
main regression model and targeting changes in dispersion parameters
through likelihood ratio tests.[48]^4 Another method, MDSeq,[49]^13
utilizes a reparameterization of the ZINB model to test the mean and
dispersion separately from generalized linear models (GLMs).
DiffDist[50]^14 uses a hierarchical Bayesian model based on the NB
distribution, which can provide tests for differential dispersion and
distribution for RNA-seq data.
Although these current methods are based on NB distributions and are
widely applied in scenarios of small sample sizes, recent research has
shown that parametric methods are prone to inflated false discovery
rates (FDRs) in both studies with small sample sizes[51]^15 and
population-level RNA-seq studies with large sample sizes.[52]^16
Additionally, they are highly sensitive to preprocessing
procedures[53]^17 and are not robust to model misspecification.[54]^18
Here, we introduce QRscore (quantile rank score), a flexible rank-based
method for conducting robust and powerful two-sample tests. QRscore
generalizes the Mann-Whitney test and capitalizes on the adaptability
of the chosen test statistic to yield substantial statistical power
across various scenarios, including the detection of shifts in mean and
variance. QRscore offers various approaches for designing test
statistics that are effective in detecting shifts in mean and
dispersion. These include a versatile naive approach applicable to
various cases beyond count data, as well as methods designed
specifically for RNA-seq count data with either a low occurrence of
zeros or a higher proportion of inflated zeros. Also, we extend QRscore
to K-sample tests, where
[MATH: K≥3 :MATH]
. Through simulations that extensively compare QRscore with existing
differential variability tests and differential centrality tests, we
find that QRscore controls FDRs across diverse model specifications
while simultaneously obtaining high statistical power. We use QRscore
to detect both differentially dispersed genes (DDGs) with variance
shifts and differentially expressed genes (DEGs) with mean shifts
across multiple tissues in the Genotype-Tissue Expression (GTEx)
project,[55]^19 showing that differential variability characterizes
functionally distinct sets of genes from differential centrality. We
also apply QRscore to pseudo-bulked single-cell RNA-seq (scRNA-seq)
data, highlighting its potential to uncover cell-type-specific variance
shifts that are obscured in bulk RNA-seq analyses. Our method
contributes to the field of bioinformatics tools for genomics and can
be adapted to other applications that require flexible non-parametric
tests.
Results
Overview of QRscore
QRscore is a rank-based hypothesis testing method that achieves robust
false positive control through its non-parametric foundation while
gaining power from parametric model-informed weighting ([56]Figure 1).
Figure 1.
[57]Figure 1
[58]Open in a new tab
Overview of QRscore
(1) Prefiltering and normalization: this initial data processing stage
ensures quality and comparability of data. (2) Pooling and sorting: for
gene g, expression levels
[MATH: Xg={X1g<
/mrow>,…,Xk<
/mi>g} :MATH]
and
[MATH: Yg={Y1g<
/mrow>,…,Yn<
/mi>g} :MATH]
are combined into a single ordered list. The rank of
[MATH:
Xig
:MATH]
, denoted as
[MATH:
RXig
:MATH]
, is determined after pooling and sorting the expression values of gene
g. (3) Constructing the test statistic: a test statistic t is
calculated using a weighted function based on modified score test
functions, which enhances test power through parametric estimation and
incorporates a generalized K-sample test. (4) Hypothesis and inference:
this stage involves evaluating the null hypothesis for gene
[MATH:
g,H0g :MATH]
,
[MATH: Pg=Qg
:MATH]
, where
[MATH: Pg :MATH]
and
[MATH: Qg :MATH]
represent the distributions of gene expression samples
[MATH: Xg :MATH]
and
[MATH: Yg :MATH]
, respectively. p values are computed and adjusted using the
Benjamini-Hochberg procedure to control the false discovery rate. (5)
Identification of DDGs and DEGs: genes with significant
expression-level differences between the conditions are identified, as
represented in the QRscore-Var and QRscore-Mean diagrams. Details of
QRscore-Var and QRscore-Mean are provided in the [59]STAR Methods and
[60]Methods S1.
Our null hypothesis posits that the two samples are drawn from the same
distribution. Specifically, given a gene g, let
[MATH:
Xig
:MATH]
and
[MATH:
Yjg
:MATH]
denote the normalized gene expression for individuals i and j in the
first and second conditions, respectively, where
[MATH: i=1,…,k :MATH]
and
[MATH: j=1,…,n :MATH]
. With samples
[MATH: {Xig
}i=1k∼i.i.d.Pg :MATH]
and
[MATH: {Yjg
}j=1n∼i.i.d.Qg :MATH]
, the null and alternative hypotheses for gene g are
[MATH:
H0g:Pg=Qg
andH1g<
/msub>:Pg≠Q<
/mi>g. :MATH]
Our rank-based test statistic,
[MATH: Tgw
:MATH]
, is computed as the normalized sum of weighted ranks of the
[MATH:
Xig
:MATH]
observations using the formula
[MATH: Tgw=1n+k∑i=1kwg(RX
ign+k), :MATH]
(Equation 1)
where
[MATH:
RXig
:MATH]
denotes the rank of
[MATH:
Xig
:MATH]
after pooling and sorting observations from both groups. Here,
[MATH: wg :MATH]
is a weight function chosen to maximize the test power against
user-specific alternatives. Its dependence on g indicates that it can
be chosen differently across different genes.
While the space of all possible weight functions
[MATH: wg :MATH]
induces a family of test statistics of the form shown in [61]Equation
1, particular choices of
[MATH: wg :MATH]
lead to well-known tests. For instance,
[MATH:
wg(x)<
/mo>=x :MATH]
gives the Mann-Whitney test, while
[MATH:
wg(x)<
/mo>=(x
−0.5)2
:MATH]
corresponds to Mood’s test.[62]^20
Focusing on differential expression analysis applications, we design
weight functions
[MATH: wg :MATH]
by composing the score function (the gradient of the log likelihood
with respect to the parameters) with the inverse cumulative
distribution function (see [63]STAR Methods). We derive these score
functions selectively from either an NB or a ZINB distribution, with
QRscore-Var targeting DDGs and QRscore-Mean targeting DEGs. The
rationale behind this design is to overcome the lower power of
traditional rank-based models; our approach significantly boosts power
under correct model specification and achieves efficiency comparable to
the likelihood ratio test.
Despite the tailored weight design, our method remains
distribution-free under the null hypothesis
[MATH: H0g :MATH]
, which assumes identical distributions (
[MATH: Pg=Qg
:MATH]
). This is achieved because the rank-based inputs are uniformly
distributed under the null, regardless of the underlying data
distribution. This distribution-free property provides a significant
advantage: it inherently reduces the impact of model misspecification
and helps maintain strict control over false discoveries.
Moreover, our method also extends to support K-sample tests (see
[64]STAR Methods), broadening its applicability to multiple sample
scenarios. Overall, QRscore is a versatile and powerful approach for
detecting differential mean or dispersion in RNA-seq data, combining
the robustness of non-parametric tests with the enhanced detection
power of a special weight design. The details of the test statistic
construction and its statistical inference are described in [65]STAR
Methods and [66]Methods S1.
Simulation results
In this section, we evaluate existing methods for detecting genes with
changes in dispersion or mean across groups using extensive
simulations. To benchmark two-sample tests, we use mean and dispersion
parameters estimated from GTEx whole-blood RNA-seq data as the
baseline. We simulate fold changes in dispersion or mean across
randomly selected DDGs, varying the magnitude across datasets. To test
robustness, we introduce outliers and zero inflation in different
proportions. We also vary sample sizes to reflect realistic
experimental designs. Lastly, we compare QRscore’s performance in both
two-sample and three-sample tests with varied fold changes and sample
sizes. Simulation details are provided in the [67]STAR Methods.
Key simulation results are qualitatively summarized in [68]Table 1. The
hypothesis tests can be categorized into two groups: those that rely on
specific parametric distributions and those that are non-parametric.
For DDG analysis, parametric methods include GAMLSS, MDSeq, diffVar,
and Levene’s test, while Mood and QRscore-Var are distribution free.
Similarly, for DEG analysis, methods like DESeq2, edgeR, and limma-voom
rely on parametric models, whereas Mann-Whitney and QRscore-Mean are
distribution free. To analyze the sensitivity and robustness of these
methods, metrics such as the area under the precision-recall curve
(AUPRC), power, and FDR under different simulation settings are
computed. Non-parametric methods generally showed greater robustness to
outliers and zeros. Notably, QRscore-Var and QRscore-Mean achieved the
highest detection power for identifying DDGs and DEGs across various
scenarios. In contrast, parametric methods were more susceptible to the
negative effects of outliers and inflated zeros, often leading to an
inflated FDR. Detailed results are provided in the following
subsections.
Table 1.
Summary table for simulation results
Method Assumption Targeted direction Efficient in NB model Robust to
outliers Robust to zeros
QRscore non-parametric mean/variance yes yes yes
GAMLSS NB variance yes no no
MDSeq ZINB mean/variance no no yes
Mood non-parametric variance no yes yes
diffVar Gaussian variance yes yes no
Levene Gaussian variance yes yes no
DESeq2 NB mean yes no yes
edgeR NB mean yes no yes
limma-voom Gaussian mean yes yes no
Mann-Whitney non-parametric mean yes yes yes
[69]Open in a new tab
QRscore is a solid contender in differential dispersion and differential
expression analysis under correctly specified NB models
In simulations based on NB distributions, QRscore-Var exhibits strong
performance, slightly behind GAMLSS (setting 1 in [70]Table 2) and
achieving comparably high detection power with FDR controlled
([71]Figure 2, “no zero inflation or outliers”). This close competition
highlights QRscore-Var’s effectiveness in scenarios with correctly
specified NB models. QRscore-Var also demonstrates higher power than
naive methods like Mood’s test and Levene’s test ([72]Figure S1A). In
smaller sample size settings (settings 2–3 in [73]Table 2), QRscore-Var
consistently ranks as the second-best method after GAMLSS, reinforcing
its reliability and showing notable improvements as sample sizes
increase.
Table 2.
Average area under the AUPRC across all fold changes in all seeds for
each simulation setting
Setting
[MATH: n1 :MATH]
[MATH: n2 :MATH]
α (outliers) β (zeros) DiffVar GAMLSS Levene MDSeq Mood QRscore-Var
1 113 249 0 0 0.774 0.798 0.775 0.701 0.749 0.788
2 50 100 0 0 0.615 0.642 0.610 0.558 0.568 0.634
3 50 50 0 0 0.518 0.575 0.525 0.516 0.479 0.527
4 113 249 0.05 0 0.762 0.753 0.763 0.627 0.745 0.783
5 113 249 0.1 0 0.754 0.714 0.756 0.573 0.745 0.780
6 113 249 0.2 0 0.728 0.640 0.731 0.481 0.737 0.772
7 113 249 0 0.05 0.741 0.736 0.753 0.699 0.742 0.783
8 113 249 0 0.1 0.716 0.689 0.736 0.699 0.741 0.780
9 113 249 0 0.2 0.650 0.594 0.692 0.698 0.732 0.772
10 113 249 0.1 0.1 0.694 0.629 0.714 0.569 0.734 0.772
11 50 100 0.1 0.1 0.529 0.429 0.555 0.399 0.550 0.613
12 50 50 0.1 0.1 0.427 0.363 0.472 0.367 0.458 0.503
[74]Open in a new tab
The simulation results in this table are from scenarios where the
dispersions of group 1 are greater than those of group 2 for true DDGs.
For each setting, the greatest value is shown in bold. Sample size for
group 1:
[MATH: n1 :MATH]
; sample size for group 2:
[MATH: n2 :MATH]
; proportion of genes with random outliers: α; and proportion of genes
with random zeros: β. Sample sizes
[MATH: n1=113 :MATH]
and
[MATH: n2=249 :MATH]
are chosen from GTEx whole-blood tissue data, corresponding to the
40–49 and 60–69 age groups, respectively. When perturbations are
introduced, QRscore outperforms the others, and it ranks second when
model assumptions are perfectly met. For a table that shows cases where
the dispersions of group 2 are greater than those of group 1 for true
DDGs, see [75]Table S1. AUPRC, area under the precision-recall curve.
Figure 2.
[76]Figure 2
[77]Open in a new tab
Simulation results for QRscore-Var benchmarked against other methods
for detecting DDGs
The true DDGs have greater dispersion in group 1 than in group 2.
(A) Line plots of power when true false discovery proportions are less
than 0.05 across different dispersion fold changes.
(B) Line plots of FDR when adjusted p values are less than 0.05 across
different dispersion fold changes.
Both plots use
[MATH: n1=113 :MATH]
and
[MATH: n2=249 :MATH]
(GTEx derived). We present four different settings in the plot:
baseline NB model, 10% genes with outliers, 10% genes with zero
inflation, and a combination of both. For additional methods benchmark,
see [78]Figure S1.
Likewise, as shown in [79]Table S3, QRscore-Mean shows performance
comparable to DESeq2 and edgeR in differential mean analysis, with a
marginal difference in the AUPRC of about 1%. This slight difference in
power and FDR is further illustrated in setting 1 of [80]Figure S2.
QRscore-Var leads in robustness and effectiveness in gene expression
variability analysis with outliers and excess zeros
Next, we simulate data with random outliers (settings 4–6 in [81]Table
2), where QRscore-Var emerges as the best-performing method based on
the average AUPRC. In contrast, methods like GAMLSS and MDSeq show much
lower AUPRC scores, even with a small proportion of outliers, due to
their reliance on the NB assumption, which makes them vulnerable to
model misspecification. For instance, in setting 4 with 5% random
outliers, the AUPRC for GAMLSS drops by 5.6% (from 0.798 to 0.753),
while QRscore-Var experiences only a 0.6% drop (from 0.788 to 0.783).
As the outlier proportion increases to 20%, GAMLSS’s score declines by
19.8% (to 0.640), whereas QRscore-Var remains stable, with only a 2.0%
drop (to 0.772). MDSeq also shows considerable performance degradation,
and both GAMLSS and MDSeq exhibit inflated FDRs in the presence of 10%
outliers, as shown in [82]Figure 2B (“outliers").
In simulations with random zeros (settings 7–9 in [83]Table 2),
QRscore-Var again demonstrates robust performance. GAMLSS and diffVar,
by contrast, show inflated FDRs and reduced power under zero-inflated
conditions ([84]Figure 2, “zero inflation"). Despite MDSeq being
designed to accommodate technical zeros, QRscore-Var consistently
outperforms it across all fold changes. Among non-parametric methods,
QRscore-Var also achieves the highest power in zero-inflated settings
([85]Figure S1, “zero inflation”). Furthermore, QRscore maintains
reliable power and FDR control under both up- and down-regulation of
variance, underscoring its robustness across different shift directions
(see [86]Tables 2 and [87]S1).
To assess QRscore-Var’s robustness in settings involving small samples,
we perform additional simulations with group sizes ranging from 5 to
15, incorporating noise through randomly inserted outliers and excess
zeros. We observe that parametric methods such as MDSeq and GAMLSS
exhibit severe FDR inflation as sample sizes decrease and fail to
provide reliable discoveries with controlled FDRs. In contrast,
QRscore-Var consistently maintains strict FDR control across all
scenarios and achieves the highest power when group sizes are 8 or
higher ([88]Figure S3A). We find that all methods have low power below
group sizes of 8. Therefore, we recommend using at least 8 samples per
group to ensure true discoveries while maintaining FDR control.
Finally, to assess the robustness of QRscore-Var under additional model
misspecification, we conduct simulations where gene expression follows
bimodal distributions—an important scenario given that real RNA-seq
data can exhibit multimodal patterns due to heterogeneous cell
populations or mixed regulatory states. As shown in [89]Table S2,
QRscore-Var consistently achieves the highest average AUPRC across all
scenarios with outliers and excess zeros. These results demonstrate
that QRscore-Var maintains strong FDR control and detection power even
when the data distribution departs from standard unimodal assumptions.
QRscore-Mean leads in robustness and effectiveness in differential gene
expression analysis with outliers and excess zeros
QRscore-Mean, similar to QRscore-Var, also shows the strongest
performance in differential gene expression analysis under high
proportions of random outliers (setting 6 in [90]Table S3). While edgeR
and DESeq2 show marginally higher AUPRC scores with fewer outliers,
they suffer from inflated FDRs, making their detection results less
reliable in practice ([91]Figure S2B, “outliers"). Among the methods
that control the FDR, QRscore-Mean performs the best in terms of the
average AUPRC and detection power.
QRscore-Mean outperforms other methods in differential gene expression
analysis when there are excess zeros (settings 7–9 in [92]Table S3).
This improved performance is due to its robustness in handling data
distribution irregularities, particularly those characterized by excess
zeros. While random zeros weaken the signal without inflating FDR, this
loss of power impacts methods like edgeR and DESeq2, which lack
explicit handling of zero inflation.
We also conduct additional “small sample" simulations with group sizes
ranging from 5 to 15, incorporating realistic noise through randomly
inserted outliers and excess zeros. QRscore-Mean and the Mann-Whitney
test preserve FDR control across all scenarios. Although DESeq2 and
edgeR show slightly higher AUPRC values at very small sample sizes,
their inflated FDR limits practical utility, as shown in [93]Figure
S3B. In contrast, QRscore-Mean consistently offers more accurate
results. Similar to variance shift testing, we recommend a minimum
sample size of 8 per group to ensure sufficient detection power.
Three-sample tests show superiority compared to pairwise two-sample tests in
diverse simulations for QRscore-Var
We investigated the performance of the three-sample QRscore test (see
[94]STAR Methods) compared to the commonly used approach of sequential
pairwise testing followed by p value aggregation. In practice, the user
may prefer to test the heterogeneity of gene expression across multiple
groups or conditions and obtain a single, well-calibrated p value for
each gene. Results from [95]Figure S4 demonstrate that the three-sample
QRscore test outperforms the pairwise two-sample tests in terms of the
AUPRC, particularly when the sample sizes for different groups are not
equal. Additional simulations explore scenarios where the three samples
share the same sample size and cases with varying degrees of dispersion
among all three groups. These simulations, which also include the
presence of random outliers and zeros, show that the three-sample tests
consistently provide superior AUPRC results, demonstrating better
adaptability and precision even under model perturbations. Based on the
simulation results, we will apply three-sample tests to GTEx data in
[96]application to GTEx.
Application to GTEx
Having demonstrated QRscore’s error control capabilities and enhanced
power, we now apply it to the GTEx dataset. The GTEx project provides
RNA-seq data from bulk tissue samples, gathered from nearly 1,000
deceased donors across various age groups and both sexes. This dataset
covers a wide array of tissue types, enabling tissue-specific gene
expression analyses.
The flexibility of QRscore allows for the detection of tissue-specific
DDGs, as well as distinct gene sets that differ in mean expression
across tissues. We also show that these gene sets exhibit unique
functional effects.
QRscore-Var detects tissue-specific genes with changes in dispersion
We apply QRscore-Var across 33 tissues to identify DDGs across three
age groups: 20–39 (younger), 40–59 (middle), and 60–79 (older). We
detected distinct trends in variance shifts from younger to older age
groups ([97]Figure 3A). In tibial artery and skeletal muscle tissues,
most DDGs show an increase in variance from the middle to the older age
group. In contrast, in whole-blood tissue, we observe more DDGs showing
an increase in variance from the younger to the middle age group,
followed by a decrease from the middle to the older group. Our latter
finding agrees with a recent study on gene expression in blood samples
from twins, which reported a general decline in expression variability
with age.[98]^6 In total, 10,362 unique tissue-specific DDGs were
detected.
Figure 3.
[99]Figure 3
[100]Open in a new tab
QRscore-Var results on GTEx data
(A) A stacked bar plot illustrates the quantity of genes identified by
QRscore-Var, categorized by variance trends. For instance, the
“decrease-decrease" label indicates a progressive reduction in gene
expression variance from the youngest age group (20–39) through the
middle age group (40–59) and into the oldest age group (60–79).
(B) Histogram of the number of tissues in which each DDG detected only
by QRscore-Var is found. The red bars indicate DDGs found from more
than 3 tissues.
(C) UpSet plot showing overlaps of DDGs from the three-sample test and
corresponding pairwise comparisons.
(D) Bar plots display the log counts of DDGs identified in various
tissues from female and subsampled male individuals, respectively.
To better interpret these patterns, we perform pairwise two-sample
tests between age groups ([101]Figure 3C). In the tibial artery,
variance shifts are primarily driven by differences between the
youngest group and the other two. In the transverse colon, the oldest
group contributes most to the detected differences. These patterns
reflect how age range and tissue context shape dispersion trends.
The majority of DDGs are tissue specific, where each gene is detected
in only one tissue ([102]Figure 3B). However, a few genes show
differential dispersion across multiple tissues ([103]Table S5). For
example, EDA2R, a gene previously known to be associated with aging in
the lung[104]^21 and skeletal muscle,[105]^22 is detected as a DDG in
16 tissues. Additionally, PTCHD4, previously associated with aging in
the thyroid gland,[106]^23 is found to have differential dispersion in
16 tissues.
Sex can significantly influence the detection of variance shifts, and
the GTEx dataset includes more male than female samples. To address
this potential confounding factor, we stratify the data by sex and
subsample male samples to match the number of females, aiming for a
fair comparison ([107]Figure 3D). In tissues such as the tibial artery,
transverse colon, and whole blood, the numbers of DDGs identified in
male and female samples are similar, and there are sex-specific DDGs
discovered in each tissue. Interestingly, within adipose visceral
tissue, even after subsampling to equalize sample sizes, DDGs are
identified in male samples but not in female samples. These findings
provide insights into sex-specific DDGs during aging.
QRscore-Var identifies different sets of genes from QRscore-Mean
In parallel with QRscore-Mean, designed to identify genes with mean
expression changes using tailored weights, we used QRscore-Var for its
capacity to detect variance shifts. Although QRscore-Var detects fewer
genes overall compared to QRscore-Mean, it identifies 2,652 unique DDGs
across at least one tissue that QRscore-Mean does not capture (that is,
DDGs that are not also DEGs; see [108]Figure 4A), indicating its
ability to uncover different aspects of gene expression changes.
Figure 4.
[109]Figure 4
[110]Open in a new tab
Analysis of age-related differential gene expression across human
tissues using QRscore-Var and QRscore-Mean
(A) Pie charts visualizes the proportion of gene biotypes across
various tissues, analyzed through the QRscore-Var and QRscore-Mean
methods. The slices represent protein-coding genes, long noncoding RNAs
(lncRNAs), and small noncoding RNAs (sncRNAs), with immunoglobulin (Ig)
genes separated into their own category due to their substantial
proportion in some tissues. Pseudogenes are excluded. A white circle at
the center indicates the total gene count. Tissues are ranked by the
number of DDGs unique to QRscore-Var.
(B) Bar plots for selected tissues, highlighting the top 20 significant
genes only detected by QRscore-Var. These plots juxtapose the
significance levels (
[MATH: −log10
(pvalue) :MATH]
) of identified genes between the QRscore-Var and QRscore-Mean
analyses, with a focus on protein-coding genes.
A notable number of DDGs are identified only by QRscore-Var, but not
QRscore-Mean, in several tissues, in line with studies emphasizing
variance shifts in the transverse colon and thyroid.[111]^24 While most
DDGs are protein-coding genes, immunoglobulin (Ig) genes are also
prominently detected, especially in tissues like esophagus mucosa,
transverse colon, and lung ([112]Figure 4A). Their contribution
suggests a link between age-related variability and immune-response
alterations. These results indicate that variance shifts can reveal
biologically meaningful changes that complement mean-based analyses in
understanding aging.
We summarize the top 20 protein-coding DDGs with age-associated
variance shifts that are not classified as DEGs across six tissues (see
[113]Figure 4B). Variance shifts in genes like C-C motif chemokine
ligand 2 (CCL2) in whole blood and SAA1 in the transverse colon
underscore their roles in shaping the inflammatory landscape during
aging. CCL2 is pivotal in chronic immune responses during
aging,[114]^25 while SAA1 contributes to an inflammatory
microenvironment that varies across individuals, leading to diverse
impacts on epithelial cell function via MMP9 up-regulation in aging
tissues.[115]^26 In the tibial artery, ADRA2B regulates vascular tone,
with its variance shifts suggesting its importance in maintaining
vascular health with age. BNIP3 in skeletal muscle indicates diverse
mitochondrial balance and autophagy responses, signaling varied
susceptibility to age-related muscle degradation.[116]^27 Meanwhile,
SERPINB1 in the lung safeguards against protease-induced tissue damage,
with its variance shifts revealing intricate age-related changes that
impact lung maintenance and contribute to chronic lung
conditions.[117]^28 These gene expression shifts reveal the intricate,
tissue-specific regulatory changes that contribute to aging and its
associated dysfunctions.
We compile lists of genes related to human aging from three databases,
GenAge,[118]^29 LongevityMap,[119]^30 and CellAge[120]^31 to annotate
age-related genes. We rank the genes based on p values for each tissue
and age group, using QRscore-Var or QRscore-Mean tests, and apply
one-sided Wilcoxon rank-sum tests to evaluate the methods’
effectiveness at prioritizing aging-related genes. In [121]Table S4,
QRscore-Var shows statistical significance (
[MATH: p<0.05 :MATH]
) in ranking aging-related genes in twelve out of fourteen tissues.
However, the Wilcoxon test p values are less significant than
QRscore-Mean in most tissues and are not significant in lung and lower
leg skin (sun exposed). This suggests that, as the existing databases
are mostly annotated by mean expression shifts, they may miss
age-related DDGs that QRscore-Var can identify, and QRscore-Var
provides a complementary perspective on the genetic basis of aging.
QRscore-Var and QRscore-Mean identify genes that are functionally distinct
To investigate the biological functions of genes identified by QRscore,
we performed Gene Ontology (GO) enrichment analysis for the top 500
DDGs and DEGs ranked by QRscore-Var and QRscore-Mean, respectively. In
whole blood, while QRscore-Mean highlights fundamental processes such
as mRNA processing and RNA splicing, QRscore-Var reveals enrichment in
pathways related to chemotaxis, angiogenesis, cell motility, and
regulation of the ERK1/2 cascade—reflecting broader regulatory changes
beyond mean expression shifts ([122]Figures 5A and 5B). Specifically,
genes with large variance shifts, such as HRG, ANGPTL3, and SERPINE1,
are linked to angiogenesis, while FGB and FGG are involved in immune
signaling and blood coagulation ([123]Figure 5C). Additionally, the
analysis reveals notable variance shift among age groups in the CCL
gene family, such as CCL14, CCL18, and CCL20, which are enriched in the
chemokine-mediated signaling pathway. This further highlights the
intricate gene expression dynamics involved in immune responses and
inflammation.
Figure 5.
[124]Figure 5
[125]Open in a new tab
Comparative analysis of GO enrichment in whole-blood tissue based on
significant genes identified by QRscore-Var and QRscore-Mean
(A) A dot plot illustrates the enriched GO terms within the biological
process ontology for the top 500 genes significantly detected by
QRscore-Var.
(B) Similar enrichment analysis results for genes identified by
QRscore-Mean.
(C) A heat plot, generated using ClusterProfiler, shows the fold change
in expression variance between the 20–39 and 40–59 age groups for
significant genes associated with the top 5 enriched GO terms
identified by the QRscore-Var analysis.
(D and E) A dot plot (D) and a density plot (E) further investigate the
expression patterns of two highly coexpressed genes identified by
QRscore-Var as being involved in angiogenesis.
We further identify the top QRscore-Var genes associated with
significantly enriched GO terms across 20 tissues and compare their
QRscore-Mean rankings (see [126]Table S6). In whole blood, CCL2 is
uniquely identified as significant by QRscore-Var and exhibits
non-monotonic changes in dispersion with age. Among the top 25 DDGs
contributing to enriched GO terms in this tissue, 23 are ranked within
the top 200 by QRscore-Var but fall outside the top 1,000 under
QRscore-Mean, underscoring the distinct insights gained from analyzing
expression variability. We also identify coexpressed DDG pairs linked
to key biological processes—for example, SERPINE1 and ANGPTL4 are
strongly coexpressed (Spearman’s
[MATH: ρ=0.808 :MATH]
) and are implicated in angiogenesis and blood coagulation
([127]Figures 5D and 5E).
Across other tissues—including the tibial artery, transverse colon, and
skeletal muscle—QRscore-Var reveals biological processes and genes with
significant expression variability during aging. In the tibial artery,
variance shifts in DDGs highlight tissue-specific processes such as
sprouting angiogenesis and the positive regulation of cell migration.
Beyond these processes, SEMA6A, identified as a DDG but not a DEG, is
associated with the semaphorin-plexin signaling pathway and the
negative regulation of axon extension involved in axon guidance,
pointing to additional molecular mechanisms at play in the tibial
artery. In the transverse colon, DDGs are enriched for processes such
as digestion, steroid and hormone metabolism, and xenobiotic response,
highlighting their roles in maintaining gastrointestinal function,
regulating metabolic activity, and responding to environmental stimuli.
In skeletal muscle, coexpressed DDGs like COMP and FMOD underscore
their importance in collagen fibril organization and maintaining the
structural integrity of muscle tissue. The DDGs, enriched pathways, and
coexpression patterns described here are shown in [128]Tables S6 and
[129]S7.
Application to pseudo-bulked scRNA-seq data
Here, we apply QRscore to data from the Asian Immune Diversity Atlas
(AIDA),[130]^32 a large-scale scRNA-seq resource profiling peripheral
blood mononuclear cells (PBMCs) from healthy donors across Asia. Our
analysis focuses on individuals of East Asian identity to minimize
confounding due to population heterogeneity (see [131]STAR Methods).
The single-cell resolution of AIDA enables detailed exploration of
age-related regulatory changes in specific immune cell types, which are
often obscured in bulk tissue analyses. To increase detection power,
stabilize variance estimates, and enable robust population-level
statistical testing, we aggregate single-cell profiles into pseudo-bulk
samples by donor, age group, and cell type. This approach preserves
cell-type resolution while capturing inter-individual variability,
aligning with our goal of characterizing gene expression variance
across the population. Applying QRscore-Var across 25 immune cell types
and three age groups (20–34, 35–49, and 50+), we detect both
cell-type-specific and shared patterns of variance shifts, highlighting
distinct regulatory programs across the immune system.
We identify widespread shifts in gene expression variability with
aging. Among the analyzed cell types, naive thymus-derived
CD4-positive, alpha-beta T cells; central memory CD4-positive,
alpha-beta T cells; and CD16-positive, CD56-dim natural killer cells
exhibit the highest DDG counts, suggesting greater transcriptional
reprogramming in these cell types during aging ([132]Figure S5A). Most
DDGs exhibit monotonic variance trajectories across age groups, with
the majority showing either progressive increases (increase-increase)
or decreases (decrease-decrease) in dispersion from younger to older
individuals. A smaller subset of DDGs display non-monotonic patterns,
with predominantly increase-decrease trajectories and a minority
showing decrease-increase variance shifts. Although the majority of
DDGs are specific to individual cell types ([133]Figure S5B), several
genes, such as RBM38, RPL27A, and BRD4, are recurrently identified
across multiple immune cell-type populations. Notably, RBM38, a
p53-regulated RNA-binding protein involved in mRNA stability and
cellular senescence, is linked to aging and tumor suppression.[134]^33
Furthermore, pairwise comparisons reveal that variance shifts are often
strongest between the youngest and oldest groups ([135]Figure S5C). In
the top three cell types with the highest DDG counts, two-sample tests
show that more DDGs exhibit significant variance shifts in the 20–34
vs. 50+ and 35–49 vs. 50+ comparisons, indicating that the oldest age
group displays the most pronounced and distinct variance patterns.
Next, we identify genes with significant variance shifts detected in
both GTEx whole-blood and pseudo-bulked scRNA-seq analyses, with
overlaps of 11%–32% across immune cell types ([136]Figure S6A). Among
the shared genes, key regulators such as RBM38, RGS1, and the NR4A
family (NR4A1/2/3) emerge as recurrent across multiple cell types,
suggesting their involvement in a core aging-related regulatory
program. These genes are well-established mediators of immune
activation, stress responses, and inflammatory regulation—processes
tightly linked to immunosenescence.
To explore the functional relevance of these shared DDGs, we perform
ontology enrichment analysis on the overlap between pseudo-bulk and
bulk results for CD14-positive monocytes. The analysis reveals
enrichment for inflammatory cytokine responses, including pathways
related to interleukin-1, tumor necrosis factor (TNF), and chemokine
signaling—key processes in age-associated inflammation ([137]Figure
S6B). Genes such as ZC3H12A, CXCL8, CCL4, and CCL20 exemplify how
transcriptional variability in monocytes is linked to enhanced
responsiveness to pro-inflammatory cytokines and immune aging
([138]Figure S6C).
While some genes are shared between the GTEx whole-blood and
pseudo-bulked scRNA-seq analyses, many DDGs are uniquely identified in
the pseudo-bulk analysis, highlighting the added resolution provided by
cell-type-specific variance analysis. [139]Figure S5D shows bar plots
of the top 20 DDGs for the three cell types with the highest DDG
counts, including cell-type-specific DDGs shown in black. In naive
thymus-derived CD4-positive, alpha-beta T cells, FURIN and EXT1 suggest
modulation of cytokine signaling and extracellular immune interactions.
In CD16-positive, CD56-dim natural killer cells, variance shifts in
NFIL3 and IL1B reflect dynamic regulation of development and
inflammatory responses. In central memory CD4-positive, alpha-beta T
cells, CSNK1E and KLHL11 exhibit variability changes linked to immune
rhythm and tolerance. These genes highlight the unique contributions of
cell-type-specific analysis in uncovering regulatory mechanisms with
age-related variability that were missed in our bulk analysis.
Discussion
We introduced QRscore, a suite of rank-based non-parametric tests for
identifying DDGs and DEGs from bulk RNA-seq data. We showed, through
extensive simulations, that QRscore is robust at detecting both mean
and dispersion shifts, particularly under challenging and realistic
conditions where datasets have inflated zeros and outliers. Notably,
QRscore controls false discovery well and achieves high power even in
such settings. Applying QRscore to GTEx data, we find different DDGs
and DEGs in each tissue that are also functionally distinct,
implicating novel candidates for further biological investigation.
Model misspecification is a critical issue in RNA-seq data analysis
that potentially explains the inflation of false discoveries observed
in practice. For instance, a study[140]^34 showed that methods relying
on specific distributional assumptions, like NB or ZINB, identify
different sets of DEGs across mouse strains and human tissues, with
elevated FDRs in the presence of outliers. As a rank-based
non-parametric method, QRscore is less sensitive to data distribution
and effectively handles technical zeros and outliers. By being less
prone to model misspecification, QRscore yields more reliable and
replicable results, improving the likelihood of downstream experimental
validation.
As pointed out in the [141]introduction, gene expression variability
plays an important role in aging, adaptation to changing environments,
and disease susceptibility. Focusing on aging in our present work,
senescence-related changes in gene expression variability are often
driven by shifting regulatory efficiency and the accumulation of
epigenetic modifications,[142]^35 which may go unnoticed when focusing
solely on mean expression levels. QRscore fills this gap by detecting
variance shifts, revealing biologically relevant DDGs that highlight
aging-related processes like angiogenesis in whole blood. These
findings reinforce the value of analyzing gene expression variability
to uncover insights beyond mean-level changes.
We further demonstrated that QRscore can be applied to pseudo-bulked
scRNA-seq data, enabling analysis of cell-type-specific regulatory
variability. This is important because bulk tissue data, like those of
GTEx, aggregate signals from mixed cell populations, potentially
obscuring variability specific to a particular lineage. In our
analysis, many DDGs with cell-type-specific roles were uniquely
identified in pseudo-bulk data. Additionally, QRscore identified shared
signals across GTEx and pseudo-bulked scRNA-seq analyses, despite
differences in sequencing technology, cell population composition, cell
numbers per sample, and population ancestry. Together, these results
highlight QRscore’s ability to uncover meaningful, cell-type-specific
insights and robust signals of gene expression variability across
diverse contexts.
The future of differential expression analysis lies in detecting
diverse expression changes while accounting for confounders and
technical noise. Flexible methods like QRscore support this goal by
capturing distributional shifts. As tools improve, biologists will gain
deeper insight into how gene expression varies across myriad contexts.
Limitations of the study
One limitation of QRscore is that it tends to be conservative when the
sample size is small. This is an inherent constraint of rank-based
tests, which generally require larger sample sizes to ensure reliable
decisions. As large-scale datasets like GTEx become more common, future
work will extend QRscore to even larger and more diverse populations.
Resource availability
Lead contact
Requests for further information and resources should be directed to
and will be fulfilled by the lead contact, Yun S. Song
(yss@berkeley.edu).
Materials availability
This study did not generate new unique reagents.
Data and code availability
* •
This paper analyzes existing, publicly available data. The
accession numbers for the datasets are listed in the [143]key
resources table. The data used to reproduce the figures in this
paper are available at
[144]https://doi.org/10.5281/zenodo.16416084.
* •
The R package for QRscore is available at
[145]https://github.com/songlab-cal/QRscore and Bioconductor. The
code for reproducing the figures in this paper is available at
[146]https://doi.org/10.5281/zenodo.16416084.
* •
Any additional information required to reanalyze the data reported
in this work paper is available from the lead contact upon request.
Acknowledgments