Graphical abstract graphic file with name fx1.jpg [32]Open in a new tab Highlights * • QRscore detects both variance and mean shifts in gene expression * • It controls FDR and achieves high power under noise and zero inflation * • Tissue-specific, age-related differentially dispersed genes are identified from GTEx data * • Cell-type-specific variance shifts associated with aging are revealed from scRNA-seq data Motivation The increasing availability of large-scale RNA-seq datasets from both bulk and single-cell sources increases our power to detect biologically meaningful shifts in mean gene expression and expression variability. However, existing methods often prioritize mean shifts or rely on strong parametric assumptions that may not hold across heterogeneous populations. This has created a gap in methods that adapt to the alternative hypothesis of interest while also controlling type I and type II errors without requiring parametric assumptions to formulate the null hypothesis. We thus develop QRscore, an R/Bioconductor software that leverages non-parametric statistics and asymptotic theory to obtain powerful and false discovery rate (FDR)-controlled flexible tests of differential expression. __________________________________________________________________ Zhou et al. present QRscore, a robust non-parametric framework for statistical hypothesis testing that detects shifts in both the mean and variance of gene expression. QRscore maintains high power and strict FDR control, even under noise and zero inflation. It reveals age-related variance changes in both bulk RNA-seq and pseudo-bulked single-cell RNA-seq data. Introduction In both bulk and single-cell genomics, differential expression analysis is a routine procedure for identifying genes whose expressions differ under varying conditions or treatments. To detect changes in average expression, widely used methods in bulk RNA sequencing (RNA-seq) differential analysis treat the dispersion of gene expression as a nuisance parameter, implementing techniques to shrink and stabilize the estimation of variance. The widespread use of these approaches can result in missed signals of gene expression variability that capture important biological mechanisms. In humans and C. elegans, variability indices are associated with gene regulation[33]^1^,[34]^2 and cell development,[35]^3 potentially explaining many biological phenomena, including adaptation to changing environments, disease susceptibility, and incomplete penetrance.[36]^4^,[37]^5 Recent studies also demonstrate that variability indices provide insights into aging,[38]^6^,[39]^7^,[40]^8 therefore underscoring the crucial need to incorporate changes in variability into tests of differential gene expression. To detect expression variability, previous studies on microarray datasets[41]^9^,[42]^10 applied traditional hypothesis testing methods, such as the F-test and Bartlett’s test. These tests are based on normal distribution assumptions of two samples and are designed to identify differences in standard deviation. To improve robustness, one approach is to analyze different measures for variability other than the standard deviation. Such measures include the median absolute deviation[43]^10 and the coefficient of variation.[44]^2 Another approach to ensure the robustness of statistical tests is to employ methods that are more robust to deviations from normality for two samples, such as Levene’s test and permutation tests.[45]^10 The method diffVar[46]^11 implements a modified version of Levene’s test that utilizes an empirical Bayes estimator instead of variance estimation for t statistics to obtain a more powerful test. However, this method maintains its parametric nature, making it susceptible to reduced performance due to deviations from the underlying assumptions. Additionally, non-parametric methods like the Kolmogorov-Smirnov (KS) test are not commonly used in the analysis of RNA-seq data because of their limited statistical power, especially when working with small sample sizes. Because bulk RNA-seq data often show overdispersion, particularly in gene expression, where the variance for most genes exceeds the mean, they are commonly modeled using the negative binomial (NB) or zero-inflated NB (ZINB) distribution. One method for detecting differences in dispersion applies generalized additive models for location, scale, and shape (GAMLSS),[47]^12 allowing the addition of regressions for dispersion parameters in the NB distribution into the main regression model and targeting changes in dispersion parameters through likelihood ratio tests.[48]^4 Another method, MDSeq,[49]^13 utilizes a reparameterization of the ZINB model to test the mean and dispersion separately from generalized linear models (GLMs). DiffDist[50]^14 uses a hierarchical Bayesian model based on the NB distribution, which can provide tests for differential dispersion and distribution for RNA-seq data. Although these current methods are based on NB distributions and are widely applied in scenarios of small sample sizes, recent research has shown that parametric methods are prone to inflated false discovery rates (FDRs) in both studies with small sample sizes[51]^15 and population-level RNA-seq studies with large sample sizes.[52]^16 Additionally, they are highly sensitive to preprocessing procedures[53]^17 and are not robust to model misspecification.[54]^18 Here, we introduce QRscore (quantile rank score), a flexible rank-based method for conducting robust and powerful two-sample tests. QRscore generalizes the Mann-Whitney test and capitalizes on the adaptability of the chosen test statistic to yield substantial statistical power across various scenarios, including the detection of shifts in mean and variance. QRscore offers various approaches for designing test statistics that are effective in detecting shifts in mean and dispersion. These include a versatile naive approach applicable to various cases beyond count data, as well as methods designed specifically for RNA-seq count data with either a low occurrence of zeros or a higher proportion of inflated zeros. Also, we extend QRscore to K-sample tests, where [MATH: K3 :MATH] . Through simulations that extensively compare QRscore with existing differential variability tests and differential centrality tests, we find that QRscore controls FDRs across diverse model specifications while simultaneously obtaining high statistical power. We use QRscore to detect both differentially dispersed genes (DDGs) with variance shifts and differentially expressed genes (DEGs) with mean shifts across multiple tissues in the Genotype-Tissue Expression (GTEx) project,[55]^19 showing that differential variability characterizes functionally distinct sets of genes from differential centrality. We also apply QRscore to pseudo-bulked single-cell RNA-seq (scRNA-seq) data, highlighting its potential to uncover cell-type-specific variance shifts that are obscured in bulk RNA-seq analyses. Our method contributes to the field of bioinformatics tools for genomics and can be adapted to other applications that require flexible non-parametric tests. Results Overview of QRscore QRscore is a rank-based hypothesis testing method that achieves robust false positive control through its non-parametric foundation while gaining power from parametric model-informed weighting ([56]Figure 1). Figure 1. [57]Figure 1 [58]Open in a new tab Overview of QRscore (1) Prefiltering and normalization: this initial data processing stage ensures quality and comparability of data. (2) Pooling and sorting: for gene g, expression levels [MATH: Xg={X1g< /mrow>,,Xk< /mi>g} :MATH] and [MATH: Yg={Y1g< /mrow>,,Yn< /mi>g} :MATH] are combined into a single ordered list. The rank of [MATH: Xig :MATH] , denoted as [MATH: RXig :MATH] , is determined after pooling and sorting the expression values of gene g. (3) Constructing the test statistic: a test statistic t is calculated using a weighted function based on modified score test functions, which enhances test power through parametric estimation and incorporates a generalized K-sample test. (4) Hypothesis and inference: this stage involves evaluating the null hypothesis for gene [MATH: g,H0g :MATH] , [MATH: Pg=Qg :MATH] , where [MATH: Pg :MATH] and [MATH: Qg :MATH] represent the distributions of gene expression samples [MATH: Xg :MATH] and [MATH: Yg :MATH] , respectively. p values are computed and adjusted using the Benjamini-Hochberg procedure to control the false discovery rate. (5) Identification of DDGs and DEGs: genes with significant expression-level differences between the conditions are identified, as represented in the QRscore-Var and QRscore-Mean diagrams. Details of QRscore-Var and QRscore-Mean are provided in the [59]STAR Methods and [60]Methods S1. Our null hypothesis posits that the two samples are drawn from the same distribution. Specifically, given a gene g, let [MATH: Xig :MATH] and [MATH: Yjg :MATH] denote the normalized gene expression for individuals i and j in the first and second conditions, respectively, where [MATH: i=1,,k :MATH] and [MATH: j=1,,n :MATH] . With samples [MATH: {Xig }i=1ki.i.d.Pg :MATH] and [MATH: {Yjg }j=1ni.i.d.Qg :MATH] , the null and alternative hypotheses for gene g are [MATH: H0g:Pg=Qg andH1g< /msub>:PgQ< /mi>g. :MATH] Our rank-based test statistic, [MATH: Tgw :MATH] , is computed as the normalized sum of weighted ranks of the [MATH: Xig :MATH] observations using the formula [MATH: Tgw=1n+ki=1kwg(RX ign+k), :MATH] (Equation 1) where [MATH: RXig :MATH] denotes the rank of [MATH: Xig :MATH] after pooling and sorting observations from both groups. Here, [MATH: wg :MATH] is a weight function chosen to maximize the test power against user-specific alternatives. Its dependence on g indicates that it can be chosen differently across different genes. While the space of all possible weight functions [MATH: wg :MATH] induces a family of test statistics of the form shown in [61]Equation 1, particular choices of [MATH: wg :MATH] lead to well-known tests. For instance, [MATH: wg(x)< /mo>=x :MATH] gives the Mann-Whitney test, while [MATH: wg(x)< /mo>=(x0.5)2 :MATH] corresponds to Mood’s test.[62]^20 Focusing on differential expression analysis applications, we design weight functions [MATH: wg :MATH] by composing the score function (the gradient of the log likelihood with respect to the parameters) with the inverse cumulative distribution function (see [63]STAR Methods). We derive these score functions selectively from either an NB or a ZINB distribution, with QRscore-Var targeting DDGs and QRscore-Mean targeting DEGs. The rationale behind this design is to overcome the lower power of traditional rank-based models; our approach significantly boosts power under correct model specification and achieves efficiency comparable to the likelihood ratio test. Despite the tailored weight design, our method remains distribution-free under the null hypothesis [MATH: H0g :MATH] , which assumes identical distributions ( [MATH: Pg=Qg :MATH] ). This is achieved because the rank-based inputs are uniformly distributed under the null, regardless of the underlying data distribution. This distribution-free property provides a significant advantage: it inherently reduces the impact of model misspecification and helps maintain strict control over false discoveries. Moreover, our method also extends to support K-sample tests (see [64]STAR Methods), broadening its applicability to multiple sample scenarios. Overall, QRscore is a versatile and powerful approach for detecting differential mean or dispersion in RNA-seq data, combining the robustness of non-parametric tests with the enhanced detection power of a special weight design. The details of the test statistic construction and its statistical inference are described in [65]STAR Methods and [66]Methods S1. Simulation results In this section, we evaluate existing methods for detecting genes with changes in dispersion or mean across groups using extensive simulations. To benchmark two-sample tests, we use mean and dispersion parameters estimated from GTEx whole-blood RNA-seq data as the baseline. We simulate fold changes in dispersion or mean across randomly selected DDGs, varying the magnitude across datasets. To test robustness, we introduce outliers and zero inflation in different proportions. We also vary sample sizes to reflect realistic experimental designs. Lastly, we compare QRscore’s performance in both two-sample and three-sample tests with varied fold changes and sample sizes. Simulation details are provided in the [67]STAR Methods. Key simulation results are qualitatively summarized in [68]Table 1. The hypothesis tests can be categorized into two groups: those that rely on specific parametric distributions and those that are non-parametric. For DDG analysis, parametric methods include GAMLSS, MDSeq, diffVar, and Levene’s test, while Mood and QRscore-Var are distribution free. Similarly, for DEG analysis, methods like DESeq2, edgeR, and limma-voom rely on parametric models, whereas Mann-Whitney and QRscore-Mean are distribution free. To analyze the sensitivity and robustness of these methods, metrics such as the area under the precision-recall curve (AUPRC), power, and FDR under different simulation settings are computed. Non-parametric methods generally showed greater robustness to outliers and zeros. Notably, QRscore-Var and QRscore-Mean achieved the highest detection power for identifying DDGs and DEGs across various scenarios. In contrast, parametric methods were more susceptible to the negative effects of outliers and inflated zeros, often leading to an inflated FDR. Detailed results are provided in the following subsections. Table 1. Summary table for simulation results Method Assumption Targeted direction Efficient in NB model Robust to outliers Robust to zeros QRscore non-parametric mean/variance yes yes yes GAMLSS NB variance yes no no MDSeq ZINB mean/variance no no yes Mood non-parametric variance no yes yes diffVar Gaussian variance yes yes no Levene Gaussian variance yes yes no DESeq2 NB mean yes no yes edgeR NB mean yes no yes limma-voom Gaussian mean yes yes no Mann-Whitney non-parametric mean yes yes yes [69]Open in a new tab QRscore is a solid contender in differential dispersion and differential expression analysis under correctly specified NB models In simulations based on NB distributions, QRscore-Var exhibits strong performance, slightly behind GAMLSS (setting 1 in [70]Table 2) and achieving comparably high detection power with FDR controlled ([71]Figure 2, “no zero inflation or outliers”). This close competition highlights QRscore-Var’s effectiveness in scenarios with correctly specified NB models. QRscore-Var also demonstrates higher power than naive methods like Mood’s test and Levene’s test ([72]Figure S1A). In smaller sample size settings (settings 2–3 in [73]Table 2), QRscore-Var consistently ranks as the second-best method after GAMLSS, reinforcing its reliability and showing notable improvements as sample sizes increase. Table 2. Average area under the AUPRC across all fold changes in all seeds for each simulation setting Setting [MATH: n1 :MATH] [MATH: n2 :MATH] α (outliers) β (zeros) DiffVar GAMLSS Levene MDSeq Mood QRscore-Var 1 113 249 0 0 0.774 0.798 0.775 0.701 0.749 0.788 2 50 100 0 0 0.615 0.642 0.610 0.558 0.568 0.634 3 50 50 0 0 0.518 0.575 0.525 0.516 0.479 0.527 4 113 249 0.05 0 0.762 0.753 0.763 0.627 0.745 0.783 5 113 249 0.1 0 0.754 0.714 0.756 0.573 0.745 0.780 6 113 249 0.2 0 0.728 0.640 0.731 0.481 0.737 0.772 7 113 249 0 0.05 0.741 0.736 0.753 0.699 0.742 0.783 8 113 249 0 0.1 0.716 0.689 0.736 0.699 0.741 0.780 9 113 249 0 0.2 0.650 0.594 0.692 0.698 0.732 0.772 10 113 249 0.1 0.1 0.694 0.629 0.714 0.569 0.734 0.772 11 50 100 0.1 0.1 0.529 0.429 0.555 0.399 0.550 0.613 12 50 50 0.1 0.1 0.427 0.363 0.472 0.367 0.458 0.503 [74]Open in a new tab The simulation results in this table are from scenarios where the dispersions of group 1 are greater than those of group 2 for true DDGs. For each setting, the greatest value is shown in bold. Sample size for group 1: [MATH: n1 :MATH] ; sample size for group 2: [MATH: n2 :MATH] ; proportion of genes with random outliers: α; and proportion of genes with random zeros: β. Sample sizes [MATH: n1=113 :MATH] and [MATH: n2=249 :MATH] are chosen from GTEx whole-blood tissue data, corresponding to the 40–49 and 60–69 age groups, respectively. When perturbations are introduced, QRscore outperforms the others, and it ranks second when model assumptions are perfectly met. For a table that shows cases where the dispersions of group 2 are greater than those of group 1 for true DDGs, see [75]Table S1. AUPRC, area under the precision-recall curve. Figure 2. [76]Figure 2 [77]Open in a new tab Simulation results for QRscore-Var benchmarked against other methods for detecting DDGs The true DDGs have greater dispersion in group 1 than in group 2. (A) Line plots of power when true false discovery proportions are less than 0.05 across different dispersion fold changes. (B) Line plots of FDR when adjusted p values are less than 0.05 across different dispersion fold changes. Both plots use [MATH: n1=113 :MATH] and [MATH: n2=249 :MATH] (GTEx derived). We present four different settings in the plot: baseline NB model, 10% genes with outliers, 10% genes with zero inflation, and a combination of both. For additional methods benchmark, see [78]Figure S1. Likewise, as shown in [79]Table S3, QRscore-Mean shows performance comparable to DESeq2 and edgeR in differential mean analysis, with a marginal difference in the AUPRC of about 1%. This slight difference in power and FDR is further illustrated in setting 1 of [80]Figure S2. QRscore-Var leads in robustness and effectiveness in gene expression variability analysis with outliers and excess zeros Next, we simulate data with random outliers (settings 4–6 in [81]Table 2), where QRscore-Var emerges as the best-performing method based on the average AUPRC. In contrast, methods like GAMLSS and MDSeq show much lower AUPRC scores, even with a small proportion of outliers, due to their reliance on the NB assumption, which makes them vulnerable to model misspecification. For instance, in setting 4 with 5% random outliers, the AUPRC for GAMLSS drops by 5.6% (from 0.798 to 0.753), while QRscore-Var experiences only a 0.6% drop (from 0.788 to 0.783). As the outlier proportion increases to 20%, GAMLSS’s score declines by 19.8% (to 0.640), whereas QRscore-Var remains stable, with only a 2.0% drop (to 0.772). MDSeq also shows considerable performance degradation, and both GAMLSS and MDSeq exhibit inflated FDRs in the presence of 10% outliers, as shown in [82]Figure 2B (“outliers"). In simulations with random zeros (settings 7–9 in [83]Table 2), QRscore-Var again demonstrates robust performance. GAMLSS and diffVar, by contrast, show inflated FDRs and reduced power under zero-inflated conditions ([84]Figure 2, “zero inflation"). Despite MDSeq being designed to accommodate technical zeros, QRscore-Var consistently outperforms it across all fold changes. Among non-parametric methods, QRscore-Var also achieves the highest power in zero-inflated settings ([85]Figure S1, “zero inflation”). Furthermore, QRscore maintains reliable power and FDR control under both up- and down-regulation of variance, underscoring its robustness across different shift directions (see [86]Tables 2 and [87]S1). To assess QRscore-Var’s robustness in settings involving small samples, we perform additional simulations with group sizes ranging from 5 to 15, incorporating noise through randomly inserted outliers and excess zeros. We observe that parametric methods such as MDSeq and GAMLSS exhibit severe FDR inflation as sample sizes decrease and fail to provide reliable discoveries with controlled FDRs. In contrast, QRscore-Var consistently maintains strict FDR control across all scenarios and achieves the highest power when group sizes are 8 or higher ([88]Figure S3A). We find that all methods have low power below group sizes of 8. Therefore, we recommend using at least 8 samples per group to ensure true discoveries while maintaining FDR control. Finally, to assess the robustness of QRscore-Var under additional model misspecification, we conduct simulations where gene expression follows bimodal distributions—an important scenario given that real RNA-seq data can exhibit multimodal patterns due to heterogeneous cell populations or mixed regulatory states. As shown in [89]Table S2, QRscore-Var consistently achieves the highest average AUPRC across all scenarios with outliers and excess zeros. These results demonstrate that QRscore-Var maintains strong FDR control and detection power even when the data distribution departs from standard unimodal assumptions. QRscore-Mean leads in robustness and effectiveness in differential gene expression analysis with outliers and excess zeros QRscore-Mean, similar to QRscore-Var, also shows the strongest performance in differential gene expression analysis under high proportions of random outliers (setting 6 in [90]Table S3). While edgeR and DESeq2 show marginally higher AUPRC scores with fewer outliers, they suffer from inflated FDRs, making their detection results less reliable in practice ([91]Figure S2B, “outliers"). Among the methods that control the FDR, QRscore-Mean performs the best in terms of the average AUPRC and detection power. QRscore-Mean outperforms other methods in differential gene expression analysis when there are excess zeros (settings 7–9 in [92]Table S3). This improved performance is due to its robustness in handling data distribution irregularities, particularly those characterized by excess zeros. While random zeros weaken the signal without inflating FDR, this loss of power impacts methods like edgeR and DESeq2, which lack explicit handling of zero inflation. We also conduct additional “small sample" simulations with group sizes ranging from 5 to 15, incorporating realistic noise through randomly inserted outliers and excess zeros. QRscore-Mean and the Mann-Whitney test preserve FDR control across all scenarios. Although DESeq2 and edgeR show slightly higher AUPRC values at very small sample sizes, their inflated FDR limits practical utility, as shown in [93]Figure S3B. In contrast, QRscore-Mean consistently offers more accurate results. Similar to variance shift testing, we recommend a minimum sample size of 8 per group to ensure sufficient detection power. Three-sample tests show superiority compared to pairwise two-sample tests in diverse simulations for QRscore-Var We investigated the performance of the three-sample QRscore test (see [94]STAR Methods) compared to the commonly used approach of sequential pairwise testing followed by p value aggregation. In practice, the user may prefer to test the heterogeneity of gene expression across multiple groups or conditions and obtain a single, well-calibrated p value for each gene. Results from [95]Figure S4 demonstrate that the three-sample QRscore test outperforms the pairwise two-sample tests in terms of the AUPRC, particularly when the sample sizes for different groups are not equal. Additional simulations explore scenarios where the three samples share the same sample size and cases with varying degrees of dispersion among all three groups. These simulations, which also include the presence of random outliers and zeros, show that the three-sample tests consistently provide superior AUPRC results, demonstrating better adaptability and precision even under model perturbations. Based on the simulation results, we will apply three-sample tests to GTEx data in [96]application to GTEx. Application to GTEx Having demonstrated QRscore’s error control capabilities and enhanced power, we now apply it to the GTEx dataset. The GTEx project provides RNA-seq data from bulk tissue samples, gathered from nearly 1,000 deceased donors across various age groups and both sexes. This dataset covers a wide array of tissue types, enabling tissue-specific gene expression analyses. The flexibility of QRscore allows for the detection of tissue-specific DDGs, as well as distinct gene sets that differ in mean expression across tissues. We also show that these gene sets exhibit unique functional effects. QRscore-Var detects tissue-specific genes with changes in dispersion We apply QRscore-Var across 33 tissues to identify DDGs across three age groups: 20–39 (younger), 40–59 (middle), and 60–79 (older). We detected distinct trends in variance shifts from younger to older age groups ([97]Figure 3A). In tibial artery and skeletal muscle tissues, most DDGs show an increase in variance from the middle to the older age group. In contrast, in whole-blood tissue, we observe more DDGs showing an increase in variance from the younger to the middle age group, followed by a decrease from the middle to the older group. Our latter finding agrees with a recent study on gene expression in blood samples from twins, which reported a general decline in expression variability with age.[98]^6 In total, 10,362 unique tissue-specific DDGs were detected. Figure 3. [99]Figure 3 [100]Open in a new tab QRscore-Var results on GTEx data (A) A stacked bar plot illustrates the quantity of genes identified by QRscore-Var, categorized by variance trends. For instance, the “decrease-decrease" label indicates a progressive reduction in gene expression variance from the youngest age group (20–39) through the middle age group (40–59) and into the oldest age group (60–79). (B) Histogram of the number of tissues in which each DDG detected only by QRscore-Var is found. The red bars indicate DDGs found from more than 3 tissues. (C) UpSet plot showing overlaps of DDGs from the three-sample test and corresponding pairwise comparisons. (D) Bar plots display the log counts of DDGs identified in various tissues from female and subsampled male individuals, respectively. To better interpret these patterns, we perform pairwise two-sample tests between age groups ([101]Figure 3C). In the tibial artery, variance shifts are primarily driven by differences between the youngest group and the other two. In the transverse colon, the oldest group contributes most to the detected differences. These patterns reflect how age range and tissue context shape dispersion trends. The majority of DDGs are tissue specific, where each gene is detected in only one tissue ([102]Figure 3B). However, a few genes show differential dispersion across multiple tissues ([103]Table S5). For example, EDA2R, a gene previously known to be associated with aging in the lung[104]^21 and skeletal muscle,[105]^22 is detected as a DDG in 16 tissues. Additionally, PTCHD4, previously associated with aging in the thyroid gland,[106]^23 is found to have differential dispersion in 16 tissues. Sex can significantly influence the detection of variance shifts, and the GTEx dataset includes more male than female samples. To address this potential confounding factor, we stratify the data by sex and subsample male samples to match the number of females, aiming for a fair comparison ([107]Figure 3D). In tissues such as the tibial artery, transverse colon, and whole blood, the numbers of DDGs identified in male and female samples are similar, and there are sex-specific DDGs discovered in each tissue. Interestingly, within adipose visceral tissue, even after subsampling to equalize sample sizes, DDGs are identified in male samples but not in female samples. These findings provide insights into sex-specific DDGs during aging. QRscore-Var identifies different sets of genes from QRscore-Mean In parallel with QRscore-Mean, designed to identify genes with mean expression changes using tailored weights, we used QRscore-Var for its capacity to detect variance shifts. Although QRscore-Var detects fewer genes overall compared to QRscore-Mean, it identifies 2,652 unique DDGs across at least one tissue that QRscore-Mean does not capture (that is, DDGs that are not also DEGs; see [108]Figure 4A), indicating its ability to uncover different aspects of gene expression changes. Figure 4. [109]Figure 4 [110]Open in a new tab Analysis of age-related differential gene expression across human tissues using QRscore-Var and QRscore-Mean (A) Pie charts visualizes the proportion of gene biotypes across various tissues, analyzed through the QRscore-Var and QRscore-Mean methods. The slices represent protein-coding genes, long noncoding RNAs (lncRNAs), and small noncoding RNAs (sncRNAs), with immunoglobulin (Ig) genes separated into their own category due to their substantial proportion in some tissues. Pseudogenes are excluded. A white circle at the center indicates the total gene count. Tissues are ranked by the number of DDGs unique to QRscore-Var. (B) Bar plots for selected tissues, highlighting the top 20 significant genes only detected by QRscore-Var. These plots juxtapose the significance levels ( [MATH: log10 (pvalue) :MATH] ) of identified genes between the QRscore-Var and QRscore-Mean analyses, with a focus on protein-coding genes. A notable number of DDGs are identified only by QRscore-Var, but not QRscore-Mean, in several tissues, in line with studies emphasizing variance shifts in the transverse colon and thyroid.[111]^24 While most DDGs are protein-coding genes, immunoglobulin (Ig) genes are also prominently detected, especially in tissues like esophagus mucosa, transverse colon, and lung ([112]Figure 4A). Their contribution suggests a link between age-related variability and immune-response alterations. These results indicate that variance shifts can reveal biologically meaningful changes that complement mean-based analyses in understanding aging. We summarize the top 20 protein-coding DDGs with age-associated variance shifts that are not classified as DEGs across six tissues (see [113]Figure 4B). Variance shifts in genes like C-C motif chemokine ligand 2 (CCL2) in whole blood and SAA1 in the transverse colon underscore their roles in shaping the inflammatory landscape during aging. CCL2 is pivotal in chronic immune responses during aging,[114]^25 while SAA1 contributes to an inflammatory microenvironment that varies across individuals, leading to diverse impacts on epithelial cell function via MMP9 up-regulation in aging tissues.[115]^26 In the tibial artery, ADRA2B regulates vascular tone, with its variance shifts suggesting its importance in maintaining vascular health with age. BNIP3 in skeletal muscle indicates diverse mitochondrial balance and autophagy responses, signaling varied susceptibility to age-related muscle degradation.[116]^27 Meanwhile, SERPINB1 in the lung safeguards against protease-induced tissue damage, with its variance shifts revealing intricate age-related changes that impact lung maintenance and contribute to chronic lung conditions.[117]^28 These gene expression shifts reveal the intricate, tissue-specific regulatory changes that contribute to aging and its associated dysfunctions. We compile lists of genes related to human aging from three databases, GenAge,[118]^29 LongevityMap,[119]^30 and CellAge[120]^31 to annotate age-related genes. We rank the genes based on p values for each tissue and age group, using QRscore-Var or QRscore-Mean tests, and apply one-sided Wilcoxon rank-sum tests to evaluate the methods’ effectiveness at prioritizing aging-related genes. In [121]Table S4, QRscore-Var shows statistical significance ( [MATH: p<0.05 :MATH] ) in ranking aging-related genes in twelve out of fourteen tissues. However, the Wilcoxon test p values are less significant than QRscore-Mean in most tissues and are not significant in lung and lower leg skin (sun exposed). This suggests that, as the existing databases are mostly annotated by mean expression shifts, they may miss age-related DDGs that QRscore-Var can identify, and QRscore-Var provides a complementary perspective on the genetic basis of aging. QRscore-Var and QRscore-Mean identify genes that are functionally distinct To investigate the biological functions of genes identified by QRscore, we performed Gene Ontology (GO) enrichment analysis for the top 500 DDGs and DEGs ranked by QRscore-Var and QRscore-Mean, respectively. In whole blood, while QRscore-Mean highlights fundamental processes such as mRNA processing and RNA splicing, QRscore-Var reveals enrichment in pathways related to chemotaxis, angiogenesis, cell motility, and regulation of the ERK1/2 cascade—reflecting broader regulatory changes beyond mean expression shifts ([122]Figures 5A and 5B). Specifically, genes with large variance shifts, such as HRG, ANGPTL3, and SERPINE1, are linked to angiogenesis, while FGB and FGG are involved in immune signaling and blood coagulation ([123]Figure 5C). Additionally, the analysis reveals notable variance shift among age groups in the CCL gene family, such as CCL14, CCL18, and CCL20, which are enriched in the chemokine-mediated signaling pathway. This further highlights the intricate gene expression dynamics involved in immune responses and inflammation. Figure 5. [124]Figure 5 [125]Open in a new tab Comparative analysis of GO enrichment in whole-blood tissue based on significant genes identified by QRscore-Var and QRscore-Mean (A) A dot plot illustrates the enriched GO terms within the biological process ontology for the top 500 genes significantly detected by QRscore-Var. (B) Similar enrichment analysis results for genes identified by QRscore-Mean. (C) A heat plot, generated using ClusterProfiler, shows the fold change in expression variance between the 20–39 and 40–59 age groups for significant genes associated with the top 5 enriched GO terms identified by the QRscore-Var analysis. (D and E) A dot plot (D) and a density plot (E) further investigate the expression patterns of two highly coexpressed genes identified by QRscore-Var as being involved in angiogenesis. We further identify the top QRscore-Var genes associated with significantly enriched GO terms across 20 tissues and compare their QRscore-Mean rankings (see [126]Table S6). In whole blood, CCL2 is uniquely identified as significant by QRscore-Var and exhibits non-monotonic changes in dispersion with age. Among the top 25 DDGs contributing to enriched GO terms in this tissue, 23 are ranked within the top 200 by QRscore-Var but fall outside the top 1,000 under QRscore-Mean, underscoring the distinct insights gained from analyzing expression variability. We also identify coexpressed DDG pairs linked to key biological processes—for example, SERPINE1 and ANGPTL4 are strongly coexpressed (Spearman’s [MATH: ρ=0.808 :MATH] ) and are implicated in angiogenesis and blood coagulation ([127]Figures 5D and 5E). Across other tissues—including the tibial artery, transverse colon, and skeletal muscle—QRscore-Var reveals biological processes and genes with significant expression variability during aging. In the tibial artery, variance shifts in DDGs highlight tissue-specific processes such as sprouting angiogenesis and the positive regulation of cell migration. Beyond these processes, SEMA6A, identified as a DDG but not a DEG, is associated with the semaphorin-plexin signaling pathway and the negative regulation of axon extension involved in axon guidance, pointing to additional molecular mechanisms at play in the tibial artery. In the transverse colon, DDGs are enriched for processes such as digestion, steroid and hormone metabolism, and xenobiotic response, highlighting their roles in maintaining gastrointestinal function, regulating metabolic activity, and responding to environmental stimuli. In skeletal muscle, coexpressed DDGs like COMP and FMOD underscore their importance in collagen fibril organization and maintaining the structural integrity of muscle tissue. The DDGs, enriched pathways, and coexpression patterns described here are shown in [128]Tables S6 and [129]S7. Application to pseudo-bulked scRNA-seq data Here, we apply QRscore to data from the Asian Immune Diversity Atlas (AIDA),[130]^32 a large-scale scRNA-seq resource profiling peripheral blood mononuclear cells (PBMCs) from healthy donors across Asia. Our analysis focuses on individuals of East Asian identity to minimize confounding due to population heterogeneity (see [131]STAR Methods). The single-cell resolution of AIDA enables detailed exploration of age-related regulatory changes in specific immune cell types, which are often obscured in bulk tissue analyses. To increase detection power, stabilize variance estimates, and enable robust population-level statistical testing, we aggregate single-cell profiles into pseudo-bulk samples by donor, age group, and cell type. This approach preserves cell-type resolution while capturing inter-individual variability, aligning with our goal of characterizing gene expression variance across the population. Applying QRscore-Var across 25 immune cell types and three age groups (20–34, 35–49, and 50+), we detect both cell-type-specific and shared patterns of variance shifts, highlighting distinct regulatory programs across the immune system. We identify widespread shifts in gene expression variability with aging. Among the analyzed cell types, naive thymus-derived CD4-positive, alpha-beta T cells; central memory CD4-positive, alpha-beta T cells; and CD16-positive, CD56-dim natural killer cells exhibit the highest DDG counts, suggesting greater transcriptional reprogramming in these cell types during aging ([132]Figure S5A). Most DDGs exhibit monotonic variance trajectories across age groups, with the majority showing either progressive increases (increase-increase) or decreases (decrease-decrease) in dispersion from younger to older individuals. A smaller subset of DDGs display non-monotonic patterns, with predominantly increase-decrease trajectories and a minority showing decrease-increase variance shifts. Although the majority of DDGs are specific to individual cell types ([133]Figure S5B), several genes, such as RBM38, RPL27A, and BRD4, are recurrently identified across multiple immune cell-type populations. Notably, RBM38, a p53-regulated RNA-binding protein involved in mRNA stability and cellular senescence, is linked to aging and tumor suppression.[134]^33 Furthermore, pairwise comparisons reveal that variance shifts are often strongest between the youngest and oldest groups ([135]Figure S5C). In the top three cell types with the highest DDG counts, two-sample tests show that more DDGs exhibit significant variance shifts in the 20–34 vs. 50+ and 35–49 vs. 50+ comparisons, indicating that the oldest age group displays the most pronounced and distinct variance patterns. Next, we identify genes with significant variance shifts detected in both GTEx whole-blood and pseudo-bulked scRNA-seq analyses, with overlaps of 11%–32% across immune cell types ([136]Figure S6A). Among the shared genes, key regulators such as RBM38, RGS1, and the NR4A family (NR4A1/2/3) emerge as recurrent across multiple cell types, suggesting their involvement in a core aging-related regulatory program. These genes are well-established mediators of immune activation, stress responses, and inflammatory regulation—processes tightly linked to immunosenescence. To explore the functional relevance of these shared DDGs, we perform ontology enrichment analysis on the overlap between pseudo-bulk and bulk results for CD14-positive monocytes. The analysis reveals enrichment for inflammatory cytokine responses, including pathways related to interleukin-1, tumor necrosis factor (TNF), and chemokine signaling—key processes in age-associated inflammation ([137]Figure S6B). Genes such as ZC3H12A, CXCL8, CCL4, and CCL20 exemplify how transcriptional variability in monocytes is linked to enhanced responsiveness to pro-inflammatory cytokines and immune aging ([138]Figure S6C). While some genes are shared between the GTEx whole-blood and pseudo-bulked scRNA-seq analyses, many DDGs are uniquely identified in the pseudo-bulk analysis, highlighting the added resolution provided by cell-type-specific variance analysis. [139]Figure S5D shows bar plots of the top 20 DDGs for the three cell types with the highest DDG counts, including cell-type-specific DDGs shown in black. In naive thymus-derived CD4-positive, alpha-beta T cells, FURIN and EXT1 suggest modulation of cytokine signaling and extracellular immune interactions. In CD16-positive, CD56-dim natural killer cells, variance shifts in NFIL3 and IL1B reflect dynamic regulation of development and inflammatory responses. In central memory CD4-positive, alpha-beta T cells, CSNK1E and KLHL11 exhibit variability changes linked to immune rhythm and tolerance. These genes highlight the unique contributions of cell-type-specific analysis in uncovering regulatory mechanisms with age-related variability that were missed in our bulk analysis. Discussion We introduced QRscore, a suite of rank-based non-parametric tests for identifying DDGs and DEGs from bulk RNA-seq data. We showed, through extensive simulations, that QRscore is robust at detecting both mean and dispersion shifts, particularly under challenging and realistic conditions where datasets have inflated zeros and outliers. Notably, QRscore controls false discovery well and achieves high power even in such settings. Applying QRscore to GTEx data, we find different DDGs and DEGs in each tissue that are also functionally distinct, implicating novel candidates for further biological investigation. Model misspecification is a critical issue in RNA-seq data analysis that potentially explains the inflation of false discoveries observed in practice. For instance, a study[140]^34 showed that methods relying on specific distributional assumptions, like NB or ZINB, identify different sets of DEGs across mouse strains and human tissues, with elevated FDRs in the presence of outliers. As a rank-based non-parametric method, QRscore is less sensitive to data distribution and effectively handles technical zeros and outliers. By being less prone to model misspecification, QRscore yields more reliable and replicable results, improving the likelihood of downstream experimental validation. As pointed out in the [141]introduction, gene expression variability plays an important role in aging, adaptation to changing environments, and disease susceptibility. Focusing on aging in our present work, senescence-related changes in gene expression variability are often driven by shifting regulatory efficiency and the accumulation of epigenetic modifications,[142]^35 which may go unnoticed when focusing solely on mean expression levels. QRscore fills this gap by detecting variance shifts, revealing biologically relevant DDGs that highlight aging-related processes like angiogenesis in whole blood. These findings reinforce the value of analyzing gene expression variability to uncover insights beyond mean-level changes. We further demonstrated that QRscore can be applied to pseudo-bulked scRNA-seq data, enabling analysis of cell-type-specific regulatory variability. This is important because bulk tissue data, like those of GTEx, aggregate signals from mixed cell populations, potentially obscuring variability specific to a particular lineage. In our analysis, many DDGs with cell-type-specific roles were uniquely identified in pseudo-bulk data. Additionally, QRscore identified shared signals across GTEx and pseudo-bulked scRNA-seq analyses, despite differences in sequencing technology, cell population composition, cell numbers per sample, and population ancestry. Together, these results highlight QRscore’s ability to uncover meaningful, cell-type-specific insights and robust signals of gene expression variability across diverse contexts. The future of differential expression analysis lies in detecting diverse expression changes while accounting for confounders and technical noise. Flexible methods like QRscore support this goal by capturing distributional shifts. As tools improve, biologists will gain deeper insight into how gene expression varies across myriad contexts. Limitations of the study One limitation of QRscore is that it tends to be conservative when the sample size is small. This is an inherent constraint of rank-based tests, which generally require larger sample sizes to ensure reliable decisions. As large-scale datasets like GTEx become more common, future work will extend QRscore to even larger and more diverse populations. Resource availability Lead contact Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Yun S. Song (yss@berkeley.edu). Materials availability This study did not generate new unique reagents. Data and code availability * • This paper analyzes existing, publicly available data. The accession numbers for the datasets are listed in the [143]key resources table. The data used to reproduce the figures in this paper are available at [144]https://doi.org/10.5281/zenodo.16416084. * • The R package for QRscore is available at [145]https://github.com/songlab-cal/QRscore and Bioconductor. The code for reproducing the figures in this paper is available at [146]https://doi.org/10.5281/zenodo.16416084. * • Any additional information required to reanalyze the data reported in this work paper is available from the lead contact upon request. Acknowledgments