Abstract

   Single-cell RNA sequencing maps gene expression heterogeneity within a
   tissue. However, identifying biological signals in this data is
   challenging due to confounding technical factors, sparsity, and high
   dimensionality. Data factorization methods address this by separating
   and identifying signals in the data, such as gene expression programs,
   but the resulting factors must be manually interpreted. We developed
   Single-Cell Interpretable REsidual Decomposition (sciRED) to improve
   the interpretation of scRNA-seq factor analysis. sciRED removes known
   confounding effects, uses rotations to improve factor interpretability,
   maps factors to known covariates, identifies unexplained factors that
   may capture hidden biological phenomena, and determines the genes and
   biological processes represented by the resulting factors. We apply
   sciRED to multiple scRNA-seq datasets and identify sex-specific
   variation in a kidney map, discern strong and weak immune stimulation
   signals in a PBMC dataset, reduce ambient RNA contamination in a rat
   liver atlas to help identify strain variation and reveal rare cell type
   signatures and anatomical zonation gene programs in a healthy human
   liver map. These demonstrate that sciRED is useful in characterizing
   diverse biological signals within scRNA-seq data.

   Subject terms: Statistical methods, Data mining, Machine learning
     __________________________________________________________________

   Single-cell RNA sequencing maps tissue-level gene expression
   heterogeneity but faces challenges in interpreting biological signals
   due to noise and technical confounders. Here, the authors present
   sciRED, a method that enhances the interpretation of scRNA-seq
   factorization by linking factors to known covariates and hidden
   biological phenomena.

Introduction

   Single-cell RNA sequencing (scRNA-seq) maps heterogeneity in gene
   expression in large cell populations. This heterogeneity can be
   attributed to various factors, both observed and hidden^[30]1. We can
   broadly categorize sources of this heterogeneity into sample-level
   factors, such as experimental conditions or age and weight of patients,
   cell-level covariates like cell type identity, cell cycle stage, and
   library size, or as gene-level attributes such as pathways. Each of
   these categories can be further classified based on their biological or
   technical nature and whether they are known during experimental design,
   or observed in or inferred based on the data. Despite their importance,
   identifying and interpreting factors in scRNA-seq data remains
   challenging due to its noise, sparsity, and high dimensionality^[31]2.

   Matrix factorization (or decomposition) can identify multiple factors
   (signals) in a cell by gene scRNA-seq data matrix, each one capturing a
   unique pattern of covarying gene expression values that may represent a
   covariate, such as a set of genes affected by batch or a biological
   gene expression program^[32]3. Many factorization methods
   exist^[33]4–[34]13, however, they generally only identify factors and
   leave interpretation up to the user. Some recent methods, such as
   scLVM^[35]14, fscLVM^[36]15, and Spectra^[37]16, can automatically
   interpret factors by matching them to pre-annotated gene sets (e.g.
   pathways), but do not consider a wider range of covariates important in
   single cell genomics data, such as sample-level attributes. To address
   this challenge, we developed Single-Cell Interpretable REsidual
   Decomposition (sciRED) to aid in scRNAseq factor analysis and
   interpretation.

   sciRED enables factor discovery and interpretation in the context of
   known covariates and biological gene expression programs. sciRED
   provides an intuitive visualization of the associations between factors
   and covariates, along with a set of interpretability metrics for all
   factors. These metrics identify clear factor-covariate pairs as well as
   factors not matching known covariates but which are potentially
   interpretable. Factor-correlated genes and pathways also aid in
   interpretation. We apply sciRED to diverse datasets including the
   scMixology^[38]17 benchmark and four biological single-cell atlases
   that contain known factors. We showcase its application in identifying
   cell identity programs and sex-specific variation in a kidney map,
   discerning strong and weak immune stimulation signals in a PBMC
   dataset, reducing ambient RNA contamination in a rat liver atlas to
   unveil strain-related variation, and revealing hidden biology
   represented by rare cell type signatures and anatomical zonation gene
   programs in a healthy human liver map. These demonstrate the utility of
   sciRED for characterizing diverse biological signals within scRNA-seq
   datasets.

Results

   To improve the efficiency and interpretability of matrix decomposition
   of single-cell RNAseq data, sciRED uses four steps to generate and
   characterize the resulting factors. These steps are: preprocessing data
   and applying matrix decomposition, evaluating factor-covariate
   relationships, examining unexplained factors, and determining
   biological interpretation of selected factors.

   sciRED begins with the input cell by gene matrix and proceeds to: (1)
   Remove known confounding effects, factorize the residual matrix to
   identify additional factors not accounted for by confounding effects,
   and use rotations to maximize factor interpretability; (2)
   Automatically match factors with covariates of interest; (3) evaluate
   unexplained factors that may indicate hidden biological phenomena; (4)
   Determine the genes and biological processes represented by factors of
   interest (Fig. [39]1A). In the factor discovery phase, sciRED removes
   user-defined unwanted technical factors, such as library size and
   sample or protocol, as covariates within a Poisson generalized linear
   model (GLM)^[40]18. This regresses out the covariates and produces
   Pearson residuals. These residuals are then factored using Principal
   Component Analysis (PCA)^[41]5 with varimax rotation^[42]19 to enhance
   interpretability (Fig. [43]1B). To identify factors that explain a
   specific covariate, sciRED attempts to find matching labels (e.g. cell
   types, covariates of interest) for each factor using an ensemble
   classifier (see methods). Each cell is represented as a vector of
   factor weights, which are classified to predict covariate labels (e.g.
   “female” or “male” factors in the covariate “biological sex”) using
   four machine learning classifiers (logistic regression^[44]20, linear
   classifier/area under the curve (AUC)^[45]21,[46]22, decision
   tree^[47]23, and extreme gradient boosting (XGB)^[48]24). Each factor
   is a feature and feature selection across an ensemble of these
   classifiers generates factor-covariate-level association (FCA) scores,
   which are visualized as a heatmap. To identify high-scoring
   factor-covariate pairs, we use three approaches. First, we use visual
   inspection of the FCA heatmap. Second, we compare the association
   scores of each factor to the background distribution of values in the
   FCA table to automatically highlight significant associations (see
   Methods, Fig. [49]1C). Third, we calculate a specificity measure to
   determine whether an explained factor captures the gene expression
   program related to a unique covariate or is associated with multiple
   covariate levels simultaneously. Interpretation is often easier when a
   factor matches only one covariate level.

Fig. 1. sciRED overview.

   [50]Fig. 1
   [51]Open in a new tab

   A sciRED comprises four main steps: factor discovery, factor
   interpretation, factor evaluation and biological characterization. B In
   the factor discovery step, a Poisson generalized linear model is
   applied to the data to remove technical covariates, followed by
   extraction of residuals and factorization using PCA. The resulting
   score and loading matrices are then rotated for enhanced
   interpretability. The score matrix represents the projection of the
   original data onto the new factor space, illustrating the relationship
   between cells and factors. Each entry in this matrix reflects how much
   each cell contributes to the factors. The loading matrix contains the
   weights or coefficients that define the factors as linear combinations
   of the original genes. These weights can be used to rank genes
   according to their contribution to each factor, facilitating further
   interpretation of the factors. Models and algorithms are color-coded in
   blue, while input, output, and intermediate data elements are
   represented as white boxes. C Factor interpretation uses an ensemble
   classifier to match factors with given covariates, generating a
   Factor-Covariate Association (FCA) table. Covariate-matched factors are
   identified by thresholding FCA scores based on the distribution of all
   FCA scores. Unannotated factors may capture novel biological processes
   or other covariates. D Factor-interpretability scores (FIS) are
   computed for each factor. E The top genes and enriched pathways
   associated with a selected factor are identified for manual
   interpretation.

   The third step of sciRED evaluates unexplained factors for potential
   interpretability using three types of metrics: separability, effect
   size, and homogeneity, which are presented as a factor-interpretability
   score (FIS) table (Figs. [52]1D, [53]S1, see Methods). For unexplained
   factors that should be prioritized for follow-up exploration because
   they may contain a hidden signal, ideally, cells should be scored
   highly by the factor (measured by effect size), there should be two
   populations of cells, ones that score highly and ones that don’t
   (bimodal distribution, measured by separability) and the potential new
   signal is expected to be strong in cells ranked highly by that factor,
   while other signals (e.g., known technical covariates) should be evenly
   distributed across this ranking (measured by homogeneity). The fourth
   step evaluates the factor loadings (gene lists) for biological signals.
   This involves manually investigating top genes and enriched pathways
   associated with the factor (Fig. [54]1E).

sciRED improves factor discovery

   We evaluated sciRED’s factor discovery alongside eight other factor
   analysis methods (PCA, ICA^[55]8, NMF^[56]6, scVI^[57]25,
   Zinbwave^[58]9, cNMF^[59]12, scCoGAPs^[60]13, and Spectra^[61]16) on
   two datasets: the scMixology^[62]17 benchmark dataset and a stimulated
   PBMC dataset^[63]26, representing distinct biological and technical
   contexts (Figs. [64]S2-[65]S5). The scMixology dataset includes
   scRNA-seq profiles from a mixture of three cancer cell lines (H2228,
   H1975, HCC827), sequenced across three batches with different library
   preparation platforms (Drop-seq, Celseq2, and 10x Genomics)
   representing controlled single-cell data with well-defined signals. In
   contrast, the stimulated PBMC dataset consists of 10x Genomics
   droplet-based scRNA-seq data from eight lupus patients, collected
   before and after a 6-hour treatment with interferon (IFN)-β, providing
   a real-world example of biological single-cell data.

   Four metrics were used to compare performance across methods: (1) the
   number of entangled covariates (indicating covariates matched to
   multiple factors); (2) the number of factors split across multiple
   covariates; (3) the number of covariate levels without an associated
   factor; and (4) runtime. Lower values for all metrics indicate stronger
   performance.

   sciRED outperformed other methods, with the exception of scCoGAPs, in
   minimizing both entangled covariates and factors distributed across
   multiple covariates in both datasets. All methods showed comparable
   results in associating covariate levels with factors, especially in
   capturing biological signals. However, scCoGAPs missed identifying the
   megakaryocyte cell type in the PBMC dataset. The runtime analysis
   highlights sciRED’s scalability as among the top four fastest on
   scMixology and second fastest to PCA on the larger PBMC data set, while
   scCoGAPs and Zinbwave were particularly slow running over 35 hours on
   scMixology; and on PBMCs scCoGAPs required 139 hours, whereas Zinbwave
   crashed after 24 hours due to RAM limitations. All methods were run
   single-threaded with default parameter settings on a workstation with
   Intel 3.0 GHz Xeon E5-2687W chip and 64 GB RAM.

   Notably, sciRED’s runtime scales linearly with both cell and gene
   counts (Fig. [66]S6). This was shown using a human lung transplant
   dataset^[67]27 (over 108,000 cells) and subsampling of both cell and
   gene counts. Runtime was tested across various cell numbers (20 K,
   40 K, 60 K, 80 K, and 100 K cells with genes fixed at 2000) and gene
   counts (500, 2000, and 5000 highly variable genes with cells fixed at
   40 K).

sciRED finds cell type identity programs and sex-specific processes in a
human healthy kidney map

   We applied sciRED to a single-cell map of healthy human kidneys
   (Fig. [68]2) obtained from 19 living donors, with nearly equal
   contributions from both females and males (10 female, 9 male)^[69]28.
   sciRED successfully identified cell type identity programs for diverse
   kidney cell types (Fig. [70]2A, B, Fig. [71]S7, Source Data) as well as
   a factor representing sex-specific differences. For instance, factors
   F1 and F13 capture the identity programs of proximal tubules and
   endothelial cells, respectively, while F18 represents sex-specific
   differences (Fig. [72]2C). Factor F18 has a high association score with
   male and female covariates based on the FCA heatmap, indicating it has
   captured sex variation. Further evaluation of the distribution of
   factor F18 across different cell types reveals a cell-type dependent
   distribution of sex-related variation, with a strong representation in
   the proximal tubule cell type population (Fig. [73]2D). The FIS heatmap
   indicates that F18 is highly specific and shows a low homogeneity score
   across sex covariates, confirming that cells from male and female
   individuals are not uniformly distributed along F18 (Fig. [74]2E).
   Consistent with the original study, pathway analysis shows an increase
   in processes related to aerobic metabolism (such as aerobic
   respiration, oxidative phosphorylation, tricarboxylic acid (TCA) cycle,
   and electron transport chain) in males (Fig. [75]2F, G). These findings
   align with the higher basal respiration and ATP-linked respiration
   processes in males, as functionally validated in the original
   study^[76]28. This highlights sciRED’s ability to identify cell type
   identity signatures and sample covariate-specific variation.

Fig. 2. Deciphering cell type identity programs and sex-specific processes in
a human healthy kidney map.

   [77]Fig. 2
   [78]Open in a new tab

   We applied sciRED to a single-cell healthy human kidney map obtained
   from 19 living donors, with similar contributions from both females and
   males. A FCA heatmap with covariate levels as rows and associated
   factors as columns. Arrow highlights factor F18, which demonstrates a
   high association score with female and male covariates. B FCA score
   distribution for all factors. The red line indicates the automatically
   defined threshold used to identify significant factor-covariate matches
   in the heatmap in (A). Distribution of cells over factor F1 and factor
   F18 colored based on C sex and D cell type covariates indicates the
   predominant representation of sex-related variation within the proximal
   tubule cell type. E FIS heatmap representing the interpretability
   scores of the selected factors. F Pathway analysis results on the top
   positive (female) and negative (male) loaded genes of factor F18. G Top
   20 positively and negatively loaded genes of factor F18.

sciRED identifies stimulation signals across lymphoid and myeloid cells in a
stimulated PBMC dataset

   We used sciRED to analyze a benchmark dataset comprising 10x Genomics
   droplet-based scRNA-seq PBMC data from eight lupus patients before and
   after a 6-hour treatment with interferon (IFN)-β^[79]26 (Fig. [80]3).
   Our sciRED analysis successfully identifies cell type identity programs
   and stimulation-to-control axes of variation (Fig. [81]3A, Source
   Data). We identified two factors, F9 and F2, which capture stimulation
   signals (Fig. [82]3B). The FIS table indicates a high bimodality
   (separability) score for both factors, a high effect size for F2, and
   high specificity for F9 (Fig. [83]3C), illustrating the utility of the
   FIS table for capturing biological signals. Differential evaluation of
   the cell type representation within the two factors reveals a
   predominance of lymphoid and myeloid cells for F9 and F2, respectively
   (Fig. [84]3D). In particular, F9 represents a stronger overall
   stimulation signal and stimulation in lymphoid populations and F2
   captures a stimulation signal in myeloid populations. Examination of
   the cell type distribution across F9 and F2 highlights distinct
   clustering between the non-stimulated control and stimulated groups
   along both factors (Fig. [85]3E, H). Immune response-related genes and
   biological processes, including interferon signaling, stress or
   pathogen response, and cytokine signaling, are enriched in both factors
   (Fig. [86]3F, G, I, J). This indicates sciRED’s ability to identify
   both cell-type-specific gene expression activity programs, including
   stimulated cellular states.

Fig. 3. Identifying interferon-β stimulation signals across lymphoid and
myeloid cells in a PBMC dataset.

   [87]Fig. 3
   [88]Open in a new tab

   We used sciRED to analyze PBMC scRNA-seq from eight lupus patients
   before and after an interferon-β (IFN-β) treatment. A FCA heatmap
   displaying covariate levels as rows and associated factors as columns.
   Arrows highlight factors F9 and F2, which capture stimulation signals.
   B Sorted factors based on FCA score values for the stimulated covariate
   level. F9 and F2 are the top factors associated with the stimulation
   covariate. C FIS heatmap illustrating the interpretability scores of
   the selected factors. Red boxes highlight factors capturing IFN-β
   stimulation. D Cell distribution over factors F9 and F2, colored based
   on cell type covariates. The red dashed line represents the diagonal
   line passing through the origin. Solid lines are the regression lines
   that fit each cell type. Lines with larger slopes than the diagonal
   represent cell types with higher association with F9, while lines with
   smaller slopes represent cell types more associated with F2. E
   Distribution of cells over factors F9 and F1 colored based on
   stimulation/control covariates, revealing distinct clustering between
   control and stimulated groups along the F9 axis. F Pathway analysis
   based on the top-loaded genes of factor F9. G Top 30 positively loaded
   genes of factor F9. H Distribution of cells over factors F1 and F2
   colored by simulation state covariate (“stimulated” or
   “control”/non-stimulated). I Pathway analyses based on the top-loaded
   genes of factor F2. J Top 30 positively loaded genes of factor F2.

sciRED alleviates ambient RNA contamination for group-based comparison

   We applied sciRED to a healthy rat liver atlas mapped in Dark Agouti
   (DA) and Lewis (LEW) strains^[89]29. This atlas contains
   hepatocyte-derived ambient RNA contamination, a known artifact likely
   caused by fragile hepatocytes leaking RNA into the cell homogenate
   before sequencing^[90]30. sciRED identified factors with cell type
   identities, as expected, along with two factors capturing
   strain-associated variation (F6 and F20) (Fig. [91]4A–F and [92]S8A,
   Source Data). The FIS table shows that factor F6 has high specificity
   and separability, as well as low strain homogeneity (Fig. [93]4B).
   Factor F20 is the second strongest factor following F6 (Fig. [94]4B,
   C). F6 and F20 represent strain differences within the hepatocyte
   (Fig. [95]4D) and myeloid cell types (Figs. [96]4E, F,  [97]S8B and C),
   respectively. Standard differential expression analysis is not able to
   identify this signal due to strong ambient RNA contamination^[98]29
   (Fig. [99]4G). For instance, four hepatocyte genes—Fabp1, Tmsb4x, Fth1,
   Apoc1—are among the top differentially expressed genes within the
   myeloid cell type of both DA and LEW strains and are estimated to be
   ambient RNA by SoupX^[100]31 (Fig. [101]4G, Source Data, see Methods).
   However, the top 50 myeloid strain-associated genes identified by
   sciRED F20 are free of such contaminants (Fig. [102]4H, Source Data).
   These myeloid-specific strain variations were experimentally validated
   in the original study. This shows the utility of sciRED in deconvolving
   biological signals from contamination to facilitate factor
   interpretation and group-based comparisons.

Fig. 4. sciRED identifies strain-based variation despite ambient RNA
contamination in a rat liver map.

   [103]Fig. 4
   [104]Open in a new tab

   We applied sciRED to a healthy atlas of the rat liver from two rat
   strains, Dark Agouti (DA) and Lewis (LEW) containing hepatocyte-derived
   ambient RNA contamination. A FCA heatmap displaying covariate levels as
   rows and associated factors as columns. B FIS heatmap illustrating the
   interpretability scores of the selected factors. C Factors F6 and F20
   are most associated with strain variation. D Distribution of cells over
   factors F6 vs. F1 colored by strain and E factors F20 vs. F1 colored by
   strain and F by cell type, indicating that factor F20 captures strain
   variation within the myeloid population. G Volcano plot of differential
   expression between strains within myeloid cells. Red dots are
   hepatocyte-derived ambient RNA transcripts as estimated by SoupX. Four
   hepatocyte genes—Fabp1, Tmsb4x, Fth1, and Apoc1—are labeled among the
   top differentially expressed genes within the myeloid cell type of both
   DA and LEW strains. These genes are among the top 50 ambient RNA
   transcripts derived from SoupX in all four rat liver samples (see
   Source Data). H Top 50 myeloid strain-associated genes identified by
   sciRED factor 20, free of contamination from hepatocyte-derived ambient
   RNA.

Exploring hidden biology in the healthy human liver atlas captured by
unannotated factors

   We applied sciRED to a healthy human liver atlas^[105]32 and explored
   the unannotated biological signals (Fig. [106]S9, Source Data).
   sciRED’s FCA heatmap shows that most signals correspond to liver cell
   type identity programs (Fig. [107]5A). For example, F3 most strongly
   captures the cell identity of liver sinusoidal endothelial cells
   (LSECs), while F4 captures non-inflammatory macrophages. The FCA
   heatmap highlights nine factors (1, 10, 19, 20, 22, 26, 28, 29, 30)
   that are not associated with annotated cell types (Fig. [108]5B). To
   discern whether these factors represent technical or biological
   signals, we calculated the correlation between each factor and three
   major technical covariates—library size, number of expressed genes, and
   percentage of mitochondrial gene expression—as well as cell cycle (S
   and G2M) phase scores (Fig. [109]5C). Out of the nine factors, F1, F20,
   and F22 are correlated with technical covariates (R > 0.45). The
   remaining factors include five (F10, F19, F26, F28, and F30) that may
   represent unannotated biological signals, as well as F29, which shows a
   strong correlation with the cell cycle signature, indicating it might
   have captured a proliferative cell population. Evaluating the FIS table
   reveals that, among the six factors analyzed, all except F19 are
   well-mixed based on the sample covariate, suggesting a possible
   sample-specific effect for F19. F10 and F29 stand out with higher
   bimodality scores, indicating they might more effectively separate a
   subpopulation of cells. F26 and F28 factors exhibit a low effect size,
   indicating weak signals (Fig. [110]5D). Factor F10 exhibits significant
   enrichment of a subpopulation of cells within the cholangiocyte cluster
   (Fig. [111]5E). The top loaded genes within this factor include MUC5A,
   MUC1, TFF1, LGALS2, SLPI, TFF2, TFF3, KRT19, LGALS4, and PIGR. The
   enrichment of biological processes such as ion export, regulation of
   transport activity, localization, and response to ER stress strongly
   suggests that F10 has captured the cellular program of a rare
   population of mucus-producing cholangiocytes that was not identified in
   the original study but was reported in a more comprehensive subsequent
   human liver single cell map^[112]33. Factor F30 is enriched within a
   subpopulation of cells labeled as plasma cells in the original map
   (Fig. [113]5F). The top genes in the loading vector include IgK+IgG+ B
   cell marker genes, including IGHG4, IGHG1, IGHG3, IGKV1-12, HERPUD1,
   IGKV1-18, IGKC, SSR4, IGKV3-20, and IGKV3-21 (Fig. [114]5F). Pathway
   enrichment analysis highlights biological processes such as protein
   folding and maturation, antibody-mediated complement activation,
   protein transport, and ER to cytosol transport (Fig. [115]5F). These
   findings suggest that F30 may represent an antibody-secreting IgK+IgG+
   B cell^[116]33 gene expression program, not described in the original
   study due to its low frequency, but later captured in an expanded human
   liver single-cell map^[117]33. Factor F19 is enriched in two hepatocyte
   clusters, Hep1 and Hep2, which were annotated as pericentrally zoned
   hepatocytes in the original study (Fig. [118]5G). The presence of
   pericentral markers such as CYP3A4, GLUL, and OAT, along with
   enrichment in biological processes such as xenobiotic metabolic
   processes, small molecule metabolism, and lipid and fatty acid
   metabolism, suggests that F19 captures the anatomical pericentral
   signature within the hepatic
   lobules^[119]30,[120]32,[121]34(Fig. [122]5G). Factor F29 is correlated
   with cell cycle signatures and inversely enriched within a population
   of γδ T cells (Fig. [123]5H). The top loaded genes for F29 include
   cell-cycle-related regulatory genes UBE2C^[124]35, TOP2A^[125]36,
   KPNA2^[126]37, CKS2^[127]38, and BIRC5^[128]39, and pathway analysis
   reveals enrichment in cell cycle and cell division, RNA splicing and
   processing, and nuclear division (Fig. [129]5H). Together, these
   results suggest that this factor captures the cell cycle state within
   the γδ T cell subset. We could not clearly interpret factors F26 and
   F28. In conclusion, sciRED is able to identify weak signals, such as
   rare cell types and subtle cell states, that were overlooked by
   standard single-cell analysis pipelines in the original study.

Fig. 5. Exploring hidden biology in the healthy human liver atlas using
unannotated factors.

   [130]Fig. 5
   [131]Open in a new tab

   We demonstrate how sciRED facilitates the identification of hidden
   biology in a healthy human liver single-cell transcriptomic atlas. A
   sciRED’s FCA heatmap reveals signals corresponding to human liver cell
   type identity gene expression programs. B Distribution of the number of
   matched covariate levels per factor identifies nine unannotated factors
   (F1, F10, F19, F20, F22, F26, F28, F29, F30) not associated with any
   given covariate. C Correlation analysis between unannotated factors and
   technical covariates (library size, number of expressed genes,
   percentage of mitochondrial gene expression) and cell cycle (S and G2M)
   phase scores. D FIS heatmap indicating the interpretability scores of
   unexplained factors uncorrelated with technical covariates. E–H The
   first row shows the distribution of selected factors on the atlas tSNE
   plot, where each dot represents a cell, and colors indicate factor
   score values. The second row presents boxplots of factor scores across
   cell types, with the box representing the interquartile range (IQR),
   the line indicating the median, whiskers extending to 1.5 × IQR, and
   dots representing individual cell factor scores identified as outliers.
   The third row displays the top-loaded genes, and the fourth row
   provides pathway analysis for factors (E) F10, (F) F30, (G) F19, and H)
   F29. E Factor 10 exhibits significant enrichment within a subpopulation
   of cells within the cholangiocyte cluster, suggesting the capture of a
   rare population of mucus-producing cholangiocytes. F Factor 30
   demonstrates positive enrichment within a subpopulation of cells
   labeled as plasma cells in the original map, suggesting that it
   captures antibody-secreting IgK+IgG+ B cells. G Factor 19 is positively
   enriched in pericentrally zoned hepatocytes (clusters Hep1 and Hep2),
   capturing an anatomical pericentral gene expression signature within
   hepatic lobules. H Factor 29 is correlated with cell cycle signatures
   and inversely enriched within a population of γδ T cells, suggesting
   capture of cell cycle state within this T cell subset. Pathway
   enrichment analysis of the top 200 genes from ordered loadings was
   performed using gProfiler in “ranked” mode with default Gene Ontology
   Biological Process and Reactome databases.

Discussion

   We developed sciRED, a novel single-cell transcriptomics data
   interpretation method that combines unsupervised factor analysis and
   supervised covariate modeling to identify biological and technical
   signals within single-cell transcriptomics data. We introduce new
   metrics to assess factor interpretability to help characterize known
   and unknown sources of variation in the data. We also showed that
   regressing out known technical factors before factorization aids in
   data interpretation.

   To test sciRED, we analyzed a series of datasets with diverse known
   covariates and showed that sciRED could recover factors associated
   with these covariates. This works well, but may miss uninterpreted
   signals in the original data. Although many single-cell transcriptomics
   simulation tools exist^[132]9,[133]40–[134]45 to assist in benchmarking
   analysis methods, there has been little focus on factor simulation. We
   simulated factors to evaluate interpretability metrics, but our
   approach was simplistic and can be improved. sciRED uses PCA coupled
   with varimax rotation as a high-performing matrix decomposition method,
   though many other factor analysis methods exist^[135]46. Our
   benchmarking results on the controlled scMixology dataset and a
   biologically complex PBMC dataset indicate that sciRED generally
   outperforms other methods in terms of combined interpretability and
   runtime efficiency. Tools dedicated to simulating loading and score
   matrices from single-cell transcriptomics, ranging from simple to
   complex structures, could aid in benchmarking the performance of matrix
   factorization methods across different scenarios. Such tools could also
   help answer questions such as determining the minimal sample size
   required to recover a factor with weak loadings. These simulation
   methods could draw inspiration from factor simulation techniques used
   in psychometric studies^[136]47–[137]50.

   We present a suite of metrics designed to evaluate the interpretability
   of factors derived from a matrix factorization method within a single
   dataset. sciRED’s modular design enables flexible integration with
   external data processing and factorization methods. Preprocessed data
   can enter directly at the factor identification stage, and previously
   derived factors and covariates can be imported for factor-covariate
   association, bypassing the initial steps if matrix factorization has
   already been applied using another method. Both FCA and FIS heatmaps
   can provide guidance for interpreting factors derived from other matrix
   factorization methods (Fig. [138]S10). These metrics help identify
   which of the K factors generated by a given factorization algorithm are
   likely to be interpretable. However, factor distributions can exhibit
   various patterns that the provided metrics may not fully capture. For
   unexplained factors, even one high metric value can be sufficient to
   highlight them as potentially interesting. In such cases, further
   interpretation should rely on enriched genes and pathways. Moreover,
   the optimal factorization method and its specific settings—such as the
   number of genes included—may vary depending on the input dataset.
   Variables like sample complexity (including diversity of cell types,
   presence of rare populations, and number of covariates in the
   experimental design), signal-to-noise ratio (including contamination
   level and dropout rate), number of batches, and variation in
   dissociation protocols and capture technologies in multi-sample studies
   can influence the choice of factorization method. Developing metrics to
   compare and identify the optimal factor analysis method and its
   parameter settings to generally improve interpretability across
   different methods would be beneficial but remains a challenge.

   In sciRED we focused on matrix factorization. However, a limitation of
   this approach is the assumption that cells can be represented as linear
   combinations of gene expression signatures, which may limit our ability
   to capture non-linear patterns. Alternative approaches, such as deep
   learning-based latent variable models like variational autoencoders
   (VAEs), can incorporate non-linearities and interactions between latent
   variables^[139]25 in the factorization step. While the interpretability
   measures we developed would be applicable to VAE-derived latent spaces,
   non-linear patterns may inherently be more challenging to interpret.
   Thus, there remains a trade-off between interpretability and the
   ability to model complex biological phenomena effectively. The
   proportion of cellular programs that can be effectively captured by
   linear versus non-linear approaches remains uncertain for any given
   dataset. Enhancing the interpretability of deep learning-based models,
   such as VAEs, while maintaining their ability to capture complex
   biological phenomena is an active research area^[140]51,[141]52.

   We only consider applying sciRED to scRNAseq data. However, it could be
   applied to bulk RNAseq datasets with tens to hundreds of samples. We do
   not recommend using the Poisson GLM step in sciRED for full-length,
   plate-based datasets such as Smart-seq2, as these data typically follow
   a different statistical distribution compared to UMI-based
   data^[142]53,[143]54. However, users can apply their custom
   preprocessing approach and then input these data into sciRED’s factor
   identification which would allow sciRED to be used on Smart-seq2 data.
   sciRED could also be extended to function with spatial and multi-omics
   data with multi-sample datasets, appropriate factor analysis methods
   and a good understanding of how technical variation affects these data.
   Analysis of one spatial transcriptomics dataset shows that sciRED can
   identify spatially distinct gene expression patterns within complex
   tissue structures (Figs. [144]S11 and [145]S12). However, further
   refinement may be needed to enhance its handling of spatial data and
   other emerging data types, particularly in terms of selecting
   appropriate error models, and optimizing the metrics used to assess
   factor quality across diverse experimental technologies. We expect that
   as single-cell and multi-omics technologies evolve, integrating and
   comparing factors extracted from many datasets and various data
   modalities will deepen our understanding of cellular systems in healthy
   and diseased tissue.

Methods

sciRED factor discovery framework

   We model the read counts for n cells and g genes (i.e.,
   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>g</mi></mrow
   ></msub> :MATH]
   ) as a combination of annotated and unannotated factors as follows:
   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>g</mi></mrow
   ></msub><mo>=</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>C</mi></mrow><mrow><mi>n</mi><m
   o>×</mo><mi>p</mi></mrow></msub><msub><mrow><mi>β</mi></mrow><mrow><mi>
   p</mi><mo>×</mo><mi>g</mi></mrow></msub><mo>+</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>F</mi></mrow><mrow><mi>n</mi><m
   o>×</mo><mi>f</mi></mrow></msub><msub><mrow><mi>A</mi></mrow><mrow><mi>
   f</mi><mo>×</mo><mi>g</mi></mrow></msub><mo>+</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>U</mi></mrow><mrow><mi>n</mi><m
   o>×</mo><mi>u</mi></mrow></msub><msub><mrow><mi>H</mi></mrow><mrow><mi>
   u</mi><mo>×</mo><mi>g</mi></mrow></msub><mo>+</mo><msub><mrow><mi>ε</mi
   ></mrow><mrow><mi>n</mi><mo>×</mo><mi>g</mi></mrow></msub> :MATH]

   Where:

   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>g</mi></mrow
   ></msub> :MATH]
   represents the observed data matrix, typically with dimensions
   [MATH: <mi>n</mi><mo>×</mo><mi>g</mi> :MATH]
   (where
   [MATH: <mi>n</mi> :MATH]
   is the number of samples or cells and
   [MATH: <mi>g</mi> :MATH]
   is the number of variables or genes)

   [MATH:
   <msub><mrow><mi>C</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>p</mi></mrow
   ></msub> :MATH]
   is the matrix of technical covariates with dimensions
   [MATH: <mi>n</mi><mo>×</mo><mi>p</mi> :MATH]
   (where
   [MATH: <mi>p</mi> :MATH]
   is the number of covariates), such as library size, batch ID, cell
   cycle stage

   [MATH:
   <msub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mo>×</mo><mi>g</mi></mrow
   ></msub> :MATH]
   represents the coefficient matrix for technical covariates with
   dimensions
   [MATH: <mi>p</mi><mo>×</mo><mi>g</mi> :MATH]

   [MATH:
   <msub><mrow><mi>F</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>f</mi></mrow
   ></msub> :MATH]
   is the matrix of annotated factors with dimensions
   [MATH: <mi>n</mi><mo>×</mo><mi>f</mi> :MATH]
   (where
   [MATH: <mi>f</mi> :MATH]
   is the number of factors)

   [MATH:
   <msub><mrow><mi>A</mi></mrow><mrow><mi>f</mi><mo>×</mo><mi>g</mi></mrow
   ></msub> :MATH]
   is the matrix of loadings for annotated factors with dimensions
   [MATH: <mi>f</mi><mo>×</mo><mi>g</mi> :MATH]

   [MATH:
   <msub><mrow><mi>U</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>u</mi></mrow
   ></msub> :MATH]
   is the matrix of unannotated factors with dimensions
   [MATH: <mi>n</mi><mo>×</mo><mi>u</mi> :MATH]
   (where
   [MATH: <mi>u</mi> :MATH]
   is the number of unannotated factors)

   [MATH:
   <msub><mrow><mi>H</mi></mrow><mrow><mi>u</mi><mo>×</mo><mi>g</mi></mrow
   ></msub> :MATH]
   represents the matrix of loadings for unannotated factors with
   dimensions
   [MATH: <mi>u</mi><mo>×</mo><mi>g</mi> :MATH]

   [MATH:
   <msub><mrow><mi>ε</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>g</mi></mrow
   ></msub> :MATH]
   is the error term, with dimensions
   [MATH: <mi>n</mi><mo>×</mo><mi>g</mi> :MATH]

   Annotated factors are those matched with known (given) covariates as
   indicated in the factor-covariate association (FCA) table. The solution
   to the above equation is reached through a two-step process.

   1. Derivation of residuals

   The first step aims to remove the effect of technical covariates from
   the data.

   Input counts, represented by the matrix
   [MATH:
   <msub><mrow><mi>Y</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>g</mi></mrow
   ></msub> :MATH]
   are modeled as a Poisson generalized linear model (GLM) to account for
   the matrix’s distributional properties^[146]11,[147]55,[148]57.
   Technical covariates
   [MATH:
   <msub><mrow><mi>C</mi></mrow><mrow><mi>n</mi><mo>×</mo><mi>p</mi></mrow
   ></msub> :MATH]
   are incorporated into the model to capture their effects on the count
   data. The statsmodels package^[149]58 (version 0.11.0) is used for GLM
   implementation.

   Pearson residuals from this model are computed as the difference
   between observed and predicted counts divided by the square root of the
   predicted counts:
   [MATH:
   <msub><mrow><mi>r</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   space
   width="0.25em"></mspace><mfrac><mrow><msub><mrow><mi>y</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><mo>−</mo><mspace
   width="0.25em"></mspace><mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover></mrow><mrow><msqrt><mrow
   ><mi>V</mi><mi>A</mi><mi>R</mi><mfenced close=")" open="("><mrow><mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover></mrow></mfenced></mrow><
   /msqrt></mrow></mfrac> :MATH]

   where
   [MATH: <mi>V</mi><mi>A</mi><mi>R</mi> :MATH]
   is the variance. For Poisson GLM,
   [MATH: <mi>V</mi><mi>A</mi><mi>R</mi><mfenced close=")"
   open="("><mrow><mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover></mrow></mfenced><mo>=</m
   o><mspace width="0.25em"></mspace><mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover> :MATH]
   .

   The Pearson residuals are used for subsequent matrix decomposition.

   Two additional residual types were evaluated, but not used in sciRED.
   Response residuals represent the difference between the observed count
   and the predicted mean count for each observation.

   Response Residuals:
   [MATH:
   <msub><mrow><mi>r</mi><mi>e</mi><mi>s</mi><mi>i</mi><mi>d</mi></mrow><m
   row><mi>i</mi></mrow></msub><msub><mrow><mo>=</mo><mi>y</mi></mrow><mro
   w><mi>i</mi></mrow></msub><mo>−</mo><mspace
   width="0.25em"></mspace><mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover> :MATH]

   Deviance residuals represent the individual contributions of samples to
   the deviance D, calculated as the signed square roots of the unit
   deviances. Deviance Residuals:
   [MATH:
   <msub><mrow><mi>d</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   i>s</mi><mi>i</mi><mi>g</mi><mi>n</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>−</mo><mspace width="0.25em"></mspace><mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover></mrow></mfenced><mo>×</m
   o><msqrt><mrow><mn>2</mn><mo>×</mo><mrow><mo>(</mo><mrow><msub><mrow><m
   i>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mi>log</mi><mfenced
   close=")"
   open="("><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>/</mo><mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover></mrow></mfenced><mo>−</m
   o><mspace width="0.25em"></mspace><mfenced close=")"
   open="("><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>−</mo><mspace width="0.25em"></mspace><mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover></mrow></mfenced></mrow><
   mo>)</mo></mrow></mrow></msqrt> :MATH]

   Pearson residuals are used for sciRED because they are established in
   other single cell analysis methods^[150]55,[151]56. However, sciRED
   results are robust to the choice of residual (among Pearson, response,
   deviance) (See Methods section “Classifier ensemble”).

   2. Residual decomposition

   After obtaining Pearson residuals from the count data, a matrix
   factorization technique is employed to uncover underlying patterns,
   including the annotated and unannotated factors. PCA factors are
   calculated using Singular Value Decomposition (SVD)^[152]59 as
   implemented in scikit-learn^[153]60 (version 0.22.1). Following factor
   decomposition, we apply varimax rotation to enhance the
   interpretability of the principal axes in the feature space. To achieve
   this, we reimplemented the optimized estimation procedure as described
   below.

Rotation types

   Factor analysis typically comprises two sequential stages. Initially,
   loadings are computed to best approximate the observed variances within
   the data. However, these initial loadings may lack interpretability;
   thus, we apply rotation to generate a revised set that is more easily
   interpretable. There are two primary rotation types in factor analysis:
   orthogonal and oblique rotations. Orthogonal rotation, such as varimax,
   seeks to produce orthogonal factors with zero correlations.
   Intuitively, varimax rotation aims to identify factors associated with
   a limited number of variables, thereby it promotes the detection of
   distinct factors rather than those affecting all variables evenly.
   Mathematically, interpretability is achieved by maximizing the variance
   of the squared loadings along each principal component axis. Varimax
   rotation seeks to maximize the Kaiser criterion:
   [MATH: <mi>v</mi><mo>=</mo><mspace
   width="0.25em"></mspace><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi></mrow></msub><msub><mrow><
   mi>v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>=</mo><mspace
   width="0.25em"></mspace><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi></mrow></msub><mspace
   width="0.25em"></mspace><mfenced close="}" open="{"><mrow><mspace
   width="0.25em"></mspace><mfenced close="]" open="["><mrow><mspace
   width="0.25em"></mspace><mi>g</mi><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow></msub><msup><mrow><
   mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>l</mi></mrow><mrow><mi>i</mi><mi>j</m
   i></mrow><mrow><mn>2</mn></mrow></msubsup><mo>/</mo><msubsup><mrow><mi>
   h</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup></
   mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup><mo>−</mo><mspace
   width="0.25em"></mspace><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow></msub><msubsup><mro
   w><mi>l</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mn>2</mn></m
   row></msubsup><mo>/</mo><msubsup><mrow><mi>h</mi></mrow><mrow><mi>i</mi
   ></mrow><mrow><mn>2</mn></mrow></msubsup></mrow></mfenced></mrow><mrow>
   <mn>2</mn></mrow></msup></mrow></mfenced><mspace
   width="0.25em"></mspace><msup><mrow><mo>/</mo><mi>g</mi></mrow><mrow><m
   n>2</mn></mrow></msup></mrow></mfenced> :MATH]

   Where
   [MATH: <mi>v</mi> :MATH]
   is the Kaiser criterion and
   [MATH:
   <msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub>
   :MATH]
   is the loading value of i^th gene and j^th factor, and
   [MATH: <mi>g</mi> :MATH]
   is the total number of genes. The communality
   [MATH: <msub><mrow><mi>h</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   is calculated as the squared sum of the loading values for each gene.
   [MATH:
   <msubsup><mrow><mi>h</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mo>=</mo><mspace
   width="0.25em"></mspace><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi></mrow></msub><msubsup><mro
   w><mi>l</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mn>2</mn></m
   row></msubsup> :MATH]

   To explicitly specify the rotation matrix, we can reformulate the
   Kaiser criterion using the following notation. Let
   [MATH: <mi>L</mi> :MATH]
   be a
   [MATH: <mi>g</mi><mo>×</mo><mi>k</mi> :MATH]
   loading matrix (eigenvectors), and
   [MATH: <mi>R</mi> :MATH]
   denote a rotation matrix such that
   [MATH:
   <msup><mrow><mi>R</mi></mrow><mrow><mi>T</mi></mrow></msup><mi>R</mi><m
   sub><mrow><mo>=</mo><mi>I</mi></mrow><mrow><mi>k</mi></mrow></msub>
   :MATH]
   , where
   [MATH: <msub><mrow><mi>I</mi></mrow><mrow><mi>k</mi></mrow></msub>
   :MATH]
   is the
   [MATH: <mi>k</mi><mo>×</mo><mi>k</mi> :MATH]
   identity matrix. Additionally, let
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub>
   :MATH]
   represent the scalar element in the
   [MATH: <mi>i</mi> :MATH]
   th row and
   [MATH: <mi>j</mi> :MATH]
   th column of matrix
   [MATH: <mi>R</mi> :MATH]
   . Varimax rotations can now be described as follows:
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>v</mi><mi>a</mi><mi>r</mi><mi>i<
   /mi><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub><mo>=</mo><msub><mrow><
   mi>a</mi><mi>r</mi><mi>g</mi><mi>m</mi><mi>a</mi><mi>x</mi></mrow><mrow
   ><mi>R</mi></mrow></msub><mspace width="0.25em"></mspace><mfenced
   close=")" open="("><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>k</mi></mrow></msubsup><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>g</mi></mrow></msubsup><msubsup><mrow><mfenced close=")"
   open="("><mrow><mi>L</mi><mi>R</mi></mrow></mfenced></mrow><mrow><mi>i<
   /mi><mi>j</mi></mrow><mrow><mn>4</mn></mrow></msubsup><mo>−</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><mn>1</mn></mrow><mrow><mi>g</mi><
   /mrow></mfrac><mspace width="0.25em"></mspace><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>k</mi></mrow></msubsup><msup><mrow><mfenced close=")"
   open="("><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>g</mi></mrow></msubsup><msubsup><mrow><mrow><mo>(</mo><mrow><
   mi>L</mi><mi>R</mi></mrow><mo>)</mo></mrow></mrow><mrow><mi>i</mi><mi>j
   </mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow></mfenced></mrow><m
   row><mn>2</mn></mrow></msup></mrow></mfenced> :MATH]

   where
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>v</mi><mi>a</mi><mi>r</mi><mi>i<
   /mi><mi>m</mi><mi>a</mi><mi>x</mi></mrow></msub> :MATH]
   denotes the resulting rotation matrix.

   The rotation matrix is computed using an iterative method relying on
   SVD to achieve sparsity in the loadings. Subsequently, the rotated
   loadings and score matrices are derived by multiplying the original
   loadings and score matrices with the rotation matrix, respectively. The
   optimization algorithm is elaborated on in detail in Appendix A of
   Stegmann et al.^[154]61. We re-implemented the base R varimax rotation
   function in Python.

   Oblique rotations, such as Promax^[155]62, allow factors to be
   correlated, thereby relaxing the orthogonality assumption^[156]63. This
   flexibility can be beneficial when factors are expected to be
   correlated within the underlying structure of the data. Promax
   initially applies the varimax method to generate a set of orthogonal
   results. Subsequently, it constructs an ideal oblique solution to
   exaggerate the orthogonal rotated gene-loading matrix. Finally, it
   rotates the orthogonal results to achieve a least squares fit with this
   ideal solution.

   We define a pattern matrix
   [MATH:
   <mi>P</mi><mo>=</mo><mrow><mo>(</mo><mrow><msub><mrow><mi>p</mi></mrow>
   <mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]
   , as the following
   [MATH: <mrow><mo>(</mo><mrow><mi>k</mi><mspace
   width="0.25em"></mspace><mo>></mo><mspace
   width="0.25em"></mspace><mn>1</mn></mrow><mo>)</mo></mrow> :MATH]
   :
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><m
   o>=</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><mo>∣</mo><msubsup><mrow><mi>l</mi
   ></mrow><mrow><mi>i</mi><mi>j</mi></mrow><mrow><mi>k</mi><mo>+</mo><mn>
   1</mn></mrow></msubsup><mo>∣</mo></mrow><mrow><msub><mrow><mi>l</mi></m
   row><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow></mfrac> :MATH]

   Each element of
   [MATH: <mi>P</mi> :MATH]
   matrix is the
   [MATH:
   <msup><mrow><mi>k</mi></mrow><mrow><mi>t</mi><mi>h</mi></mrow></msup>
   :MATH]
   power (typically 3rd power) of the corresponding element in the
   row-column normalized varimax loading matrix. Next, the least squares
   fit of the orthogonal matrix of factor loadings to the pattern matrix
   is computed.
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>P</mi><mi>r</mi><mi>o</mi><mi>m<
   /mi><mi>a</mi><mi>x</mi></mrow></msub><mo>=</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>a</mi><mi>r</mi><mi>g</mi><mi>m
   </mi><mi>i</mi><mi>n</mi></mrow><mrow><mi>R</mi></mrow></msub><msub><mr
   ow><mo>∣</mo><mo>∣</mo><mi>P</mi><mo>−</mo><msub><mrow><mi>L</mi></mrow
   ><mrow><mi>r</mi><mi>o</mi><mi>t</mi></mrow></msub><mi>R</mi><mo>∣</mo>
   <mo>∣</mo></mrow><mrow><mn>2</mn></mrow></msub> :MATH]
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>P</mi><mi>r</mi><mi>o</mi><mi>m<
   /mi><mi>a</mi><mi>x</mi></mrow></msub><mo>=</mo><mspace
   width="0.25em"></mspace><msup><mrow><mrow><mo>(</mo><mrow><msubsup><mro
   w><mi>L</mi></mrow><mrow><mi>r</mi><mi>o</mi><mi>t</mi></mrow><mrow><mo
   >′</mo></mrow></msubsup><msub><mrow><mi>L</mi></mrow><mrow><mi>r</mi><m
   i>o</mi><mi>t</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><m
   o>−</mo><mn>1</mn></mrow></msup><msubsup><mrow><mi>L</mi></mrow><mrow><
   mi>r</mi><mi>o</mi><mi>t</mi></mrow><mrow><mo>′</mo></mrow></msubsup><m
   i>P</mi> :MATH]

   where
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>P</mi><mi>r</mi><mi>o</mi><mi>m<
   /mi><mi>a</mi><mi>x</mi></mrow></msub> :MATH]
   is the unnormalized rotation matrix,
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mi>r</mi><mi>o</mi><mi>t</mi></mrow
   ></msub> :MATH]
   is the varimax rotated loadings, and
   [MATH: <mi>P</mi> :MATH]
   is the pattern matrix defined above. The columns of
   [MATH:
   <msub><mrow><mi>R</mi></mrow><mrow><mi>P</mi><mi>r</mi><mi>o</mi><mi>m<
   /mi><mi>a</mi><mi>x</mi></mrow></msub> :MATH]
   are normalized such that their sums of squares are equal to unity. We
   reimplemented the base R promax function in Python.

   To evaluate the impact of factor rotations, we applied PCA, sciRED
   (varimax-based^[157]19), and promax-rotated^[158]62 PCA to the Pearson
   residual of the scMixology dataset after regressing out protocol and
   library size. The factor-covariate association scores reveal a high
   correlation between sciRED and Promax. Both methods enhance specificity
   and achieve one-to-one association between factors and cell line
   covariates, outperforming unrotated PCA (Figure [159]S13). We used
   varimax as the default rotation method for sciRED, as it has been
   established to perform well for interpretable factor
   analysis^[160]64,[161]65.

Factor identification benchmarking

   For both scMixology and PBMC datasets, PCA and ICA were applied to
   normalized (library-regressed) data, while NMF, scVI, Zinbwave, cNMF,
   and scCoGAPS were run on filtered raw counts. All methods used the top
   2000 highly variable genes, following sciRED’s standard pipeline. The
   number of factors (K) was set to 30 across all methods; for additional
   comparison, K was also set to 10 for the scMixology dataset.

   Spectra was applied exclusively to the PBMC dataset, as it relies on
   the Cytopus knowledge base, which contains gene sets tailored for
   standard scRNA-seq data but lacks coverage for cancer cell
   line-specific gene sets. For PBMC, cell type labels were standardized
   to CD4-T, mono, cDC, NK, CD8-T, and B, with an “all-cells” category
   used for cell types without specific annotations. The number of factors
   per cell type was set to five to approximate the total number of
   factors used in the other methods.

   All methods were run in single-threaded mode with default parameters on
   a workstation with Intel 3.0 GHz Xeon E5-2687W chip and 64 GB RAM.

Data pre-processing and handling of batch effects with sciRED

   sciRED takes raw count (not batch corrected) data as input, removing
   cells and genes with zero total read count and retaining only the top
   highly variable genes, typically the top 2000 as in standard scRNA-seq
   workflows. While using all genes is an option, the first step—fitting
   the Poisson GLM—becomes more time-consuming with larger gene sets;
   thus, a smaller set of highly variable genes is used. The minimum
   required covariate is library size (total read count per cell), which
   serves as an equivalent normalization step.

   Additional covariates can be incorporated into the GLM model based on
   specific analysis goals; however, technical covariates should be
   removed with caution to prevent the unintentional exclusion of
   biologically relevant signals. For instance, in analyses focused on
   sex-specific variations, adjusting for sample IDs as covariates can
   inadvertently eliminate sex-associated signals if sample and sex
   effects are confounded.

   In multi-sample analyses, we recommend merging all samples without
   batch correction. Batch correction is generally unnecessary with
   sciRED, as our analyses demonstrate robust batch effect handling
   without directly correcting the data matrix or latent embedding. To
   evaluate batch effect handling, sciRED was tested on two PBMC datasets
   profiled using 10x Genomics single-cell 3’ and 5’ gene expression
   libraries with strong batch effects (Fig. [162]S14). sciRED was applied
   to the combined count matrices, with library and sample IDs regressed
   as covariates in the Poisson GLM step. The UMAP projection (Figures
   S14AB) reveals batch-related clustering without correction; however,
   sciRED analysis (Fig. [163]S14C, D) shows well-integrated cell type
   identities, even across assay types. For instance, Factors F1 and F4,
   which identify CD14+ monocytes and B cells, respectively, are well
   integrated across both assays. The distributions of cells over F1 and
   F4, colored by cell type and assay, demonstrate effective batch
   integration, and the corresponding box plots and UMAP enrichment
   patterns further confirm these results (Fig. [164]S14E–J).

Impact of data sparsity on factor decomposition performance

   To assess the impact of data sparsity on sciRED’s decomposition, we
   systematically varied the sparsity level (0.01, 0.3, 0.5, 0.7, 0.9,
   0.95, 0.99) in the human liver atlas dataset by incrementally replacing
   randomly selected gene expression values with zeros (Fig. [165]S15).
   For each level, we evaluated the factors based on average and maximum
   variance explained, the percentage of matched factors, and the
   percentage of matched covariates. Our results indicate that the average
   variance explained by all factors remains largely robust across
   different sparsity levels, likely due to sciRED’s rotation step, as
   this trend is not observed for PCA factors. The maximum variance
   explained by factors is also more stable with sciRED’s rotated factors
   compared to PCA as sparsity increases. As expected, the percentage of
   matched factors decreases with rising sparsity, with a marked drop
   around a sparsity threshold of ~70%. At lower sparsity levels (0.01 and
   0.5), the FCA heatmaps display a balanced distribution of matched
   covariates per factor, while at higher sparsity (0.95 and 0.99), most
   covariates align with only a few factors, reducing interpretability.
   However, the percentage of matched covariates remains stable, as most
   covariates continue to align with at least one factor.

Impact of factor number (K) on decomposition results

   To evaluate the impact of factor count (K) on sciRED’s decomposition
   results, we tested various factor numbers—5, 10, 20, 30, and 50—on the
   human liver map (Fig. [166]S16). At lower factor counts, such as K = 5,
   each factor showed higher covariate associations (e.g., Factor 1 was
   matched to nine covariates). As K increased, each factor aligned with
   fewer covariates, and the percentage of matched factors decreased. At
   K = 50, 54% of factors had no covariate matches. Correlating factors
   between the K = 10 and K = 30 decompositions revealed that individual
   factors in K = 10 (e.g., F1, F3, and F7) mapped to multiple factors in
   K = 30, suggesting that low factor limits may lead to each factor
   capturing multiple gene expression programs, complicating
   interpretation. Repeating this analysis on the PBMC dataset yielded
   similar results: as the number of factors increased, each factor
   aligned with fewer covariates, and a higher proportion of factors
   remained unmatched at higher K values (Fig. [167]S17).

   The ideal factor count varies by dataset and is affected by biological
   and technical signal diversity. While a Scree plot is commonly used in
   PCA to determine factor counts, sciRED’s rotation step changes the
   ordering by variance explained, making this approach less effective.
   Empirically, a moderately high K (e.g., 30) has yielded interpretable
   results across datasets. We recommend starting with a higher K
   value and excluding factors with no covariate matches or low factor
   importance scores (FIS).

Feature importance calculation

   We evaluated a range of classifiers for inclusion in the sciRED
   ensemble classifier. Each classifier is trained on individual levels of
   each covariate separately (e.g., “female” and “male” for a “biological
   sex” covariate). Each classifier uses a different approach to estimate
   feature importance:
    1. Logistic regression: feature importance is the magnitude of the
       coefficient for each factor, which represents the change in the
       log-odds of belonging to a covariate level per unit of factor
       weight.
    2. Decision trees, random forest^[168]66 and extreme gradient
       boosting: feature importance scores represent the decrease in
       covariate mixing (e.g. Gini impurity or entropy) when the feature
       is used within a tree averaged across all trees.
    3. K-nearest neighbor^[169]67 (KNN): feature importance is estimated
       as the decrease in predictive accuracy when the values for that
       feature are randomly permuted. This value is calculated as the
       average across five permutations for each factor based on the
       default scikit-learn package implementation.
    4. Linear classifier (AUC): feature importance for a linear classifier
       (i.e. fixed threshold) is calculated as the area under (AUC) the
       receiver-operating characteristic (ROC) curve. The AUC for
       one-dimensional data is equivalent to the Wilcoxon or Mann-Whitney
       U test statistic with the relation:

   [MATH: <mi>A</mi><mi>U</mi><mi>C</mi><mo>=</mo><mspace
   width="0.25em"></mspace><mi>U</mi><mo>/</mo><mrow><mo>(</mo><mrow><msub
   ><mrow><mi>n</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>×</mo><msub><
   mrow><mi>n</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow><mo>)</mo></
   mrow> :MATH]

   Where U is the Mann-Whitney U statistic, and
   [MATH: <msub><mrow><mi>n</mi></mrow><mrow><mn>0</mn></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>n</mi></mrow><mrow><mn>1</mn></mrow></msub>
   :MATH]
   are the sample sizes of the two groups being compared. The Mann-Whitney
   U test is a non-parametric test used to assess whether two independent
   samples are selected from populations having the same mean rank. Here,
   samples are defined as factor scores for the target group (cells
   labeled with the covariate level of interest) and the non-target group.
   In the context of feature importance, a higher AUC value indicates that
   the factor is better at separating the classes, while a lower AUC value
   suggests less discriminatory power. The scikit-learn package is used to
   implement decision tree, random forest, logistic regression and KNN
   classifiers with default parameters. XGBoost (version 1.5.0) and
   SciPy^[170]68 (version 1.4.1) packages are used for XGB and the
   Mann-Whitney U test, respectively.

Classifier ensemble

   We optimized the sciRED classifier ensemble by evaluating different
   classifier combinations on four independent datasets: a healthy human
   kidney map, a healthy human liver map, a PBMC atlas, and the scMixology
   benchmark dataset (Fig. [171]S18). For each experiment, we randomly
   shuffled the covariate labels to generate a null distribution of
   classifier association scores and calculated the average number of
   significant associated factors (p < 0.05) per covariate level
   (Fig. [172]S18A). We defined a one-to-one association between factors
   and covariates as the optimally interpretable result. This analysis
   shows that the sciRED classifier (ensemble of logistic regression,
   linear classifier/area under the curve (AUC), decision tree, and
   extreme gradient boosting (XGB)) outperforms or matches the performance
   of the individual classifiers, depending on the data set. Initially,
   six classifiers—AUC, K-Nearest Neighbors (KNN), logistic regression,
   decision tree, random forest, and XGB—were compared on the scMixology
   benchmark dataset. Due to KNN’s poor classification performance and
   random forest’s inferior scalability, they were excluded from the
   sciRED ensemble model (Fig. [173]S18B, C). Benchmarking on the four
   independent datasets described above demonstrates sciRED’s superior
   performance relative to single classifiers (Fig. [174]S18D–G).

   To optimize the ensemble score, we also tested all combinations of
   three different scaling methods and two mean calculations
   (Fig. [175]S19). Specifically, we considered standardization, min-max
   scaling, or rank-based scaling combined with arithmetic or geometric
   means^[176]69. By comparing scores for the real scMixology data vs.
   permuted data labels, we identified standardization followed by
   arithmetic mean calculation as the optimal scoring method
   (Fig. [177]S19).

   Standardization, min-max scaling, and rank-based scaling are defined as
   follows:

   Given a set of feature importance scores for each factor
   [MATH: <mi>i</mi> :MATH]
   :
   [MATH:
   <mi>x</mi><mo>=</mo><mrow><mo>{</mo><mrow><msub><mrow><mi>x</mi></mrow>
   <mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>x</mi></mrow><m
   row><mn>2</mn></mrow></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mo>…</mo><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi></
   mrow></msub></mrow><mo>}</mo></mrow> :MATH]
   , and their ascending order of ranks
   [MATH:
   <mi>r</mi><mo>=</mo><mrow><mo>{</mo><mrow><msub><mrow><mi>r</mi></mrow>
   <mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>r</mi></mrow><m
   row><mn>2</mn></mrow></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mo>…</mo><mo>,</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>r</mi></mrow><mrow><mi>n</mi></
   mrow></msub></mrow><mo>}</mo></mrow> :MATH]
   , where
   [MATH: <mi>n</mi> :MATH]
   is the total number of features (factors):
   [MATH:
   <msub><mrow><mi>s</mi><mi>t</mi><mi>a</mi><mi>n</mi><mi>d</mi><mi>a</mi
   ><mi>r</mi><mi>d</mi><mi>i</mi><mi>z</mi><mi>e</mi><mi>d</mi></mrow><mr
   ow><mi>i</mi></mrow></msub><mo>=</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><msub><mrow><mi>x</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><mo>−</mo><mi
   mathvariant="normal">mean</mi><mrow><mo>(</mo><mrow><mi>x</mi></mrow><m
   o>)</mo></mrow></mrow><mrow><mi>s</mi><mi>t</mi><mi>d</mi><mrow><mo>(</
   mo><mrow><mi>x</mi></mrow><mo>)</mo></mrow></mrow></mfrac> :MATH]
   [MATH:
   <msub><mrow><mi>min</mi><mo>−</mo><mi>max</mi><mi>s</mi><mi>c</mi><mi>a
   </mi><mi>l</mi><mi>e</mi><mi>d</mi></mrow><mrow><mi>i</mi></mrow></msub
   ><mo>=</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><msub><mrow><mi>x</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><mo>−</mo><mi>min</mi><mrow><mo>(</mo><mrow><m
   i>x</mi></mrow><mo>)</mo></mrow></mrow><mrow><mi>max</mi><mfenced
   close=")"
   open="("><mrow><mi>x</mi></mrow></mfenced><mo>−</mo><mi>min</mi><mrow><
   mo>(</mo><mrow><mi>x</mi></mrow><mo>)</mo></mrow></mrow></mfrac> :MATH]
   [MATH:
   <msub><mrow><mi>r</mi><mi>a</mi><mi>n</mi><mi>k</mi><mo>−</mo><mi>s</mi
   ><mi>c</mi><mi>a</mi><mi>l</mi><mi>e</mi><mi>d</mi></mrow><mrow><mi>i</
   mi></mrow></msub><mo>=</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><msub><mrow><mi>r</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow><mrow><mi
   mathvariant="normal">n</mi></mrow></mfrac> :MATH]

   Rank scaling generates a value between 0 and 1 to each data point based
   on its rank within the dataset. Lower values in the original dataset
   will have lower rank-based scaled values, while higher values will have
   higher rank-based scaled values.

   Arithmetic and geometric means across classifiers are calculated as
   follows:
   [MATH:
   <mi>A</mi><mi>r</mi><mi>i</mi><mi>t</mi><mi>h</mi><mi>m</mi><mi>e</mi><
   mi>t</mi><mi>i</mi><mi>c</mi><mspace
   width="0.16em"></mspace><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi><mo>=</
   mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mfrac><msubsu
   p><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi
   >i</mi></mrow></msub> :MATH]

   Where n is the number of values.
   [MATH:
   <mi>G</mi><mi>e</mi><mi>o</mi><mi>m</mi><mi>e</mi><mi>t</mi><mi>r</mi><
   mi>i</mi><mi>c</mi><mspace
   width="0.16em"></mspace><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi><mo>=</
   mo><msup><mrow><mfenced close=")" open="("><mrow><msubsup><mrow><mo
   mathsize="big">∏</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi
   >i</mi></mrow></msub></mrow></mfenced></mrow><mrow><mfrac><mrow><mn>1</
   mn></mrow><mrow><mi>n</mi></mrow></mfrac></mrow></msup> :MATH]

   The geometric mean is the n^th root of the product of a set of values.

   The effect of the choice of residual (Pearson, response, deviance) on
   sciRED performance was also evaluated, indicating that sciRED results
   are robust to the choice of residual (Fig. [178]S20).

Determining significant factor-covariate associations

   FCA scores are binarized into significant and non-significant
   associations based on a threshold. This threshold is automatically
   determined using Otsu thresholding^[179]70. Otsu thresholding iterates
   through all potential threshold values and computes a separability
   measure in each iteration. The threshold value that maximizes this
   separability measure is chosen as the optimal threshold to partition
   the FCA score distributions into significant and non-significant
   associations. The scikit-image package^[180]71 (version 0.23.2) is used
   for implementation.

Benchmarking covariate-factor association metrics based on permutation
results

   We used a permutation test to evaluate the significance of each
   covariate-factor association. This process entailed randomly shuffling
   cell covariate labels 500 times and recalculating factor-covariate
   association scores for each permutation to create an empirical null
   distribution. A shuffled dataset should result in lower-scaled
   factor-covariate association scores compared to the original dataset,
   given that the cell labels are randomized after each permutation.

   Empirical p-values were calculated as the number of permuted FCA scores
   that are as or more extreme than the observed association value,
   divided by the number of permutations. The number of significant
   associations for each covariate was determined using p-value < 0.05. We
   expect one significant factor association per covariate level in the
   ideal case. High values of this metric likely indicate false positive
   associations (low specificity) and zero values highlight false negative
   results (low sensitivity).

   To evaluate model performance without relying on predefined thresholds
   (such as p-value = 0.05), we employed the Gini index, an inequality
   measure ranging from 0 to 1. A Gini index of 0 signifies perfect
   equality, where all values are identical, while a score of 1 indicates
   perfect inequality, with one value dominating the distribution. In our
   context, we aimed to assess the extent of association between factors
   and covariates across various levels of pre- and post-permutation.
   Following label shuffling, we expect the factor-covariate association
   scores to exhibit uniformity across all covariate levels (Gini score
   closer to 0). In cases of non-random baseline labels, sparse
   associations between factors and covariates would result in a Gini
   score closer to 1. To calculate the Gini index for a given FCA matrix,
   Gini was calculated for each covariate level separately and the average
   Gini index across all covariate levels is provided.
   [MATH:
   <msub><mrow><mi>G</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>=</mo><m
   frac><mrow><mn>1</mn></mrow><mrow><mn>2</mn><msup><mrow><mi>n</mi></mro
   w><mrow><mn>2</mn></mrow></msup><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow></mfrac><
   msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><mo>∣</mo><msub><mrow><mi>x</mi></mrow
   ><mrow><mi>i</mi></mrow></msub><mo>−</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></
   mrow></msub><mo>∣</mo> :MATH]

   Where
   [MATH: <msub><mrow><mi>G</mi></mrow><mrow><mi>k</mi></mrow></msub>
   :MATH]
   is the Gini coefficient for a covariate level
   [MATH: <mi>k</mi> :MATH]
   ,
   [MATH: <mi>n</mi> :MATH]
   is the number of factors, and
   [MATH: <mover accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover>
   :MATH]
   are the mean FCA scores for a given covariate level. We then calculate
   the global Gini index as:
   [MATH: <mi>G</mi><mo>=</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><mn>1</mn></mrow><mrow><mi>K</mi><
   /mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><msub><mrow><mi>G</mi></mrow><mrow><mi
   >k</mi></mrow></msub> :MATH]

   Where
   [MATH: <mi>K</mi> :MATH]
   is the total number of covariate levels.

Factor evaluation

   We defined four categories of metrics: separability, effect size,
   specificity, and homogeneity to measure factor interpretability. Two
   are label-free (separability, effect size) and the other two are
   label-dependent (specificity, and homogeneity). Metrics for each
   category are organized into a table of factor interpretability scores
   (FIS), with metrics arranged as rows and factors as columns.
   Subsequently, the FIS undergoes row-wise scaling and is visualized as a
   heatmap to support the comparison of interpretability scores between
   factors.

   sciRED uses the Silhouette score and bimodality index to assess
   separability. Homogeneity is measured using the arithmetic mean of the
   average scaled variance (ASV) table. The Simpson index is employed to
   evaluate factor specificity, while factor variance serves as an effect
   size measure (See Methods section “Correlation between interpretability
   metrics and overlap values”).

Separability (bimodality)

   We use bimodality scores as an indicator for measuring the separability
   of factors. We use existing metrics developed for identifying genes
   with bimodal expression distributions, which are typically used to
   assess genes with prognostic value in discriminating patient subgroups.
   We selected metrics in accordance with Hellwig et al.’s comparative
   assessment of these scores based on survival times in a breast cancer
   dataset^[181]72.

Cluster-based bimodality measures

   Cluster-based methods group observations into clusters, then calculate
   various statistics to measure the degree of distinctiveness of these
   clusters. Clusters were generated using a standard k-means^[182]73
   algorithm (k = 2) implemented using the scikit-learn^[183]60 package.
   The following cluster-based bimodality measures are included in sciRED:
     * Variance Ratio Score (VRS), also known as Calinski–Harabasz index
       (CHI)^[184]74 assesses the proportion of variance reduction when
       splitting the data into two clusters. We can decompose the total
       sum of squares (TSS) to between-cluster sum of squares (BSS) and
       within-cluster sum of squares (WSS). VRS is then defined as the
       ratio of BSS and WSS.
     * Weighted Variance Ratio score (WVRS) is similar to VRS but measures
       the variance reduction independent of the cluster sample sizes. In
       both cases, higher values indicate better separation between
       clusters and reflect bimodality.
     * Silhouette score^[185]75 measures the cohesion and separation of
       clusters, ranging from -1 to 1, where higher values indicate
       better-defined clusters. Silhouette is calculated based on mean
       intra-cluster distance (ICD) and the mean nearest-cluster distance
       (NCD). For each data point Silhouette is then given as
       [MATH:
       <mrow><mo>(</mo><mrow><mi>N</mi><mi>C</mi><mi>D</mi><mo>−</mo><mi>I
       </mi><mi>C</mi><mi>D</mi></mrow><mo>)</mo></mrow><mo>/</mo><mi>max<
       /mi><mrow><mo>(</mo><mrow><mi>I</mi><mi>C</mi><mi>D</mi><mo>,</mo><
       mi>N</mi><mi>C</mi><mi>D</mi></mrow><mo>)</mo></mrow> :MATH]
       .
     * Davies-Bouldin index^[186]76 computes the average similarity
       between each cluster and its most similar cluster. It measures
       similarity as the ratio of within-cluster distances to
       between-cluster distances. The minimum value is zero with lower
       values indicating better cluster separation. To ensure consistency
       with other metrics, we scale the inverse Davies-Bouldin Index.

   Calinski–Harabasz, Silhouette, and Davies-Bouldin Indices were
   implemented using scikit-learn.

Bimodality index

   The Bimodality Index^[187]77 is another bimodality metric which was
   initially introduced to rank bimodal gene expression signatures within
   cancer gene expression datasets. We adopt the bimodality index to
   assess the bimodal nature of factor scores by modeling them as a
   mixture of two normal distributions using a Gaussian Mixture Model
   (GMM). Let
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mn>1</mn></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mn>2</mn></mrow></msub>
   :MATH]
   represent the means of the two normal distributions, and
   [MATH: <mi>σ</mi> :MATH]
   the standard deviation. Given an equal variance assumption, the
   standardized distance δ between the distributions is calculated by:
   [MATH:
   <mi>δ</mi><mo>=</mo><mfrac><mrow><mo>∣</mo><msub><mrow><mi>μ</mi></mrow
   ><mrow><mn>1</mn></mrow></msub><mo>−</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mn>2</mn></
   mrow></msub><mo>∣</mo></mrow><mrow><mi>σ</mi></mrow></mfrac> :MATH]

   Given GMM estimated parameters, we can calculate the proportion of
   observations in the first component
   [MATH: <mi>π</mi> :MATH]
   . We then compute the bimodality index (BI) as:
   [MATH: <mi>B</mi><mi>I</mi><mo>=</mo><msqrt><mrow><mi>π</mi><mfenced
   close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mi>π</mi></mrow></mfenced></mrow></
   msqrt><mspace width="0.25em"></mspace><mo>×</mo><mi>δ</mi> :MATH]

   A higher BI indicates a stronger bimodal distribution, aiding in the
   identification of bimodal factor signatures. Scikit-learn was used to
   fit GMMs.

Dip score

   Another bimodality evaluation metric is Hartigan’s dip test^[188]78, a
   statistical measure used to assess deviations from unimodality in
   distributions. It evaluates multimodality by comparing the maximum
   difference between the empirical distribution function and the unimodal
   distribution function that minimizes this difference. The dip score was
   computed for each factor, indicating the degree of bimodality present
   in the factor distribution. Higher dip scores suggest stronger evidence
   of bimodality, while lower scores indicate unimodality. Implementation
   was based on the diptest package.

Effect size

   The variance of factor scores across cells was employed as a measure of
   effect size consistency.

Specificity

   We assessed the Simpson diversity index and Shannon entropy as measures
   of specificity.

Simpson diversity Index

   The Simpson diversity index is a measure commonly used in ecology to
   quantify the diversity or evenness of species within a
   community^[189]79. It assesses the probability of encountering
   different species within a community and how evenly distributed these
   species are. We adopted the Simpson diversity index to measure
   specificity for individual factors within the context of FCA scores. By
   applying the Simpson index to each vector of FCA scores, we can
   evaluate the extent to which a factor uniquely characterizes a
   particular covariate level.

   Mathematically, the Simpson diversity index
   [MATH: <mi>D</mi> :MATH]
   is expressed as
   [MATH: <mi>D</mi><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><msubsup><mrow><mi>p</mi></mrow><mrow>
   <mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup> :MATH]

   Where
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   denotes the scaled score of factor-covariate level association
   (probability of a factor being chosen for a given covariate level
   [MATH: <mi>i</mi> :MATH]
   ), and
   [MATH: <mi>n</mi> :MATH]
   represents the total number of covariate levels. Here, the Simpson
   diversity index ranges between 0 and 1, where 0 indicates low factor
   specificity (maximum diversity, all factor association scores are
   equally distributed) and 1 indicates greater specificity (minimum
   diversity) of a factor towards a particular covariate level.

Shannon diversity index

   We applied Shannon entropy to measure the specificity of individual
   factors within FCA scores. By calculating Shannon entropy for each
   vector of FCA scores, we assess how uniquely a factor characterizes a
   particular covariate level.

   Mathematically, Shannon entropy is defined as:
   [MATH: <mi>H</mi><mo>=</mo><mo>−</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><msub><mrow><mi>p</mi></mrow><mrow><mi
   >i</mi></mrow></msub><mi
   mathvariant="normal">log</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>p</m
   i></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow> :MATH]

   where
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   is the probability of a factor being associated with a given covariate
   level
   [MATH: <mi>i</mi> :MATH]
   , and
   [MATH: <mi>N</mi> :MATH]
   is the total number of covariate levels. Shannon diversity index ranges
   from 0 (high factor specificity) to
   [MATH:
   <mi>log</mi><mrow><mo>(</mo><mrow><mi>N</mi></mrow><mo>)</mo></mrow>
   :MATH]
   (low factor specificity).

Homogeneity

   To assess the homogeneity or even distribution of factor scores across
   different levels of a covariate, we compute the scaled variance for
   each covariate level. This metric quantifies the proportion of variance
   in the factor scores observed at a specific covariate level
   [MATH: <mrow><mo>(</mo><mrow><mi>L</mi></mrow><mo>)</mo></mrow> :MATH]
   relative to the total variance across all covariate levels^[190]80.
   Thus, the scaled variance
   [MATH: <mi>S</mi><mi>V</mi> :MATH]
   for a factor
   [MATH: <mi>x</mi> :MATH]
   and a particular covariate level
   [MATH: <mi>L</mi> :MATH]
   is computed as
   [MATH: <mi>S</mi><mi>V</mi><mo>=</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><mi>V</mi><mi>a</mi><mi>r</mi><mro
   w><mo>(</mo><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>L</mi></mrow><
   /msub></mrow><mo>)</mo></mrow></mrow><mrow><mi>V</mi><mi>a</mi><mi>r</m
   i><mrow><mo>(</mo><mrow><mi>x</mi></mrow><mo>)</mo></mrow></mrow></mfra
   c> :MATH]

   Here,
   [MATH: <mi>V</mi><mi>a</mi><mi>r</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>L</mi></mrow></ms
   ub></mrow></mfenced> :MATH]
   denotes the variance of the factor scores
   [MATH: <mi>x</mi> :MATH]
   corresponding to the covariate level
   [MATH: <mi>L</mi> :MATH]
   ,
   [MATH: <mi>V</mi><mi>a</mi><mi>r</mi><mfenced close=")"
   open="("><mrow><mi>x</mi></mrow></mfenced> :MATH]
   represents the total variance of the factor scores across all cells.

   To establish a unified metric across all levels of a single covariate
   (e.g., Batch1, Batch2, etc., representing levels of the covariate
   “Batch”), we adopt different approaches based on the covariate’s number
   of unique levels. We compute both the geometric and arithmetic means of
   the scaled variances (SV) of all covariate levels for each factor. The
   arithmetic mean of SV values was chosen based on factor simulation
   results showing it performed best (explained below).

Evaluating factor interpretability metrics using simulation

Simulating mixture Gaussian distribution

   Factors were simulated under the assumption of being generated from a
   mixture of Gaussian distributions. Each Gaussian distribution
   represents the factor scores of the cells belonging to one covariate
   level. This simulation process was implemented using the following
   mathematical formulation:

   Let
   [MATH: <mi>X</mi> :MATH]
   be a random variable representing the simulated factors. We generated
   [MATH: <mi>n</mi> :MATH]
   samples (cells) from a mixture of
   [MATH: <mi>K</mi> :MATH]
   (
   [MATH: <mi>K</mi> :MATH]
   =2) normal distributions, each characterized by its mean
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mi>k</mi></mrow></msub>
   :MATH]
   , standard deviation
   [MATH: <msub><mrow><mi>σ</mi></mrow><mrow><mi>k</mi></mrow></msub>
   :MATH]
   , and proportion
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mi>k</mi></mrow></msub>
   :MATH]
   . The mixture distribution
   [MATH: <mi>X</mi> :MATH]
   follows the distribution:
   [MATH: <msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><msub><mrow><mi>p</mi></mrow><mrow><mi
   >k</mi></mrow></msub><mi mathvariant="script">.</mi><mi
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>μ</mi>
   </mrow><mrow><mi>k</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>k</mi
   ></mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mo>)</mo></mrow>
   :MATH]

   where
   [MATH: <mi mathvariant="script">N</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>k</mi></mrow></ms
   ub><mspace width="0.25em"></mspace><mo>,</mo><mspace
   width="0.25em"></mspace><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>k</mi
   ></mrow><mrow><mn>2</mn></mrow></msubsup></mrow></mfenced> :MATH]
   denotes the probability density function (PDF) of the
   [MATH: <mi>k</mi> :MATH]
   -th normal distribution.

   Given this assumption, 10 factors for 10000 cells were simulated for
   100 rounds.
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mi>k</mi></mrow></msub>
   :MATH]
   was set to 0.5 for simplicity.
   [MATH: <mi>μ</mi> :MATH]
   and
   [MATH: <mi>σ</mi> :MATH]
   were set by random sampling from uniform distributions (parameters:
   [MATH: <msub><mrow><mi>σ</mi></mrow><mrow><mi>min</mi></mrow></msub>
   :MATH]
    = 0.5,
   [MATH: <msub><mrow><mi>σ</mi></mrow><mrow><mi>max</mi></mrow></msub>
   :MATH]
    = 1,
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mi>min</mi></mrow></msub>
   :MATH]
    = 0,
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mi>max</mi></mrow></msub>
   :MATH]
    = 4).

Correlation between interpretability metrics and overlap values

   We assessed the efficacy of the proposed factor interpretability
   metrics by examining their correlation with the overlap between the two
   Gaussian distributions representing each factor. We anticipated that
   factors exhibiting greater overlap would demonstrate lower separability
   and specificity scores (negative correlation), along with higher
   homogeneity values (positive correlation), and vice versa. Our goal was
   to identify metrics with higher absolute mean correlation values and
   lower variability, as illustrated in Fig. [191]S21A and B. Based on
   simulation results (Fig. [192]S21C), Silhouette and bimodality index
   indicate high performance for the separability measure. The arithmetic
   mean of the average scaled variance (ASV) table shows higher
   performance compared to the geometric mean for the homogeneity measure.
   Our simulation results indicated a superior performance of Simpson
   index compared to Shannon entropy^[193]81 to measure factor
   specificity.

Calculating overlap between double Gaussian distributions

   The overlap between two Gaussian distributions with means
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mn>1</mn></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>μ</mi></mrow><mrow><mn>2</mn></mrow></msub>
   :MATH]
   , and standard deviations
   [MATH: <msub><mrow><mi>σ</mi></mrow><mrow><mn>1</mn></mrow></msub>
   :MATH]
   and
   [MATH: <msub><mrow><mi>σ</mi></mrow><mrow><mn>2</mn></mrow></msub>
   :MATH]
   respectively, is quantified based on the intersection of the PDFs of
   the two distributions. For distributions with unequal means and
   standard deviations, the overlap
   [MATH: <mi>O</mi> :MATH]
   is computed as:
   [MATH: <mi>O</mi><mo>=</mo><mn>1</mn><mo>−</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn><
   /mrow></mfrac><mspace width="0.25em"></mspace><mfenced close="]"
   open="["><mrow><mi>e</mi><mi>r</mi><mi>f</mi><mfenced close=")"
   open="("><mrow><mfrac><mrow><mi>c</mi><mo>−</mo><msub><mrow><mi>μ</mi><
   /mrow><mrow><mn>1</mn></mrow></msub></mrow><mrow><msqrt><mrow><mn>2</mn
   ></mrow></msqrt><mspace
   width="0.25em"></mspace><msub><mrow><mi>σ</mi></mrow><mrow><mn>1</mn></
   mrow></msub></mrow></mfrac></mrow></mfenced><mo>−</mo><mi>e</mi><mi>r</
   mi><mi>f</mi><mfenced close=")"
   open="("><mrow><mfrac><mrow><mi>c</mi><mo>−</mo><msub><mrow><mi>μ</mi><
   /mrow><mrow><mn>2</mn></mrow></msub></mrow><mrow><msqrt><mrow><mn>2</mn
   ></mrow></msqrt><mspace
   width="0.25em"></mspace><msub><mrow><mi>σ</mi></mrow><mrow><mn>2</mn></
   mrow></msub></mrow></mfrac></mrow></mfenced></mrow></mfenced> :MATH]

   where
   [MATH: <mi>c</mi> :MATH]
   represents the intersection point of the two PDFs and
   [MATH: <mi>e</mi><mi>r</mi><mi>f</mi> :MATH]
   is the error function.

Pathway and gene set enrichment analysis

   Gene set and pathway enrichment analysis methods were used to study the
   biological signatures represented by each factor. The gene scores
   corresponding to the factors of interest were selected from the loading
   matrix to order the list of genes from most to least contribution to
   the given factor. Pathway enrichment analysis was performed on the top
   200 genes of the ordered loadings using the gprofiler2^[194]82 (version
   0.2.3) enrichment tool based on the default Gene Ontology Biological
   Process and Reactome gene set database sources, and using the “ranked”
   mode.

Interpretation of factors using FCA and FIS tables

   The FCA and FIS tables facilitate interpretation of identified factors.
   When covariates are available, evaluating the FCA table first can
   reveal factors associated with known covariates. Visual inspection of
   the FCA heatmap helps identify specific factor-covariate pairs, and
   automatic thresholding highlights factor-covariate pairs with scores
   that stand out against the background distribution. Ideally, factors of
   interest will exhibit high specificity, as summarized in the FIS table.
   Specificity measures how uniquely a factor associates with a particular
   covariate level (e.g., scRNA-seq technology type), with higher
   specificity enhancing interpretability.

   The FIS heatmap includes three additional metric scores - bimodality
   (separability), effect size, and homogeneity - which capture distinct
   aspects of the factor distribution and help prioritize unexplained
   factors for further exploration. Bimodality, which is
   covariate-independent, reflects the distribution’s modality, where high
   values indicate the presence of two distinct cell populations with high
   and low factor scores. Effect size, also covariate-independent,
   represents the distribution’s spread and variance. Higher effect size
   indicates higher signal strength. Homogeneity is covariate-dependent
   and measures the relative spread of the data specific to a single
   covariate compared to the total spread. Homogeneity values closer to
   one suggest cells are well-mixed with respect to the covariate along
   the factor’s axis. These metrics guide the identification of
   interpretable factors from a set of K factors generated by any
   factorization algorithm. However, factor distributions may show
   patterns beyond these metrics, and even one high metric value can
   justify follow-up interpretation, guided by enriched genes and pathways
   or expert knowledge.

Modular design for flexible usage

   sciRED’s modular design allows integration with external data
   processing and normalization methods. Preprocessed data can be input
   directly into step 2 (factor identification). Additionally, if external
   matrix factorization has already been performed, the initial steps can
   be bypassed, with factors and covariates imported directly for
   factor-covariate association analysis.

Data Preprocessing

   The scMixology^[195]17 dataset includes three human lung adenocarcinoma
   cell lines: HCC827, H1975, and H2228. These cell lines were cultured
   individually and subsequently processed. Single cells from each cell
   line were combined in equal proportions and libraries were generated
   using three protocols: CEL-seq2, Drop-seq, and 10x Genomics Chromium.
   The processed count data was obtained using the scPipe package in R and
   converted to .h5ad for import into Python. The data underwent log
   normalization and standardization using the “StandardScaler” function
   from scikit-learn package.

   The stimulated PBMC data^[196]26 includes 10x Genomics droplet-based
   scRNA-seq PBMC data from eight lupus patients before and after
   6h-treatment with interferon-beta. Count data was extracted and
   analyzed using sciRED (number of components(k)=30) while modeling
   library size as a technical covariate. Three outlier cells were removed
   from the sciRED cell-by-factor score matrix, and factors F2 and F9
   scores for 29,062 cells were visualized in Fig. [197]3.

   The healthy human kidney map was constructed based on 19 living donors
   (10 female, 9 male)^[198]28 including the transcriptomes of 27,677
   cells. Filtered and normalized data was downloaded and analyzed using
   sciRED (k = 30).

   The healthy rat total liver homogenate map includes four whole livers
   which were acquired from 8-10 week-old healthy male Dark Agouti and
   Lewis strain rats, and the resulting total liver homogenates went
   through two-step collagenase digestion and 10x Genomics droplet-based
   scRNA-seq^[199]29. Five outlier cells were removed from the score
   matrix, and factors F6 and F20 scores for 23,036 cells were visualized
   in Fig. [200]4. sciRED (k = 30) was applied to the count data while
   modeling library size as a technical covariate. Sample was not included
   as a technical covariate to preserve the strain-specific variations.
   SoupX^[201]31 software (version: 1.6.2) was used to identify genes with
   the greatest contribution to the ambient RNA. We used the default
   automatic contamination fraction estimation (Rho) feature in the SoupX
   (autoEstCont function) to estimate Rho for each sample included in the
   total liver homogenate map of healthy rat livers. Subsequently, we
   extracted the estimated ambient RNA profile and identified the top 50
   genes contributing the most to the ambient RNA in each sample. Genes
   were selected if they ranked among the high-scoring ambient RNA
   contributors in at least two samples. These selected genes were
   assessed for their presence among the strain-specific myeloid markers
   identified using both sciRED and standard differential expression
   methods. Differential expression analysis between the DA and LEW
   strains within the myeloid population (cluster 5) of the rat liver map
   was conducted using Seurat’s FindMarkers function with default
   parameters (logfc.threshold = 0.1, min.pct = 0.01, min.cells.feature =
   3, and min.cells.group = 3), implementing the non-parametric Wilcoxon
   rank-sum test.

   The healthy human liver map^[202]32 includes 8,444 parenchymal and
   non-parenchymal cells obtained from the fractionation of fresh hepatic
   tissue from five human livers. The liver tissue was obtained from
   livers procured from deceased donors deemed acceptable for liver
   transplantation. sciRED (k = 30) was applied to the filtered count
   while modeling library size as a technical covariate.

   The human lung transplant dataset^[203]27 includes donor lung biopsies
   from six transplant cases and over 108,000 cells. We performed
   subsampling on both genes and cells to systematically assess sciRED’s
   runtime across different dataset sizes.

   The two filtered 10k PBMC datasets, profiled using 10x Genomics
   single-cell 3’ and 5’ gene expression libraries, were downloaded from
   the 10x Genomics and directly analyzed using sciRED.

   We applied sciRED to spatial transcriptomic data from Maynard et al.
   ^[204]83, focusing on identifying spatial gene expression patterns
   within the six-layered human dorsolateral prefrontal cortex (DLPFC).
   Specifically, we selected Visium samples from two subjects (Br8100 and
   Br5292), with four samples per subject. The data were subsetted by
   subject. sciRED’s Poisson GLM was applied to adjust for library size.

Reporting summary

   Further information on research design is available in the [205]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [206]Supplementary Information^ (4.6MB, pdf)
   [207]Reporting Summary^ (349.8KB, pdf)
   [208]Transparent Peer Review file^ (844KB, pdf)

Source data

   [209]Source Data^ (11.3MB, xlsx)

Acknowledgements