Abstract

   The advent of single-cell RNA sequencing (scRNA-seq) technologies has
   revolutionized transcriptomic studies. However, large-scale integrative
   analysis of scRNA-seq data remains a challenge largely due to unwanted
   batch effects and the limited transferabilty, interpretability, and
   scalability of the existing computational methods. We present
   single-cell Embedded Topic Model (scETM). Our key contribution is the
   utilization of a transferable neural-network-based encoder while having
   an interpretable linear decoder via a matrix tri-factorization. In
   particular, scETM simultaneously learns an encoder network to infer
   cell type mixture and a set of highly interpretable gene embeddings,
   topic embeddings, and batch-effect linear intercepts from multiple
   scRNA-seq datasets. scETM is scalable to over 10^6 cells and confers
   remarkable cross-tissue and cross-species zero-shot transfer-learning
   performance. Using gene set enrichment analysis, we find that
   scETM-learned topics are enriched in biologically meaningful and
   disease-related pathways. Lastly, scETM enables the incorporation of
   known gene sets into the gene embeddings, thereby directly learning the
   associations between pathways and topics via the topic embeddings.

   Subject terms: Functional clustering, Machine learning
     __________________________________________________________________

   Computational single-cell RNA-seq analyses often face challenges in
   scalability, model interpretability, and confounders. Here, we show a
   new model to address these challenges by learning meaningful embeddings
   from the data that simultaneously refine gene signatures and cell
   functions in diverse conditions.

Introduction

   Advances in high-throughput sequencing technologies^[33]1 provide an
   unprecedented opportunity to profile the individual cells’
   transcriptomes across various biological and pathological conditions,
   and have spurred the creation of several atlas projects^[34]2–[35]5.
   Emerged as a key application of single-cell RNA sequencing (scRNA-seq)
   data, unsupervised clustering allows for cell-type identification in a
   data-driven manner. Flexible, scalable, and interpretable computational
   methods are crucial for unleashing the full potential from the wealth
   of single-cell datasets and translating the transcription profiles into
   biological insights. Despite considerable progress made on clustering
   method development for scRNA-seq data analysis^[36]6–[37]16, several
   challenges remain.

   First, compared to bulk RNA-seq, scRNA-seq data commonly exhibit higher
   noise levels and drop-out rates, where the data only captures a small
   fraction of a cell’s transcriptome^[38]17. Changes in gene expression
   due to experimental design, often referred to as batch effects^[39]18,
   can have a large impact on clustering^[40]12,[41]18–[42]20. If not
   properly addressed, these technical artefacts may mask the true
   biological signals in cell clustering.

   Second, the partitioning of the cell population alone is insufficient
   to produce biological interpretation. The annotations of the cell
   clusters require extensive manual literature search in practice and the
   annotation quality may be dependent on users’ domain knowledge^[43]20.
   Therefore, an interpretable and flexible model is needed. In the
   current work, we consider model interpretability as whether the model
   parameters can be directly used to associate the input features with
   latent factors or target outcomes. Latent topic models are a popular
   approach in mining genomic and healthcare data^[44]21–[45]23 and are
   increasingly being used in the scRNA-seq literature^[46]24.
   Specifically, in topic modeling, we infer the topic distribution for
   both the samples and genomic features by decomposing the
   samples-by-features matrix into samples-by-topics and
   topics-by-features matrices, which also be viewed as a probabilistic
   non-negative factorization (NMF)^[47]25. Importantly, the top genes
   under each latent topic can reveal the gene signatures for specific
   cellular programs, which may be shared across cell types or exclusive
   to a particular cell type. Traditionally, the latter are detected via
   differential expression analysis at individual gene levels, which has
   limited statistical power in scRNA-seq data analysis because of the
   sparse gene counts, small number of unique biological samples, and the
   burdens of multiple testings.

   Third, model transferrability is an important consideration. We
   consider a model as transferable if the learned knowledge manifested as
   the model parameters could benefit future data modeling. In the context
   of scRNA-seq data analysis, it translates to learning feature
   representations from one or more large-scale reference datasets and
   applying the learned representations to a target dataset^[48]26,[49]27.
   If the model is not further trained on the target dataset, the learning
   setting called zero-shot transfer learning. A model that can
   successfully separate cells of distinct cell types that are not present
   in the reference dataset implies that the model has learned some
   meaningful abstraction of the cellular programs from the reference
   dataset such that it can generalize to annotating new cell types of
   different kinds. An analogy would be that someone who has learned how
   to distinguish triangles from rectangles may also be able to
   distinguish squares from circles. As the number and size of scRNA-seq
   datasets continue to increase, there is an increasingly high demand for
   efficient exploitation and knowledge transfer from the existing
   reference datasets.

   Several recent methods have attempted to address these challenges.
   Seurat^[50]7 uses canonical correlation analysis to project cells onto
   a common embedding, then identifies, filters, scores, and weights
   anchor cell pairs between batches to perform data integration.
   Harmony^[51]28 iterates between maximum diversity clustering and a
   linear batch correction based on the mixture-of-experts model.
   Scanorama^[52]10 performs all-to-all dataset matching by querying
   nearest neighbors of a cell among all remaining batches, after which it
   merges the batches with a Gaussian kernel to form a single-cell
   panorama. These methods rely on feature (gene) selection and/or
   dimensionality reduction methods; otherwise they can not scale to
   compendium-scale reference^[53]2 or cohort-level single-cell
   transcriptome data^[54]29 or are sensitive to the noise inherent to
   scRNA-seq count data. They are also non-transferable, meaning that the
   knowledge learned from one dataset cannot be easily transferred through
   model parameter sharing to benefit the modeling of another dataset. NMF
   approaches such as UNCURL^[55]30 works only with one scRNA-seq dataset.
   LIGER^[56]9 uses integrative NMF to jointly factorize multiple
   scRNA-seq matrices across conditions using genes as the common axis,
   linking cells from different conditions by a common set of latent
   factors also known as metagenes. LIGER is weakly transferable in the
   sense that the global metagenes-by-genes matrix can be recycled as
   initial parameters when modeling a new target dataset, whereas the
   cells-by-metagenes and the final metagenes-by-genes must be further
   computed and updated by iterative numerical optimization to fit the
   target dataset.

   Deep-learning approaches, especially autoencoders, have demonstrated
   promising performance in scRNA-seq data modeling. scAlign^[57]15 and
   MARS^[58]31 encode cells with non-linear embeddings using autoencoders,
   which is naturally transferable across datasets. While scAlign
   minimizes the distance between the pairwise cell similarity at the
   embedding and original space, MARS looks for latent landmarks from
   known cell types to infer cell types of unknown cells. Variational
   autoencoders (VAE)^[59]32 is an efficient Bayesian framework for
   approximating intractable posterior distribution using proposed
   distribution parameterized by neural networks. Several recent studies
   have tailored the original VAE framework towards modeling single-cell
   data. Single-cell variational inference (scVI)^[60]6 models library
   size and takes into account batch effect in generating cell embeddings.
   scVAE-GM^[61]11 changed the prior distribution of the latent variables
   in the VAE from Gaussian to Gaussian mixture model, adding a
   categorical latent variable that clusters cells. Lotfollahi et
   al.^[62]26 developed a VAE model called scGen to infer the expression
   difference due to perturbation conditions by latent space
   interpolation. A key drawback for these VAE models is the lack of
   interpretability—post hoc analyses are needed to decipher the learned
   model parameters and distill biological meaning from the learned
   network parameters. To improve interpretability, Svensson et al.^[63]14
   developed a linear decoded VAE (hereafter referred to as scVI-LD) as a
   part of the scVI software.

   In this paper, we present single-cell Embedded Topic Model (scETM), a
   generative topic model that facilitates integrative analysis of
   large-scale single-cell transcriptomic data. Our key contribution is
   the utilization of a transferable neural-network-based encoder while
   having an interpretable linear decoder via a matrix tri-factorization.
   The scETM simultaneously learns the encoder network parameters and a
   set of highly interpretable gene embeddings, topic embeddings, and
   batch-effect linear intercepts from scRNA-seq data. The flexibility and
   expressiveness of the encoder network enable scETM to model extremely
   large scRNA-seq datasets without the need of feature selection or
   dimension reduction. By tri-factorizing cells-genes matrix into
   cells-by-topics, topics-by-embeddings, and embeddings-by-genes, we are
   able to incorporate existing pathway information into gene embeddings
   during the model training to further improve interpretability. This is
   a salient feature compared to related methods such as scVI-LD. It
   allows scETM to simultaneously discover interpretable cellular
   signatures and gene markers while integrating scRNA-seq data across
   conditions, subjects and/or experimental studies.

   We demonstrate that scETM offers state-of-the-art performance in
   clustering cells into known cell types across a diverse range of
   datasets with desirable runtime and memory requirements. We also
   demonstrate scETM’s capability of effective knowledge transfer between
   different sequencing technologies, between different tissues, and
   between different species. We then use scETM to discover biologically
   meaningful gene expression signatures indicative of known cell types
   and pathophysiological conditions. We analyze scETM-inferred topics and
   show that several topics are enriched in cell-type-specific or
   disease-related pathways. Finally, we directly incorporate known
   pathway-gene relationships (pathway gene set databases) into scETM in
   the form of gene embeddings, and use the learned pathway-topic
   embedding to show the pathway-informed scETM (p-scETM)’s capability of
   learning biologically meaningful information.

Results

scETM model overview

   We developed scETM to model scRNA-seq data across experiments or
   studies, which we term as batches (Fig. [64]1a and Supplementary
   Fig. [65]1a). Adapted from the Embedded Topic Model (ETM)^[66]33, scETM
   inherits the benefits of topic models, and is effective for handling
   large and heavy-tailed distribution of word frequency. In the context
   of scRNA-seq data analysis, each sampled single-cell transcriptome is
   provided as a vector of normalized gene counts to a two-layer
   fully-connected neural-network (i.e., encoder; see detailed
   architecture in Supplementary Fig. [67]1c), which infers the topic
   mixing proportions of the cell. The trained encoder on a reference
   scRNA-seq data can be used to infer topic mixture of unseen scRNA-seq
   data collected from different tissues or species (Fig. [68]1b).

Fig. 1. scETM model overview.

   [69]Fig. 1
   [70]Open in a new tab

   a scETM training. Given as input the scRNA-seq data matrices across
   multiple experiments or studies (i.e., batches), scETM models the
   single-cell transcriptomes using an embedded topic-modeling approach.
   Each scRNA-seq profile serves as an input to a variational autoencoder
   (VAE) as the normalized gene counts. The encoder network produces a
   stochastic sample of the latent topic mixture (θ[s,d] for batch
   s = 1, …, S and cell d = 1, …, N[s]), which can be used for clustering
   cells (see panel b). The linear decoder learns topic embedding and gene
   embedding, which can be used to analyze cellular programs via
   enrichment analyses (see panel c). b Workflow used to perform zero-shot
   transfer learning. The trained scETM-encoder on a reference scRNA-seq
   dataset is used to infer the cell topic mixture θ^* from an unseen
   scRNA-seq dataset without training them. The resulting cell mixtures
   are then visualized via UMAP visualization and evaluated by standard
   unsupervised clustering metrics using the ground-truth cell types. c
   Exploring gene embeddings and topic embeddings. As the genes and topics
   share the same embedding space, we can explore their connections via
   UMAP visualization or annotate each topic via enrichment analyses using
   known pathways.

   For interpretability, we use a linear decoder with the gene and topic
   embeddings as the learnable parameters. Specifically, we factorize the
   cells-by-genes count matrix into a cells-by-topics matrix θ (inferred
   by the encoder), topics-by-embedding α, and embedding-by-genes ρ
   matrices (Supplementary Fig. [71]1b). This tri-factorization design
   allows for exploring the relations among cells, genes, and topics in a
   highly interpretable way. To account for biases across conditions or
   subjects, we introduce an optional batch correction parameter λ, which
   acts as a linear intercept term in the categorical softmax function to
   alleviate the burden of modeling batch effects from the encoder to let
   it focus on inferring biologically meaningful cell topic mixture θ[d].
   The encoder and embedding learning is performed by an amortized
   variational inference algorithm to maximize the evidence lower bound
   (ELBO) of the marginal categorical likelihood of the scRNA-seq
   counts^[72]32. Compared to scVI-LD^[73]14, the linear decoder component
   that learns a common embeddings for both topics and genes offers more
   flexibility and interpretability and overall better performance as we
   demonstrate next (Fig. [74]1c). Details of the scETM algorithm and
   implementation are described in Methods.

Data integration

   We benchmarked scETM, along with seven state-of-the-art single-cell
   clustering or integrative analysis methods—scVI^[75]6, scVI-LD^[76]14,
   Seurat v3^[77]7, scVAE-GM^[78]11, Scanorama^[79]10, Harmony^[80]28 and
   LIGER^[81]9, on six published datasets, namely Mouse Pancreatic Islet
   (MP)^[82]34, Human Pancreatic Islet (HP)^[83]7, Tabula Muris
   (TM)^[84]3, Alzheimer’s Disease dataset (AD)^[85]35, Major Depressive
   Disorder dataset (MDD)^[86]29, and Mouse Retina (MR)^[87]36,[88]37
   (Supplementary Methods). Across all datasets, scETM stably delivered
   competitive results especially among the transferable and interpretable
   models, while others methods fluctuate across different datasets in
   terms of Adjusted Rand Index (ARI) and Normalized Mutual Information
   (NMI) (Table [89]1; Supplementary Table [90]1). Overall, Harmony and
   Seurat have slightly higher ARIs than scETM, with trade-offs of model
   transferrability, interpretability, and/or scalability, which we
   investigate in the following sections.

Table 1.

   Model properties and unsupervised clustering performance on data
   integration tasks.
               Transferable Interpretable   MP    HP    TM    AD    MDD    MR
  Harmony                                 0.969  0.955 0.705 0.994 0.784 0.763
  Scanorama                               0.915  0.859 0.542 0.997 0.587 0.780
  Seurat                                  0.944  0.968 0.676 0.991 0.550 0.781
  scVAE-GM     ✓                          0.805  NA    NA    0.997 0.563 0.778
  scVI         ✓                          0.932  0.759 0.670 0.991 0.541 0.783
  LIGER        ✓            ✓             0.914  0.911 0.591 0.894 0.704 0.714
  scVI-LD      ✓            ✓             0.875  0.656 0.608 0.989 0.666 0.718
  scETM        ✓            ✓             0.946  0.943 0.761 0.996 0.717 0.859
  scETM − λ    ✓            ✓             0.851  0.474 0.629 0.996 0.719 0.773
  scETM + adv  ✓            ✓             0.944  0.946 0.704 0.993 0.717 0.772
  Batch Effect                            Strain Tech. Tech. Ind.  Ind.  Studies
   [91]Open in a new tab

   The clustering performance is measured by adjusted rand index (ARI)
   between ground-truth cell types and Leiden^[92]75 clusters. scETM
   performances with or without the linear batch correction (scETM,
   scETM − λ) are both reported. scETM + adv is scETM plus adversarial
   network loss to further correct batch effects. Batch variables include
   strain, sequencing technologies (Tech.) and individuals (Ind.). NA is
   reported for models that did not converge. Experimental details are
   described in the “Methods” section. We ran Seurat on all datasets with
   integration turned on whenever applicable.

   We further experimented the same scETM without the batch correction
   term, namely scETM-λ. Compared to the λ-ablated model, the full scETM
   model confers higher ARI in 3 out of the 5 datasets (Table [93]1) and
   higher NMI in 4 out of the 5 datasets (Supplementary Table [94]1).
   Improvement over the Human Pancreas (HP) dataset is remarkably high,
   implying an effective correction of the confounder due to the scRNA-seq
   technology differences. We observe no improvement in the AD dataset in
   terms of both ARI and NMI and small improvement in MDD only in terms of
   NMI. This implies a lesser concern of batch effects from only the
   individual brain sample donors, with all data being collected by the
   same technology in a single study.

   We also evaluated the batch mixing aspect of scETM and other methods
   using k-nearest-neighbor Batch-Effect Test (kBET)^[95]18 (Table [96]2)
   and examined to what extent scETM’s batch mixing performance can be
   improved by introducing an adversarial loss term to scETM
   (Methods)^[97]38. Briefly, we used a discriminator network (a two-layer
   feed-forward network) to predict batch labels using the cell topic
   mixture embeddings generated by the encoder network, and directed the
   encoder network to fool the discriminator. We observe notable
   improvement on kBET with similar ARI and NMI scores (Table [98]1 and
   Supplementary Table [99]1, row "scETM+adv") at the cost of up to 50%
   more running time. This shows the expandability of scETM. For the
   subsequent analyses, we opted to use the results from scETM (without
   the adversarial loss but with the linear batch correction λ) because of
   its simpler design, scalability, comparable ARI scores, and less
   aggressive batch correction (see below).

Table 2.

   Batch correction performance on data integration tasks. The batch
   correction performance is measured by kBET^[100]18. More details are
   described in Table [101]1 caption and Methods.
                  MP       HP         TM         AD        MDD       MR
   Harmony      0.390  0.342      0.148      0.518      0.440      0.063
   Scanorama    0.439  0.155      0.001      0.286      0.202      0.063
   Seurat       0.538  0.381      0.281      0.195      0.085      0.053
   scVAE-GM     0.233  NA         NA         0.073      0.051      0.033
   scVI         0.516  0.140      0.056      0.434      0.313      0.073
   LIGER        0.602  0.660      0.374      0.771      0.716      0.176
   scVI-LD      0.148  0.034      0.069      0.476      0.277      0.022
   scETM+adv    0.546  0.443      0.144      0.627      0.570      0.141
   scETM        0.270  0.163      0.096      0.278      0.228      0.066
   scETM − λ    0.217  0.000      0.000      0.122      0.066      0.017
   Batch Effect Strain Technology Technology Individual Individual Studies
   [102]Open in a new tab

   scETM is also robust to architectural and hyperparameter changes,
   requiring very few or no architecture adaptation or hyperparameter
   tuning efforts when applied to unseen datasets (Supplementary
   Table [103]2). As a result, we used the same architecture and
   hyperparameters for all datasets in Table [104]1. We also performed a
   comprehensive ablation analysis to validate our model choices. The
   ablation experiment demonstrates the necessity of the key model
   components, such as the batch-effect correction factors λ and the batch
   normalization technique used in training the encoder. Normalizing gene
   expression scRNA-seq counts as the input to the encoder also improves
   the performance (Supplementary Table [105]3).

   Clustering agreement metrics are not the only metrics for evaluating
   scRNA-seq methods, and are not available to unannotated datasets.
   Therefore, we also evaluated the negative log-likelihood (NLL) on
   held-out samples, which is a principled way for model selection without
   labels. We computed the held-out (10%) NLL. We found that scETM is
   robust to different architectures in terms of the NLL (Supplementary
   Table [106]2). We also found that ARI and NLL are modestly negatively
   correlated on the TM dataset (Supplementary Fig. [107]2), implying an
   agreement between the two metrics although this might not be always the
   case since it highly depends on the cell type labels and the data
   quality.

   To further verify the clustering performance and validate our
   evaluation metrics, we visualized the cell embeddings using Uniform
   Manifold Approximation and Projection (UMAP)^[108]39 for some of the
   datasets (Supplementary Figs. [109]3,[110]4,[111]5,[112]6,[113]7,[114]8
   and Fig. [115]2). Altogether, these results support that scETM
   effectively captures cell-type-specific information, while accounting
   for artefacts arising from individual or technological variations.

Fig. 2. Integration and batch correction on the Mouse Retina dataset.

   [116]Fig. 2
   [117]Open in a new tab

   Each panel shows the Mouse Retina cell clusters using UMAP based on the
   cell embeddings obtained by each of the 9 methods. The cells are
   colored by cell types in the first two rows and by batches, which are
   the two source studies, in the last two rows. The ARI and kBET scores
   of each method are shown below each plot. UMAP visualization for the
   other 5 datasets are illustrated in Supplementary Figs. [118]3, [119]4,
   [120]5, [121]6, [122]7.

Batch overcorrection analysis

   Some methods may risk over-correcting batch effects and fail to capture
   some aspect of biological variations. In the above analysis, we observe
   that some methods such as LIGER conferred competitive kBET but low ARI,
   suggesting potential overcorrection of batch effects. To experiment the
   extent of batch overcorrection by each method, we conducted two
   experiments using two datasets, namely the Human Pancreas (HP)
   dataset^[123]7 and the Mouse Retina (MR) dataset^[124]36,[125]37.

   For the HP data, we manually removed beta cells from all 5 batches
   except for batch CelSeq2, resulting in the cell type distributions
   shown in Supplementary Table [126]4. We expect that, if a method is
   guilty of batch-effect overcorrection, it would assign beta cells to
   other non-beta cell clusters in the latent space by forcing the
   alignment of different batches. Consequently, such methods would have
   poor clustering scores. We evaluated all of the methods on this dataset
   using 3 metrics: ARI, kBET, and average silhouette width (ASW)^[127]40.
   Briefly, higher ASW indicates larger distances between cell types and
   lower distance within cell type in the cell embedding space (Methods).
   We measured the overall ASW as well as the ASW for only the beta cells
   (i.e., ASW-beta). The results are summarized in Supplementary
   Table [128]5. We observed that scETM struck a good balance between
   discriminating cell types (ARI: 0.9298; ASW: 0.3525; ASW-beta: 0.5370)
   and integrating different batches (kBET: 0.1247). Adding the
   adversarial loss to the scETM (i.e., scETM+adv) increased kBET from
   0.1247 to 0.3445 (while maintaining ARI above 0.92) but greatly
   compromised ASW-beta (0.0045), suggesting a more aggressive
   overcorrection. Similarly, LIGER performed the best in kBET (0.5978)
   but conferred a much lower ARI (0.8476) and low ASW-beta (0.0912),
   indicating a more severe overcorrection of the batch effects (i.e.,
   mixing beta cells with other cells).

   We then visualized the clustering by UMAP to examine how the beta cells
   are assigned to different clusters (Supplementary Fig. [129]9). We
   found that beta cells (colored in red) were clustered separately by
   scETM from other cell types. In contrast, beta cells were mixed up with
   other cell types by methods including Harmony and LIGER, which
   overcorrected the batch effects when integrating the five batches.
   Visually, we also observe that scETM+adv method moves the beta cell
   cluster closer to the alpha cell cluster, confirming a higher level of
   overcorrection compared to scETM.

   The MR dataset is a collection of two independent studies on mouse
   retina^[130]36,[131]37. Here we consider the two source studies as two
   batches, hereafter referred to as the Macosko batch and the Shekhar
   batch. Many cell types are uniquely present in the Macosko batch
   (Supplementary Table [132]6). There is also a large difference in the
   cell proportion between the two batches. In particular, rods is only
   0.35% in Shekhar but 65% in Macosko. In this scenario, we expect that
   methods that overcorrect the batch effects would tend to mix rods with
   cells of other cell types from the Shekhar batch, resulting in low ARI
   and high kBET. On the contrary, a desirable integration method would
   strike a balance between the ARI (or ASW) and kBET on this combined
   dataset. Therefore, this setup imposes a great challenge on the
   integration methods.

   Overall, scETM achieved the highest ARI (0.859), reasonable ASW
   (0.2873), and modest kBET (0.0656), indicating its ability to capture
   the true biology from the data without over-correcting the batch
   effects (Supplementary Table [133]7). In contrast, LIGER is more
   aggressive in its batch correction, resulting in the highest kBET score
   of 0.176 but lowest ARI score of 0.714. We further investigated the
   extent of improving kBET while maintaining a high ARI score with
   scETM+adv. Indeed, scETM+adv conferred an increased kBET of 0.1410 and
   a reasonably high ARI score of 0.7720. Visualizing the clustering of
   each method using UMAP (Fig. [134]2) confirms the quantitative
   clustering results.

   Incidentally, we also notice that scETM is not sensitive to the 669
   doublelets or contaminants, all of which were from the Shekhar batch
   (Supplementary Table [135]8). In contrast, if we did not filter out the
   doublets/contaminants, the performance of LIGER and Seurat degrades
   drastically possibly due to batch overcorrection or failing to
   integrate the same cell types from different batches together.

Scalability

   A key advantage of scETM is its high scalability and efficiency. We
   demonstrated this by comparing the run-time, memory usage, and
   clustering performance of the state-of-the-art models using their
   recommended pipelines when integrating a merged dataset consisting of
   cells from the MDD and AD datasets (Methods). Because of the simple
   model design and efficient implementation (e.g., sparse matrix
   representation, multi-threaded data retrieval, etc; Discussion), scETM
   achieved the shortest run-time among all deep-learning based models
   (Fig. [136]3a). Specifically, on the largest dataset (148,247 cells),
   scETM ran 3–4 times faster than scVI and scVI-LD, and over 10 times
   faster than scVAE-GM. We note that the run-time largely depends on the
   implementation rather than the network architectures and loss functions
   in these deep-learning methods. Harmony and Scanorama were the only
   methods faster than scETM, yet they both operate on a hundred principal
   components at most. Although for comparison purpose we used the top
   3000 most variable genes for all of the methods, scETM can easily scale
   to all of the genes, which is more desirable because the resulting
   model can generalize to other datasets.

Fig. 3. Benchmark of the efficiency and scalability of the seven scRNA-seq
clustering algorithms.

   [137]Fig. 3
   [138]Open in a new tab

   The line styles in the plot indicate model inputs. The number of genes
   was fixed to 3000 in this experiment. We increased the number of cells
   randomly sampled from the combined AD and MDD dataset and evaluated the
   performance of each method in terms of: a runtime, b memory usage, and
   c adjusted rand index (ARI). The run-time of scETM on CPU is annotated
   on the left panel.

   Because of the amortized stochastic variational
   inference^[139]32,[140]41,[141]42, scETM in principle takes linear
   run-time and constant memory with respect to the sample size per
   training epoch. The use of multi-threaded data loader to streamline the
   random minibatch retrieval and loading further speed-up the training
   process in practice. In contrast, the memory requirement of Seurat
   increases rapidly with the number of cells, due to the vast numbers of
   plausible anchor cell pairs in the two brain datasets (Fig. [142]3b).
   In terms of clustering accuracy, scETM consistently confers competitive
   performance, whereas Harmony and Scanorama perform unstably as dataset
   sizes vary (Fig. [143]3c). UMAP visual inspection of scVAE embeddings
   suggests that scVAE likely suffers from under-correction of batch
   effects (Supplementary Fig. [144]8). The sudden drop of LIGER’s
   clustering performance in the largest benchmark dataset may be due to
   batch overcorrection.

   Although it has been widely accepted by the deep-learning community
   that computing using Graphical Processing Units (GPUs) results
   in ~ 10 × speed-up over computing using CPU, the adoption of GPUs in
   the computational biology community is beginning to catch up. In our
   non-exhaustive experiment on the Mouse Pancreas dataset, training scETM
   for 1000 steps on the 6-core Core i7 10750H CPU requires 650 s
   (Fig. [145]3a), while with an Nvidia RTX 2070 laptop GPU it only takes
   50 s—a 13 × speed-up over the CPU computer.

Transfer learning across single-cell datasets

   A prominent feature of scETM is that its parameters, hence the
   knowledge of modeling scRNA-seq data, are transferable across datasets.
   Specifically, as part of the scETM, the encoder trained on a reference
   scRNA-seq dataset can be applied to infer cell topic mixture of a
   target scRNA-seq dataset (Fig. [146]1b), regardless of whether the two
   datasets share the same cell types. As an example, we trained an scETM
   model on the Tabula Muris FACS dataset (TM (FACS)) (which is a subset
   of the TM dataset) from a multi-organ mouse single-cell atlas, and
   evaluated it using the MP data, which only contains mouse pancreatic
   islet cells. Although the two datasets were obtained using different
   sequencing technologies in two independent studies, the model yielded
   an encouragingly high ARI score of 0.941, considering that a model
   directly trained on MP achieves ARI 0.946. Interestingly, in the UMAP
   plot, the TM (FACS)-pretrained model placed B cells, T cells and
   macrophages far away from other clusters and separated B cells and T
   cells from macrophages, which is not observed in the model directly
   trained on MP (Supplementary Figs. [147]3,[148]4,[149]5,[150]6,[151]7;
   Supplementary Table [152]9). We repeated the same experiment three
   times with different random seeds and observed consistently that B and
   T cells are close to each other and distant from macrophages
   (Supplementary Fig. [153]10). We also experimented transfer learning by
   first training scETM on TM (FACS) with pancreas removed and then
   applied to MP dataset. The performance decreased but is still
   reasonably good (Supplementary Table [154]9), demontrasting scETM’s
   ability to transfer knowledge across tissues.

   Encouraged by the above results, we then performed a comprehensive set
   of cross-tissue and cross-species transfer-learning analysis with 6
   tasks (Methods): (1) Transfer between the TM (FACS) and the MP dataset
   (including MP → TM (FACS)); (2) Transfer between the Human Pancreas
   (HP) dataset and the Mouse Pancreas (MP) dataset; (3) Transfer between
   the Human primary motor cortex (M1C) (HumM1C) dataset and the Mouse
   primary motor area (MusMOp) dataset both obtained from the Allen Brain
   Map data portal^[155]43. We chose to transfer between the human M1C and
   mouse MOp because of the high number of shared cell types between the
   brain regions of the two species. The batches for HumM1C are the two
   post-mortem human brain M1 specimens and the two mice for MusMOp. Note
   that in these transfer-learning tasks (A → B) we only corrected batch
   effects during the training on the source data A but not during the
   transfer to the target data B.

   As a comparison, we evaluated and visualized the clustering results in
   all 6 transfer-learning tasks using scETM, scVI-LD, and scVI
   (Fig. [156]4; Supplementary Tables [157]10 and [158]11). Overall, scETM
   achieved the highest ARI across all tasks and competitive kBET scores.
   In particular, scETM trained on TM (FACS) on heterogeneous tissues
   clustered much better the MP cells (ARI: 0.941; kBET: 0.339) than scVI
   (ARI: 0.484; kBET: 0.257) and scVI-LD (ARI: 0.398; kBET: 0.256).
   Remarkably, scETM trained only on the MP dataset can cluster reasonably
   well the much larger TM single-cell data, which were collected from
   diverse primary tissues including pancreas. This implies that scETM
   does not merely learn cell-type-specific signatures but also the
   underlying transcriptional programs that are generalizable to unseen
   tissues.

Fig. 4. Cross-tissue and cross-species zero-shot transfer learning.

   [159]Fig. 4
   [160]Open in a new tab

   Each panel displays the UMAP visualization of cells by training on
   dataset A and the applied to dataset B (i.e., A → B). In total, we
   performed 6 transfer-learning tasks: TM (FACS) ↔ MP, HP (inDrop) ↔ MP,
   MusMOp ↔ HumM1C. For each task, we evaluated scETM, scVI-LD, or scVI,
   which are the rows in the above figure. The cells are colored by
   tissues or cell types as indicated in the legend. The corresponding ARI
   and kBET are indicated below each panel. For MP → TM (FACS), we
   evaluated the ARI based on the 92 cell types although we colored the
   cells by the 20 tissues of origin instead of their cell types because
   of the large number of cell types. Abbreviations: TM (FACS): tabula
   muris sequenced with fluorescence-activated single-cell sorting; MP:
   mouse pancreas; HP (inDrop): human pancreas sequenced with InDrop
   technology; HumM1C: human primary motor cortex (from Allen Brain map);
   MusMOp: mouse primary motor area (from Allen Brain map).

   In cross-species transfer learning between HP and MP, scETM captured
   better the conserved pancreas functions compared to scVI and scVI-LD
   (Fig. [161]4; Supplementary Table [162]10). On the other hand,
   cross-species transfer between MusMOp and HumM1C is a much more
   challenging task due to the evolutionarily divergent functions of the
   brains between the two species. Nonetheless, scETM conferred a much
   higher ARI of 0.696 for the MusMOp → HumM1C transfer and ARI of 0.167
   for the HumM1C → MusMOp transfer. In contrast, scVI-LD and scVI did not
   work well on these tasks with ARI scores lower than 0.1. Since they
   cannot separate cells by cell types, all cells are mixed together,
   leading to a high kBET score. The improvements achieved by scETM over
   scVI(-LD) are possibly attributable to the simpler linear batch
   correction on the source data, jointly learning topic and gene
   embedding, and the topic-modeling formalism, which together lead to an
   encoder network that is better at capturing the transferable cellular
   programs.

Pathway enrichment analysis of scETM topics

   We next investigated whether the scETM-inferred topics are biologically
   relevant in terms of known gene pathways in human. One approach would
   be to arbitrarily choose a number of top genes under each topic and
   test for pathway enrichment using hypergeometric tests. This approach
   works well when there are asymptotic p-values at the individual gene
   level. In our case, each gene is characterized by the topic scores,
   making it difficult to systematically choose the number of top genes
   per topic. To this end, we resorted to Gene Set Enrichment Analysis
   (GSEA)^[163]44. Briefly, we calculated the maximum running sum of the
   enrichment scores with respect to a query gene set by going down the
   gene list that is sorted in the decreasing order by a given topic
   distribution β[k] (Methods). For each dataset, we trained a scETM with
   100 topics.

   For the HP dataset, each topic detected many significantly enriched
   pathways with Benjamini–Hochberg False Discover Rate (FDR) < 0.01
   (Fig. [164]5a). Many of them are relevant to pancreas functions,
   including insulin processing (Fig. [165]5b), insulin receptor
   recycling, insulin glucose pathway, pancreatic cancer, etc
   (Supplementary Table [166]12). As scETM jointly learns both the gene
   embeddings and topic embeddings, we can visualize both the genes and
   topics in the same embedding space via UMAP (Fig. [167]5c). Indeed, we
   observe a strong co-localization of the genes in Insulin Processing
   pathway and the corresponding enriched topic (i.e., Topic 54).

Fig. 5. Gene set enrichment analysis of the Human Pancreas dataset.

   [168]Fig. 5
   [169]Open in a new tab

   a Manhattan plot of the GSEA results on the 100 scETM topics learned
   from the HP dataset. The x-axis is the topic and the y-axis is the -log
   q-value from the permutation test corrected for multiple testings by
   the BH-method. The dots correspond to the tested GSEA gene sets. The
   maximum -log q-value is capped at 5. b Leading edge analysis of Insulin
   Processing (IP) pathway using Topic 54. The genes are ranked by the
   topic score (i.e., the unnormalized topic mixture beta) under Topic 54.
   The running sum enrichment score was calculated by GSEA. The black bars
   in the middle indicate the genes that are found in the IP pathway.
   Topic 54 is significantly enriched in Insulin Processing pathway (GSEA
   permutation test BH-adjusted q-value = 0). c UMAP visualization of the
   gene and topic embeddings learned from the HP dataset. Genes in IP are
   colored in red and Topic 54 in blue. The inset box displays a magnified
   view of the cluster zooming into the IP pathway genes (including
   Insulin (INS)) that are near Topic 54 in the embedding space.

   For the AD dataset, we found topics enriched for Reactome Amyloid Fiber
   Formation, KEGG AD, and Deregulated CDK5 triggers multiple
   neurodegenerative in AD (Supplementary Fig. [170]11 and Supplementary
   Table [171]13). For MDD dataset, we found enrichment for substance/drug
   induced depressive disorder (Supplementary Figs. [172]12, [173]13 and
   Supplementary Table [174]14). The full GSEA enrichment results for all
   3 datasets are listed in Supplementary Data [175]1. As a comparison, we
   also performed GSEA over the 100 gene loadings learned by scVI-LD
   (matching the 100 topics in the scETM) on these 3 datasets but found
   fewer relevant distinct gene sets (Supplementary Tables [176]12,[177]
   13 and Supplementary Table [178]14) or weaker statistical enrichments
   by GSEA (Supplementary Fig. [179]14).

Differential scETM topics in disease conditions and cell types

   We sought to discover scETM topics that are condition-specific or
   cell-type specific. Starting with the AD dataset, we found that the
   scETM-learned topics are highly selective of cell-type marker genes
   (Fig. [180]6a) and highly discriminative of cell types (Fig. [181]6b).
   To detect disease signatures, we separated the cells into the ones
   derived from the 24 AD subjects and the ones from the 24 control
   subjects. We then performed permutation tests to evaluate whether the
   two cell groups exhibit significant differences in terms of their topic
   expression (Methods). Topic 12 and 58 are differentially expressed in
   the AD cells and control cells (Fig. [182]6c, d; permutation test
   p-value = 1e-5). Interestingly, topic 58 is also highly enriched for
   mitochondrial genes. Indeed, it is known that β-amyloids selectively
   build up in the mitochondria in the cells of AD-affected
   brains^[183]45. For the MDD dataset, topics 1, 52, 68, 70, 86 exhibit
   differential expressions between the suicidal group and the healthy
   group (Supplementary Fig. [184]18c) and interesting neurological
   pathway enrichments (Supplementary Table [185]14).

Fig. 6. scETM topic embeddings of the Alzheimer’s Disease snRNA-seq dataset.

   [186]Fig. 6
   [187]Open in a new tab

   a Gene-topic heatmap. The top genes that are known as cell-type marker
   genes based on PanglaoDB are highlighted. For visualization purposes,
   we divided the topic values by the maximum absolute value within the
   same topic. Only the differential topics with respect to cell-type or
   AD were shown. b Topics intensity of cells (n = 10,000) sub-sampled
   from the AD dataset. Topic intensities shown here are the Gaussian mean
   before applying softmax. Only the select topics with the sum of
   absolute values >1500 across all sampled cells are shown. The three
   color bars show disease conditions, cell types, and batch identifiers
   (i.e., subject IDs). c Differential expression analysis of topics
   across the eight cell types and two clinical conditions. Colors
   indicate mean differences between cell groups with and without a
   certain label (cell-type or condition). Asterisks indicate
   Bonferroni-corrected empirical p-value < 0.05 for 100,000 one-sided
   permutation tests of upregulated topics in each cell-type and disease
   labels.

   We also identified several cell-type-specific scETM topics from the HP,
   AD, and MDD datasets. In HP, topics or metagenes 20, 45, 99 are
   upregulated in acinar cells, topic 12 upregulated in macrophage, topic
   52 upregulated in delta, and topics 30 and 37 are upregulated in more
   than one cell types, including endothelial, stellate and others
   (Supplementary Fig. [188]15). In AD, as shown by both the cell topic
   mixture heatmap and the differential expression analysis
   (Fig. [189]6b), topics 19, 35, 50, 69, 97 are upregulated in
   oligodendrocytes, micro/macroglia, astrocytes, endothelial cells, and
   oligodendrocyte progenitor cells (OPCs) respectively (permutation test
   p-value = 1e-5; Fig. [190]6b, Supplementary Fig. [191]16).
   Interestingly, two subpopulations of cells from the oligodendrocytes
   (Oli) and excitatory (Ex) exhibit high expression of topics 12 and 58,
   respectively, and are primarily AD cells (Supplementary Fig. [192]17).
   Among them, there is also a strong enrichment for the female subjects,
   which is consistent with the original finding^[193]35.

   For MDD, topics 1, 20, 59, 64, and 72 are upregulated in astrocytes,
   oligodendrocytes, micro/macroglia, endothelial cells, and OPCs,
   respectively (Supplementary Fig. [194]18c). This is consistent with the
   heatmap pattern (Supplementary Fig. [195]18b). Several topics are
   dominated by long non-coding RNAs (lincRNAs) (Supplementary
   Fig. [196]18a). While previous studies have suggested that lincRNAs can
   be cell-type-specific^[197]46, it remains difficult to interpret
   them^[198]47. We further experimented the enrichment using only the
   protein coding genes, but did not find significantly more marker genes
   among the top 10 genes per topic (Supplementary Fig. [199]19).

Pathway-informed scETM topics

   To further improve topic interpretability, we incorporated the known
   pathway information to guide the learning of the topic embeddings
   (Fig. [200]7a). We denoted this scETM variant as the pathway-informed
   scETM or p-scETM. In particular, we fixed the gene embedding ρ to a
   pathways-by-genes matrix obtained from pathDIP4 database^[201]48. We
   then learned only the topics embedding α, which provides direct
   associations between topics and pathways (Methods). We tested p-scETM
   on the HP, AD and MDD datasets. Without compromising the clustering
   performance (Supplementary Table [202]15), p-scETM learned functionally
   meaningful topic embeddings (Supplementary Fig. [203]20; Supplementary
   Tables [204]16 and [205]17). In the HP topic embeddings, we found
   Insulin Signaling, Nutrient Digestion and Metabolism to be the top
   pathways among several topics (Supplementary Fig. [206]20a). In the MDD
   topic embeddings, the top pathway associated with Topic 40, Beta-2
   Adrenergic Receptor Signaling, was also enriched in a MDD genome-wide
   association studies^[207]49. In the AD topic embeddings, we found the
   association between Topic 9 and Alzheimer Disease-Amyloid Secretase
   pathway.

Fig. 7. Pathway-topics embeddings learned by the pathway-informed scETM
(p-scETM).

   [208]Fig. 7
   [209]Open in a new tab

   a p-scETM overview. Pathways information as pathways-by-genes are
   provided as the gene embedding in the linear decoder. The learned topic
   embedding is the direct association between the topics and pathways. b
   The pathway-topics heatmap of top 5 pathways in selected topics. Here,
   the pathways are the Gene Ontology-Biological Processes (GO-BP) terms.
   For the HP dataset, GO-BP terms whose names include the keywords
   insulin or pancreatic were highlighted. For the AD dataset, GO-BP terms
   whose names include the keywords amyloid or alzheimer were highlighted.
   For MDD—all genes and MDD—coding genes only, GO-BP terms whose names
   include the keywords neuron or G-protein were highlighted.

   To further demonstrate the utility of p-scETM, we also used 7481 gene
   ontology biological process (GO-BP) terms^[210]50,[211]51 as the fixed
   gene embedding, which learns the topics-by-GOs topic embedding from
   each dataset. Under each topic, we selected the top 5 high-scoring
   GO-BP terms to examine their relevance to the target tissue or disease
   (Fig. [212]7b and Supplementary Data [213]2). For the HP dataset,
   Negative Regulation of Type B Pancreatic Cell (GO:2000675) and
   Regulation of Pancreatic Juice Secretion (GO:0090186) are among the top
   GO-BP terms for Topics 27 and 68, respectively. For the AD dataset,
   Amyloid Precursor Protein Biosynthetic Process (GO:0042983) is among
   the top 5 GO-BP terms under Topic 40. For the MDD dataset, similar top
   GO-BP terms were found among topics learned using all of the genes and
   using only the coding gene. Many topics exhibit high embedding scores
   for neuronal functions including Neuronal Signal Transduction
   (GO:0023041), Central Nervous System Projection Neuron Axonogenesis
   (GO:0021952), and Branchiomotor Neuron Axon Guidance (GO:0021785).
   Interestingly, Topic 98 in MDD—coding genes only and Topics 22, 51 in
   MDD—all genes involve Adenylate Cyclase Modulating G-protein Coupled
   Receptors (GPCRs) Signaling (GO:0007188), which is the target of
   several recently-developed antidepressant drugs^[214]52.

Discussion

   As scRNA-seq technologies become increasingly affordable and
   accessible, large-scale datasets have emerged. This challenges
   traditional statistical approaches and calls for transferable,
   scalable, and interpretable representation learning methods to mine the
   latent biological knowledge from the vast amount of scRNA-seq data. To
   address these challenges, we developed scETM and demonstrated its
   state-of-the-art performance on several unsupervised learning tasks
   across diverse scRNA-seq datasets. scETM demonstrates excellent
   capabilities of batch-effect correction and knowledge transfer across
   datasets.

   In terms of integrating multiple scRNA-seq data from different
   technologies, experimental batches, or studies, we introduce a simple
   batch-effect bias term to correct for non-biological effects. This in
   general improves the cell clustering and topic quality. When using the
   original ETM^[215]33, we observed that ubiquitously expressed genes
   such as MALAT1 tended to appear among the top genes in several topics.
   Our scETM corrects the background gene expressions by the
   gene-dependent and batch-dependent intercepts. As a result, the
   ubiquitously expressed genes do not dominate all topics from scETM. We
   also introduced a more aggressive batch correction strategy by
   adversarial network loss, which shows improved kBET with small
   trade-off for the ARI in most datasets.

   In terms of scalability, although scETM is similar to other existing
   VAE models in terms of theoretical time and space complexity, we
   emphasize that implementation is also very important, especially for
   deep-learning models. For example, scVAE-GM^[216]11 is much slower and
   more memory consuming than scVI^[217]6, while they are very similar VAE
   models. One of the main speedups provided by scETM comes from our
   implementation of a multi-threaded data loader for minibatches of
   cells, which does not need to be re-initialized at every training epoch
   as the standard PyTorch DataLoader. Compared to scVI and scVI-LD, the
   normalized counts in both the encoder input and the reconstruction loss
   used by scETM remove the need to infer the cell-specific library size
   variable, and the simpler categorical likelihood choice also helps
   reduce the computational time.

   In terms of transferrability, many existing integration methods require
   running on both reference and query datasets to perform post hoc
   analyses such as joint clustering and label
   transfer^[218]7,[219]9,[220]10,[221]28. In contrast, our method enables
   a direct or zero-shot knowledge transfer of the pretrained encoder
   network parameters learned from a reference dataset in annotating a new
   target dataset without further training. We demonstrated this important
   aspect in cross-technology, cross-tissue, and cross-species
   applications, for which we achieved superior performance compared to
   the state-of-the-art methods.

   In terms of interpretability, our quantitative experiments showed that
   scETM identified more relevant pathways than scVI-LD. Qualitative
   experiments also show that scETM topics preserve cell functional and
   cell-type-specific biological signals implicated in the single-cell
   transcriptomes. By seamlessly incorporating the known pathway
   information in the gene embedding, p-scETM finds biologically and
   pathologically important pathways by directly learning the association
   between the topics with the pathways via the topic embedding. Recently
   proposed by^[222]53, single-cell Hierarchical Poisson Factor (scHPF)
   model applies hierarchical Poisson factorization to discover
   interpretable gene expression signatures in an attempt to address the
   interpretability challenge. However, compared to our model, scHPF lacks
   the flexibility in learning the gene embedding and incorporating
   existing pathway knowledge, and is not designed to account for batch
   effects. Moreover, scETM has the benefits of both flexibility in the
   neural-network encoder and the interpretability in the linear decoder.

   As future work, we will extend scETM in several directions. To further
   improve batch correction, as our current model only considers a single
   categorical batch variable, we can extend it to correct for multiple
   categorical batch variables. For a small number of categorical batch
   variables, we may use several sets of batch intercept terms to model
   them. For hierarchical batch variables, we may use a tree of batch
   intercept terms. For numerical batch effects such as subject age, one
   way is to convert them into categorical batch variables by numerical
   ranges. When the number of batch variables become larger, we consider
   three strategies. First, we can add the batch variables as the
   covariates in the linear regression on the gene expression and fit the
   linear coefficients each corresponding to a sample-dependent batch
   variable. Second, we can factorize the batches-by-genes into
   batches-by-factors and factors-by-genes. Learning the two matrices will
   be similar to the scETM algorithm. Third, we can extend our current
   scETM+adv to correct for both categorical and continuous batch
   variables with a discriminator network, which predicts batch effects
   using the encoder-generated cell topic mixture^[223]38.

   To further improve data integration, we can extend scETM to a
   multi-omic integration method, which can integrate scRNA-seq plus other
   omics such as protein expression measured in the same cells as
   scRNA-seq^[224]54 or scATAC-seq measured in different cells but the
   same biological system^[225]7. In these applications, multi-modality
   over different omics will need to be considered to capture the
   intrinsic technical and biological variance of each omic while
   borrowing information among them.

   To further improve interpretability, the original ETM used pretrained
   word embedding from word2vec^[226]55 on a larger reference corpus such
   as Wikipedia to improve topic quality on modeling the target
   documents^[227]33. Similarly, although we demonstrated the use of
   existing pathway information in p-scETM, we can also pretrain our gene
   embeddings on PubMed articles, gene regulatory network, protein-protein
   interactions, or Gene Ontology graph using either gene2vec^[228]56 or
   more general graph embedding approaches^[229]57,[230]58^[231]58. We
   expect that the gene embedding pretrained from these (structured)
   knowledge graphs will further improve the efficiency and
   interpretability of scETM.

   Together, scETM serves as a unified and highly scalable framework for
   integrative analysis of large-scale single-cell transcriptomes across
   multiple datasets. Compared to existing methods, scETM offers
   consistently competitive performance in data integration, transfer
   learning, scalability, and interpretability. The simple Bayesian model
   design in scETM also provides a highly expandable framework for future
   developments.

Methods

scETM data generative process

   To model scRNA-seq data distribution, we take a topic-modeling
   approach^[232]59. In our framework, each cell is considered as a
   document, each scRNA-seq read (or UMI) as a token in the document, and
   the gene that gives rise to the read (or UMI) is considered as a word
   from the vocabulary of size V. We assume that each cell can be
   represented as a mixture of latent cell types, which are commonly
   referred to as the latent topics. The original LDA model^[233]59
   defines a fixed set of K independent Dirichlet distributions β over a
   vocabulary of size V. Following the ETM model^[234]33, here we
   decompose the unnormalized topic distribution
   [MATH: <msup><mrow><mi
   mathvariant="bold-italic">β</mi></mrow><mrow><mo>*</mo></mrow></msup><m
   o>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>K</mi><mo>×</mo><mi>
   V</mi></mrow></msup> :MATH]
   into the topic embedding
   [MATH: <mi mathvariant="bold-italic">α</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>K</mi><mo>×</mo><mi>
   L</mi></mrow></msup> :MATH]
   and gene embedding
   [MATH: <mi mathvariant="bold-italic">ρ</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>L</mi><mo>×</mo><mi>
   V</mi></mrow></msup> :MATH]
   , where L denotes the size of the embedding space. Therefore, the
   unnormalized probability of a gene belonging to a topic is proportional
   to the dot product between the topic embedding matrix and the gene
   embedding matrix. Formally, the data generating process of each
   scRNA-seq profile d is:
    1. Draw a latent cell type proportion θ[d] for a cell d from logistic
       normal
       [MATH: <msub><mrow><mi
       mathvariant="bold-italic">θ</mi></mrow><mrow><mi>d</mi></mrow></msu
       b><mo>~</mo><mi class="MJX-tex-caligraphic"
       mathvariant="script">LN</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</
       mo><mi mathvariant="bold">I</mi></mrow><mo>)</mo></mrow> :MATH]
       :

   [MATH: <msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>~</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><m
   i mathvariant="bold">I</mi></mrow><mo>)</mo></mrow><mo>,</mo><mspace
   width="1.0em"></mspace><msub><mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>=</mo><mspace
   width="0.25em"></mspace><mstyle><mtext>softmax</mtext></mstyle><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>d</mi></mrow></msub></
   mrow><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><mi>exp</mi><mrow><mo>(</m
   o><mrow><msub><mrow><mi>δ</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>k</m
   i></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><msubsup><mrow><mo
   >∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>K</m
   i></mrow></msubsup><mi>exp</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>δ<
   /mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>k</mi></mrow></msub></mrow><mo
   >)</mo></mrow></mrow></mfrac> :MATH]
       1
    2. For each gene read (or UMI) w[i,d] in cell d, draw gene g from a
       categorical distribution Cat(r[d,⋅]):

   [MATH:
   <msub><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>d</mi></mrow
   ></msub><mo>~</mo><munder><mrow><mo>∏</mo></mrow><mrow><mi>g</mi></mrow
   ></munder><msubsup><mrow><mi>r</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi
   >g</mi></mrow><mrow><mrow><mo>[</mo><mrow><msub><mrow><mi>w</mi></mrow>
   <mrow><mi>i</mi><mo>,</mo><mi>d</mi></mrow></msub><mo>=</mo><mi>g</mi><
   /mrow><mo>]</mo></mrow></mrow></msubsup><mo>,</mo><mspace
   width="1.0em"></mspace><msub><mrow><mi>y</mi></mrow><mrow><mi>d</mi><mo
   >,</mo><mi>g</mi></mrow></msub><mo>=</mo><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><msub><mrow><mi>N</mi></mrow><mrow><mi>d</mi></mro
   w></msub></mrow></munderover><mrow><mo>[</mo><mrow><msub><mrow><mi>w</m
   i></mrow><mrow><mi>i</mi><mo>,</mo><mi>d</mi></mrow></msub><mo>=</mo><m
   i>g</mi></mrow><mo>]</mo></mrow> :MATH]
       2

   Here N[d] is the library size of cell d, w[i,d] is the index of the
   gene that gives rise to the i^th read (or UMI) in cell d (i.e.,
   [w[i,d] = g]), and y[d,g] is the total counts of gene g in cell d. The
   transcription rate r[d,g] is parameterized as follows:
   [MATH:
   <msub><mrow><mi>r</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>g</mi></mrow
   ></msub><mo>=</mo><mfrac><mrow><mi>exp</mi><mrow><mo>(</mo><mrow><msub>
   <mrow><mover
   accent="true"><mrow><mi>r</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>d</mi><mo>,</mo><mi>g</mi></mrow></msub></mrow><mo>)</mo>
   </mrow></mrow><mrow><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><msup><mrow><mi>g</mi></mrow><mrow><m
   o>′</mo></mrow></msup></mrow></msub><mi>exp</mi><mrow><mo>(</mo><mrow><
   msub><mrow><mover
   accent="true"><mrow><mi>r</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>d</mi><mo>,</mo><msup><mrow><mi>g</mi></mrow><mrow><mo>′<
   /mo></mrow></msup></mrow></msub></mrow><mo>)</mo></mrow></mrow></mfrac>
   <mo>,</mo><mspace width="1.0em"></mspace><msub><mrow><mover
   accent="true"><mrow><mi>r</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>d</mi><mo>,</mo><mi>g</mi></mrow></msub><mo>=</mo><msub><
   mrow><mi
   mathvariant="bold-italic">θ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   i mathvariant="bold-italic">α</mi><msub><mrow><mi
   mathvariant="bold-italic">ρ</mi></mrow><mrow><mi>g</mi></mrow></msub><m
   o>+</mo><msub><mrow><mi>λ</mi></mrow><mrow><mi>s</mi><mrow><mo>(</mo><m
   row><mi>d</mi></mrow><mo>)</mo></mrow><mo>,</mo><mi>g</mi></mrow></msub
   > :MATH]
   3

   Here θ[d] is the 1 × K cell topic mixture for cell d, α is the global
   K × L cell topic embedding, ρ[g] is a L × 1 gene-specific embedding,
   and λ[s(d),g] is the batch-dependent and gene-specific scalar effect,
   where s(d) indicates the batch index for cell d. Notably, to model the
   sparsity of gene expression in each cell (i.e., only a small fraction
   of the genes have non-zero expression), we use the softmax function to
   normalize the transcription rate over all of the genes.

scETM model inference

   In scETM, we treat the latent cell topic mixture δ[d] for each cell d
   as the only latent variable. We treat the topic embedding α, the
   gene-specific transcriptomic embedding ρ, and the batch-effect λ as
   point estimates. Let Y be the D × V gene expression matrix for D cells
   and V genes. The posterior distribution of the latent variables p(δ∣Y)
   is intractable. Hence, we took a variational inference approach using a
   proposed distribution q(δ[d]) to approximate the true posterior.
   Specifically, we define the following proposed distribution: q(δ∣y) =
   ∏[d]q(δ[d]∣y[d]), where
   [MATH: <mi>q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>∣</mo><msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>d</mi></mrow></msub></mrow><m
   o>)</mo></mrow><mo>=</mo><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>+</mo><mspace
   width="0.25em"></mspace><mstyle><mtext>diag</mtext></mstyle><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">σ</mi></mrow><mrow><mi>d</mi></mrow></msub></
   mrow><mo>)</mo></mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><m
   i>I</mi></mrow><mo>)</mo></mrow> :MATH]
   and
   [MATH: <mrow><mo>[</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>,</mo><mi mathvariant="normal">log</mi><mspace
   width="0.25em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">σ</mi></mrow><mrow><mi>d</mi></mrow><mrow><mn
   >2</mn></mrow></msubsup></mrow><mo>]</mo></mrow><mo>=</mo><mi
   mathvariant="normal">NNET</mi><mrow><mo>(</mo><mrow><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mo>~</mo></mrow></mover></mrow><
   mrow><mi>d</mi></mrow></msub><mo>;</mo><msub><mrow><mi
   mathvariant="bold">W</mi></mrow><mrow><mi>θ</mi></mrow></msub></mrow><m
   o>)</mo></mrow> :MATH]
   . Here
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mo>~</mo></mrow></mover></mrow><
   mrow><mi>d</mi></mrow></msub> :MATH]
   is the normalized counts for each gene as the raw count of the gene
   divided by the total counts in cell d. The function NNET(v; W) is a
   two-layer feed-forward neural-network used to estimate the sufficient
   statistics of the proposed distribution for the cell topic mixture
   δ[d].

   To learn the above variational parameters W[θ], we optimize the
   evidence lower bound (ELBO) of the log-likelihood, which is equivalent
   to minimizing the Kullback-Leibler (KL) divergence between the true
   posterior and the proposed distribution:
   [MATH: <mi mathvariant="normal">ELBO</mi><mo>=</mo><msub><mrow><mi
   mathvariant="double-struck">E</mi></mrow><mrow><mi>q</mi></mrow></msub>
   <mrow><mo>[</mo><mrow><mi mathvariant="normal">log</mi><mspace
   width="0.25em"></mspace><mi>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">Y</mi><mo>∣</mo><mi
   mathvariant="normal">Θ</mi></mrow><mo>)</mo></mrow></mrow><mo>]</mo></m
   row><mo>−</mo><mi mathvariant="normal">KL</mi><mspace
   width="0.33em"></mspace><mrow><mo>[</mo><mrow><mi>q</mi><mrow><mo>(</mo
   ><mrow><mi mathvariant="normal">Θ</mi><mo>∣</mo><mi
   mathvariant="bold">Y</mi></mrow><mo>)</mo></mrow><mo>∣</mo><mo>∣</mo><m
   i>p</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">Θ</mi></mrow><mo>)</mo></mrow></mrow><mo>]</mo></m
   row> :MATH]
   . The Bayesian learning is carried out by maximizing the reconstruction
   likelihood with regularization in the form of KL divergence of the
   proposed distribution (
   [MATH: <mi>q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>∣</mo><msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>d</mi></mrow></msub></mrow><m
   o>)</mo></mrow><mo>=</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>,</mo><mspace
   width="0.25em"></mspace><mstyle><mtext>diag</mtext></mstyle><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">σ</mi></mrow><mrow><mi>d</mi></mrow></msub></
   mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow> :MATH]
   ) from the prior (
   [MATH: <mi>p</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>d</mi></mrow></msub></
   mrow><mo>)</mo></mrow><mo>=</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mn>0</mn><mo>,</mo><m
   i mathvariant="bold">I</mi></mrow><mo>)</mo></mrow> :MATH]
   ). For computational efficiency, we optimize ELBO with respect to the
   variational parameters by amortized variational
   inference^[235]32,[236]41,[237]42. Specifically, we draw a sample of
   the latent variables from q(δ∣y) for a minibatch of cells from
   reparameterized Gaussian proposed distribution q(δ∣y)^[238]32, which
   has the mean and variance determined by the NNET functions. We then use
   those draws as the noisy estimates of the variational expectation for
   the ELBO. The optimization is then carried out by back-propagating the
   gradients into the encoder weights and the topic and gene embeddings.

Details of training scETM

   We chose the encoder for inferring the cell topic mixture to be a
   2-layer neural-network, with hidden sizes of 128, ReLU
   activations^[239]60, 1D batch normalization^[240]61, and 0.1 drop-out
   rate between layers. We set the gene embedding dimension to 400, and
   the number of topics to 50. We optimize our model with Adam Optimizer
   and a 0.005 learning rate. To prevent over-regularization, we start
   with zero weight penalty on the KL divergence and linearly increase the
   weight of the KL divergence in the ELBO loss to 10^−7 during the first
   [MATH: <mfrac><mrow><mn>1</mn></mrow><mrow><mn>3</mn></mrow></mfrac>
   :MATH]
   epochs. With a minibatch size of 2000, scETM typically needs 5k-20k
   training steps to converge. We show that our model is robust to changes
   in the above hyperparameters (Supplementary Table [241]2). During the
   evaluation, we used the variational mean of the unnormalized topic
   mixture μ[d] in
   [MATH: <mi>q</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">δ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>∣</mo><msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>d</mi></mrow></msub></mrow><m
   o>)</mo></mrow><mo>=</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">μ</mi></mrow><mrow><mi>d</mi></mrow></msub><m
   o>,</mo><mspace
   width="0.25em"></mspace><mstyle><mtext>diag</mtext></mstyle><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">σ</mi></mrow><mrow><mi>d</mi></mrow></msub></
   mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow> :MATH]
   as the scETM cell topic mixture for cell d.

scETM+adv: adversarial loss for further batch correction

   In the scETM+adv variant, we added a discriminator network (a two-layer
   fully-connected network) to predict batch labels using the unnormalized
   cell topic mixture embedding δ generated by the encoder network. This
   discriminator helps batch correction in an adversarial fashion.
   Specifically, in each training iteration, we first update the scETM
   parameters once by maximizing the ELBO plus the batch prediction
   cross-entropy loss from the discriminator, with a hyperparameter
   controlling the weight of the latter term. It should be noted that by
   maximizing the prediction loss the encoder network learns to produce
   batch agnostic cell embeddings. Then we update the discriminator
   network eight times by trying to minimize the cross-entropy loss in
   predicting the batch labels.

scETM software

   We implemented scETM using the PyTorch library^[242]62. Our initial
   implementation was based on the ETM code from GitHub (adjidieng/ETM)
   by^[243]33. Since then, we completely revamped the code to
   substantially improve the scalability and to integrate it into the
   Python ecosystem. In particular, we packaged and released our code on
   PyPI so one can easily install the package by entering pip install
   scETM in the terminal. The package is integrated with scanpy^[244]16
   and tensorboard^[245]63. Users can view the cell, gene and topic
   embeddings interactively via tensorboard. For example, one can easily
   train a scETM as follows:
from scETM import scETM, UnsupervisedTrainermodel = scETM(adata.n_vars, adata.ob
s.batch_indices.nunique())trainer = UnsupervisedTrainer(model, adata)trainer.tra
in(save_model_ckpt = False)model.get_all_embeddings_and_nll(adata)

   The above code snippet will instantiate an scETM model object, train
   the model, infer the unnormalized cell topics mixture of adata and
   store them in adata.obsm[‘delta’]. We can also access the gene and
   topic embeddings via adata.varm[‘rho’] and adata.uns[‘alpha’].

Transfer learning with scETM

   When transferring from a reference dataset to a target dataset, we
   operate on the genes common to both datasets. For cross-species
   transfer, the orthologous genes based on the Mouse Genome Informatics
   database^[246]64,[247]65 are considered as common genes. The trained
   scETM encoder can be directly applied to an unseen target dataset, as
   long as the genes in the target dataset are aligned to the genes in the
   reference dataset. In the main text, for example, we trained scETM on a
   reference dataset and evaluated the scETM-encoder on a target dataset
   in six transfer-learning tasks (Fig. [248]4).

Pathway enrichment analysis

   To assess whether a topic is enriched in any known pathway, one common
   way is to test for Over Representation Analysis (ORA)^[249]66. However,
   ORA requires choosing a subset of genes (e.g., from differential
   expression analysis). While we could choose the top genes scored by
   each topic, it requires some arbitrary threshold to select those genes.
   To avoid thresholding genes, we employed Gene Set Enrichment Analysis
   (GSEA)^[250]44. GSEA calculates a running sum of enrichment scores (ES)
   by going down the gene list that is sorted in the decreasing order by
   their association statistic with a phenotype.

   In our context, we treated the gene scores under each topic from the
   genes-by-topics matrix (i.e., β) as the association statistic. The ES
   for a gene set S is the maximum difference between P_hit(S,i) and
   P_miss(S,i), where P_hit(S,i) is the fraction of genes in S weighted by
   their topic scores up to gene index i in the sorted list and
   P_miss(S,i) is the fraction of genes not in S weighted by their topic
   scores up to gene index i in the sorted list. The enrichment p-value
   for each gene set is computed by permutation tests by randomly
   shuffling the gene symbols on the sorted list (while keeping the
   gene-topic scores in the decreasing order) 1000 times to compute the
   null distribution of the ES for each gene set and each topic. The
   empirical p-value was calculated as (N’ + 1)/(N + 1), where N’ is the
   number of permutation trials in which ES is greater than the observed
   ES, and N is the total number of trials (i.e., 1000). We then corrected
   the p-values for multiple testing using Benjamini–Hochberg (BH)
   method^[251]67.

   For AD and HP datasets, we used the MSigDB Canonical Pathways gene
   sets^[252]68 as the gene set database in GSEA; and for MDD, we used
   PsyGeNET database^[253]69 in order to find psychiatric disease-specific
   associations. We also run GSEA for scVI-LD gene loadings for
   comparison. The detailed pathway enrichment statistic can be found in
   Supplementary Data [254]1.

Differential analysis of topic expression

   We aimed to identify topics that are differentially associated with
   known cell type labels or disease conditions. For topic k and cell
   label j (i.e., cell type or disease condition), we first calculated the
   difference of the average topic activities between the cells with label
   j and the cells without label j. For each permutation trial, we
   randomly shuffled the label assignments among cells and recalculated
   the difference of average topic activities from the resulting
   permutation. The empirical p-value was calculated as (N’ + 1)/(N + 1),
   where N’ is the number of permutation trials in which the difference is
   greater than the observed difference, and N is the total number of
   trials. To account for multiple hypotheses, we applied Bonferroni
   correction by multiplying the p-value by the product of the topic
   number and the number of labels. We performed N = 100,000 permutations.

   We determined a topic to be differentially expressed (DE) if the
   Bonferroni-corrected q-value is lower than 0.01 and the mean difference
   is greater than 2 for cell-type DE topcis or 0.2 for disease DE topics.
   Supplementary Table [255]18 summarizes the number of DE topics we
   identified for each cell type and disease conditions from the AD and
   MDD data. We use the PanglaoDB database^[256]70 to find the overlap
   between top genes of cell-type-specific DE topics and known cell type
   markers.

Incorporation of pathway knowledge into the gene embeddings in p-scETM

   We downloaded the pathDIP4 pathway database from^[257]48, and the Gene
   Ontology (Biological Processes) (GO-BP) dataset from MSigDB v7.2
   Release^[258]68. Pathway gene sets or GO-BP terms containing fewer than
   five genes were removed. We represented the pathway knowledge as a
   pathways-by-genes ρ matrix, where ρ[ij] = 1 if gene set i contains gene
   j, and ρ[ij] = 0 otherwise. We standardized each column (i.e., gene) of
   this matrix for numerical stability. During training the p-scETM, we
   fixed the gene embedding matrix ρ to the pathways-by-genes matrix.

Clustering performance benchmark and visualization

   We assessed the performance of each method by three metrics: Adjusted
   Rand Index (ARI)^[259]71, Normalized Mutual Information (NMI) and
   k-nearest-neighbor Batch-Effect Test (kBET)^[260]18. ARI and NMI are
   widely-used representatives of two families of clustering agreement
   measures, pair-counting and information theoretic measures,
   respectively. A high ARI or NMI indicates a high degree of agreement
   for a given clustering result against the ground-truth cell type
   labels. We calculated ARI and NMI using the Python library
   scikit-learn^[261]72.

   kBET measures how well mixed the batches are based on the local batch
   label distribution in randomly sampled nearest-neighbor cells compared
   against the global batch label distribution. Average silhouette width
   (ASW)^[262]40 indicates clustering quality using cell type labels.
   Silhouette width (SW) of a cell i is the distance of cell i from all of
   the cells within the same cluster subtracted by the distance of cell i
   from cells in a nearest but different cluster, normalized by the
   maximum of these two values. ASW is the averaged SW over all the cells
   in a dataset and larger values indicate better clustering quality.
   Therefore, larger ASW indicates the higher distances between cell types
   and lower distance within cell type in the cell embedding space. We
   choose the distance function to be the euclidean distance. We adapted
   the Pegasus implementation^[263]73 for kBET calculation, and set the k
   to 15.

   All embedding plots were generated using the Python scanpy
   package^[264]16. We use UMAP^[265]39 to reduce the dimension of the
   embeddings to 2 for visualization, and Leiden^[266]74 to cluster cells
   by their cell embeddings produced by each method in comparison. During
   the clustering, we tried multiple resolution values and reported the
   result with the highest ARI for each method.

   For reproducibility, the evaluation and the plotting steps were
   implemented in a single evaluate function in the scETM package, which
   takes in an AnnData object with cell embeddings and returns a Figure
   object for the ARI, NMI, kBET, and embedding plot. For consistency, we
   used this function to evaluate all methods, including those written in
   R, where we used the reticulate package^[267]75 to call our evaluate
   function.

   We ran all methods under their recommended pipeline settings
   (Supplementary Methods), and we use batch correction option whenever
   applicable to account for batch effects. All results are obtained on a
   compute cluster with Intel Gold 6148 Skylake CPUs and Nvidia V100 GPUs.
   We limit each experiment to use 8 CPU cores, 192 GB RAM and 1 GPU.

Efficiency and scalability benchmark of the existing methods

   To create a benchmark dataset for evaluating the run-time of each
   method, we merged MDD and AD, keeping the genes that appear in both
   datasets. We then selected the 3000 most variable genes using scanpy’s
   highly_variable_genes(n_top_genes=3000, flavor=‘seurat_v3’) function,
   and randomly sampled 28,000, 14,000, 70,000, and 148,247 (all) cells to
   create our benchmark datasets. The memory requirements reported in
   Fig. [268]3 were obtained by reading the rss attribute of the named
   tuple returned by calling Process().memory_info() from the psutil
   Python package^[269]76. For methods based on R, we use the reticulate
   package^[270]75 to call the above Python function for consistency. We
   used the same settings (RAM size, number of GPUs, etc) as described in
   the Clustering performance benchmark and visualization section
   throughout the experiments.

Reporting summary

   Further information on research design is available in the [271]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [272]Supplementary Information^ (30MB, pdf)
   [273]Peer Review File^ (8.8MB, pdf)
   [274]41467_2021_25534_MOESM3_ESM.pdf^ (386.7KB, pdf)

   Description of Additional Supplementary Files
   [275]Supplementary Data 1^ (34.5KB, xlsx)
   [276]Supplementary Data 2^ (44.4KB, xlsx)
   [277]Reporting Summary^ (254.2KB, pdf)

Acknowledgements