Graphical abstract

   graphic file with name fx1.jpg
   [37]Open in a new tab

Highlights

     * •
       autoCell imputes heterogeneous and sparse sc/snRNA-seq data
     * •
       autoCell improves the performance of capturing cell developmental
       trajectories
     * •
       autoCell captures disease-relevant cellular pathobiology in latent
       space
     * •
       autoCell identifies cell-type-specific gene networks in Alzheimer’s
       disease

Motivation

   Single-cell RNA sequencing (scRNA-seq) enables researchers to study
   gene expression at cellular resolution. However, noise caused by
   amplification and dropout may hamper precise data analyses. It is
   urgent to develop scalable denoising methods to deal with the
   increasingly large, but sparse, scRNA-seq data. Here, we present
   autoCell, a graph-embedded Gaussian mixture variational autoencoder
   network algorithm for scRNA-seq dropout imputation and feature
   extraction. Our autoCell provides a deep-learning toolbox for
   end-to-end analysis of large-scale single-cell/nucleus RNA-seq data,
   including visualization, clustering, imputation, and disease-specific
   gene network identification.
     __________________________________________________________________

   Xu et al. develop a graph-embedded Gaussian mixture variational
   autoencoder network algorithm (termed autoCell) for end-to-end analyses
   of single-cell/nuclei RNA-seq data, including visualization,
   clustering, imputation, and cell-type-specific gene network
   identification. autoCell offers a useful tool for large-scale
   single-cell genomic data analyses to accelerate translational biology
   and disease discoveries.

Introduction

   Single-cell technology is a revolutionary breakthrough, allowing us to
   study the genome, transcriptome, and multi-omics systems of each cell,
   in each state, in tissues.[38]^1^,[39]^2 Combined with technologies
   such as fluorescent labeling and microdissection, it can also determine
   spatial attributes and cell-cell communication. These technologies have
   been widely used, leading to a revolution in basic and translational
   medicine.

   Single-cell or single-nucleus RNA sequencing (sc/snRNA-seq) is
   important for identifying biological and disease-relevant cell types
   and subpopulations from heterogeneous
   cells.[40]^3^,[41]^4^,[42]^5^,[43]^6 Low-dimensional analysis of
   expression in different cell states can also be highly effective in
   reconstructing the cell developmental
   trajectory.[44]^7^,[45]^8^,[46]^9^,[47]^10^,[48]^11^,[49]^12 However,
   the amount of mRNA in a single cell is small, which necessitates a
   nearly million-fold amplification. Although the measurement technology
   has been greatly improved, technical factors still cause considerable
   noise in data generated in scRNA-seq experiments, including
   amplification deviation, library size difference, and extremely low
   capture rate. In particular, the extremely low RNA capture rate leads
   to undetectable, albeit expressed, genes, namely “dropout”
   events.[50]^13 An essential difference is found between the “false”
   zero count caused by the “dropout” event and the true zero count. Given
   the sparse expression metrics, traditional analytic tools cannot
   achieve scientific rigor, and they lack high data
   reproducibility.[51]^14

   Deep-learning algorithms have shown compelling performance in
   high-dimensional data processing, including sparse genomic data.
   DeepImute, an interpolation algorithm based on a deep neural network,
   is one such example. It uses the dropout layer to learn the patterns in
   the data to achieve accurate imputation.[52]^15 Another denoising model
   is to denoise scRNA-seq datasets through a depth-counting autoencoder
   (DCA) network. A DCA takes the count distribution, overdispersion, and
   sparsity of the data into account using a negative binomial noise model
   with zero inflation.[53]^16 scScope, a scalable deep-learning-based
   method, can accurately and quickly identify cell-type composition from
   millions of heterogeneous single-cell transcriptomic profiles.[54]^9
   Compared with DCA and scScope, scVI[55]^17 uses a variational
   autoencoder (VAE) to reduce the dimensionality of scRNA-seq data.
   However, the standard VAE implemented by scVI[56]^17 only uses a single
   isotropic multi-variate Gaussian distribution on the latent variable,
   which is not generally suitable for representing multi-category data
   such as scRNA-seq data containing multiple types of cells/nuclei. scVAE
   is another method based on VAEs, which uses Gaussian mixture models
   (GMMs) as prior distributions and introduces Poisson or negative
   binomial distributions to obtain latent representations of
   cells.[57]^18 However, the model considers the mean and variance of the
   GMM as random variables, and it is approximated by a neural network
   with a parameter
   [MATH: <mrow><mi>β</mi></mrow> :MATH]
   , which makes model optimization challenging. As the number of
   measurable cells/nuclei and additional emerging sc/snRNA-seq data
   analysis challenges increase, the demand for faster and scalable
   estimation methods becomes a pressing need.

   With the rapid development of graph neural networks, graph autoencoders
   can learn low-dimensional representations of graph topology and train
   node relationships with a global, entire graph view. Increasingly,
   exploiting a graph neural network framework in sc/snRNA-seq data
   analysis can be considered. Single-cell graph neural network
   (scGNN)[58]^19 establishes cell-cell relationships through graphical
   neural networks and uses left-truncated mixed Gaussian models to model
   heterogeneous gene expression patterns. It also integrates three
   iterative multi-mode autoencoders. However, the iterative autoencoder
   framework requires more computing resources, which is more time
   consuming.

   In this study, we present a deep-learning framework, namely autoCell,
   for dropout imputation and feature extraction from scRNA-seq data. The
   accuracy index indicates that autoCell outperformed six
   state-of-the-art published imputation methods in simulated datasets and
   biologically relevant sc/snRNA-seq datasets with varying degrees of
   human diseases. Therefore, autoCell is a scalable and accurate
   scRNA-seq data processing method that is superior to other scRNA-seq
   data analysis tools.

Results

Overview of the autoCell framework

   The overview of autoCell is shown in [59]Figure 1. It is a VAE network
   that combines graph embedding and GMMs to model the distribution of
   high-dimensional, sparse scRNA-seq data. autoCell architecture can use
   biological representations of cells and genes to perform different
   scRNA-seq data analysis tasks. By integrating GMMs, autoCell can better
   estimate data distribution. We apply graph embedding to deal with
   sc/snRNA-seq data. Capturing the graphical information of the local
   data structure is a good complement to deep GMMs, making the network
   learn a global model with local structure constraints. Recent studies
   have showed that zero-inflated negative binomial (ZINB) distribution
   for modeling is an appropriate tool to solve the “dropout” event for
   scRNA-seq data. In reducing the impact of dropout events in highly
   sparse and over-dispersed count data, we introduce the ZINB
   distribution model, thereby denoising scRNA-seq data (see [60]STAR
   Methods).

Figure 1.

   [61]Figure 1
   [62]Open in a new tab

   Overview of the autoCell framework

   autoCell uses a Gaussian mixture model (GMM) and a deep neural network
   (DNN) to model the process of data generation. It uses zero-inflated
   negative binomial (ZINB) loss to process “dropout” events in scRNA-seq.
   The encoder and decoder are a two-layer neural network (128–128) with
   10-dimensional latent variables (features) directly connected to the
   output. The cell-cell network is used to constrain the latent feature
   [MATH: <mrow><mi>Z</mi></mrow> :MATH]
   ; thus, similar cells have similar latent features and cluster
   assignments. The yellow nodes depict the mean of the negative binomial
   distribution, which is the main output of the method representing
   denoised data, whereas the purple and blue nodes represent the other
   two parameters of the ZINB distribution, namely, dispersion and
   dropout.

autoCell effectively imputes scRNA-seq data

   We first applied autoCell to simulated scRNA-seq data to assess its
   imputation performance.[63]^20 We simulated two datasets with three
   cell types, including 1,000 cells and 2,000 genes. For the simulation
   of the two datasets, 60% and 71% of the data values were set to the
   zero matrix to simulate dropout events in real data. We divided the
   entry of the simulated raw expression data into zero and non-zero
   space. Based on the density plot of the estimated and real values, the
   recovery values of DCA and autoCell are closer to the true expression
   values, and scGNN is at a medium level. MAGIC,[64]^21 SAVER,[65]^22 and
   SAUCIE[66]^23 always tend to underestimate the original value. We also
   calculated the median L1 distance, root-mean-square error (RMSE)
   scores, and cosine similarity (see [67]STAR Methods) score between the
   real expression value and restored expression value to measure the
   estimation accuracy. As shown in [68]Figure 2, autoCell achieves
   overall better performance than the others. In particular, autoCell
   ranks second in the median L1 distance of gene expression restoration
   on the two simulated datasets and ranks second in the cosine similarity
   score on the simulated dataset with a synthetic dropout rate of 71%. In
   addition, the dropout event will increase the noise, muddling the
   identity of cell types, which can be restored through interpolation by
   value recovery algorithms. We also found that autoCell outperforms
   existing methods in the two simulation datasets ([69]Figures S1A and
   S1B).

Figure 2.

   [70]Figure 2
   [71]Open in a new tab

   Performance comparison between autoCell and other state-of-the-art
   methods under 10% synthetic dropout rate

   (A) Density plots of imputed versus original data masked. The x axis
   corresponds to the imputed values, and the y axis represents the true
   values of the masked data points. Each row is a different dataset, and
   each column is a different imputation method. Pearson correlation
   coefficient (PPC; higher is better).

   (B) Comparison of the cosine similarity (higher is better), median L1
   distance (lower is better), and root-mean-square error (RMSE) scores
   (lower is better) between autoCell and the other six imputation tools.

   In evaluating the performance of autoCell in imputing missing values,
   we also selected two biologically relevant sc/snRNA-seq
   datasets[72]^24^,[73]^25 with well-annotated cell types as benchmarks.
   We simulated the dropout effects by randomly flipping 10% non-zero
   entries to the zero matrix. Similarly, three indicators between the
   original dataset and imputed values of these synthetic items are
   calculated as a measure of estimation accuracy. Compared with several
   state-of-the-art algorithms ([74]Figure 2), autoCell achieves the best
   performance evaluated by the median L1 distance, cosine similarity, and
   RMSE at the 10% synthetic dropout rate. Furthermore, autoCell
   imputation is closer to the true expression value based on the density
   plot of the estimated and true values ([75]Figure 2). Collectively,
   autoCell outperforms state-of-the-art methods in sc/snRNA-seq data
   imputation analysis ([76]Figure S2).

autoCell improves the performance of existing tools for capturing cell
developmental trajectories

   Apart from identifying cell types, scRNA-seq facilitates the
   organization of cells by time course or developmental stage (i.e., cell
   trajectory). The transition of cells from one functional state to
   another is a critical event in development. However, transitions are
   difficult to characterize since trapping and purifying cells in between
   stable endpoint states are challenging. Although some models currently
   exist to infer cell developmental trajectories based on scRNA-seq data,
   most inference methods do not address dropout events. We tested the
   accuracy of inferring the cell trajectory of scRNA-seq data after
   interpolation via autoCell. We used a benchmark dataset[77]^26 with
   1,529 single cells with five stages of well-annotated human
   preimplantation embryonic development from embryogenesis embryonic day
   3 (E3) to E7. We reconstructed cell trajectories using monocle3[78]^27
   after various interpolation processes. The interpolation of autoCell
   produced the highest correspondence between the inferred pseudotime and
   the real-time cellular development ([79]Figure S3). Pseudotime order
   score (POS; see [80]STAR Methods) increased from 0.838 to 0.850. On the
   contrary, the POS obtained by other algorithms is lower than the
   original scRNA-seq dataset ([81]Figure S3). Furthermore, we used
   another common trajectory analysis model, slingshot,[82]^28^,[83]^29 to
   test whether autoCell improves trajectory analysis. We found that cell
   development trajectory is well captured by interpolation from autoCell
   ([84]Figure 3). Therefore, autoCell captures more accurate
   transcriptome dynamics and cell developmental trajectories across
   different developmental stages.

Figure 3.

   [85]Figure 3
   [86]Open in a new tab

   autoCell improves pseudotime analysis in the human preimplantation
   embryonic development dataset

   (A and B) Results of slingshot estimated pseudotime using (A) raw data
   as input and (B) processed data from autoCell as input.

   (C) Processed data from DCA as input.

   (D) Processed data from MAGIC as input.

   (E) Processed data from SAUCIE as input.

   (F) Processed data from scVI as input.

   (G) Processed data from SAVER as input.

   (H) Processed data from scGNN as input.

   POS, pseudotime order score (the higher the value the better).
   Pseudotime is a measure of how much progress an individual cell has
   made through a process such as cell differentiation.

autoCell captures cellular pathobiology in latent space

   We also assessed the extent to which the latent space inferred by
   autoCell reflects the biological variability among cells based on the
   previous stratification of cells into biologically important
   subpopulations through unsupervised clustering followed by manual
   inspection and annotation. We applied autoCell to two simulated
   datasets and four biologically relevant scRNA-seq datasets. The zero
   ratio of these six datasets ranges from 60% to 90%. By default,
   autoCell extracted 10 features from the input data. For a fair
   comparison, we further applied common scRNA-seq data dimension
   reduction methods, including scVI, DESC,[87]^30 scVAE, DCA, and SAUCIE,
   to reduce the input data to 10 dimensions. We used uniform manifold
   approximation and projection (UMAP) to visualize the features extracted
   from these tools and the original data. We found that the feature
   embeddings of autoCell were well separated among cell types with closer
   inner-group distances and larger between-group distances. However, the
   embedding and original data of DESC, scVAE, DCA, and SAUCIE overlapped
   among certain cell types. We found that autoCell is the best approach
   to identify all three cell types in the two simulated datasets
   ([88]Figure S1C). For the Klein dataset,[89]^25 scVI, scVAE, and
   autoCell showed better performance, and DCA caused the cell types d0
   and d2 to be closely linked. However, SAUCIE and DESC only separated
   cells with cell type d0 and incorrectly divided cell type d7 into two
   cell types ([90]Figure 4A). For the Zeisel dataset,[91]^24 we found
   that autoCell, scVI, and scVAE still outperformed the other models, and
   autoCell and scVAE achieved a closer intra-group distance
   ([92]Figure 4B).

Figure 4.

   [93]Figure 4
   [94]Open in a new tab

   UMAP visualization of the extracted features using different approaches

   (A and B) We evaluated autoCell with DCA, DESC, scVI, SAUCIE, and scVAE
   using two datasets: (A) Klein and (B) Zeisel datasets. For comparison,
   autoCell, DCA, DESC, scVI, SAUCIE, and scVAE all performed dimension
   reduction to 10 dimensions before applying UMAP.

   (C) Comparison on the effect of clustering on four benchmark datasets.
   Clustering accuracy was evaluated by applying K-means clustering on the
   extracted features to obtain cluster assignments.

   NMI, normalized mutual information (the higher the value the better);
   ARI, adjusted rand index; COM, completeness (the higher the value the
   better); HOM, homogeneity (the higher the value the better).

   We applied K-means clustering on autoCell-extracted latent features and
   assessed the clustering accuracy by comparing it with scVI, DESC,
   scVAE, DCA, and SAUCIE. We found that autoCell displays the best
   performance across all tested scRNA-seq datasets ([95]Figure 4). In the
   Klein dataset,[96]^25 the clustering output using autoCell
   ([97]Figure 4C) was more consistent with the predefined unit-type
   annotation (normalized mutual information [NMI] = 0.882 and adjusted
   Rand index [ARI] = 0.907) than the second ranked model, scVI (NMI =
   0.832, ARI = 0.784). In the Zeisel dataset,[98]^24 the clustering
   performance of autoCell was considerably better than other existing
   tools. Then, we changed K-means to another four clustering algorithms
   (including spectral clustering, affinity propagation, birch, and
   agglomerative clustering). autoCell shows the best performance across
   all tested scRNA-seq datasets ([99]Table S1).

   In addition, we compared the visualization performance using
   principal-component analysis (PCA). As shown in [100]Figure S4, in the
   Klein dataset, the feature embedding of autoCell, scVI, and scVAE was
   well separated among cell types. For the Zeisel dataset, autoCell and
   DESC showed the best performance, although they mixed astrocytes and
   brain endothelial cells, both of which were glial cells located in the
   CNS, but they can effectively separate most cell types. For the Romanov
   dataset, all models identify two glial clusters (astrocytes and brain
   endothelial cells) close to each other in the latent space. However,
   autoCell was the only model that effectively separated the microglia
   ([101]Figure S4).

   Collectively, autoCell presented elevated accuracy in capturing
   cellular pathobiology than existing state-of-the-art approaches for
   simulated and real-world biologically relevant scRNA-seq datasets
   ([102]Figures S4 and [103]S5).

Discovery of cell-type-specific molecular networks by autoCell

   In testing whether the autoCell-inferred cell type can capture specific
   pathobiology of human diseases, we analyzed astrocytes, microglia,
   neurons, and oligodendrocyte progenitor cells (OPCs) using Alzheimer’s
   disease (AD) as a prototypical example. In total, we re-analyzed 13,214
   high-quality nuclei generated from the entorhinal cortex of AD brains
   and healthy controls. Using autoCell, we identified four microglia
   clusters, nine astrocyte clusters, and five OPC clusters
   ([104]Figure 5A). Recent studies using human postmortem brain tissues
   identified disease-associated astrocyte (DAA) involved in crucial roles
   of AD pathogenesis and disease progression of AD. Using 11
   experimentally validated DAA marker genes (four upregulated marker
   genes [GFAP, CD44, HSPB1, and TNS] and seven downregulated marker genes
   [SLC1A2, SLC1A3, GLUL, NRXN1, CADM2, PTN, and GPC5]),[105]^31 we
   identified astrocyte subcluster 4 as a DAA by autoCell
   ([106]Figure S6). Next, we built a DAA-specific molecular network using
   a state-of-the-art network-based algorithm, GPSnet,[107]^32 under the
   human protein-protein interaction (PPI) network model. The DAA-specific
   module network included 50 PPIs connected by 44 proteins, such as APOE,
   MAPT, CD44, FOS, and STAT3 ([108]Figure 5B; [109]Table S2). APOE and
   microtubule-associated protein Tau (MAPT) were two of the most
   well-known risk genes for AD.[110]^33^,[111]^34 CD44 was an
   inflammation-associated protein. The inhibition of CD44 could be a
   potential strategy for AD treatment.[112]^35 In a mouse model study,
   Stat3-deficient and Stat3-deletion astrocytes presented dropped levels
   of β-amyloid and pro-inflammatory cytokine activities.[113]^36 Proteins
   from the DAA-specific molecular network were enriched by multiple
   AD-related pathways, such as cytokine signaling, spinal cord injury,
   and brain-derived neurotrophic factor signaling pathways
   ([114]Figure 5B; [115]Table S3). For example, several proteins in the
   DAA-specific network (STAT3, MAPT, HSPB8, HSPB1, JUNB, and LINGO1) were
   enriched in multiple cytokine signaling pathways, including
   interleukin-5 (IL-5), IL-2, IL-18, IL-3, and IL-4, consistent with the
   important role of neuro-inflammation mediated by microglia in
   AD.[116]^37^,[117]^38 Therefore, using autoCell, we can identify
   disease-associated, cell-type-specific molecular network involved in
   key AD pathobiology.

Figure 5.

   [118]Figure 5
   [119]Open in a new tab

   Discovery of cell-type-specific molecular networks and significant
   ligand-receptor interactions in Alzheimer’s disease (AD) using autoCell

   (A) UMAP visualization of subclusters of astrocytes (including
   disease-associated astrocyte [DAA]), microglia, and OPCs, labeled with
   autoCell clusters.

   (B) Reconstruction of the DAA-specific molecular subnetwork from the
   human protein-protein interactome. The DAA-specific module network
   included 50 protein-protein interactions (PPIs) connected by 44
   proteins (e.g., APOE, MAPT, CD44, FOS, and STAT3). The proteins from
   the DAA-specific molecular network are enriched with multiple
   AD-related pathways, including cytokine signaling pathways. The
   proteins from the DAA-specific molecular network are enriched with
   multiple cytokine signaling pathways.

   (C) Inferred cell-cell interactions using CellChat.

   (D) Top selected significant ligand-receptor pairs among
   autoCell-identified cell types. Ligand-receptor interactions were
   predicted by CellChat (see [120]STAR Methods).

   We also identified significant ligand-receptor interactions involved in
   cell-cell communications in AD. We first inferred cell subpopulations
   using autoCell and the predicted ligand-receptor interaction using
   CellChat.[121]^39 As shown in [122]Figure 5C, we found strong
   ligand-receptor interactions among astrocytes, OPCs, and
   oligodendrocytes compared with the other three cell types (neuron,
   microglia, and endothelial). Two ligand-receptor pairs (NRG3-ERBB4 and
   NRG1-ERBB4) revealed strong interactions across multiple cell-cell
   pairs ([123]Figure 5D; [124]Table S4). Multiple single-nucleotide
   polymorphisms in the NRG3 gene were found to be associated with the
   onset of AD.[125]^40 In addition, the overexpression of ERBB4 in
   neurons was found to be associated with AD neuropathology.[126]^41 A
   recent AD mouse model study found that immunoreactivity of NRG1 and
   ERBB4 was associated with plaques in the hippocampus region.[127]^42
   Using AD as a prototypical example, we demonstrated that the
   disease-associated cell subtype identified by autoCell could identify
   molecular targets and networks (i.e., ligand-receptor interactions)
   involved in AD pathogenesis and provide potential drug targets for AD
   or other human diseases if broadly applied.

autoCell is scalable to large datasets

   The increasing number of cells is the main challenge in scRNA-seq
   analysis. In large projects such as the Human Cell Atlas,[128]^43 the
   number of cells may be hundreds of thousands. In large datasets,
   identifying cell populations is challenging because many existing
   scRNA-seq clustering methods cannot scale up to handle them. Thus, we
   used the Zheng-68k[129]^44 dataset and the Zheng-73k[130]^44 dataset to
   study whether autoCell is suitable for large datasets. These cell types
   were used as references when benchmarking autoCell. autoCell worked