Graphical abstract graphic file with name fx1.jpg [37]Open in a new tab Highlights * • autoCell imputes heterogeneous and sparse sc/snRNA-seq data * • autoCell improves the performance of capturing cell developmental trajectories * • autoCell captures disease-relevant cellular pathobiology in latent space * • autoCell identifies cell-type-specific gene networks in Alzheimer’s disease Motivation Single-cell RNA sequencing (scRNA-seq) enables researchers to study gene expression at cellular resolution. However, noise caused by amplification and dropout may hamper precise data analyses. It is urgent to develop scalable denoising methods to deal with the increasingly large, but sparse, scRNA-seq data. Here, we present autoCell, a graph-embedded Gaussian mixture variational autoencoder network algorithm for scRNA-seq dropout imputation and feature extraction. Our autoCell provides a deep-learning toolbox for end-to-end analysis of large-scale single-cell/nucleus RNA-seq data, including visualization, clustering, imputation, and disease-specific gene network identification. __________________________________________________________________ Xu et al. develop a graph-embedded Gaussian mixture variational autoencoder network algorithm (termed autoCell) for end-to-end analyses of single-cell/nuclei RNA-seq data, including visualization, clustering, imputation, and cell-type-specific gene network identification. autoCell offers a useful tool for large-scale single-cell genomic data analyses to accelerate translational biology and disease discoveries. Introduction Single-cell technology is a revolutionary breakthrough, allowing us to study the genome, transcriptome, and multi-omics systems of each cell, in each state, in tissues.[38]^1^,[39]^2 Combined with technologies such as fluorescent labeling and microdissection, it can also determine spatial attributes and cell-cell communication. These technologies have been widely used, leading to a revolution in basic and translational medicine. Single-cell or single-nucleus RNA sequencing (sc/snRNA-seq) is important for identifying biological and disease-relevant cell types and subpopulations from heterogeneous cells.[40]^3^,[41]^4^,[42]^5^,[43]^6 Low-dimensional analysis of expression in different cell states can also be highly effective in reconstructing the cell developmental trajectory.[44]^7^,[45]^8^,[46]^9^,[47]^10^,[48]^11^,[49]^12 However, the amount of mRNA in a single cell is small, which necessitates a nearly million-fold amplification. Although the measurement technology has been greatly improved, technical factors still cause considerable noise in data generated in scRNA-seq experiments, including amplification deviation, library size difference, and extremely low capture rate. In particular, the extremely low RNA capture rate leads to undetectable, albeit expressed, genes, namely “dropout” events.[50]^13 An essential difference is found between the “false” zero count caused by the “dropout” event and the true zero count. Given the sparse expression metrics, traditional analytic tools cannot achieve scientific rigor, and they lack high data reproducibility.[51]^14 Deep-learning algorithms have shown compelling performance in high-dimensional data processing, including sparse genomic data. DeepImute, an interpolation algorithm based on a deep neural network, is one such example. It uses the dropout layer to learn the patterns in the data to achieve accurate imputation.[52]^15 Another denoising model is to denoise scRNA-seq datasets through a depth-counting autoencoder (DCA) network. A DCA takes the count distribution, overdispersion, and sparsity of the data into account using a negative binomial noise model with zero inflation.[53]^16 scScope, a scalable deep-learning-based method, can accurately and quickly identify cell-type composition from millions of heterogeneous single-cell transcriptomic profiles.[54]^9 Compared with DCA and scScope, scVI[55]^17 uses a variational autoencoder (VAE) to reduce the dimensionality of scRNA-seq data. However, the standard VAE implemented by scVI[56]^17 only uses a single isotropic multi-variate Gaussian distribution on the latent variable, which is not generally suitable for representing multi-category data such as scRNA-seq data containing multiple types of cells/nuclei. scVAE is another method based on VAEs, which uses Gaussian mixture models (GMMs) as prior distributions and introduces Poisson or negative binomial distributions to obtain latent representations of cells.[57]^18 However, the model considers the mean and variance of the GMM as random variables, and it is approximated by a neural network with a parameter [MATH: β :MATH] , which makes model optimization challenging. As the number of measurable cells/nuclei and additional emerging sc/snRNA-seq data analysis challenges increase, the demand for faster and scalable estimation methods becomes a pressing need. With the rapid development of graph neural networks, graph autoencoders can learn low-dimensional representations of graph topology and train node relationships with a global, entire graph view. Increasingly, exploiting a graph neural network framework in sc/snRNA-seq data analysis can be considered. Single-cell graph neural network (scGNN)[58]^19 establishes cell-cell relationships through graphical neural networks and uses left-truncated mixed Gaussian models to model heterogeneous gene expression patterns. It also integrates three iterative multi-mode autoencoders. However, the iterative autoencoder framework requires more computing resources, which is more time consuming. In this study, we present a deep-learning framework, namely autoCell, for dropout imputation and feature extraction from scRNA-seq data. The accuracy index indicates that autoCell outperformed six state-of-the-art published imputation methods in simulated datasets and biologically relevant sc/snRNA-seq datasets with varying degrees of human diseases. Therefore, autoCell is a scalable and accurate scRNA-seq data processing method that is superior to other scRNA-seq data analysis tools. Results Overview of the autoCell framework The overview of autoCell is shown in [59]Figure 1. It is a VAE network that combines graph embedding and GMMs to model the distribution of high-dimensional, sparse scRNA-seq data. autoCell architecture can use biological representations of cells and genes to perform different scRNA-seq data analysis tasks. By integrating GMMs, autoCell can better estimate data distribution. We apply graph embedding to deal with sc/snRNA-seq data. Capturing the graphical information of the local data structure is a good complement to deep GMMs, making the network learn a global model with local structure constraints. Recent studies have showed that zero-inflated negative binomial (ZINB) distribution for modeling is an appropriate tool to solve the “dropout” event for scRNA-seq data. In reducing the impact of dropout events in highly sparse and over-dispersed count data, we introduce the ZINB distribution model, thereby denoising scRNA-seq data (see [60]STAR Methods). Figure 1. [61]Figure 1 [62]Open in a new tab Overview of the autoCell framework autoCell uses a Gaussian mixture model (GMM) and a deep neural network (DNN) to model the process of data generation. It uses zero-inflated negative binomial (ZINB) loss to process “dropout” events in scRNA-seq. The encoder and decoder are a two-layer neural network (128–128) with 10-dimensional latent variables (features) directly connected to the output. The cell-cell network is used to constrain the latent feature [MATH: Z :MATH] ; thus, similar cells have similar latent features and cluster assignments. The yellow nodes depict the mean of the negative binomial distribution, which is the main output of the method representing denoised data, whereas the purple and blue nodes represent the other two parameters of the ZINB distribution, namely, dispersion and dropout. autoCell effectively imputes scRNA-seq data We first applied autoCell to simulated scRNA-seq data to assess its imputation performance.[63]^20 We simulated two datasets with three cell types, including 1,000 cells and 2,000 genes. For the simulation of the two datasets, 60% and 71% of the data values were set to the zero matrix to simulate dropout events in real data. We divided the entry of the simulated raw expression data into zero and non-zero space. Based on the density plot of the estimated and real values, the recovery values of DCA and autoCell are closer to the true expression values, and scGNN is at a medium level. MAGIC,[64]^21 SAVER,[65]^22 and SAUCIE[66]^23 always tend to underestimate the original value. We also calculated the median L1 distance, root-mean-square error (RMSE) scores, and cosine similarity (see [67]STAR Methods) score between the real expression value and restored expression value to measure the estimation accuracy. As shown in [68]Figure 2, autoCell achieves overall better performance than the others. In particular, autoCell ranks second in the median L1 distance of gene expression restoration on the two simulated datasets and ranks second in the cosine similarity score on the simulated dataset with a synthetic dropout rate of 71%. In addition, the dropout event will increase the noise, muddling the identity of cell types, which can be restored through interpolation by value recovery algorithms. We also found that autoCell outperforms existing methods in the two simulation datasets ([69]Figures S1A and S1B). Figure 2. [70]Figure 2 [71]Open in a new tab Performance comparison between autoCell and other state-of-the-art methods under 10% synthetic dropout rate (A) Density plots of imputed versus original data masked. The x axis corresponds to the imputed values, and the y axis represents the true values of the masked data points. Each row is a different dataset, and each column is a different imputation method. Pearson correlation coefficient (PPC; higher is better). (B) Comparison of the cosine similarity (higher is better), median L1 distance (lower is better), and root-mean-square error (RMSE) scores (lower is better) between autoCell and the other six imputation tools. In evaluating the performance of autoCell in imputing missing values, we also selected two biologically relevant sc/snRNA-seq datasets[72]^24^,[73]^25 with well-annotated cell types as benchmarks. We simulated the dropout effects by randomly flipping 10% non-zero entries to the zero matrix. Similarly, three indicators between the original dataset and imputed values of these synthetic items are calculated as a measure of estimation accuracy. Compared with several state-of-the-art algorithms ([74]Figure 2), autoCell achieves the best performance evaluated by the median L1 distance, cosine similarity, and RMSE at the 10% synthetic dropout rate. Furthermore, autoCell imputation is closer to the true expression value based on the density plot of the estimated and true values ([75]Figure 2). Collectively, autoCell outperforms state-of-the-art methods in sc/snRNA-seq data imputation analysis ([76]Figure S2). autoCell improves the performance of existing tools for capturing cell developmental trajectories Apart from identifying cell types, scRNA-seq facilitates the organization of cells by time course or developmental stage (i.e., cell trajectory). The transition of cells from one functional state to another is a critical event in development. However, transitions are difficult to characterize since trapping and purifying cells in between stable endpoint states are challenging. Although some models currently exist to infer cell developmental trajectories based on scRNA-seq data, most inference methods do not address dropout events. We tested the accuracy of inferring the cell trajectory of scRNA-seq data after interpolation via autoCell. We used a benchmark dataset[77]^26 with 1,529 single cells with five stages of well-annotated human preimplantation embryonic development from embryogenesis embryonic day 3 (E3) to E7. We reconstructed cell trajectories using monocle3[78]^27 after various interpolation processes. The interpolation of autoCell produced the highest correspondence between the inferred pseudotime and the real-time cellular development ([79]Figure S3). Pseudotime order score (POS; see [80]STAR Methods) increased from 0.838 to 0.850. On the contrary, the POS obtained by other algorithms is lower than the original scRNA-seq dataset ([81]Figure S3). Furthermore, we used another common trajectory analysis model, slingshot,[82]^28^,[83]^29 to test whether autoCell improves trajectory analysis. We found that cell development trajectory is well captured by interpolation from autoCell ([84]Figure 3). Therefore, autoCell captures more accurate transcriptome dynamics and cell developmental trajectories across different developmental stages. Figure 3. [85]Figure 3 [86]Open in a new tab autoCell improves pseudotime analysis in the human preimplantation embryonic development dataset (A and B) Results of slingshot estimated pseudotime using (A) raw data as input and (B) processed data from autoCell as input. (C) Processed data from DCA as input. (D) Processed data from MAGIC as input. (E) Processed data from SAUCIE as input. (F) Processed data from scVI as input. (G) Processed data from SAVER as input. (H) Processed data from scGNN as input. POS, pseudotime order score (the higher the value the better). Pseudotime is a measure of how much progress an individual cell has made through a process such as cell differentiation. autoCell captures cellular pathobiology in latent space We also assessed the extent to which the latent space inferred by autoCell reflects the biological variability among cells based on the previous stratification of cells into biologically important subpopulations through unsupervised clustering followed by manual inspection and annotation. We applied autoCell to two simulated datasets and four biologically relevant scRNA-seq datasets. The zero ratio of these six datasets ranges from 60% to 90%. By default, autoCell extracted 10 features from the input data. For a fair comparison, we further applied common scRNA-seq data dimension reduction methods, including scVI, DESC,[87]^30 scVAE, DCA, and SAUCIE, to reduce the input data to 10 dimensions. We used uniform manifold approximation and projection (UMAP) to visualize the features extracted from these tools and the original data. We found that the feature embeddings of autoCell were well separated among cell types with closer inner-group distances and larger between-group distances. However, the embedding and original data of DESC, scVAE, DCA, and SAUCIE overlapped among certain cell types. We found that autoCell is the best approach to identify all three cell types in the two simulated datasets ([88]Figure S1C). For the Klein dataset,[89]^25 scVI, scVAE, and autoCell showed better performance, and DCA caused the cell types d0 and d2 to be closely linked. However, SAUCIE and DESC only separated cells with cell type d0 and incorrectly divided cell type d7 into two cell types ([90]Figure 4A). For the Zeisel dataset,[91]^24 we found that autoCell, scVI, and scVAE still outperformed the other models, and autoCell and scVAE achieved a closer intra-group distance ([92]Figure 4B). Figure 4. [93]Figure 4 [94]Open in a new tab UMAP visualization of the extracted features using different approaches (A and B) We evaluated autoCell with DCA, DESC, scVI, SAUCIE, and scVAE using two datasets: (A) Klein and (B) Zeisel datasets. For comparison, autoCell, DCA, DESC, scVI, SAUCIE, and scVAE all performed dimension reduction to 10 dimensions before applying UMAP. (C) Comparison on the effect of clustering on four benchmark datasets. Clustering accuracy was evaluated by applying K-means clustering on the extracted features to obtain cluster assignments. NMI, normalized mutual information (the higher the value the better); ARI, adjusted rand index; COM, completeness (the higher the value the better); HOM, homogeneity (the higher the value the better). We applied K-means clustering on autoCell-extracted latent features and assessed the clustering accuracy by comparing it with scVI, DESC, scVAE, DCA, and SAUCIE. We found that autoCell displays the best performance across all tested scRNA-seq datasets ([95]Figure 4). In the Klein dataset,[96]^25 the clustering output using autoCell ([97]Figure 4C) was more consistent with the predefined unit-type annotation (normalized mutual information [NMI] = 0.882 and adjusted Rand index [ARI] = 0.907) than the second ranked model, scVI (NMI = 0.832, ARI = 0.784). In the Zeisel dataset,[98]^24 the clustering performance of autoCell was considerably better than other existing tools. Then, we changed K-means to another four clustering algorithms (including spectral clustering, affinity propagation, birch, and agglomerative clustering). autoCell shows the best performance across all tested scRNA-seq datasets ([99]Table S1). In addition, we compared the visualization performance using principal-component analysis (PCA). As shown in [100]Figure S4, in the Klein dataset, the feature embedding of autoCell, scVI, and scVAE was well separated among cell types. For the Zeisel dataset, autoCell and DESC showed the best performance, although they mixed astrocytes and brain endothelial cells, both of which were glial cells located in the CNS, but they can effectively separate most cell types. For the Romanov dataset, all models identify two glial clusters (astrocytes and brain endothelial cells) close to each other in the latent space. However, autoCell was the only model that effectively separated the microglia ([101]Figure S4). Collectively, autoCell presented elevated accuracy in capturing cellular pathobiology than existing state-of-the-art approaches for simulated and real-world biologically relevant scRNA-seq datasets ([102]Figures S4 and [103]S5). Discovery of cell-type-specific molecular networks by autoCell In testing whether the autoCell-inferred cell type can capture specific pathobiology of human diseases, we analyzed astrocytes, microglia, neurons, and oligodendrocyte progenitor cells (OPCs) using Alzheimer’s disease (AD) as a prototypical example. In total, we re-analyzed 13,214 high-quality nuclei generated from the entorhinal cortex of AD brains and healthy controls. Using autoCell, we identified four microglia clusters, nine astrocyte clusters, and five OPC clusters ([104]Figure 5A). Recent studies using human postmortem brain tissues identified disease-associated astrocyte (DAA) involved in crucial roles of AD pathogenesis and disease progression of AD. Using 11 experimentally validated DAA marker genes (four upregulated marker genes [GFAP, CD44, HSPB1, and TNS] and seven downregulated marker genes [SLC1A2, SLC1A3, GLUL, NRXN1, CADM2, PTN, and GPC5]),[105]^31 we identified astrocyte subcluster 4 as a DAA by autoCell ([106]Figure S6). Next, we built a DAA-specific molecular network using a state-of-the-art network-based algorithm, GPSnet,[107]^32 under the human protein-protein interaction (PPI) network model. The DAA-specific module network included 50 PPIs connected by 44 proteins, such as APOE, MAPT, CD44, FOS, and STAT3 ([108]Figure 5B; [109]Table S2). APOE and microtubule-associated protein Tau (MAPT) were two of the most well-known risk genes for AD.[110]^33^,[111]^34 CD44 was an inflammation-associated protein. The inhibition of CD44 could be a potential strategy for AD treatment.[112]^35 In a mouse model study, Stat3-deficient and Stat3-deletion astrocytes presented dropped levels of β-amyloid and pro-inflammatory cytokine activities.[113]^36 Proteins from the DAA-specific molecular network were enriched by multiple AD-related pathways, such as cytokine signaling, spinal cord injury, and brain-derived neurotrophic factor signaling pathways ([114]Figure 5B; [115]Table S3). For example, several proteins in the DAA-specific network (STAT3, MAPT, HSPB8, HSPB1, JUNB, and LINGO1) were enriched in multiple cytokine signaling pathways, including interleukin-5 (IL-5), IL-2, IL-18, IL-3, and IL-4, consistent with the important role of neuro-inflammation mediated by microglia in AD.[116]^37^,[117]^38 Therefore, using autoCell, we can identify disease-associated, cell-type-specific molecular network involved in key AD pathobiology. Figure 5. [118]Figure 5 [119]Open in a new tab Discovery of cell-type-specific molecular networks and significant ligand-receptor interactions in Alzheimer’s disease (AD) using autoCell (A) UMAP visualization of subclusters of astrocytes (including disease-associated astrocyte [DAA]), microglia, and OPCs, labeled with autoCell clusters. (B) Reconstruction of the DAA-specific molecular subnetwork from the human protein-protein interactome. The DAA-specific module network included 50 protein-protein interactions (PPIs) connected by 44 proteins (e.g., APOE, MAPT, CD44, FOS, and STAT3). The proteins from the DAA-specific molecular network are enriched with multiple AD-related pathways, including cytokine signaling pathways. The proteins from the DAA-specific molecular network are enriched with multiple cytokine signaling pathways. (C) Inferred cell-cell interactions using CellChat. (D) Top selected significant ligand-receptor pairs among autoCell-identified cell types. Ligand-receptor interactions were predicted by CellChat (see [120]STAR Methods). We also identified significant ligand-receptor interactions involved in cell-cell communications in AD. We first inferred cell subpopulations using autoCell and the predicted ligand-receptor interaction using CellChat.[121]^39 As shown in [122]Figure 5C, we found strong ligand-receptor interactions among astrocytes, OPCs, and oligodendrocytes compared with the other three cell types (neuron, microglia, and endothelial). Two ligand-receptor pairs (NRG3-ERBB4 and NRG1-ERBB4) revealed strong interactions across multiple cell-cell pairs ([123]Figure 5D; [124]Table S4). Multiple single-nucleotide polymorphisms in the NRG3 gene were found to be associated with the onset of AD.[125]^40 In addition, the overexpression of ERBB4 in neurons was found to be associated with AD neuropathology.[126]^41 A recent AD mouse model study found that immunoreactivity of NRG1 and ERBB4 was associated with plaques in the hippocampus region.[127]^42 Using AD as a prototypical example, we demonstrated that the disease-associated cell subtype identified by autoCell could identify molecular targets and networks (i.e., ligand-receptor interactions) involved in AD pathogenesis and provide potential drug targets for AD or other human diseases if broadly applied. autoCell is scalable to large datasets The increasing number of cells is the main challenge in scRNA-seq analysis. In large projects such as the Human Cell Atlas,[128]^43 the number of cells may be hundreds of thousands. In large datasets, identifying cell populations is challenging because many existing scRNA-seq clustering methods cannot scale up to handle them. Thus, we used the Zheng-68k[129]^44 dataset and the Zheng-73k[130]^44 dataset to study whether autoCell is suitable for large datasets. These cell types were used as references when benchmarking autoCell. autoCell worked