Graphical abstract
graphic file with name fx1.jpg
[37]Open in a new tab
Highlights
* •
autoCell imputes heterogeneous and sparse sc/snRNA-seq data
* •
autoCell improves the performance of capturing cell developmental
trajectories
* •
autoCell captures disease-relevant cellular pathobiology in latent
space
* •
autoCell identifies cell-type-specific gene networks in Alzheimer’s
disease
Motivation
Single-cell RNA sequencing (scRNA-seq) enables researchers to study
gene expression at cellular resolution. However, noise caused by
amplification and dropout may hamper precise data analyses. It is
urgent to develop scalable denoising methods to deal with the
increasingly large, but sparse, scRNA-seq data. Here, we present
autoCell, a graph-embedded Gaussian mixture variational autoencoder
network algorithm for scRNA-seq dropout imputation and feature
extraction. Our autoCell provides a deep-learning toolbox for
end-to-end analysis of large-scale single-cell/nucleus RNA-seq data,
including visualization, clustering, imputation, and disease-specific
gene network identification.
__________________________________________________________________
Xu et al. develop a graph-embedded Gaussian mixture variational
autoencoder network algorithm (termed autoCell) for end-to-end analyses
of single-cell/nuclei RNA-seq data, including visualization,
clustering, imputation, and cell-type-specific gene network
identification. autoCell offers a useful tool for large-scale
single-cell genomic data analyses to accelerate translational biology
and disease discoveries.
Introduction
Single-cell technology is a revolutionary breakthrough, allowing us to
study the genome, transcriptome, and multi-omics systems of each cell,
in each state, in tissues.[38]^1^,[39]^2 Combined with technologies
such as fluorescent labeling and microdissection, it can also determine
spatial attributes and cell-cell communication. These technologies have
been widely used, leading to a revolution in basic and translational
medicine.
Single-cell or single-nucleus RNA sequencing (sc/snRNA-seq) is
important for identifying biological and disease-relevant cell types
and subpopulations from heterogeneous
cells.[40]^3^,[41]^4^,[42]^5^,[43]^6 Low-dimensional analysis of
expression in different cell states can also be highly effective in
reconstructing the cell developmental
trajectory.[44]^7^,[45]^8^,[46]^9^,[47]^10^,[48]^11^,[49]^12 However,
the amount of mRNA in a single cell is small, which necessitates a
nearly million-fold amplification. Although the measurement technology
has been greatly improved, technical factors still cause considerable
noise in data generated in scRNA-seq experiments, including
amplification deviation, library size difference, and extremely low
capture rate. In particular, the extremely low RNA capture rate leads
to undetectable, albeit expressed, genes, namely “dropout”
events.[50]^13 An essential difference is found between the “false”
zero count caused by the “dropout” event and the true zero count. Given
the sparse expression metrics, traditional analytic tools cannot
achieve scientific rigor, and they lack high data
reproducibility.[51]^14
Deep-learning algorithms have shown compelling performance in
high-dimensional data processing, including sparse genomic data.
DeepImute, an interpolation algorithm based on a deep neural network,
is one such example. It uses the dropout layer to learn the patterns in
the data to achieve accurate imputation.[52]^15 Another denoising model
is to denoise scRNA-seq datasets through a depth-counting autoencoder
(DCA) network. A DCA takes the count distribution, overdispersion, and
sparsity of the data into account using a negative binomial noise model
with zero inflation.[53]^16 scScope, a scalable deep-learning-based
method, can accurately and quickly identify cell-type composition from
millions of heterogeneous single-cell transcriptomic profiles.[54]^9
Compared with DCA and scScope, scVI[55]^17 uses a variational
autoencoder (VAE) to reduce the dimensionality of scRNA-seq data.
However, the standard VAE implemented by scVI[56]^17 only uses a single
isotropic multi-variate Gaussian distribution on the latent variable,
which is not generally suitable for representing multi-category data
such as scRNA-seq data containing multiple types of cells/nuclei. scVAE
is another method based on VAEs, which uses Gaussian mixture models
(GMMs) as prior distributions and introduces Poisson or negative
binomial distributions to obtain latent representations of
cells.[57]^18 However, the model considers the mean and variance of the
GMM as random variables, and it is approximated by a neural network
with a parameter
[MATH: β :MATH]
, which makes model optimization challenging. As the number of
measurable cells/nuclei and additional emerging sc/snRNA-seq data
analysis challenges increase, the demand for faster and scalable
estimation methods becomes a pressing need.
With the rapid development of graph neural networks, graph autoencoders
can learn low-dimensional representations of graph topology and train
node relationships with a global, entire graph view. Increasingly,
exploiting a graph neural network framework in sc/snRNA-seq data
analysis can be considered. Single-cell graph neural network
(scGNN)[58]^19 establishes cell-cell relationships through graphical
neural networks and uses left-truncated mixed Gaussian models to model
heterogeneous gene expression patterns. It also integrates three
iterative multi-mode autoencoders. However, the iterative autoencoder
framework requires more computing resources, which is more time
consuming.
In this study, we present a deep-learning framework, namely autoCell,
for dropout imputation and feature extraction from scRNA-seq data. The
accuracy index indicates that autoCell outperformed six
state-of-the-art published imputation methods in simulated datasets and
biologically relevant sc/snRNA-seq datasets with varying degrees of
human diseases. Therefore, autoCell is a scalable and accurate
scRNA-seq data processing method that is superior to other scRNA-seq
data analysis tools.
Results
Overview of the autoCell framework
The overview of autoCell is shown in [59]Figure 1. It is a VAE network
that combines graph embedding and GMMs to model the distribution of
high-dimensional, sparse scRNA-seq data. autoCell architecture can use
biological representations of cells and genes to perform different
scRNA-seq data analysis tasks. By integrating GMMs, autoCell can better
estimate data distribution. We apply graph embedding to deal with
sc/snRNA-seq data. Capturing the graphical information of the local
data structure is a good complement to deep GMMs, making the network
learn a global model with local structure constraints. Recent studies
have showed that zero-inflated negative binomial (ZINB) distribution
for modeling is an appropriate tool to solve the “dropout” event for
scRNA-seq data. In reducing the impact of dropout events in highly
sparse and over-dispersed count data, we introduce the ZINB
distribution model, thereby denoising scRNA-seq data (see [60]STAR
Methods).
Figure 1.
[61]Figure 1
[62]Open in a new tab
Overview of the autoCell framework
autoCell uses a Gaussian mixture model (GMM) and a deep neural network
(DNN) to model the process of data generation. It uses zero-inflated
negative binomial (ZINB) loss to process “dropout” events in scRNA-seq.
The encoder and decoder are a two-layer neural network (128–128) with
10-dimensional latent variables (features) directly connected to the
output. The cell-cell network is used to constrain the latent feature
[MATH: Z :MATH]
; thus, similar cells have similar latent features and cluster
assignments. The yellow nodes depict the mean of the negative binomial
distribution, which is the main output of the method representing
denoised data, whereas the purple and blue nodes represent the other
two parameters of the ZINB distribution, namely, dispersion and
dropout.
autoCell effectively imputes scRNA-seq data
We first applied autoCell to simulated scRNA-seq data to assess its
imputation performance.[63]^20 We simulated two datasets with three
cell types, including 1,000 cells and 2,000 genes. For the simulation
of the two datasets, 60% and 71% of the data values were set to the
zero matrix to simulate dropout events in real data. We divided the
entry of the simulated raw expression data into zero and non-zero
space. Based on the density plot of the estimated and real values, the
recovery values of DCA and autoCell are closer to the true expression
values, and scGNN is at a medium level. MAGIC,[64]^21 SAVER,[65]^22 and
SAUCIE[66]^23 always tend to underestimate the original value. We also
calculated the median L1 distance, root-mean-square error (RMSE)
scores, and cosine similarity (see [67]STAR Methods) score between the
real expression value and restored expression value to measure the
estimation accuracy. As shown in [68]Figure 2, autoCell achieves
overall better performance than the others. In particular, autoCell
ranks second in the median L1 distance of gene expression restoration
on the two simulated datasets and ranks second in the cosine similarity
score on the simulated dataset with a synthetic dropout rate of 71%. In
addition, the dropout event will increase the noise, muddling the
identity of cell types, which can be restored through interpolation by
value recovery algorithms. We also found that autoCell outperforms
existing methods in the two simulation datasets ([69]Figures S1A and
S1B).
Figure 2.
[70]Figure 2
[71]Open in a new tab
Performance comparison between autoCell and other state-of-the-art
methods under 10% synthetic dropout rate
(A) Density plots of imputed versus original data masked. The x axis
corresponds to the imputed values, and the y axis represents the true
values of the masked data points. Each row is a different dataset, and
each column is a different imputation method. Pearson correlation
coefficient (PPC; higher is better).
(B) Comparison of the cosine similarity (higher is better), median L1
distance (lower is better), and root-mean-square error (RMSE) scores
(lower is better) between autoCell and the other six imputation tools.
In evaluating the performance of autoCell in imputing missing values,
we also selected two biologically relevant sc/snRNA-seq
datasets[72]^24^,[73]^25 with well-annotated cell types as benchmarks.
We simulated the dropout effects by randomly flipping 10% non-zero
entries to the zero matrix. Similarly, three indicators between the
original dataset and imputed values of these synthetic items are
calculated as a measure of estimation accuracy. Compared with several
state-of-the-art algorithms ([74]Figure 2), autoCell achieves the best
performance evaluated by the median L1 distance, cosine similarity, and
RMSE at the 10% synthetic dropout rate. Furthermore, autoCell
imputation is closer to the true expression value based on the density
plot of the estimated and true values ([75]Figure 2). Collectively,
autoCell outperforms state-of-the-art methods in sc/snRNA-seq data
imputation analysis ([76]Figure S2).
autoCell improves the performance of existing tools for capturing cell
developmental trajectories
Apart from identifying cell types, scRNA-seq facilitates the
organization of cells by time course or developmental stage (i.e., cell
trajectory). The transition of cells from one functional state to
another is a critical event in development. However, transitions are
difficult to characterize since trapping and purifying cells in between
stable endpoint states are challenging. Although some models currently
exist to infer cell developmental trajectories based on scRNA-seq data,
most inference methods do not address dropout events. We tested the
accuracy of inferring the cell trajectory of scRNA-seq data after
interpolation via autoCell. We used a benchmark dataset[77]^26 with
1,529 single cells with five stages of well-annotated human
preimplantation embryonic development from embryogenesis embryonic day
3 (E3) to E7. We reconstructed cell trajectories using monocle3[78]^27
after various interpolation processes. The interpolation of autoCell
produced the highest correspondence between the inferred pseudotime and
the real-time cellular development ([79]Figure S3). Pseudotime order
score (POS; see [80]STAR Methods) increased from 0.838 to 0.850. On the
contrary, the POS obtained by other algorithms is lower than the
original scRNA-seq dataset ([81]Figure S3). Furthermore, we used
another common trajectory analysis model, slingshot,[82]^28^,[83]^29 to
test whether autoCell improves trajectory analysis. We found that cell
development trajectory is well captured by interpolation from autoCell
([84]Figure 3). Therefore, autoCell captures more accurate
transcriptome dynamics and cell developmental trajectories across
different developmental stages.
Figure 3.
[85]Figure 3
[86]Open in a new tab
autoCell improves pseudotime analysis in the human preimplantation
embryonic development dataset
(A and B) Results of slingshot estimated pseudotime using (A) raw data
as input and (B) processed data from autoCell as input.
(C) Processed data from DCA as input.
(D) Processed data from MAGIC as input.
(E) Processed data from SAUCIE as input.
(F) Processed data from scVI as input.
(G) Processed data from SAVER as input.
(H) Processed data from scGNN as input.
POS, pseudotime order score (the higher the value the better).
Pseudotime is a measure of how much progress an individual cell has
made through a process such as cell differentiation.
autoCell captures cellular pathobiology in latent space
We also assessed the extent to which the latent space inferred by
autoCell reflects the biological variability among cells based on the
previous stratification of cells into biologically important
subpopulations through unsupervised clustering followed by manual
inspection and annotation. We applied autoCell to two simulated
datasets and four biologically relevant scRNA-seq datasets. The zero
ratio of these six datasets ranges from 60% to 90%. By default,
autoCell extracted 10 features from the input data. For a fair
comparison, we further applied common scRNA-seq data dimension
reduction methods, including scVI, DESC,[87]^30 scVAE, DCA, and SAUCIE,
to reduce the input data to 10 dimensions. We used uniform manifold
approximation and projection (UMAP) to visualize the features extracted
from these tools and the original data. We found that the feature
embeddings of autoCell were well separated among cell types with closer
inner-group distances and larger between-group distances. However, the
embedding and original data of DESC, scVAE, DCA, and SAUCIE overlapped
among certain cell types. We found that autoCell is the best approach
to identify all three cell types in the two simulated datasets
([88]Figure S1C). For the Klein dataset,[89]^25 scVI, scVAE, and
autoCell showed better performance, and DCA caused the cell types d0
and d2 to be closely linked. However, SAUCIE and DESC only separated
cells with cell type d0 and incorrectly divided cell type d7 into two
cell types ([90]Figure 4A). For the Zeisel dataset,[91]^24 we found
that autoCell, scVI, and scVAE still outperformed the other models, and
autoCell and scVAE achieved a closer intra-group distance
([92]Figure 4B).
Figure 4.
[93]Figure 4
[94]Open in a new tab
UMAP visualization of the extracted features using different approaches
(A and B) We evaluated autoCell with DCA, DESC, scVI, SAUCIE, and scVAE
using two datasets: (A) Klein and (B) Zeisel datasets. For comparison,
autoCell, DCA, DESC, scVI, SAUCIE, and scVAE all performed dimension
reduction to 10 dimensions before applying UMAP.
(C) Comparison on the effect of clustering on four benchmark datasets.
Clustering accuracy was evaluated by applying K-means clustering on the
extracted features to obtain cluster assignments.
NMI, normalized mutual information (the higher the value the better);
ARI, adjusted rand index; COM, completeness (the higher the value the
better); HOM, homogeneity (the higher the value the better).
We applied K-means clustering on autoCell-extracted latent features and
assessed the clustering accuracy by comparing it with scVI, DESC,
scVAE, DCA, and SAUCIE. We found that autoCell displays the best
performance across all tested scRNA-seq datasets ([95]Figure 4). In the
Klein dataset,[96]^25 the clustering output using autoCell
([97]Figure 4C) was more consistent with the predefined unit-type
annotation (normalized mutual information [NMI] = 0.882 and adjusted
Rand index [ARI] = 0.907) than the second ranked model, scVI (NMI =
0.832, ARI = 0.784). In the Zeisel dataset,[98]^24 the clustering
performance of autoCell was considerably better than other existing
tools. Then, we changed K-means to another four clustering algorithms
(including spectral clustering, affinity propagation, birch, and
agglomerative clustering). autoCell shows the best performance across
all tested scRNA-seq datasets ([99]Table S1).
In addition, we compared the visualization performance using
principal-component analysis (PCA). As shown in [100]Figure S4, in the
Klein dataset, the feature embedding of autoCell, scVI, and scVAE was
well separated among cell types. For the Zeisel dataset, autoCell and
DESC showed the best performance, although they mixed astrocytes and
brain endothelial cells, both of which were glial cells located in the
CNS, but they can effectively separate most cell types. For the Romanov
dataset, all models identify two glial clusters (astrocytes and brain
endothelial cells) close to each other in the latent space. However,
autoCell was the only model that effectively separated the microglia
([101]Figure S4).
Collectively, autoCell presented elevated accuracy in capturing
cellular pathobiology than existing state-of-the-art approaches for
simulated and real-world biologically relevant scRNA-seq datasets
([102]Figures S4 and [103]S5).
Discovery of cell-type-specific molecular networks by autoCell
In testing whether the autoCell-inferred cell type can capture specific
pathobiology of human diseases, we analyzed astrocytes, microglia,
neurons, and oligodendrocyte progenitor cells (OPCs) using Alzheimer’s
disease (AD) as a prototypical example. In total, we re-analyzed 13,214
high-quality nuclei generated from the entorhinal cortex of AD brains
and healthy controls. Using autoCell, we identified four microglia
clusters, nine astrocyte clusters, and five OPC clusters
([104]Figure 5A). Recent studies using human postmortem brain tissues
identified disease-associated astrocyte (DAA) involved in crucial roles
of AD pathogenesis and disease progression of AD. Using 11
experimentally validated DAA marker genes (four upregulated marker
genes [GFAP, CD44, HSPB1, and TNS] and seven downregulated marker genes
[SLC1A2, SLC1A3, GLUL, NRXN1, CADM2, PTN, and GPC5]),[105]^31 we
identified astrocyte subcluster 4 as a DAA by autoCell
([106]Figure S6). Next, we built a DAA-specific molecular network using
a state-of-the-art network-based algorithm, GPSnet,[107]^32 under the
human protein-protein interaction (PPI) network model. The DAA-specific
module network included 50 PPIs connected by 44 proteins, such as APOE,
MAPT, CD44, FOS, and STAT3 ([108]Figure 5B; [109]Table S2). APOE and
microtubule-associated protein Tau (MAPT) were two of the most
well-known risk genes for AD.[110]^33^,[111]^34 CD44 was an
inflammation-associated protein. The inhibition of CD44 could be a
potential strategy for AD treatment.[112]^35 In a mouse model study,
Stat3-deficient and Stat3-deletion astrocytes presented dropped levels
of β-amyloid and pro-inflammatory cytokine activities.[113]^36 Proteins
from the DAA-specific molecular network were enriched by multiple
AD-related pathways, such as cytokine signaling, spinal cord injury,
and brain-derived neurotrophic factor signaling pathways
([114]Figure 5B; [115]Table S3). For example, several proteins in the
DAA-specific network (STAT3, MAPT, HSPB8, HSPB1, JUNB, and LINGO1) were
enriched in multiple cytokine signaling pathways, including
interleukin-5 (IL-5), IL-2, IL-18, IL-3, and IL-4, consistent with the
important role of neuro-inflammation mediated by microglia in
AD.[116]^37^,[117]^38 Therefore, using autoCell, we can identify
disease-associated, cell-type-specific molecular network involved in
key AD pathobiology.
Figure 5.
[118]Figure 5
[119]Open in a new tab
Discovery of cell-type-specific molecular networks and significant
ligand-receptor interactions in Alzheimer’s disease (AD) using autoCell
(A) UMAP visualization of subclusters of astrocytes (including
disease-associated astrocyte [DAA]), microglia, and OPCs, labeled with
autoCell clusters.
(B) Reconstruction of the DAA-specific molecular subnetwork from the
human protein-protein interactome. The DAA-specific module network
included 50 protein-protein interactions (PPIs) connected by 44
proteins (e.g., APOE, MAPT, CD44, FOS, and STAT3). The proteins from
the DAA-specific molecular network are enriched with multiple
AD-related pathways, including cytokine signaling pathways. The
proteins from the DAA-specific molecular network are enriched with
multiple cytokine signaling pathways.
(C) Inferred cell-cell interactions using CellChat.
(D) Top selected significant ligand-receptor pairs among
autoCell-identified cell types. Ligand-receptor interactions were
predicted by CellChat (see [120]STAR Methods).
We also identified significant ligand-receptor interactions involved in
cell-cell communications in AD. We first inferred cell subpopulations
using autoCell and the predicted ligand-receptor interaction using
CellChat.[121]^39 As shown in [122]Figure 5C, we found strong
ligand-receptor interactions among astrocytes, OPCs, and
oligodendrocytes compared with the other three cell types (neuron,
microglia, and endothelial). Two ligand-receptor pairs (NRG3-ERBB4 and
NRG1-ERBB4) revealed strong interactions across multiple cell-cell
pairs ([123]Figure 5D; [124]Table S4). Multiple single-nucleotide
polymorphisms in the NRG3 gene were found to be associated with the
onset of AD.[125]^40 In addition, the overexpression of ERBB4 in
neurons was found to be associated with AD neuropathology.[126]^41 A
recent AD mouse model study found that immunoreactivity of NRG1 and
ERBB4 was associated with plaques in the hippocampus region.[127]^42
Using AD as a prototypical example, we demonstrated that the
disease-associated cell subtype identified by autoCell could identify
molecular targets and networks (i.e., ligand-receptor interactions)
involved in AD pathogenesis and provide potential drug targets for AD
or other human diseases if broadly applied.
autoCell is scalable to large datasets
The increasing number of cells is the main challenge in scRNA-seq
analysis. In large projects such as the Human Cell Atlas,[128]^43 the
number of cells may be hundreds of thousands. In large datasets,
identifying cell populations is challenging because many existing
scRNA-seq clustering methods cannot scale up to handle them. Thus, we
used the Zheng-68k[129]^44 dataset and the Zheng-73k[130]^44 dataset to
study whether autoCell is suitable for large datasets. These cell types
were used as references when benchmarking autoCell. autoCell worked