Abstract

   Single-cell RNA-sequencing (scRNA-Seq) is widely used to reveal the
   heterogeneity and dynamics of tissues, organisms, and complex diseases,
   but its analyses still suffer from multiple grand challenges, including
   the sequencing sparsity and complex differential patterns in gene
   expression. We introduce the scGNN (single-cell graph neural network)
   to provide a hypothesis-free deep learning framework for scRNA-Seq
   analyses. This framework formulates and aggregates cell–cell
   relationships with graph neural networks and models heterogeneous gene
   expression patterns using a left-truncated mixture Gaussian model.
   scGNN integrates three iterative multi-modal autoencoders and
   outperforms existing tools for gene imputation and cell clustering on
   four benchmark scRNA-Seq datasets. In an Alzheimer’s disease study with
   13,214 single nuclei from postmortem brain tissues, scGNN successfully
   illustrated disease-related neural development and the differential
   mechanism. scGNN provides an effective representation of gene
   expression and cell–cell relationships. It is also a powerful framework
   that can be applied to general scRNA-Seq analyses.

   Subject terms: Computational models, Machine learning, Software
     __________________________________________________________________

   Single-cell RNA-Seq suffers from heterogeneity in sequencing sparsity
   and complex differential patterns in gene expression. Here, the authors
   introduce a graph neural network based on a hypothesis-free deep
   learning framework as an effective representation of gene expression
   and cell–cell relationships.

Introduction

   Single-cell RNA-sequencing (scRNA-seq) techniques enable
   transcriptome-wide gene expression measurement in individual cells,
   which are essential for identifying cell-type clusters, inferring the
   arrangement of cell populations according to trajectory topologies, and
   highlighting somatic clonal structures while characterizing cellular
   heterogeneity in complex diseases^[43]1,[44]2. scRNA-seq analysis for
   biological inference remains challenging due to its complex and
   un-determined data distribution, which has a large volume and high rate
   of dropout events. Some pioneer methodologies, e.g., Phenograph^[45]3,
   MAGIC^[46]4, and Seurat^[47]5 use a k-nearest-neighbor (KNN) graph to
   model the relationships between cells. However, such a graph
   representation may over-simplify the complex cell and gene
   relationships of the global cell population. Recently, the emerging
   graph neural network (GNN) has deconvoluted node relationships in a
   graph through neighbor information propagation in a deep learning
   architecture^[48]6–[49]8. Compared with other autoencoders used in the
   scRNA-Seq analysis^[50]9–[51]12 for revealing an effective
   representation of scRNA-Seq data via recreating its own input, the
   unique feature of graph autoencoder is in being able to learn a
   low-dimensional representation of the graph topology and train node
   relationships in a global view of the whole graph^[52]13.

   We introduce a multi-modal framework scGNN (single-cell graph neural
   network) for modeling heterogeneous cell–cell relationships and their
   underlying complex gene expression patterns from scRNA-Seq. scGNN
   trains low-dimensional feature vectors (i.e., embedding) to represent
   relationships among cells through topological abstraction based on both
   gene expression and transcriptional regulation information. There are
   three unique features in scGNN: (i) scGNN utilizes GNN with multi-modal
   autoencoders to formulate and aggregate cell–cell relationships,
   providing a hypothesis-free framework to derive biologically meaningful
   relationships. The framework does not need to assume any statistical
   distribution or relationships for gene expression data or dropout
   events. (ii) Cell-type-specific regulatory signals are modeled in
   building a cell graph, equipped with a left-truncated mixture Gaussian
   (LTMG) model for scRNA-Seq data^[53]14. This can improve the
   signal-to-noise ratio in terms of embedding biologically meaningful
   information. (iii) Bottom-up cell relationships are formulated from a
   dynamically pruned GNN cell graph. The entire graph can be represented
   by pooling on learned graph embedding of all nodes in the graph. The
   graph embedding can be used as low-dimensional features with tolerance
   to noises for the preservation of topological relationships in the cell
   graph. The derived cell–cell relationships are adopted as regularizers
   in the autoencoder training to recover gene expression values.

   scGNN has great potential in capturing biological cell–cell
   relationships in terms of cell-type clustering, cell trajectory
   inference, cell lineages formation, and cells transitioning between
   states. In this paper, we mainly focus on discovering its applicative
   power in two fundamental aspects from scRNA-Seq data, i.e., gene
   imputation and cell clustering. Gene imputation aims to solve the
   dropout issue which commonly exists in scRNA-Seq data where the
   expressions of a large number of active genes are marked as
   zeros^[54]15–[55]17. The excess of zero values often needs to be
   recovered or handled to avoid the exaggeration of the dropout events in
   many downstream biological analyses and interpretations. Existing
   imputation methods^[56]18, such as MAGIC^[57]4 and SAVER^[58]19, have
   an issue in generating biased estimates of gene expression and tend to
   induce false-positive and biased gene correlations that could possibly
   eliminate some meaningful biological variations^[59]20,[60]21. On the
   other hand, many studies, including Seurat^[61]5 and Phenograph^[62]3,
   have explored the cell–cell relationships using raw scRNA-seq data, and
   built cell graphs with reduced data dimensions and detected cell
   clusters by applying the Louvain modularity optimization. Accurate
   cell–cell relationships obey the rule that cells are more homogeneous
   within a cell type and more heterogeneous among different cell
   types^[63]22, The scGNN model provides a global perspective in
   exploring cell relationships by integrating cell neighbors on the whole
   population.

   scGNN achieves promising performance in gene imputation and cell
   cluster prediction on four scRNA-Seq data sets with gold-standard cell
   labels^[64]23–[65]26, compared to nine existing imputation and four
   clustering tools (Supplementary Table [66]1). We believe that the
   superior performance in gene imputation and cell cluster prediction
   benefits from (i) our integrative autoencoder framework, which
   synergistically determines cell clusters based on a bottom-up
   integration of detailed pairwise cell–cell relationships and the
   convergence of predicted clusters, and (ii) the integration of both
   gene regulatory signals and cell network representations in hidden
   layers as regularizers of our autoencoders. To further demonstrate the
   power of scGNN in complex disease studies, we applied it to an
   Alzheimer’s disease (AD) data set containing 13,214 single nuclei,
   which elucidated its application power on cell-type identification and
   recovering gene expression values^[67]27. We claim that such a
   GNN-based framework is powerful and flexible enough to have great
   potential in integrating scMulti-Omics data.

Results

The architecture of scGNN comprises stacked autoencoders

   The main architecture of scGNN is used to seek effective
   representations of cells and genes that are useful for performing
   different tasks in scRNA-Seq data analyses (Fig. [68]1 and
   Supplementary Fig. [69]1). It has three comprehensive computational
   components in an iteration process, including gene regulation
   integration in a feature autoencoder, cell graph representation in a
   graph autoencoder, gene expression updating in a set of parallel
   cell-type-specific cluster autoencoders, as well as the final gene
   expression recovery in an imputation autoencoder (Fig. [70]1).

Fig. 1. The architecture of scGNN.

   [71]Fig. 1
   [72]Open in a new tab

   It takes the gene expression matrix generated from scRNA-Seq as the
   input. LTMG can translate the input gene expression data into a
   discretized regulatory signal as the regularizer for the feature
   autoencoder. The feature autoencoder learns a dimensional
   representation of the input as embedding, upon which a cell graph is
   constructed and pruned. The graph autoencoder learns a topological
   graph embedding of the cell graph, which is used for cell-type
   clustering. The cells in each cell type have an individual cluster
   autoencoder to reconstruct gene expression values. The framework treats
   the reconstructed expression as a new input iteratively until
   converging. Finally, the imputed gene expression values are obtained by
   the feature autoencoder regularized by the cell–cell relationships in
   the learned cell graph on the original pre-processed raw expression
   matrix through the imputation autoencoder. LTMG is abbreviated for the
   left-truncated mixed Gaussian model.

   The feature autoencoder intakes the pre-processed gene expression
   matrix after the removal of low-quality cells and genes, normalization,
   and variable gene ranking (Fig. [73]2a). First, the LTMG
   model^[74]14,[75]28 is adopted to the top 2,000 variable genes to
   quantify gene regulatory signals encoded among diverse cell states in
   scRNA-Seq data (see “Methods” section and Supplementary Fig. [76]2).
   This model was built based on the kinetic relationships between the
   transcriptional regulatory inputs and mRNA metabolism and abundance,
   which can infer the expression of multi-modalities across single cells.
   The captured signals have a better signal-to-noise ratio to be used as
   a high-order restraint to regularize the feature autoencoder. The aim
   of this regularization is to treat each gene differently based on their
   individual regulation status through a penalty in the loss function.
   The feature autoencoder learns a low-dimensional embedding by the gene
   expression reconstruction together with the regularization. A cell–cell
   graph is generated from the learned embedding via the KNN graph, where
   nodes represent individual cells and the edges represent neighborhood
   relations among these cells^[77]29,[78]30. Then, the cell graph is
   pruned from selecting an adaptive number of neighbors for each node on
   the KNN graph by removing the noisy edges^[79]3.

Fig. 2. The architecture of scGNN Autoencoders.

   [80]Fig. 2
   [81]Open in a new tab

   a The feature autoencoder takes the expression matrix as the input,
   regularized by LTMG signals. The dimensions of the encoder and decoder
   layers are 512 × 128 and 128 × 512, respectively. The feature
   autoencoder is trained by minimizing the difference between the input
   matrix and the output matrix. b The graph autoencoder takes the
   adjacency matrix of the pruned graph as the input. The encoder consists
   of two layers of GNNs. In each layer, each node of the graph aggregates
   information from its neighbors. The encoder learns a low-dimensional
   presentation (i.e., graph embedding) of the pruned cell graph. The
   decoder reconstructs the adjacency matrix of the graph by dot products
   of the learned graph embedding followed by a sigmoid activation
   function. The graph autoencoder is trained by minimizing the
   cross-entropy loss between the input and the reconstructed graph. Cell
   clusters are obtained by applying k-means and Louvain on the graph
   embedding. c The cluster autoencoder takes a reconstructed expression
   matrix from the feature autoencoder as the input. An individual encoder
   is built on the cells in each of the identified clusters, and each
   autoencoder is trained individually. The concatenation of the results
   from all clusters is treated as the reconstructed matrix.

   Taking the pruned cell graph as input, the encoder of the graph
   autoencoder uses GNN to learn a low-dimensional embedding of each node
   and then regenerates the whole graph structure through the decoder of
   the graph autoencoder (Fig. [82]2b). Based on the topological
   properties of the cell graph, the graph autoencoder abstracts intrinsic
   high-order cell–cell relationships propagated on the global graph. The
   low-dimensional graph embedding integrates the essential pairwise
   cell–cell relationships and the global cell–cell graph topology using a
   graph formulation by regenerating the topological structure of the
   input cell graph. Then the k-means clustering method is used to cluster
   cells on the learned graph embedding^[83]31, where the number of
   clusters is determined by the Louvain algorithm^[84]31 on the cell
   graph.

   The expression matrix in each cell cluster from the feature autoencoder
   is reconstructed through the cluster autoencoder. Using the inferred
   cell-type information from the graph autoencoder, the cluster
   autoencoder treats different cell types specifically and regenerates
   expression in the same cell cluster (Fig. [85]2c). The cluster
   autoencoder helps discover cell-type-specific information for each cell
   type in its individualized learning. Accompanied by the feature
   autoencoder, the cluster autoencoder leverages the inferences between
   global and cell-type-specific representation learning. Iteratively, the
   reconstructed matrix is fed back into the feature autoencoder. The
   iteration process stops until it converges with no change in cell
   clustering and this cell clustering result is recognized as the final
   results of cell-type prediction.

   After the iteration stops, this imputation autoencoder takes the
   original gene expression matrix as input and is trained with the
   additional L1 regularizer of the inferred cell–cell relationships. The
   regularizers (see “Methods” section) are generated based on edges in
   the learned cell graph in the last iteration and their co-occurrences
   in the same predicted cell type. Besides, the L1 penalty term is
   applied to increase the model generalization by squeezing more zeroes
   into the autoencoder model weights. The sparsity brought by the L1 term
   benefits the expression imputation in dropout effects. Finally, the
   reconstructed gene expression values are used as the final imputation
   output.

scGNN can effectively impute scRNA-Seq data and accurately predict cell
clusters

   To assess the imputation and cell clustering performance of scGNN, four
   scRNA data sets (i.e., Chung^[86]26, Kolodziejczy^[87]23, Klein^[88]24,
   and Zeisel^[89]25) with gold-standard cell-type labels are chosen as
   the benchmarks (more performance evaluation on other data sets can be
   found in Supplementary Data [90]1–[91]2). We simulated the dropout
   effects by randomly flipping a number of the non-zero entries to zeros.
   The synthetic dropout simulation was based on the same leave-one-out
   strategy used in scVI^[92]32 (Supplementary Fig. [93]3). Median L1
   distance, cosine similarity, and root-mean-square-deviation (RMSE)
   scores between the original data set and the imputed values for these
   synthetic entries were calculated to compare scGNN with MAGIC^[94]4,
   SAUCIE^[95]10, SAVER^[96]19, scImpute^[97]33, scVI^[98]32, DCA^[99]11,
   DeepImpute^[100]34, scIGANs^[101]35, and netNMF-sc^[102]36 (see
   “Methods” section). scGNN achieves the best results in recovering gene
   expressions in terms of median L1 distance, and RMSE at the 10 and 30%
   synthetic dropout rate, respectively. While the cosine similarity score
   of scGNN ranks at the top place for 10% rate and the third place for
   30% rate. (Fig. [103]3a and Supplementary Data [104]1). Furthermore,
   scGNN can recover the underlying gene–gene relationships missed in the
   raw expression data due to the sparsity of scRNA-Seq. For example, two
   pluripotency epiblast gene pairs, Ccnd3 versus Pou5f1 and Nanog versus
   Trim28, are lowly correlated in the original raw data but show strong
   correlations relations, which are differentiated by time points after
   scGNN imputation and, therefore, perform with a consistency leading to
   the desired results sought in the original paper^[105]24
   (Fig. [106]3b). The recovered relations of four more gene pairs are
   also showcased in Supplementary Figure [107]4. scGNN amplifies
   differentially expressed genes (DEGs) signals with a higher fold change
   than the original, using an imputed matrix to confidently depict the
   cluster heterogeneity (Fig. [108]3c). We also compared the DEG signal
   changes before and after imputation using other imputation tools. As an
   example, 744 DEGs (logFC > 0.25) identified in Microglia (benchmark
   cell label) of Zeisel data were compared logFC value change before and
   after imputation (Supplementary Fig. [109]5). The result turned out
   that scGNN is the only tool that increases all most all DEG signals in
   Microglia with the strongest Pearson’s correlation coefficient to the
   original data. Other tools showed weaker coefficients and signals in
   some of the genes were decreased, indicating imputation bias in these
   tools. Our results indicate that scGNN can accurately restore
   expression values, capture true gene–gene relations, and increase DEG
   signals, without inducing additional noises.

Fig. 3. Comparison of the imputation performance.

   [110]Fig. 3
   [111]Open in a new tab

   a Comparison of the cosine similarity, median L1 distance, and RMSE
   scores between scGNN and other nine imputation tools under 10 and 30%
   synthetic dropout rate. Darker color indicates better performances. The
   highest score in each column is highlighted with the yellow box. RMSE
   scores were scaled by multiplying by 100. b Co-expression patterns can
   be addressed more explicitly after applying scGNN on the Klein data. No
   clear gene pair relationship of Ccnd3 versus Pou5f1 (upper panel) and
   Nanog versus Trim28 (lower panel) is observed in the raw data (left)
   compared to the observation of unambiguous correlations within each
   cell type after scGNN imputation (right). c Comparison of DEG logFC
   scores using the original expression value (x axis) and the scGNN
   imputed expression values (y axis) identified in day 1 cells of the
   Klein data (up) and microglial cells of the Zeisel data (bottom). The
   differentiation signals are amplified after imputation.

   Besides the artificial dropout benchmarks, we continued to evaluate the
   clustering performance of scGNN and the nine imputation tools on the
   same two data sets. The predicted cell labels were systematically
   evaluated using 10 criteria including an adjusted Rand index
   (ARI)^[112]37, Silhouette^[113]38, and eight other criteria
   (Fig. [114]4a and Supplementary Data [115]2). By visualizing cell
   clustering results on UMAPs^[116]39, one can observe more apparent
   closeness of cells within the same cluster and separation among
   different clusters when using scGNN embeddings compared to the other
   nine imputation tools (Fig. [117]4b). We also observed that compared to
   the tSNE^[118]40 and PHATE^[119]41 visualization methods, UMAP showed
   better display results with closer inner-group distance and larger
   between-group distances (Supplementary Fig. [120]6). The expression
   patterns show heterogeneity along with embryonic stem cell development.
   In the case of Klein’s time-series data, scGNN recovered a complex
   structure that was not well represented by the raw data, showing a
   well-aligned trajectory path of cell development from Day 1 to Day 7
   (Fig. [121]4c). Moreover, scGNN showed significant enhancement in cell
   clustering compared to the existing scRNA-Seq analytical framework
   (e.g., Seurat using the Louvain method) when using the raw data
   (Supplementary Fig. [122]7). We hypothesized that the cell–cell graph
   constructed from scGNN can reflect cell–cell communications based on
   ligand–receptor pairs. Using CellChat^[123]42 and curated
   receptor–ligand pairs, we proved that aggregated interaction
   probability of cell pairs defined in an scGNN cell–cell graph is
   significantly higher than randomly selected cell pairs, which strongly
   indicates the capability of scGNN in capturing the real cell–cell
   communications and interactions (Supplementary Fig. [124]8).

Fig. 4. Cell clustering and trajectory evaluations.

   [125]Fig. 4
   [126]Open in a new tab

   a Comparison of ARI and Silhouette scores among scGNN and nine tools
   using Klein and Zeisel data sets. b Comparison of UMAP visualizations
   on the same two data sets, indicating that when scGNN embeddings are
   utilized, cells are more closely grouped within the same cluster but
   when other tools are used, cells are more separated between clusters.
   Cells were clustered via the Louvain method and visualized using UMAP.
   c Pseudotime analysis using the raw expression matrix and scGNN imputed
   matrix of the Klein data set via Monocle. d Justification of using the
   graph autoencoder, the cluster autoencoder, and the top 2000 variable
   genes on the Klein data set in the scGNN framework, in terms of ARI.
   scGNN CA- shows the results of the graph autoencoder’s ablation, CA-
   shows the results of the cluster autoencoder’s ablation, and AG shows
   the results after using all genes in the framework.

   On top of that, to address the significance of using the graph
   autoencoder and cluster autoencoder in scGNN, we performed ablation
   tests to bypass each autoencoder and compare the ARI results on the
   Klein data set (Fig. [127]4d and Supplementary Fig. [128]9). The
   results showed that removing either of these two autoencoders
   dramatically decreased the performance of scGNN in terms of cell
   clustering accuracy. Another test using all genes rather than the top
   2,000 variable genes also showed poor performance in the results and
   doubled the runtime of scGNN, indicating that those low variable genes
   may reduce the signal-to-noise ratio and negatively affect the accuracy
   of scGNN. The design and comprehensive results of the ablation studies
   on both clustering and imputation are detailed in Supplementary
   Methods [129]1–[130]4, Supplementary Table [131]2, and Supplementary
   Data [132]3–[133]8. We also extensively studied the parameter selection
   in Supplementary Data [134]9–[135]12.

scGNN illustrates AD-related neural development and the underlying regulatory
mechanism

   To further demonstrate the applicative power of scGNN, we applied it to
   a scRNA-Seq data set (GEO accession number [136]GSE138852) containing
   13,214 single nuclei collected from six AD and six control
   brains^[137]27. scGNN identifies 10 cell clusters, including microglia,
   neurons, oligodendrocyte progenitor cells (OPCs), astrocytes, and six
   sub-clusters of oligodendrocytes (Fig. [138]5a). Specifically, the
   proportions of these six oligodendrocyte sub-clusters differ between AD
   patients (Oligos 2, 3, and 4) and healthy controls (Oligos 1, 5, and 6)
   (Fig. [139]5b). Moreover, the difference between AD and the control in
   the proportion of astrocyte and OPCs is observed, indicating the change
   of cell population in AD patients compared to healthy controls
   (Fig. [140]5b). We then combined these six oligodendrocyte sub-clusters
   into one to discover DEGs. Since scGNN can significantly increase true
   signals in the raw data set, DEG patterns are more explicit
   (Supplementary Fig. [141]10). Among all DEGs, we confirmed 22 genes as
   cell-type-specific markers for astrocytes, OPCs, oligodendrocytes, and
   neurons, in that order^[142]43 (Fig. [143]5c). A biological pathway
   enrichment analysis shows several highly positive enrichments in AD
   cells compared to control cells among all five cell types. These
   enrichments include oxidative phosphorylation and pathways associated
   with AD, Parkinson’s disease, and Huntington disease^[144]44
   (Fig. [145]5d and Supplementary Fig. [146]11). Interestingly, we
   observed a strong negative enrichment of the MAPK (mitogen-activated
   protein kinase) signaling pathway in the microglia cells, suggesting a
   relatively low MAPK regulation in microglia than other cells.

Fig. 5. Alzheimer’s disease data set ([147]GSE138852) analysis based on
scGNN.

   [148]Fig. 5
   [149]Open in a new tab

   a Cell clustering UMAP. Labeled with scGNN clusters (left) and
   AD/control samples (right). b Comparison of cell proportions in
   AD/control samples (left) and each cluster (right). c Heatmap of DEGs
   (logFC > 0.25) in each cluster. Six oligodendrocyte sub-clusters are
   merged as one to compare with other cell types. Marker genes identified
   in DEGs are listed on the right. d Selected AD-related enrichment
   pathways in each cell type in the comparison between AD and control
   cells. e Underlying TFs are responsible for the cell-type-specific gene
   regulations identified by IRIS3.

   In order to investigate the regulatory mechanisms underlying the
   AD-related neural development, we applied the imputed matrix of scGNN
   to IRIS3 (an integrated cell-type-specific regulon inference server
   from single-cell RNA-Seq) and identified 21 cell-type-specific regulons
   (CTSR) in five cell types^[150]45 (Fig. [151]5e and Supplementary
   Data [152]13; IRIS3 job ID: 20200626160833). Not surprisingly, we
   identified several AD-related transcription factors (TFs) and target
   genes that have been reported to be involved in the development of AD.
   SP2 is a common TF identified in both oligodendrocytes and astrocytes.
   It has been shown to regulate the ABCA7 gene, which is an IGAP
   (International Genomics of Alzheimer’s Project) gene that is highly
   associated with late-onset AD^[153]46. We also observed an SP2 CTSR in
   astrocytes that regulate APOE, AQP4, SLC1A2, GJA1, and FGFR3. All of
   these five targeted genes are marker genes of astrocytes, which have
   been reported to be associated with AD^[154]47,[155]48. In addition,
   the SP3 TF, which can regulate the synaptic function in neurons is
   identified in all cell clusters, and it is highly activated in
   AD^[156]49,[157]50. We identified CTSRs regulated by SP3 in OPCs,
   astrocytes, and neurons suggesting significant SP3-related regulation
   shifts in these three clusters. We observed 26, 60, and 22 genes that
   were uniquely regulated in OPCs, astrocytes, and neurons, as well as 60
   genes shared among the three clusters (Supplementary Data [158]14).
   Such findings provide a direction for the discovery of SP3 function in
   AD studies.

Discussion

   It is still a fundamental challenge to explore cellular heterogeneity
   in high-volume, high-sparsity, and noisy scRNA-Seq data, where the
   high-order topological relationships of the whole-cell graph are still
   not well explored and formulated. The key innovations of scGNN are
   incorporating global propagated topological features of the cells
   through GNNs, together with integrating gene regulatory signals in an
   iterative process for scRNA-Seq data analysis. The benefits of GNN are
   its intrinsic learnable properties of propagating and aggregating
   attributes to capture relationships across the whole cell–cell graph.
   Hence, the learned graph embedding can be treated as the high-order
   representations of cell–cell relationships in scRNA-Seq data in the
   context of graph topology. Unlike the previous autoencoder applications
   in scRNA-Seq data analysis, which only captures the top-down
   distributions of the overall cells, scGNN can effectively aggregate
   detailed relationships between similar cells using a bottom-up
   approach. We also observed that the imputation of scGNN can decrease
   batch effects introduced by different sequencing technologies
   (Supplementary Fig. [159]12), which makes scGNN a good choice for data
   imputation prior to multiple scRNA-Seq data integration^[160]51.
   Furthermore, scGNN integrates gene regulatory signals efficiently by
   representing them discretely in LTMG in the feature autoencoder
   regularization. These gene regulatory signals can help identify
   biologically meaningful gene–gene relationships as they apply to our
   framework and eventually, they are proven capable of enhancing
   performance. Technically, scGNN adopts multi-modal autoencoders in an
   iterative manner to recover gene expression values and cell-type
   prediction simultaneously. Notably, scGNN is a hypothesis-free deep
   learning framework on a data-driven cell graph model, and it is
   flexible to incorporate different statistical models (e.g., LTMG) to
   analyze complex scRNA-Seq data sets.

   Some limitations can still be found in scGNN. (i) It is prone to
   achieve better results with large data sets, compared to relatively
   small data sets (e.g., <1000 cells), as it is designed to learn better
   representations with many cells from scRNA-Seq data, as shown in the
   benchmark results, and (ii) Compared with statistical model-based
   methods, the iterative autoencoder framework needs more computational
   resources, which is more time-consuming (Supplementary Data [161]15).
   In the future, we will investigate creating a more efficient scGNN
   model with a lighter and more compressed architecture.

   In the future, we will continue to enhance scGNN by implementing
   heterogeneous graphs to support the integration of single-cell
   multi-omics data (e.g., the intra-modality of Smart-Seq2 and Droplet
   scRNA-Seq data; and the inter-modality integration of scRNA-Seq and
   scATAC-Seq data). We will also incorporate attention mechanisms and
   graph transformer models^[162]52 to make the analyses more explainable.
   Specifically, by allowing the integration of scRNA-Seq and scATAC-Seq
   data, scGNN has the potential to elucidate cell-type-specific gene
   regulatory mechanisms^[163]53. On the other hand, T cell receptor
   repertoires are considered as unique identifiers of T cell ancestries
   that can improve both the accuracy and robustness of predictions
   regarding cell–cell interactions^[164]54. scGNN can also facilitate
   batch effects and build connections across diverse sequencing
   technologies, experiments, and modalities. Moreover, scGNN can be
   applied to analyze spatial transcription data sets regarding spatial
   coordinates as additional regularizers to infer the cell neighborhood
   representation and better prune the cell graph. We plan to develop a
   more user-friendly software system from our scGNN model, together with
   modularized analytical functions in support of standardizing the data
   format, quality control, data integration, multi-functional scMulti-seq
   analyses, performance evaluations, and interactive visualizations.

Methods

Data set preprocessing

   scGNN takes the scRNA-Seq gene expression profile as the input. Data
   filtering and quality control are the first steps of data
   preprocessing. Due to the high dropout rate of scRNA-seq expression
   data, only genes expressed as non-zero in more than 1% of cells, and
   cells expressed as non-zero in more than 1% of genes are kept. Then,
   genes are ranked by standard deviation, i.e., the top 2000 genes in
   variances are used for the study. All the data are log-transformed.

Left-truncated mixed Gaussian (LTMG) modeling

   A mixed Gaussian model with left truncation assumption is used to
   explore the regulatory signals from gene expression^[165]14. The
   normalized expression values of gene X over N cells are denoted as X =
   {x[1],…x[N]}, where x[j] ∈ X is assumed to follow a mixture of k
   Gaussian distributions, corresponding to k possible gene regulatory
   signals (TRSs). The density function of X is:
   [MATH: <mi>p</mi><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>;</mo><mi
   mathvariant="normal">Θ</mi></mrow></mfenced><mo>=</mo><munderover
   accent="false"
   accentunder="false"><mrow><mo>∏</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mi>p</mi><mrow><mo>
   (</mo><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></msub>
   <mo>;</mo><mi
   mathvariant="normal">Θ</mi></mrow><mo>)</mo></mrow><mo>=</mo><munderove
   r accent="false"
   accentunder="false"><mrow><mo>∏</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>k</mi></mrow></munderover><msub><mrow><mi>α</m
   i></mrow><mrow><mi>i</mi></mrow></msub><mi>p</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></ms
   ub><mo>;</mo><msub><mrow><mi>θ</mi></mrow><mrow><mi>i</mi></mrow></msub
   ></mrow></mfenced><mo>=</mo><munderover accent="false"
   accentunder="false"><mrow><mo>∏</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>k</mi></mrow></munderover><msub><mrow><mi>α</m
   i></mrow><mrow><mi>i</mi></mrow></msub><mfrac><mrow><mn>1</mn></mrow><m
   row><msqrt><mrow><mn>2</mn><mi>π</mi></mrow></msqrt><msub><mrow><mi>σ</
   mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac><mspace
   width="0.25em"></mspace><msup><mrow><mi>e</mi></mrow><mrow><mfrac><mrow
   ><mo>−</mo><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></ms
   ub><mo>−</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mrow></msub
   ></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><mn
   >2</mn><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn
   >2</mn></mrow></msubsup></mrow></mfrac></mrow></msup><mo>=</mo><mi>L</m
   i><mfenced close=")" open="("><mrow><mi
   mathvariant="normal">Θ</mi><mo>;</mo><mi>X</mi></mrow></mfenced> :MATH]
   1

   where α[i] is the mixing weight, μ[i] and σ[i] are the mean and
   standard deviation of the ith Gaussian distribution, which can be
   estimated by:
   [MATH: <mi
   mathvariant="normal">Θ</mi><mo>*</mo><mo>=</mo><mtable><mtr><mtd
   columnalign="center"><mi mathvariant="normal">arg</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">max</mi><mspace
   width="0.25em"></mspace><mi>L</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">Θ</mi><mo>;</mo><mi>X</mi></mrow><mo>)</mo></mrow>
   </mtd></mtr><mtr><mtd columnalign="center"><mi
   mathvariant="normal">Θ</mi></mtd></mtr></mtable> :MATH]
   to model the errors at zero and the low expression values. With the
   left truncation assumption, the gene expression profile is split into
   M, which is a truly measured expression of values, and N − M
   representing left-censored gene expressions for N conditions. The
   parameter Θ maximizes the likelihood function and can be estimated by
   an expectation-maximization algorithm. The number of Gaussian
   components is selected by the Bayesian Information Criterion; then, the
   original gene expression values are labeled to the most likely
   distribution under each cell. In detail, the probability that x[j]
   belongs to distribution i is formulated by:
   [MATH: <mi>p</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></ms
   ub><mo>∈</mo><mi mathvariant="normal">TRS</mi><mspace
   width="0.25em"></mspace><mi>i</mi><mo>∣</mo><mi>K</mi><mo>,</mo><mi
   mathvariant="normal">Θ</mi><mo>*</mo></mrow></mfenced><mo>∝</mo><mfrac>
   <mrow><msub><mrow><mi>α</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow
   ><mrow><msqrt><mrow><mn>2</mn><mi>π</mi><msubsup><mrow><mi>σ</mi></mrow
   ><mrow><mi>j</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow></msqrt
   ></mrow></mfrac><mspace
   width="0.25em"></mspace><msup><mrow><mi>e</mi></mrow><mrow><mfrac><mrow
   ><mo>−</mo><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></ms
   ub><mo>−</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mrow></msub
   ></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><mn
   >2</mn><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn
   >2</mn></mrow></msubsup></mrow></mfrac></mrow></msup> :MATH]
   2

   where x[j] is labeled by TRS i if
   [MATH: <mi>p</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></ms
   ub><mo>∈</mo><mi mathvariant="normal">TRS</mi><mspace
   width="0.25em"></mspace><mi>i</mi><mo>∣</mo><mi>K</mi><mo>,</mo><mi
   mathvariant="normal">Θ</mi><mo>*</mo></mrow></mfenced><mo>=</mo><munder
   ><mrow><mi>max</mi></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo
   ><mo>…</mo><mo>,</mo><mi>K</mi></mrow></munder><mrow><mo>(</mo><mrow><m
   i>p</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mrow></ms
   ub><mo>∈</mo><mi mathvariant="normal">TRS</mi><mspace
   width="0.25em"></mspace><mi>i</mi><mo>∣</mo><mi>K</mi><mo>,</mo><mi
   mathvariant="normal">Θ</mi><mo>*</mo></mrow></mfenced></mrow><mo>)</mo>
   </mrow> :MATH]
   . Thus, the discrete values (1,2, …, K) for each gene are generated.

Feature autoencoder

   The feature autoencoder is proposed to learn the representative
   embedding of the scRNA expression through stacked two layers of dense
   networks in both the encoder and decoder. The encoder constructs the
   low-dimensional embedding of X′ from the input gene expression X, and
   the encoder reconstructs the expression
   [MATH: <mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   from the embedding; thus,
   [MATH: <mi>X</mi><mo>,</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover><mo
   >∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>N</mi><mo>×</mo><mi>
   M</mi></mrow></msup> :MATH]
   and X′
   [MATH: <mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>N</mi><mo>×</mo><msu
   p><mrow><mi>M</mi></mrow><mrow><mo>′</mo></mrow></msup></mrow></msup>
   :MATH]
   , where M is the number of input genes, M′ is the dimension of the
   learned embedding, and M′ < M. The objective of training the feature
   autoencoder is to achieve a maximum similarity between the original and
   reconstructed through minimizing the loss function, in which
   [MATH: <mo>∑</mo><msup><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mn>2</mn></mrow></msup> :MATH]
   is the main term serving as the mean squared error (MSE) between the
   original and the reconstructed expressions.

Regularization

   Regularization is adopted to integrate gene regulation information
   during the feature autoencoder training process. The aim of this
   regularization is to treat each gene differently based on their
   individual gene regulation role through penalizing it in the loss
   function. The MSE is defined as:
   [MATH: <mi>α</mi><mo mathsize="big">∑</mo><mfenced close=")"
   open="("><mrow><msup><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mn>2</mn></mrow></msup><mo>∘</mo><mi
   mathvariant="normal">TRS</mi></mrow></mfenced> :MATH]
   3

   where
   [MATH: <mi mathvariant="normal">TRS</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>N</mi><mo>×</mo><mi>
   M</mi></mrow></msup> :MATH]
   ; α is a parameter used to control the strength of gene regulation
   regularization; α ∈ [0,1]. ∘ denotes element-wise multiplication. Thus,
   the loss function of the feature autoencoder is shown as Eq.([166]4).
   [MATH: <mi
   mathvariant="normal">Loss</mi><mo>=</mo><mrow><mo>(</mo><mrow><mn>1</mn
   ><mo>−</mo><mi>α</mi></mrow><mo>)</mo></mrow><mo
   mathsize="big">∑</mo><msup><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mn>2</mn></mrow></msup><mo>+</mo><mi>α</mi>
   <mo mathsize="big">∑</mo><mfenced close=")"
   open="("><mrow><msup><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mn>2</mn></mrow></msup><mo>∘</mo><mi
   mathvariant="normal">TRS</mi></mrow></mfenced> :MATH]
   4

   In the encoder, the output dimensions of the first and second layers
   are set as 512 and 128, respectively. Each layer is followed by the
   ReLU activation function. In the decoder, the output dimensions of the
   first and second layers are 128 and 512, respectively. Each layer is
   followed by a sigmoid activation function. The learning rate is set as
   0.001. The cluster autoencoder has the same architecture as the feature
   autoencoder, but without gene regulation regularization in the loss
   function.

Cell graph and pruning

   The cell graph formulates the cell–cell relationships using embedding
   learned from the feature autoencoder. As done in the previous
   works^[167]4,[168]55, the cell graph is built from a KNN graph, where
   nodes are individual single cells, and the edges are relationships
   between cells. K is the predefined parameter used to control the scale
   of the captured interaction between cells. Each node finds its
   neighbors within the K shortest distances and creates edges between
   them and itself. Euclidian distance is calculated as the weights of the
   edges on the learned embedding vectors. The pruning process selects an
   adaptive number of neighbors for each node on the original KNN graph
   and keeps a more biologically meaningful cell graph. Here, Isolation
   Forest is applied to prune the graph to detect the outliner in the
   K-neighbors of each node^[169]56. Isolation Forest builds individual
   random forest to check distances from the node to all K-neighbors and
   only disconnects the outliners.

Graph autoencoder

   The graph autoencoder learns to embed and represent the topological
   information from the pruned cell graph. For the input pruned cell
   graph, G = (V, E) with N = |V| nodes denoting the cells and E
   representing the edges. A is its adjacency matrix and D is its degree
   matrix. The node feature matrix of the graph autoencoder is the learned
   embedding X′ from the feature autoencoder.

   The graph convolution network (GCN) is defined as
   [MATH: <mi mathvariant="normal">GCN</mi><mfenced close=")"
   open="("><mrow><msup><mrow><mi>X</mi></mrow><mrow><mo>′</mo></mrow></ms
   up><mo>,</mo><mi>A</mi></mrow></mfenced><mo>=</mo><mi
   mathvariant="normal">ReLU</mi><mrow><mo>(</mo><mrow><mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>~</mo></mrow></mover><ms
   up><mrow><mi>X</mi></mrow><mrow><mo>′</mo></mrow></msup><mi>W</mi></mro
   w><mo>)</mo></mrow> :MATH]
   , and W is a weight matrix learned from the training.
   [MATH: <mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>~</mo></mrow></mover><mo
   >=</mo><msup><mrow><mi>D</mi></mrow><mrow><mo>−</mo><mn>1</mn><mo>/</mo
   ><mn>2</mn></mrow></msup><mi>A</mi><msup><mrow><mi>D</mi></mrow><mrow><
   mo>−</mo><mn>1</mn><mo>/</mo><mn>2</mn></mrow></msup> :MATH]
   is the symmetrically normalized adjacency matrix and activation
   function ReLU(∙) = max(0, ∙). The encoder of the graph autoencoder is
   composed of two layers of GCN, and Z is the graph embedding learned
   through the encoder in Eq.([170]5). W[1] and W[2] are learned weight
   matrices in the first and second layers, and the output dimensions of
   the first and second layers are set at 32 and 16, respectively. The
   learning rate is set at 0.001.
   [MATH: <mi>Z</mi><mo>=</mo><mi
   mathvariant="normal">ReLU</mi><mrow><mo>(</mo><mrow><mover
   accent="true"><mrow><mi
   mathvariant="normal">A</mi></mrow><mrow><mo>~</mo></mrow></mover><mi>R<
   /mi><mi>e</mi><mi>L</mi><mi>U</mi><mfenced close=")"
   open="("><mrow><mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>~</mo></mrow></mover><ms
   up><mrow><mi>X</mi></mrow><mrow><mo>′</mo></mrow></msup><msub><mrow><mi
   >W</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow></mfenced><msub><mro
   w><mi>W</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow><mo>)</mo></mro
   w> :MATH]
   5

   The decoder of the graph autoencoder is defined as an inner product
   between the embedding:
   [MATH: <mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>^</mo></mrow></mover><mo
   >=</mo><mi
   mathvariant="normal">sigmoid</mi><mrow><mo>(</mo><mrow><mi>Z</mi><msup>
   <mrow><mi>Z</mi></mrow><mrow><mi>T</mi></mrow></msup></mrow><mo>)</mo><
   /mrow> :MATH]
   6

   where
   [MATH: <mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   is the reconstructed adjacency matrix of A. sigmoid(∙) = 1/(1 + e^−∙)
   is the sigmoid activation function.

   The goal of learning the graph autoencoder is to minimize the
   cross-entropy L between the input adjacency matrix A and the
   reconstructed matrix
   [MATH: <mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   :
   [MATH: <mi>L</mi><mfenced close=")"
   open="("><mrow><mi>A</mi><mo>,</mo><mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced><mo>=</mo><mo>−</mo><mfrac><mrow><mn>1</mn></mrow><mrow><
   mi>N</mi><mo>×</mo><mi>N</mi></mrow></mfrac><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><munderover
   accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>N</mi></mrow></munderover><mrow><mo>(</mo><mro
   w><msub><mrow><mi>a</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub>
   <mo>*</mo><mi mathvariant="normal">log</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>a</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow></mfenced><mo>+</mo>
   <mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msub><mrow><mi>a</mi></mrow><mrow><
   mi>i</mi><mi>j</mi></mrow></msub></mrow></mfenced><mo>*</mo><mi
   mathvariant="normal">log</mi><mrow><mo>(</mo><mrow><mn>1</mn><mo>−</mo>
   <msub><mrow><mover
   accent="true"><mrow><mi>a</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow></m
   row><mo>)</mo></mrow> :MATH]
   7

   where a[ij] and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>a</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>i</mi><mi>j</mi></mrow></msub> :MATH]
   are the elements of the adjacency matrix A and
   [MATH: <mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   in the ith row and the jth column. As there are N nodes as the cell
   number in the graph, N × N is the total number of elements in the
   adjacency matrix.

Iterative process

   The iterative process aims to build the single-cell graph iteratively
   until converging. The iterative process of the cell graph can be
   defined as:
   [MATH: <mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>~</mo></mrow></mover><mo
   >=</mo><mi>λ</mi><msub><mrow><mi>L</mi></mrow><mrow><mn>0</mn></mrow></
   msub><mo>+</mo><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mi>λ</mi></mrow></mfenced><mfrac><m
   row><msub><mrow><mi>A</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msu
   b></mrow><mrow><msub><mrow><mo>∑</mo></mrow><mrow><mi>j</mi></mrow></ms
   ub><msub><mrow><mi>A</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub
   ></mrow></mfrac> :MATH]
   8

   where L[0] is the normalized adjacency matrix of the initial pruned
   graph, and
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>=</mo><m
   subsup><mrow><mi>D</mi></mrow><mrow><mn>0</mn></mrow><mrow><mo>−</mo><m
   n>1</mn><mo>/</mo><mn>2</mn></mrow></msubsup><msub><mrow><mi>A</mi></mr
   ow><mrow><mn>0</mn></mrow></msub><msubsup><mrow><mi>D</mi></mrow><mrow>
   <mn>0</mn></mrow><mrow><mo>−</mo><mn>1</mn><mo>/</mo><mn>2</mn></mrow><
   /msubsup> :MATH]
   , where D[0] is the degree matrix. λ is the parameter to control the
   converging speed, λ ∈ [0,1]. Each time in iteration t, two criteria are
   checked to determine whether to stop the iteration: (1) that is, to
   determine whether the adjacency matrix converges, i.e.,
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>t</mi></mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>t</mi><mo>−</mo><mn>1</mn></mrow></msub><mo><</mo><msub><
   mrow><mi>γ</mi></mrow><mrow><mn>1</mn></mrow></msub><msub><mrow><mover
   accent="true"><mrow><mi>A</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mn>0</mn></mrow></msub><mo>,</mo> :MATH]
   or (2) whether the inferred cell types are similar enough, i.e., ARI <
   γ[2]. ARI is the similarity measurement, which is detailed in the next
   section. In our setting, λ = 0.5 and γ[1], γ[2] = 0.99. The cell-type
   clustering results obtained in the last iteration are chosen as the
   final cell-type results.

Imputation autoencoder

   After the iterative process stops, the imputation autoencoder imputes
   and denoises the raw expression matrix within the inferred cell–cell
   relationship. The imputation autoencoder shares the same architecture
   as the feature autoencoder, but it also uses three additional
   regularizers from the cell graph in Eq. ([171]9), cell types in Eq.
   ([172]10), and the L1 regularizer in Eq. ([173]11):
   [MATH: <msub><mrow><mi>γ</mi></mrow><mrow><mn>1</mn></mrow></msub><mo
   mathsize="big">∑</mo><mrow><mo>(</mo><mrow><mi>A</mi><mo>⋅</mo><msup><m
   row><mrow><mo>(</mo><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mo>)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup></mrow><mo>)<
   /mo></mrow> :MATH]
   9

   where
   [MATH: <mi>A</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>N</mi><mo>×</mo><mi>
   N</mi></mrow></msup> :MATH]
   is the adjacency matrix from the pruned cell graph in the last
   iteration. ∙ denotes dot product. Cells within an edge in the pruned
   graph will be penalized in the training:
   [MATH: <mtable><mtr><mtd
   columnalign="center"><msub><mrow><mi>γ</mi></mrow><mrow><mn>2</mn></mro
   w></msub><mo
   mathsize="big">∑</mo><mrow><mo>(</mo><mrow><mi>B</mi><mo>⋅</mo><msup><m
   row><mrow><mo>(</mo><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mo>)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup></mrow><mo>)<
   /mo></mrow></mtd></mtr><mtr><mtd
   columnalign="center"><msub><mrow><mi>B</mi></mrow><mrow><mi>i</mi><mi>j
   </mi></mrow></msub><mo>=</mo><mfenced open="{"><mrow><mtable><mtr><mtd
   columnalign="center"><mn>1</mn></mtd><mtd columnalign="center"><mi
   mathvariant="normal">where</mi><mspace
   width="0.25em"></mspace><mi>i</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">and</mi><mspace
   width="0.25em"></mspace><mi>j</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">in</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">same</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">cell</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">type</mi></mtd></mtr><mtr><mtd
   columnalign="center"><mn>0</mn></mtd><mtd columnalign="center"><mi
   mathvariant="normal">
   else</mi></mtd></mtr></mtable></mrow></mfenced></mtd></mtr></mtable>
   :MATH]
   10

   where
   [MATH: <mi>B</mi><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>N</mi><mo>×</mo><mi>
   N</mi></mrow></msup> :MATH]
   is the relationship matrix between cells, and two cells in the same
   cell type have a B[ij] value of 1. Cells within the same inferred cell
   type will be penalized in the training. γ[1], γ[2] are the intensities
   of the regularizers and γ[1], γ[2] ∈ [0,1]. The L1 regularizer is
   defined as
   [MATH: <mi>β</mi><mo mathsize="big">∑</mo><mfenced close="∣"
   open="∣"><mrow><mi>w</mi></mrow></mfenced> :MATH]
   11

   which brings sparsity and increases the generalization performance of
   the autoencoder by reducing the number of non-zero w terms in ∑|w|,
   where β is a hyper-parameter controlling the intensity of the L1 term
   (β ∈ [0,1]). Therefore, the loss function of the imputation autoencoder
   is
   [MATH: <mi mathvariant="normal">Loss</mi><mo>=</mo><mspace
   width="0.16em"></mspace><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mi>α</mi></mrow></mfenced><mo
   mathsize="big">∑</mo><msup><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mn>2</mn></mrow></msup><mo>+</mo><mi>α</mi>
   <mo mathsize="big">∑</mo><mfenced close=")"
   open="("><mrow><msup><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mn>2</mn></mrow></msup><mo>∘</mo><mi
   mathvariant="normal">TRS</mi></mrow></mfenced><mo>+</mo><mi>β</mi><mo
   mathsize="big">∑</mo><mfenced close="∣"
   open="∣"><mrow><mi>w</mi></mrow></mfenced><mo>+</mo><msub><mrow><mi>γ</
   mi></mrow><mrow><mn>1</mn></mrow></msub><mo
   mathsize="big">∑</mo><mfenced close=")"
   open="("><mrow><mi>A</mi><mo>⋅</mo><msup><mrow><mfenced close=")"
   open="("><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow></mfenced><mo
   >+</mo><msub><mrow><mi>γ</mi></mrow><mrow><mn>2</mn></mrow></msub><mo
   mathsize="big">∑</mo><mrow><mo>(</mo><mrow><mi>B</mi><mo>⋅</mo><msup><m
   row><mrow><mo>(</mo><mrow><mi>X</mi><mo>−</mo><mover
   accent="true"><mrow><mi>X</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mo>)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup></mrow><mo>)<
   /mo></mrow> :MATH]
   12

Benchmark evaluation compared to existing tools

Imputation evaluation

   For benchmarking imputation performance, we performed synthetic dropout
   simulation to randomly flip 10% of the non-zero entries to zeros. These
   synthetic dropouts still follow the zero-inflated negative binomial
   (ZINB) distribution with details shown in Supplementary Method [174]5
   and Data [175]16. We evaluated median L1 distance, cosine similarity,
   and root-mean-squared error (RMSE) between the original data set and
   the imputed values for these corrupted entries. For all the flipped
   entries, x is the row vector of the original expression, and y is its
   corresponding row vector of the imputed expression. The L1 distance is
   the absolute deviation between the value of the original and imputed
   expression. A lower L1 distance means a higher similarity.
   [MATH: <mi mathvariant="normal">L</mi><mn>1</mn><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">distance</mi><mo>=</mo><mfenced close="∣"
   open="∣"><mrow><mi>x</mi><mo>−</mo><mi>y</mi></mrow></mfenced><mo>,</mo
   ><mspace width="1.0em"></mspace><mi
   mathvariant="normal">L</mi><mn>1</mn><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">distance</mi><mo>∈</mo><mrow><mo>[</mo><mrow><mn>0
   </mn><mo>,</mo><mo>+</mo><mi>∞</mi></mrow><mo>)</mo></mrow> :MATH]
   13

   The cosine similarity computes the dot products between original and
   imputed expression.
   [MATH: <mi mathvariant="normal">Cosine</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">similarity</mi><mfenced close=")"
   open="("><mrow><mi>x</mi><mo>,</mo><mi>y</mi></mrow></mfenced><mo>=</mo
   ><mfrac><mrow><mi>x</mi><msup><mrow><mi>y</mi></mrow><mrow><mi>T</mi></
   mrow></msup></mrow><mrow><mfenced close="∥"
   open="∥"><mrow><mi>x</mi></mrow></mfenced><mspace
   width="0.25em"></mspace><mfenced close="∥"
   open="∥"><mrow><mi>y</mi></mrow></mfenced></mrow></mfrac><mo>,</mo><msp
   ace width="1.0em"></mspace><mi mathvariant="normal">Cosine</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">similarity</mi><mo>∈</mo><mrow><mo>[</mo><mrow><mn
   >0</mn><mo>,</mo><mn>1</mn></mrow><mo>]</mo></mrow> :MATH]
   14

   The RMSE computes the squared root of the quadratic mean of differences
   between original and imputed expression.
   [MATH: <mi mathvariant="normal">RMSE</mi><mfenced close=")"
   open="("><mrow><mi>x</mi><mo>,</mo><mi>y</mi></mrow></mfenced><mo>=</mo
   ><msqrt><mrow><mfrac><mrow><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>i<
   /mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></msubsup><msup><
   mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>−</mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub
   ></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><mi
   >N</mi></mrow></mfrac></mrow></msqrt><mo>,</mo><mspace
   width="0.25em"></mspace><mspace width="0.25em"></mspace><mi
   mathvariant="normal">RMSE</mi><mo>∈</mo><mfenced close=")"
   open="["><mrow><mn>0</mn><mo>,</mo><mo>+</mo><mi>∞</mi></mrow></mfenced
   > :MATH]
   15

   The process is repeated three times, and the mean and standard
   deviation were selected as a comparison. The scores are compared
   between scGNN and nine imputation tools (i.e., MAGIC^[176]4,
   SAUCIE^[177]10, SAVER^[178]19, scImpute^[179]33, scVI^[180]32,
   DCA^[181]11, DeepImpute^[182]34, scIGANs^[183]35, and
   netNMF-sc^[184]36), using the default parameters.

Clustering evaluation

   We compared the cell clustering results of scGNN, the same nine
   imputation tools, and four scRNA-Seq analytical frameworks, in terms of
   ten clustering evaluation scores. Noted that, we considered the default
   cell clustering method (i.e., Louvain method^[185]31 in Seurat^[186]5,
   Ward.D2^[187]57 method in CIDR^[188]58, Louvain method in
   Monocle^[189]59, and k-means^[190]60 method in RaceID^[191]61) in each
   of the analytical frameworks to compare the cell clustering performance
   with scGNN. The default parameters are applied in all test tools.
   ARI^[192]37 is used to compute similarities by considering all pairs of
   the samples that are assigned in clusters in the current and previous
   clustering adjusted by random permutation:
   [MATH: <mi mathvariant="normal">ARI</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">RI</mi><mo>−</mo><mi
   mathvariant="normal">E</mi><mfenced close="]" open="["><mrow><mi
   mathvariant="normal">RI</mi></mrow></mfenced></mrow><mrow><mi
   mathvariant="normal">max</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">RI</mi></mrow><mo>)</mo></mrow><mo>−</mo><mi
   mathvariant="normal">E</mi><mrow><mo>[</mo><mrow><mi
   mathvariant="normal">RI</mi></mrow><mo>]</mo></mrow></mrow></mfrac>
   :MATH]
   16

   where the unadjusted rand index (RI) is defined as:
   [MATH: <mi
   mathvariant="normal">RI</mi><mo>=</mo><mfrac><mrow><mi>a</mi><mo>+</mo>
   <mi>b</mi></mrow><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mi>n</mi>
   </mrow><mrow><mn>2</mn></mrow></msubsup></mrow></mfrac> :MATH]
   17

   where a is the number of pairs correctly labeled in the same sets, and
   b is the number of pairs correctly labeled as not in the same data set.
   [MATH:
   <msubsup><mrow><mi>C</mi></mrow><mrow><mi>n</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup> :MATH]
   is the total number of possible pairs. E[RI] is the expected RI of
   random labeling.

   Different from ARI which requires known ground truth labels, the
   Silhouette coefficient score^[193]38 defines how similar an object is
   to its own cluster compared to other clusters. It is defined as:
   [MATH: <mi
   mathvariant="normal">Silhouette</mi><mo>=</mo><mfrac><mrow><mi>b</mi><m
   o>−</mo><mi>a</mi></mrow><mrow><mi>max</mi><mfenced close=")"
   open="("><mrow><mi>a</mi><mo>,</mo><mi>b</mi></mrow></mfenced></mrow></
   mfrac> :MATH]
   18

   where a is the mean distance between a sample and all other points in
   the same class, b is the mean distance between a sample and all other
   points in the next nearest cluster. Silhouette ∈ [−1,1], where 1
   indicates the best clustering results and −1 indicates the worst. We
   calculated the average Silhouette score of all cells in each data set
   to compare the cell clustering results. More quantitative measurements
   are also used in Supplementary Method [194]4.

Statistical validation of cell–cell graph topology based on LRPs

   We used CellChat^[195]42 to predict potential interaction probability
   scores (ranging from 0 to 1; a higher score indicates the two cells are
   more likely to interact with each other) of ligand–receptor pairs (LRP)
   between any two cells. We built a fully connected cell–cell background
   graph (using all the cells) based on Pearson’s correlation of the raw
   expression matrix and compared it with the cell–cell graph generated
   from scGNN. CellChat calculates an aggregated interaction probability
   for each linked cell pair based on the expression level of LRPs. For
   all linked cell pairs in the background graph and scGNN cell–cell
   graph, we performed a Wilcoxon test to evaluate the statistical
   significance between the corresponding aggregated interaction
   probability. Five scRNA-Seq data sets (i.e., Klein, Zeisel, Kolo,
   Chung, and AD) were used in this analysis.

Case study of the AD database

   We applied scGNN on public Alzheimer’s disease (AD) scRNA-Seq data with
   13,214 cells^[196]27. The resolution of scGNN was set to 1.0, KI was
   set to 20, and the remaining parameters were kept as default. The AD
   patient and control labels were provided by the original paper and used
   to color the cells on the same UMAP coordinates generated from scGNN.
   We simply combined cells in six oligodendrocyte subpopulations into one
   cluster, referred to as merged oligo. The DEGs were identified in each
   cell cluster via the Wilcoxon rank-sum test implemented in the Seurat
   package along with adjusted p-values using the Benjamini-Hochberg
   procedure with a nominal level of 0.05. DEGs with logFC > 0.25 or
   <−0.25 were finally selected. We further identified the DEGs between AD
   and control cells in each cluster using the same strategy and applied
   GSEA for pathway enrichment analysis^[197]62. The imputed matrix, which
   resulted from scGNN was then sent to IRIS3 for CTSR prediction, using
   the predicted cell clustering labels with merged
   oligodendrocytes^[198]45. The default parameters were served in
   regulatory analysis in IRIS3.

Software implementation

   Tools and packages used in this paper include: Python version 3.7.6,
   numpy version 1.18.1, torch version 1.4.0, networkx version 2.4, pandas
   version 0.25.3, rpy2 version 3.2.4, matplotlib version 3.1.2, seaborn
   version 0.9.0, umap-learn version 0.3.10, munkres version 1.1.2, R
   version 3.6.1, and igraph version 1.2.5. The IRIS3 website is at
   [199]https://bmbl.bmi.osumc.edu/iris3/index.php.

Reporting summary

   Further information on research design is available in the [200]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [201]Supplementary Information^ (1.8MB, pdf)
   [202]41467_2021_22197_MOESM2_ESM.pdf^ (75.1KB, pdf)

   Description of Additional Supplementary Files
   [203]Supplementary Data 1–16^ (299.1KB, xlsx)
   [204]Reporting Summary^ (68.4KB, pdf)

Acknowledgements