Abstract
Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis
of cellular heterogeneity, but dropout events, where gene expression is
undetected in individual cells, present a significant challenge. We
propose scMASKGAN, which transforms matrix imputation into a pixel
restoration task to improve the recovery of missing gene expression
data. Specifically, we integrate masking, convolutional neural networks
(CNNs), attention mechanisms, and residual networks (ResNets) to
effectively address dropout events in scRNA-seq data. The masking
mechanism ensures the preservation of complete cellular information,
while convolution and attention mechanisms are employed to capture both
global and local features. Residual networks augment feature
representation and effectively mitigate the risk of model overfitting.
Additionally, cell-type labels are incorporated as constraints to guide
the methods in learning more accurate cellular features. Finally,
multiple experiments were conducted to evaluate the methods’
performance using seven different data types and scRNA-seq data from
ten neuroblastoma samples. The results demonstrate that the data
imputed by scMASKGAN not only perform excellently across various
evaluation metrics but also significantly enhance the effectiveness of
downstream analyses, enabling a more comprehensive exploration of
underlying biological information.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12859-025-06138-9.
Keywords: Single-cell RNA sequencing, Data imputation, Deep learning
Introduction
Single-cell RNA sequencing (scRNA-seq) measures gene expression at the
single-cell level, providing valuable insights into cellular
heterogeneity and diversity [[32]1, [33]2]. This technique involves
reverse transcription and amplification of full-length mRNA [[34]3],
but typically focuses on the 5’ and 3’ ends, introducing inherent
limitations [[35]4]. Variability in reverse transcription efficiency
and limited starting RNA contribute to high technical noise, capturing
only a fraction of the transcriptome and generating many zero
expression values [[36]5]. These zeros may represent either true lack
of expression or artificial dropout events caused by technical noise
[[37]6], obscuring the true gene expression landscape. Consequently,
advanced methods are needed to accurately identify and mitigate dropout
events in scRNA-seq analysis.
Early approaches to address this challenge include methods such as
MAGIC [[38]7], SAVER [[39]8], DrImpute [[40]9], VIPER [[41]10],
scImpute [[42]11], ENHANCE [[43]12], and SCRABBLE [[44]13]. MAGIC
imputes missing values by leveraging similarities between cells, while
SAVER uses a Bayesian framework for probabilistic data recovery.
DrImpute groups similar cells via clustering for imputation, and VIPER
employs a non-negative sparse regression model based on neighborhood
information. Additionally, ENHANCE and SCRABBLE utilize principal
component analysis and batch normalization techniques, respectively, to
reduce dimensionality and mitigate batch effects, thereby enhancing
data recovery quality.
Recent years have seen a growing application of deep learning methods
to address dropout events in scRNA-seq data [[45]14]. Several
approaches-including DeepImpute [[46]15], AutoImpute [[47]16], DCA
[[48]17], scVI [[49]18, [50]19], DISC [[51]20], and sciGANs
[[52]21]-have made significant strides in this domain. For example,
DeepImpute employs a divide-and-conquer strategy with sub-neural
networks to estimate gene expression, outperforming earlier methods
such as DrImpute and SAVER. However, its reliance on extensive
parameter tuning and using 95
[MATH: % :MATH]
of the data for training can lead to overfitting. AutoImpute learns the
inherent data distribution for imputation but has been observed to
sometimes produce invalid (negative) values. DCA enhances traditional
autoencoders by incorporating negative binomial and zero-inflated noise
models [[53]22], thereby improving both denoising and imputation. scVI,
a hierarchical Bayesian model, addresses batch correction and
imputation by projecting data into latent spaces, although it may
struggle when gene counts exceed cell numbers. DISC improves
reliability by training on datasets with a high proportion of
unexpressed genes (90
[MATH: % :MATH]
), but its efficiency is constrained by the need for high-performance
hardware. sciGANs leverages Generative Adversarial Networks (GANs)
[[54]23] to frame dropout imputation as a pixel recovery task, thus
avoiding overfitting and maintaining robustness in detecting
low-expression genes, but converting each row of scRNA-seq data into a
square image matrix can introduce significant noise.
Earlier imputation algorithms, such as MAGIC and SAVER, are typically
based on assumptions of cell similarity or gene co-expression networks.
While these methods can enhance signal clarity, they tend to remove
lowly expressed genes and overlook the inherent stochasticity of
natural cellular processes, thereby complicating the imputation of rare
cell populations. More recent deep learning-based approaches (e.g.,
scIALM [[55]24], SAE-Impute [[56]25]) often incorporate mathematical
assumptions and regularization techniques to constrain the imputation
process. Although these methods may improve imputation accuracy or
enhance the visual quality of UMAP clustering plots, they frequently
produce excessively smoothed results that compromise the representation
of true biological variability. In contrast, GAN-based approaches can
effectively circumvent these limitations by reframing imputation as a
generative process. This strategy allows GANs to learn the true data
distribution, generate realistic synthetic data points, and use these
generated samples to estimate missing values, thereby mitigating the
risk of overfitting while better preserving the intrinsic biological
variability.
In this study, we introduce scMASKGAN, a GAN-based framework for
scRNA-seq data imputation that combines masking, convolutional neural
networks (CNN) [[57]26], attention mechanisms [[58]27], and ResNets
[[59]28] to effectively address dropout events. Our main contributions
are summarized below:
* We design scMASKGAN with a novel masking mechanism that preserves
the intrinsic structure of gene expression data without imposing
gene-specific constraints. By integrating CNNs, Self-attention, and
ResNets, the model dynamically captures intricate gene-gene and
gene-cell interactions, learning hierarchical representations that
maintain both local and global biological features.
* Instead of directly modifying the original data, scMASKGAN
generates realistic synthetic single-cell data for imputation,
thereby avoiding overfitting to dominant cell types while retaining
features of rare cells. Additionally, an Isolation Forest algorithm
is employed to detect and remove anomalous values during synthesis,
ensuring high biological fidelity in the final imputed dataset.
* We validate scMASKGAN on seven diverse datasets and 10
neuroblastoma samples, demonstrating its superior performance over
existing imputation methods across multiple quantitative metrics,
as well as its ability to restore biologically meaningful gene
expression patterns, including improved gene-gene correlations,
trajectory inference, batch correction, and differential expression
analysis.
Materials and methods
Data preparation
We utilize seven real datasets from Xu et al. [[60]21], encompassing a
diverse array of scRNA-seq data types. These datasets include Human
brain scRNA-seq data,[61]1 ERCC spike-in RNAs scRNA-seq data[62]2,
Mouse ESC scRNA-seq data,[63]3 time-course scRNA-seq data,[64]4 and
three scRNA-seq datasets derived from the sc_Drop-seq,[65]5
sc_CEL-seq2,[66]6 and sc_10X platforms.[67]7 The details is as follows:
* Human brain scRNA-seq data: This dataset provides high-quality
single-cell RNA sequencing data from human brain tissues. It is
used to evaluate the performance of imputation methods in complex
biological systems with diverse cell types and intricate gene
expression patterns.
* ERCC spike-in RNAs scRNA-seq data: This dataset is unique in that
it consists of spike-in RNA molecules, where the number of cells
exceeds the number of genes. It serves as a benchmark for
evaluating the imputation performance on extremely small gene
expression matrix datasets.
* Mouse ESC scRNA-seq data: This dataset includes embryonic stem cell
(ESC) differentiation and large-scale scRNA-seq data. It allows us
to test the imputation method’s effectiveness in developmental
biology studies, where gene expression profiles exhibit dynamic
changes.
* Time-course scRNA-seq data: Temporal sequencing data is crucial for
understanding gene expression dynamics over time. This dataset
helps evaluate how well the imputation method preserves temporal
patterns and recovers missing values in time-dependent scRNA-seq
studies.
* sc_Drop-seq dataset: Drop-seq is a widely used droplet-based
scRNA-seq technology. By including this dataset, we assess the
robustness of scMASKGAN across different sequencing platforms,
ensuring its compatibility with droplet-based data.
* sc_CEL-seq2 dataset: CEL-seq2 is a plate-based sequencing platform
known for its improved sensitivity and accuracy in capturing gene
expression. This dataset helps evaluate the imputation model’s
adaptability to high-precision sequencing techniques.
* sc_10X dataset: The 10X Genomics platform is one of the most widely
used commercial scRNA-seq technologies. Including this dataset
ensures that scMASKGAN can generalize to large-scale single-cell
datasets generated from commercial platforms, demonstrating its
scalability and broad applicability.
By incorporating datasets from different species, sequencing platforms,
data sizes, dropout rate and experimental conditions, we ensure that
our evaluation of scMASKGAN is comprehensive and reflects its potential
utility in diverse real-world applications. Detailed information on
these datasets is provided in Table [68]1.
Table 1.
Seven scRNA-seq samples
Datasets Type Cells Genes Accession number Dropout
Brain Human 420 22085 [69]GSE67835 81
[MATH: % :MATH]
ERCC Spike-in 288 92 E-MTAB-2512 33
[MATH: % :MATH]
ESC Mouse 6886 24176 [70]GSE65525 70
[MATH: % :MATH]
sc_Drop-seq ... 226 15128 [71]GSE118706 62
[MATH: % :MATH]
sc_CEL-seq2 ... 275 28205 [72]GSE117617 74
[MATH: % :MATH]
sc_10X ... 740 16469 [73]GSE111108 45
[MATH: % :MATH]
Time-course ... 759 19190 [74]GSE75748 55
[MATH: % :MATH]
[75]Open in a new tab
Additionally, we employ a dataset from Yuan et al. [[76]29] that
comprises 10 neuroblastoma samples, including 5 high-risk
neuroblastomas (NB) and 5 low-risk neuroblastomas-among which there are
4 ganglioneuroblastomas (GNB) and 1 ganglioneuroma (GN). These data can
be downloaded from GEO,[77]8 as detailed in Table [78]2. The samples
were obtained from clinical surgical resections and belong to the same
sequencing batch, which facilitates a comparative assessment of model
performance across batches and ensures that the imputation process
mitigates batch effects during data integration. Moreover, as dropout
rates exceed 90% in most of these datasets, it facilitates rigorous
testing of the model’s robustness under extreme conditions. In
subsequent figures, we will primarily use the dropout rates of
different datasets as legends or titles.
Table 2.
Ten neuroblastoma samples
Datasets Type Cells Genes Dropout
[79]GSM5768743 NB 960 33514 95
[MATH: % :MATH]
[80]GSM5768744 NB 768 33514 91
[MATH: % :MATH]
[81]GSM5768745 NB 445 33514 97
[MATH: % :MATH]
[82]GSM5768746 NB 357 33514 85
[MATH: % :MATH]
[83]GSM5768747 NB 639 33514 96
[MATH: % :MATH]
[84]GSM5768748 GNB 740 33514 97
[MATH: % :MATH]
[85]GSM5768749 GNB 1052 33514 97
[MATH: % :MATH]
[86]GSM5768750 GNB 1053 33514 97
[MATH: % :MATH]
[87]GSM5768751 GNB 360 33514 84
[MATH: % :MATH]
[88]GSM5768752 GN 551 33514 97
[MATH: % :MATH]
[89]Open in a new tab
Methods
Inspired by the remarkable performance of GANs in image restoration and
the introduction of StackGAN [[90]30], a model designed for
text-to-image generation, we propose scMASKGAN, a generative
adversarial network tailored for scRNA-seq imputation. Both image
inpainting and missing value imputation share the fundamental goal of
restoring incomplete data while preserving its structural and
contextual integrity. In image inpainting, GANs learn spatial
structures and content features to generate realistic pixels that
seamlessly blend with surrounding regions. Similarly, in missing value
imputation, GANs leverage gene expression patterns and intercellular
relationships to generate biologically meaningful values that
accurately reflect the underlying data distribution. By converting each
cell’s normalized gene expression profile into a grayscale image,
scMASKGAN extracts essential features and utilizes GANs’ robust image
synthesis capabilities to generate realistic grayscale representations
that impute missing gene data. The detailed framework is shown in
Fig. [91]1, and in this section, we provide a comprehensive description
of the scMASKGAN workflow.
Fig. 1.
[92]Fig. 1
[93]Open in a new tab
The overall framework of scMASKGAN is illustrated across four panels.
Panel A shows the process of preparing data, beginning with the
sequencing-derived gene expression matrix and converting it into a
format suitable for model input. Panel B details the architecture of
discriminator, designed to distinguish between original and imputed
data, thereby driving the generator to produce more biologically
accurate imputed values, the upsampling module is designed to increase
data dimensionality and capture detailed distribution features. The
downsampling module reduces resolution and utilizes convolution to
extract higher-order features. Residual blocks are employed to enhance
these high-level features, while the attention module is used to
capture global characteristics. Panel C presents the architecture of
generator, which uses multiple layers and mechanisms to infer and fill
missing values based on the observed data patterns. Panel D outlines
the complete imputation workflow
Data preprocessing
scMASKGAN reframes the imputation of single-cell RNA-seq data as an
image inpainting (pixel restoration) problem by converting the gene
expression matrix into an image-based representation. As shown in
Fig. [94]1A, each cell’s expression profile (originally a vector of m
gene features) is mapped onto a 2D grid of pixels of size
[MATH: N×N :MATH]
(chosen such that
[MATH: N×N≥m
:MATH]
, ). Specifically, the gene expression vector of each cell is
zero-padded to a length of
[MATH: N×N :MATH]
, reshaped into a two-dimensional array by sequentially arranging the
elements row-wise, thus ensuring consistency with the original matrix.
In this image, each pixel corresponds to a specific gene, and its
intensity represents the expression level of that gene in the given
cell. For a gene j mapped to pixel coordinate (u, v), we have:
[MATH: Ii(u,v)=xi,j
:MATH]
1
where
[MATH:
xi,j
:MATH]
denotes the expression value of gene j in cell i. Dropout events (genes
with undetected expression in a cell) manifest as pixels with zero
intensity in the image, and scMASKGAN applies a masking mechanism to
flag these missing values. A binary mask of the same
[MATH: N×N :MATH]
size is generated for each cell, using 0 for dropouts and 1 for
observed gene expressions. This masking mechanism serves two critical
functions. First, since gene expression values are initially
normalized, missing entries would otherwise be assigned nonzero values,
potentially introducing noise when extracting features using CNNs and
attention mechanisms. The binary mask effectively mitigates this
interference by distinguishing observed values from imputed ones.
Second, by masking the padded values, the model ensures that they do
not participate in the computation process, preserving the integrity of
the learned representations. This image-based representation allows
missing data imputation to be treated as a pixel-wise image restoration
task, enabling convolutional neural networks with attention mechanisms
to capture both local gene-gene interaction patterns and global
expression structures from the spatial layout, leading to more
effective recovery of missing data.
Moreover, cell-type labels (
[MATH: y :MATH]
) is extracted from the dataset and converted into categorical
numerical values to represent different cell types. In scMASKGAN
training, the label is transformed into a one-hot encoded vector and
incorporated into both the generator and discriminator, ensuring
class-specific data generation and enhancing classification
performance. In the imputation process, label is utilized to associate
each cell with the corresponding generated candidate set, thereby
ensuring that missing values are imputed using biologically relevant
data from the same category.
Generator
In scMASKGAN, the generator
[MATH: GθG :MATH]
transforms a random noise vector
[MATH: z :MATH]
into a synthetic gene expression tensor, conditioned on a one-hot
encoded label matrix
[MATH: y :MATH]
through a series of linear and non-linear operations, as illustrated in
Fig. [95]1B. Initially, the noise vector
[MATH: z∈Rdz :MATH]
is sampled from a prior distribution
[MATH: pz :MATH]
, (
[MATH: z∼N(0,1) :MATH]
). The generator’s architecture is defined by the following sequence of
transformations:
[MATH: GθG:z,y→F1<
mi mathvariant="bold">H(1)→F2<
mi mathvariant="bold">H(2)→F3,⋯,FLH(L)=G(z,y)
:MATH]
2
where each
[MATH: Fl :MATH]
denotes a parameterized transformation (either linear or
convolutional), followed by a ReLU or Sigmoid activation function to
ensure the non-negativity of scRNA-seq data and the binary
representation of grayscale images. The final output
[MATH: H(L) :MATH]
constitutes the generated tensor, which represents imputed gene
expression data. The objective of the generator is to produce synthetic
data that conform to the biological context specified by the label
matrix
[MATH: y :MATH]
. The generator is optimized by minimizing the following loss function:
[MATH: minGEz∼pz(z)log1-DG(z∣y)∣y :MATH]
3
where
[MATH: D :MATH]
represents the discriminator and
[MATH: G(z∣y) :MATH]
is the generator output conditioned on
[MATH: y :MATH]
. This loss encourages the generator to produce outputs that are
indistinguishable from real data under the given cell-type conditions.
Initially, the generator applies a linear transformation, batch
normalization, and ReLU activation to the input noise vector
[MATH: z :MATH]
, yielding an intermediate feature map
[MATH: H(1) :MATH]
. This feature map is subsequently reshaped into a tensor of dimensions
[MATH: B,cn1,
img\_size4,img\_size4 :MATH]
, where
[MATH: B :MATH]
is the batch size and
[MATH: cn1 :MATH]
denotes the number of channels in the first convolutional block. Next,
a self-attention mechanism is employed to capture long-range
dependencies among features. In this mechanism, attention scores are
computed by learning queries
[MATH: Q :MATH]
, keys
[MATH: K :MATH]
, and values
[MATH: V :MATH]
, and the attention output
[MATH: H(att) :MATH]
is calculated as follows:
[MATH: Attention=Soft
maxQK⊤d,H(att)=V·Attention.
mrow> :MATH]
4
This output is integrated with the preceding feature map via a residual
connection, modulated by a learnable scaling parameter
[MATH: γ :MATH]
. Concurrently, the label matrix
[MATH: y :MATH]
is upsampled to match the spatial dimensions of the feature map
[MATH: H(3) :MATH]
and processed through a convolutional layer to reduce its channel
dimensionality. The resulting label-aligned feature map is then
concatenated with
[MATH: H(3) :MATH]
to form the final feature map
[MATH: H(5) :MATH]
, which is further processed by additional convolutional layers. The
output image
[MATH: G(z,y) :MATH]
is obtained by applying a sigmoid activation to the final convolutional
layer’s output. The detailed design of the generator is as follows:
Algorithm 1.
[96]Algorithm 1
[97]Open in a new tab
scMASKGAN Generator
Discriminator
The primary function of the discriminator
[MATH: D :MATH]
is to distinguish between real and synthetic gene expression data. As
illustrated in Fig. [98]1C, the discriminator receives as input an
image
[MATH: x :MATH]
representing gene expression data and a label matrix
[MATH: y :MATH]
encoding cell-type information. Its optimization objective is defined
as
[MATH: maxDEx∼p
data(x)logD(x∣y)+Ez∼p
z(z)log(1-D(G(z∣y)∣y)) :MATH]
5
where
[MATH: D(x∣y) :MATH]
denotes the probability that the discriminator assigns to a real sample
[MATH: x :MATH]
under the condition
[MATH: y :MATH]
, and
[MATH: D(G(z∣y)∣y) :MATH]
corresponds to the probability assigned to a generated sample.
Specifically, the input image
[MATH: x :MATH]
is first processed by a preprocessing layer that flattens the image and
reconstructs it into a feature map suitable for convolutional
operations. This feature map is then passed through a downsampling
convolutional block to reduce its spatial dimensions, with residual
connections incorporated to enhance feature propagation and facilitate
model learning. Moreover, a self-attention mechanism is integrated to
improve feature extraction by computing attention scores and generating
the corresponding output
[MATH: H(att) :MATH]
. Simultaneously, the label matrix
[MATH: y :MATH]
is upsampled and processed analogously to the procedure employed in the
generator, after which it is concatenated with the image features to
form a label-aligned feature map. Finally, a series of upsampling
layers is applied to restore the spatial dimensions of the feature map,
ultimately yielding a probability map that reflects the authenticity of
the input image
[MATH: x :MATH]
under the condition
[MATH: y :MATH]
. The detailed design of the discriminator is as follows:
Algorithm 2.
[99]Algorithm 2
[100]Open in a new tab
scMASKGAN Discriminator
Overall, the generator and discriminator collaborate within an
adversarial framework. The generator produces realistic gene expression
data conditioned on the label matrix
[MATH: y :MATH]
. The discriminator distinguishes between real and generated data based
on the same cell-type label conditioning. This architecture enables the
generation of realistic gene expression profiles. The detailed design
of the scMASKGAN is as follows:
Algorithm 3.
[101]Algorithm 3
[102]Open in a new tab
scMASKGAN Training Process
Imputed method
For a given cell
[MATH: ci :MATH]
in subgroup
[MATH: Kci :MATH]
, we first generate a candidate set
[MATH:
AKci :MATH]
of
[MATH: ncan :MATH]
expression profiles using a well-trained scMASKGAN. Specifically, the
one-hot encoded cell-type vectors are combined with randomly sampled
noise and fed into the pre-trained generator. Through multiple
transformations and mapping processes, the generator produces a
candidate set of gene expression profiles that correspond to the given
cell type. We then apply the Isolation Forest (ISOforest)
algorithm[[103]31] (Fig. [104]1D) to detect and remove outliers by
constructing isolation trees and assigning anomaly scores based on path
lengths-where shorter paths indicate potential outliers. Specifically,
each tree recursively partitions the data by randomly selecting
features and split points. The depth of an isolation tree, which
represents the number of splits required to isolate a data point,
reflects the difficulty of isolating that point. To quantify the degree
of anomaly, an anomaly score
[MATH: s(x) :MATH]
is computed for each candidate gene expression profile using the
formula
[MATH: s(x)=2-E(h(x))c(n) :MATH]
. where
[MATH: E(h(x)) :MATH]
denotes the average path length of data point
[MATH: x :MATH]
across all isolation trees, and
[MATH: c(n) :MATH]
represents the average path length for a dataset of size
[MATH: n :MATH]
. Since anomalous points are generally easier to isolate, they tend to
have shorter path lengths and, consequently, higher anomaly scores.
Finally, a threshold
[MATH: τ :MATH]
is set to identify outliers, where a data point is classified as an
anomaly if
[MATH: s(x)>τ :MATH]
The threshold
[MATH: τ :MATH]
can be dynamically determined based on the distribution of anomaly
scores within the dataset. Moreover, we find scMASKGAN employs the
Isolation Forest anomaly detection algorithm to identify and remove
outliers during the imputation of batch data in Sect. [105]4.3. By
eliminating anomalies introduced during batch sequencing, this approach
reduces the impact of batch effects.
Finally, using Euclidean distance (
[MATH: d(x,y)=∑i=1n<
mrow>(xi-yi)2
:MATH]
), where
[MATH: x :MATH]
and
[MATH: y :MATH]
represent the gene expression vectors of two cells, with a dimension of
[MATH: n :MATH]
(i.e., the number of genes). To ensure computational accuracy and
comparability, all gene expression data undergo normalization
preprocessing before computing the Euclidean distance, thereby
eliminating the influence of varying gene expression ranges.
Furthermore, to enhance the reliability of nearest neighbor selection,
we define a dynamically adjusted similarity threshold
[MATH: θ :MATH]
, which is determined based on the statistical distribution of
Euclidean distances between all cell pairs. Typically, it is set as the
mean
[MATH: μ :MATH]
plus a certain multiple of the standard deviation
[MATH: σ :MATH]
, i.e.,
[MATH:
θ=μ+kσ :MATH]
. To optimize the imputation performance, we further employ Optuna
[[106]32] for hyperparameter tuning, primarily optimizing the
[MATH: k :MATH]
-nearest neighbors (KNN) parameter
[MATH: k :MATH]
. The objective is to search for the optimal
[MATH: k :MATH]
value that minimizes the mean squared error (MSE) between the original
and imputed data, ensuring that selected neighbors exhibit high
similarity while avoiding the influence of outliers. Ultimately, only
cells with a Euclidean distance less than
[MATH: θ :MATH]
are retained as candidate neighbors, improving the reliability and
biological interpretability of the imputed data. To estimate the
expression value of gene
[MATH: j :MATH]
in cell
[MATH: ci :MATH]
, we use the following formula:
[MATH: c^i,j=ci,
j,ifci,<
mi>j>0,ci<
mo>,kNN,j,otherwise.
:MATH]
6
Here,
[MATH: c^i,j :MATH]
denotes the estimated expression value,
[MATH:
ci,j
:MATH]
is the original expression value from the raw profile, and
[MATH:
ci,kNN,j :MATH]
represents the value estimated from the expression profiles of the
nearest neighbor cells. If the expression value of gene
[MATH: j :MATH]
in cell
[MATH: ci :MATH]
is greater than zero, the original value is retained. Otherwise, the
corresponding gene expression value from the nearest neighbors is used
for imputation.
Finally, we obtained a matrix consisting of the imputed scRNA-seq data
along with padded zeros. Since we applied a masking mechanism that
ignored these zeros, they were not processed during model training and
imputation. Therefore, by removing the padded zeros, we obtained the
final imputed matrix.
Parameter adjustment strategies
The parameter-tuning strategy involves carefully adjusting the learning
rate (lr), typically starting from values such as 0.001 or 0.0001, and
then employing Adam [[107]33] with weighted iterative refinements based
on observed convergence performance. The parameter
[MATH: img\_size :MATH]
, defined as dimension
[MATH: N :MATH]
, directly corresponds to gene count, a higher gene count necessitates
larger dimensions. The latent dimension
[MATH: latent\_dim :MATH]
is consistently set equal to
[MATH: img\_size :MATH]
to preserve spatial coherence. Cell-type labels (
[MATH: y :MATH]
) and the
[MATH: ncls :MATH]
are adjusted simultaneously to ensure biologically relevant category
assignments and effective imputation. In addition, we also present the
parameter settings and convolutional layer configurations of the
generator and discriminator.
Generator parameters and convolutional architecture:
1. Perform linear transformation and reshape the latent noise vector
[MATH: z :MATH]
through a fully connected layer, generating an initial tensor
[MATH: zn :MATH]
of dimension
[MATH: (32,N,N) :MATH]
.
2. Conduct convolution on
[MATH: zn :MATH]
using convolutional layer
[MATH:
GConv1<
/mrow> :MATH]
with kernel size
[MATH: 3×3 :MATH]
, obtaining tensor
[MATH: zconv :MATH]
of dimension
[MATH: (32,N,N) :MATH]
.
3. Apply the self-attention mechanism on
[MATH: zconv :MATH]
to derive an attention-refined tensor.
4. Perform convolution on one-hot encoded cell-type label tensors
using convolutional layer
[MATH:
LConv1<
/mrow> :MATH]
with kernel size
[MATH: 3×3 :MATH]
, producing tensor
[MATH: Ln :MATH]
of dimension
[MATH: (8,N,
N) :MATH]
.
5. Concatenate the attention-refined
[MATH: zconv :MATH]
and
[MATH: Ln :MATH]
into tensor
[MATH: GConcat :MATH]
with dimension
[MATH: (40,N,N) :MATH]
.
6. Execute sequential convolutional operations
[MATH:
GConv2
1 :MATH]
and
[MATH:
GConv2
2 :MATH]
, each with kernel size
[MATH: 3×3 :MATH]
, on
[MATH: GConcat :MATH]
, yielding the generator output tensor of dimensions
[MATH: (1,N,
N) :MATH]
.
Discriminator parameters and convolutional architecture:
1. Perform initial linear transformation and reshape the input tensor
into dimensions
[MATH: (1,N,
N) :MATH]
.
2. Apply convolutional layer
[MATH:
DConv1<
/mrow> :MATH]
with kernel size
[MATH: 3×3 :MATH]
, followed by max-pooling and a residual block, reducing the tensor
dimension to
[MATH: (32,N/2,N/2)
:MATH]
.
3. Conduct convolutional layer
[MATH:
DConv2<
/mrow> :MATH]
with kernel size
[MATH: 3×3 :MATH]
, further reducing tensor dimension to
[MATH: (16,N/2,N/2)
:MATH]
.
4. Integrate the self-attention mechanism on this reduced tensor to
capture inter-feature dependencies.
5. Perform convolution on one-hot encoded cell-type label tensors
using convolutional layer
[MATH:
LConv2<
/mrow> :MATH]
with kernel size
[MATH: 3×3 :MATH]
, generating tensor
[MATH: Ld :MATH]
of dimension
[MATH: (8,N/2,N/2)
:MATH]
.
6. Concatenate attention-refined tensor and
[MATH: Ld :MATH]
forming tensor
[MATH: DConcat :MATH]
with dimension
[MATH: (24,N/2,N/2)
:MATH]
.
7. Apply fully connected layers and additional convolutional layers
with kernel size
[MATH: 3×3 :MATH]
to produce the discriminator output.
The size of the convolutional kernel and the dimensional settings are
adjusted based on the number of genes. The dimension typically tuned
between 32 and 64, which is sufficient to capture the characteristics
of various types of data. The residual parameter
[MATH: γ :MATH]
serves as a balancing factor within residual blocks, moderating the
interplay between real and generated features to stabilize model
training. Typically,
[MATH: γ :MATH]
is adjusted experimentally, higher values emphasize generated features,
whereas lower values prioritize real features. This parameter is
fine-tuned iteratively based on convergence behavior.
Metrics evaluation
The primary objective of data imputation is to construct a “gold
standard” dataset by rectifying false zeros introduced by technical
noise, thereby accurately recovering the true expression levels of
genes. In this process, genes that are genuinely unexpressed remain as
true zeros. To mitigate the impact of zero values on computational
analyses, minimal non-zero values are assigned to lowly expressed
genes, while the expression levels of highly expressed genes remain
largely unaffected.
To rigorously assess the performance of scMASKGAN, we conducted
comprehensive comparisons with 12 state-of-the-art imputation
algorithms-namely, AutoImpute, DCA, DeepImpute, DrImpute, ENHANCE,
MAGIC, SAVER, scImpute, SCRABBLE, VIPER, scIGANS, and scGAIN-across
seven real scRNA-seq datasets. Evaluation metrics included the
following:
* Uniform Manifold Approximation and Projection (UMAP) distribution
plots: Used to visualize the global structure of the data before
and after imputation, ensuring that the overall clustering patterns
remain biologically meaningful.
* Z-score standardized distribution: Examined to verify that imputed
data retains a normalized distribution, facilitating comparability
with the original data and minimizing artificial distortions.
* Coefficient of Variation (CV): Assessed to maintain gene expression
variability, ensuring that important biological signals are not
lost during imputation.
* Wasserstein distance and Jensen-Shannon distance: Employed to
quantify distributional differences between the imputed and
original datasets, providing robust statistical metrics for
evaluating imputation performance.
* Accuracy (ACC): Evaluated to measure the ability of imputation
methods to correctly classify cell types based on reconstructed
expression profiles.
* Area Under the Receiver Operating Characteristic Curve (AUC): Used
to assess the sensitivity and specificity of gene expression
restoration, providing a robust measure of imputation reliability.
* F1 scores: Computed to balance precision and recall, particularly
in identifying dropout events versus true biological zeros.
* Pearson correlation coefficients: Calculated relative to the
original datasets to quantify the fidelity of imputed data,
ensuring that gene-gene relationships remain intact.
These metrics were rigorously analyzed to provide a comprehensive
assessment of the efficacy of scMASKGAN in accurately imputing gene
expression data.
UMAP distribution
UMAP is widely used for dimensionality reduction in scRNA-seq, enabling
the visualization of high-dimensional data in two dimensions and
facilitating data distribution analysis [[108]34]. As illustrated in
Figure [109]2, UMAP projections of imputed Humanbrain datasets reveal
that methods such as MAGIC, AutoImpute, DeepImpute, VIPER, scGAIN, and
scMASKGAN effectively recover data structure. Notably, AutoImpute
exhibits superior cell type separation, clearly distinguishing neurons,
astrocytes, and other cell types with clustering patterns that closely
resemble the original data. scMASKGAN performs similarly well,
maintaining strong alignment with the inherent data structure and
suggesting minimal alteration of the underlying biological
distribution. While DeepImpute and VIPER also achieve reasonable cell
type separation, minor overlaps between clusters are observed. In
contrast, DCA, scrImpute, and scGAIN yield substantial cell type
overlap and weaker clustering, reducing the interpretability of the
imputed data. Overall, AutoImpute and scMASKGAN emerge as the most
optimal imputation methods, effectively preserving cell type structure
and distribution for robust downstream analysis.
Fig. 2.
[110]Fig. 2
[111]Open in a new tab
The UMAP projections of the original scRNA-seq data alongside those of
data imputed by 13 different methods, accompanied by cell-type labels
Furthermore, we have presented the distributions of other datasets in
Supplementary Figures 1 to 8. Supplementary Figures 1 and 2 display the
UMAP distributions of the original neuroblastoma data across 10 groups
and the corresponding imputed data generated by scMASKGAN,
demonstrating that even under extremely high dropout rates, scMASKGAN
maintains excellent imputation performance. Additionally, Supplementary
Figures 3 to 8 illustrate the T-SNE and UMAP distributions of the
datasets in Table 1, along with the imputed data distributions from
scMASKGAN, clearly indicating that favorable imputation results are
consistently achieved under various dropout rates.
Coefficient of variation
The Coefficient of Variation (CV) [[112]35] serves as a standardized
metric for quantifying data dispersion, higher CV values denote
increased variability, whereas lower CV values indicate greater
consistency. As illustrated in Fig. [113]3A, the CV values across
datasets with varying dropout rates are presented. Notably, the
SCRABBLE method exhibits comparatively elevated CV values, indicative
of pronounced variability. In contrast, the DCA method demonstrates
exceptionally high variability in the Mouse ESC dataset (70
[MATH: % :MATH]
dropout), which we attribute to potential inaccuracies in parameter
estimation due to the large data volume. Conversely, the AutoImpute,
scMASKGAN, and scGANs methods yield relatively lower CV values,
reflecting superior imputation performance.
Fig. 3.
[114]Fig. 3
[115]Open in a new tab
Comparison of imputation metrics. Panel A shows the CV coefficients of
different imputation methods across various dropout rates. Panel B
presents the Z-score standardized distribution comparison between
scMASKGAN imputed data and the original data under extreme missingness.
Panel C illustrates the CV coefficients comparison between scMASKGAN
imputed data and the original data under extreme missingness
conditions. Panel D compares multiple imputation methods using the JS
test and Wasserstein distance
Furthermore, to rigorously assess the robustness of scMASKGAN under
extreme missing data conditions, Fig. [116]3C displays the CV values
for scMASKGAN’s imputed results across 10 neuroblastoma dataset groups.
In conjunction with the UMAP distribution analyses, these findings
suggest that scMASKGAN effectively integrates low variability with high
fidelity to the original data distribution, thereby representing an
exemplary imputation approach that is critical for the accuracy of
downstream analyses.
Z-score distribution
Z-score [[117]36] distribution refers to the distribution obtained by
applying Z-score normalization to the data. Specifically, for each data
point
[MATH: x :MATH]
, the Z-score is defined as
[MATH: z=x-μσ
:MATH]
7
where
[MATH: μ :MATH]
is the mean of the data and
[MATH: σ :MATH]
is the standard deviation. This normalization procedure transforms the
data such that the resulting distribution has a mean of 0 and a
standard deviation of 1. We utilize the Z-score distribution to compare
the original data with the imputed data and to detect significant
deviations. A more concentrated Z-score distribution indicates fewer
outliers and superior data quality. As shown in Supplementary Figure 9,
the Z-score distributions for datasets with varying dropout rates
demonstrate that scMASKGAN performs outstandingly. Moreover,
Fig. [118]3B presents the Z-score distributions of both the original
and the scMASKGAN-imputed data under extremely high dropout rates. When
combined with the UMAP distribution and CV coefficient analyses from
the preceding sections, these results indicate that scMASKGAN maintains
excellent performance even under extreme conditions.
Statistical tests
In Fig. [119]3D, we compare the JS distance [[120]37] and the
Wasserstein distance (EMD) [[121]38] between the imputed data generated
by various methods and the original data. The JS distance is computed
derived from the Jensen-Shannon divergence, which is defined as
[MATH: JSD(P‖Q)=12DKLP‖P+Q
2+12DKLQ‖P+Q
2
:MATH]
8
where
[MATH: DKL :MATH]
denotes the Kullback–Leibler divergence. The JS distance is then
obtained as the square root of the Jensen-Shannon divergence:
[MATH: JSdistance(P,Q)=JSD
(P‖Q) :MATH]
9
The Wasserstein distance (also known as the Earth Mover’s Distance,
EMD) is defined as
[MATH: W1(P,Q)=infγ∈Γ(P,Q)∫X×X‖x-y‖dγ(x,y), :MATH]
10
where
[MATH: Γ(P,Q) :MATH]
denotes the set of joint distributions with marginals
[MATH: P :MATH]
and
[MATH: Q :MATH]
, and
[MATH: ‖x-y‖ :MATH]
represents the distance between
[MATH: x :MATH]
and
[MATH: y :MATH]
in the space
[MATH: X :MATH]
.
Our results indicate that scMASKGAN exhibits outstanding performance
with respect to both metrics, as evidenced by its low JS distance and
low Wasserstein distance, which suggest a high degree of consistency
between the imputed data and the original data. In contrast, the
AutoImpute, scGAIN, sciGANs, and SCRABBLE methods show only moderate
performance in terms of the Wasserstein distance. Furthermore, sciGANs
and AutoImpute perform poorly in terms of the JS distance. A closer
examination of the imputed data reveals that sciGANs tends to impute
all missing values, which does not conform to the requirement of
removing dropout values and consequently introduces a large number of
spurious biological signals. Additionally, in order to achieve higher
metric scores, AutoImpute introduces negative values into the gene
expression matrix, a practice that is biologically implausible.
This observation highlights the limitations in adaptability and
stability of both scImpute and Deepimpute across diverse datasets. In
particular, while scImpute can effectively impute data from certain
platforms, its output for other datasets results in missing values when
calculating the JS distance, EMD statistics, and Z-score distributions,
rendering these metrics uncomputable. Meanwhile, DeepImpute produced
usable results solely in the Human Brain dataset. These findings
suggest that some methods may require further optimization or
integration with other approaches to enhance estimation performance and
ensure the reliability of subsequent metric calculations.
Cluster metrics
ACC, AUC, and F1 score are standard classification evaluation metrics
used in clustering, with values ranging from 0 to 1, where higher
values indicate better clustering performance[[122]39]. To further
assess the effectiveness of various imputation algorithms, we employed
the Louvain algorithm to cluster seven datasets and compared the
resulting clustering labels with the corresponding cell types. The
performance in terms of ACC, AUC, and F1 scores is summarized in Tables
[123]3 and [124]4. Table [125]3 details different types of scRNA-seq
data, with ”...” indicating that certain algorithms were not applicable
to specific datasets, while Table [126]4 records scRNA-seq data from
different platforms. Figure [127]4 visualizes these results.
Table 3.
Clustering metrics for 13 imputation algorithms across various datasets
Datasets Human brain ERCC spike-in Mouse ESC Time-course
Metric ACC AUC F1 ACC AUC F1 ACC AUC F1 ACC AUC F1
AutoImpute 0.876 0.745 0.634 0.608 0.578 0.453 0.518 0.575 0.518 0.743
0.669 0.507
DCA 0.781 0.669 0.472 0.542 0.538 0.434 0.502 0.666 0.506 0.500 0.667
0.504
Deepimpute 0.803 0.678 0.492 0.559 0.557 0.454 0.640 0.730 0.721 0.846
0.865 0.874
DrImpute 0.834 0.732 0.580 0.604 0.61 0.513 ... ... ... ... ... ...
ENHANCE 0.862 0.751 0.626 0.746 0.718 0.624 ... ... ... ... ... ...
MAGIC 0.812 0.689 0.512 0.563 0.542 0.423 0.515 0.670 0.515 0.500 0.667
0.494
SAVER 0.856 0.804 0.671 0.525 0.526 0.425 ... ... ... ... ... ...
scGAIN 0.626 0.562 0.332 0.359 0.501 0.491 0.749 0.772 0.808 0.739
0.792 0.789
scImpute 0.857 0.767 0.638 0.711 0.734 0.649 0.904 0.910 0.966 0.977
0.977 0.998
SCRABBLE 0.481 0.575 0.367 0.540 0.522 0.404 ... ... ... ... ... ...
VIPER 0.768 0.662 0.460 0.511 0.507 0.404 ... ... ... ... ... ...
scIGANs 0.802 0.678 0.492 0.546 0.525 0.405 0.812 0.785 0.940 0.542
0.684 0.549
scMASKGAN 0.975 0.965 0.964 0.54 0.954 0.667 0.614 0.704 0.684 0.946
0.948 0.949
[128]Open in a new tab
The bold indicates the best results of the test metrics across
different datasets
Table 4.
Clustering metrics for 13 imputation algorithms across platform
datasets
Datasets sc_10X sc_CELseq2 sc_Dropseq
Metric ACC AUC F1 ACC AUC F1 ACC AUC F1
AutoImpute 0.979 0.977 0.969 0.577 0.565 0.461 0.821 0.800 0.737
DCA 0.997 0.977 0.997 0.995 0.994 0.993 0.935 0.926 0.905
Deepimpute 0.997 0.997 0.996 0.990 0.988 0.985 0.994 0.993 0.991
DrImpute 0.996 0.995 0.993 0.980 0.977 0.970 0.909 0.897 0.865
ENHANCE 0.791 0.800 0.726 0.958 0.976 0.996 0.994 0.993 0.991
MAGIC 0.953 0.948 0.930 0.546 0.507 0.366 0.546 0.535 0.429
SAVER 0.996 0.995 0.993 0.990 0.988 0.985 0.994 0.993 0.991
scGAIN 0.880 0.872 0.825 0.500 0.508 0.424 0.481 0.500 0.425
scImpute 0.658 0.668 0.503 0.940 0.883 0.787 0.930 0.910 0.890
SCRABBLE 0.990 0.988 0.985 0.501 0.500 0.406 0.519 0.511 0.409
VIPER 0.999 0.998 0.999 0.564 0.517 0.365 0.923 0.920 0.890
scIGANs 0.658 0.668 0.503 0.430 0.035 0.601 0.412 0.667 0.500
scMASKGAN 0.999 0.981 0.966 0.969 0.503 0.668 0.997 0.836 0.803
[129]Open in a new tab
The bold indicates the best results of the test metrics across
different datasets
Fig. 4.
Fig. 4
[130]Open in a new tab
The clustering metrics for imputation algorithms compared to cell-type
labels. The figure presents boxplots comparing ACC, AUC, and F1 scores
of 13 imputation methods across different datasets, with each color
representing a different method
scMASKGAN outperforms other methods in terms of ACC, followed by DCA,
DeepImpute, and SAVER, which also show strong performance from the
analysis. In terms of the F1 score, scMASKGAN, DeepImpute, and SAVER
exhibit the highest median values. For AUC, scMASKGAN, DeepImpute, and
SAVER again show leading performance, with scMASKGAN maintaining a high
median and demonstrating the most stable distribution, as indicated by
the boxplot. These results suggest that scMASKGAN not only excels
across all metrics but also demonstrates superior stability and
generalizability, making it well-suited for diverse scRNA-seq data
types, sizes, and platforms.
Correlation analysis
In addition to comparing clustering performance, we evaluated each
imputation method by analyzing the Pearson correlation between the
imputed and original data [[131]40]. Figure [132]5 presents boxplots of
the Pearson correlation coefficients, where scMASKGAN, SAVER, scImpute,
and SCRABBLE show higher median values, indicating a strong alignment
with the original data and, thus, more reliable imputation. In
contrast, MAGIC, ENHANCE, and VIPER display lower correlation values,
implying that their imputed data deviate more from the observed
values-likely due to the introduction of bias during imputation.
Fig. 5.
Fig. 5
[133]Open in a new tab
Boxplot comparison of Pearson correlation coefficients among 13
imputation methods
In conclusion, integrating both clustering performance and correlation
analyses, scMASKGAN and SAVER emerge as the most optimal imputation
methods, excelling in preserving data consistency and stability.
Conversely, ENHANCE and MAGIC show limitations in both aspects, making
them less suitable for tasks that require high-fidelity imputation.
Cost analysis
To rigorously assess the performance of scMASKGAN, we conducted
benchmarking experiments using a computational platform comprising two
NVIDIA 4090 GPUs (24GB each) and one NVIDIA A4000 GPU (16GB). We
evaluated both the memory consumption and the runtime efficiency of 13
imputation methods, including scMASKGAN, on gene expression datasets
spanning 1,000 to 100,000 cells. As illustrated in Fig. [134]6,
scMASKGAN demonstrates competitive performance, achieves second place
in memory consumption-surpassed only by scIGAN-while its execution
speed is slightly lower than that of the MAGIC and DeepImpute methods.
Overall, these findings underscore the robust computational performance
of scMASKGAN, affirming its efficacy in terms of both memory usage and
runtime efficiency for large-scale single-cell transcriptomic analyses.
Fig. 6.
Fig. 6
[135]Open in a new tab
Comparison of the memory consumption and the runtime efficiency among
13 imputation methods
Downstream analysis
We demonstrated the biological relevance of scMASKGAN for data recovery
through comprehensive validation using several biological datasets.
Specifically, we utilized Mouse ESC scRNA-seq data, temporal data, and
10 neuroblastoma sample datasets. We conducted a series of analyses,
including gene-gene correlation analysis, temporal analysis, gene
enrichment pathway analysis, batch data imputation, and differential
gene expression analysis, to assess the performance of scMASKGAN across
different types of biological data. These analyses not only help
clarify the role of scMASKGAN in biological data recovery but also
provide strong evidence for its potential use in a variety of
biological research applications.
Gene-gene correlation
Gene-gene correlation reveals similarities in expression patterns,
aiding in the identification of regulatory networks and gene functions
[[136]41]. We analyzed this using a Mouse ESC scRNA-seq dataset,
previously applied in scIGAN studies, which comprises 44 cell cycle
genes across over 6,800 cells. Figure [137]7A shows confusion matrices
based on Pearson correlation coefficients for both original and imputed
data. Notably, significant correlations-such as those among Mcm6, Nasp,
Mcm2, and Pcna, as well as between Cdk1, Cenpl, and Top2a-are preserved
post-imputation. Additionally, new correlations between cdc20 and
Cenpa/PLK1, and between Msh2 and Mcm2/Mcm6, align with known
co-expression relationships [[138]42, [139]43]. Hierarchical clustering
further confirms that the imputed data retains the original data’s
structural characteristics. Overall, these results validate that
scMASKGAN effectively restores biologically meaningful gene
correlations while preserving the inherent data structure.
Fig. 7.
[140]Fig. 7
[141]Open in a new tab
Gene-gene correlation analysis for mouse ESC data, trajectory analysis
of Time-course scRNA-seq data, and gene pathway enrichment analysis.
Panel A shows a heatmap of gene-gene correlations for scMASKGAN-imputed
versus original data, with the left color bar representing hierarchical
clustering results and the middle section showing a confusion matrix of
44 gene groups where darker blue indicates stronger correlations. Panel
B displays trajectory analysis plots for original and imputed data.
Panel C shows temporal profiles for six marker genes, with the red line
indicating imputed data and the blue line showing original data. Panel
D features bar and bubble charts of GO enrichment analysis for FCER1A,
used to assess the reasons for its temporal upregulation
Temporal analysis
Temporal analysis is employed in scRNA-seq to infer cellular
trajectories during dynamic processes [[142]44]. In this study, we
imputed a time-course scRNA-seq dataset-capturing the differentiation
of H1 embryonic stem cells (ESCs) into definitive endodermal cells
(DEC)-using scMASKGAN, and then reconstructed trajectories with the
Monocle3 R package [[143]45]. As illustrated in Fig. [144]7B, UMAP
plots of the original data reveal distinct, dispersed clusters due to
technical noise, while the scMASKGAN-imputed data show smoother
transitions and enhanced connectivity between cells. Further temporal
analysis of marker genes (Fig. [145]7C) indicates that genes such as
CD14, CD3D, MS4A1, and PPBP exhibit increased expression along clearer
trajectories, whereas FCER1A, previously affected by dropout events,
demonstrates a gradual expression increase over time. GO enrichment
analysis (Fig. [146]7D) confirms that these genes are primarily
associated with immune functions, and suggesting that the progressive
activation of immune-related pathways underlies the differentiation
process from ESC to DEC.
Batch data imputation
We use scRNA-seq data from the same batch to impute and integrate 10
homologous neuroblastoma samples from clinical cases. All data undergo
a complete Scanpy [[147]46] quality control pipeline: ensuring that
each gene is expressed in at least 3 cells, each cell contains at least
200 genes, and cells with mitochondrial gene expression exceeding 5
[MATH: % :MATH]
are removed. Subsequently, the data are log-transformed and subjected
to PCA for dimensionality reduction. These preprocessing steps enable
us to assess whether the original gene expression patterns are
maintained and to evaluate the restoration of key marker genes.
As shown in Fig. [148]8A, we present UMAP plots of the original data
alongside datasets imputed by scMASKGAN, DCA, and DeepImpute. The
results indicate that scMASKGAN enhances cell connectivity and
effectively reduces the technical noise present in the original data.
In contrast, although the DCA-imputed data exhibit a more dispersed
clustering that still delineates cell population differences, the
excessive fragmentation may compromise biological relevance. Thus,
following batch effect correction and integration, scMASKGAN better
reflects the underlying cellular states and connectivity, offering
superior imputation performance.
Fig. 8.
[149]Fig. 8
[150]Open in a new tab
The results of batch data imputation and differential gene expression
analysis are illustrated in three panels. Panel A displays UMAP plots
comparing batch-integrated data from three imputation
algorithms-scMASKGAN, DCA, and Deepimpute-to the original dataset.
Panel B shows the expression levels of five marker genes and the
interference gene (NTRK2) in both the original and scMASKGAN-imputed
data, with red indicating original data and blue indicating imputed
data. Panel C presents scatter plots of the corresponding gene
expressions, with the color bar on the right representing gene
expression levels
Additionally, we selected one low-risk neuroblastoma ([151]GSM5768752)
sample to compare the expression of five marker genes and a control
gene (NTRK2, which is highly expressed in high-risk neuroblastoma)
[[152]47] between the scMASKGAN-imputed and original datasets. As shown
in supplementary Fig. [153]8B, scMASKGAN-imputed data exhibit minimal
changes in high-expression genes while notably restoring low-expression
genes. Furthermore, supplementary Fig. [154]8C demonstrates that,
except for NTRK2, the expression levels of genes with initially low
expression increase, whereas high-expression genes remain largely
unchanged. These observations confirm that scMASKGAN achieves high
accuracy in data recovery without indiscriminately altering gene
expression.
Gene expression analysis
To further validate the effectiveness of the scMASKGAN imputation, we
compared the expression profiles of commonly highly variable genes
across different cell types between the imputed and original data under
various dropout rates, as illustrated in Fig. [155]9. Specifically, we
present the comparison for the Human brain dataset (81.4
[MATH: % :MATH]
dropout) and observe that the data structure remains intact, with no
significant alterations in the expression of highly variable genes. In
addition, we analyzed datasets with other dropout rates in
Supplementary Figure 10, where scMASKGAN consistently preserved the
complete biological signal. These results further demonstrate the
accuracy and reliability of the scMASKGAN imputation.
Fig. 9.
[156]Fig. 9
[157]Open in a new tab
Comparison of common highly variable genes between imputed data and
original data across different cell types
Discussion
Robustness across diverse datasets and platforms
To comprehensively evaluate the strengths and weaknesses of scMASKGAN,
we compared it with 12 existing imputation methods across various
scRNA-seq datasets. In terms of the CV, scMASKGAN’s performance is
surpassed only by AutoImpute and scIGANs (Fig. [158]3A). For EMD and JS
distance metrics (Fig. [159]3D), while most methods exhibit robust
performance, scImpute and DeepImpute were excluded from some
comparisons due to their inability to impute certain datasets, the
suboptimal performance of sciGAN and AutoImpute relative to other
methods has been discussed in Sect. [160]3.4. Moreover, clustering
assessments (as detailed in Tables [161]3 and [162]4, and Fig. [163]4)
indicate that scMASKGAN, alongside SAVER, yields the most effective
segregation of cellular heterogeneity. Pearson correlation analysis
further confirms that scMASKGAN’s imputed data maintains the highest
correlation with the original data, and the Z-score standardized
distribution (Supplementary Figure 9) demonstrates that its imputation
results exhibit minimal outliers and remarkable stability across
varying dropout rates. Collectively, these findings underscore the
superior efficacy of scMASKGAN as an imputation method.
The results also indicate that scMASKGAN performed exceptionally well
on the ERCC spike-in dataset, Human brain dataset, and Time-course
dataset, demonstrating its advantages in handling highly sparse,
small-scale, and temporal sequencing data. This superior performance
may be attributed to the adversarial training mechanism, which enables
scMASKGAN to effectively learn the latent data distribution and
generate imputed values that closely resemble the true expression
patterns. For the Human brain dataset, scMASKGAN successfully recovered
gene expression patterns while preserving cellular heterogeneity. In
the Time-course dataset, it maintained the dynamic characteristics
inherent in temporal sequencing data, thereby avoiding the loss of
time-related information often encountered in traditional imputation
methods.
On the large-scale Mouse ESC dataset, scMASKGAN demonstrated moderate
performance, without consistently outperforming all comparative
methods. As evidenced by Supplementary Figure 4, although scMASKGAN
effectively delineated three distinct clusters, the corresponding
clustering metrics remained only moderate. This may be attributable to
alterations in the UMAP embedding distribution following cluster
separation, potentially introducing inaccuracies in the clustering
indices. Nevertheless, with respect to other evaluation criteria,
scMASKGAN consistently produced biologically meaningful imputed values
and preserved the overall structural integrity of the dataset.
For the sc_10X and sc_Drop-seq datasets, scMASKGAN exhibited robust
performance, underscoring its strong generalization capabilities in
high-throughput sequencing scenarios. These datasets, characterized by
substantial technical noise and sparsity, benefit from scMASKGAN’s
integrated attention mechanism and adversarial training framework,
which collectively enable the effective discrimination of biological
signals from technical artifacts, thereby yielding highly accurate
imputed values. Moreover, in the neuroblastoma dataset with
exceptionally high dropout rates, scMASKGAN demonstrated outstanding
imputation performance, further confirming its capacity to enhance data
quality under extreme conditions.
The performance on the sc_CEL-seq2 dataset was suboptimal, likely due
to its unique characteristics and high levels of sequencing noise. Our
imputation results did not fully recover the expression profiles of
several key genes, indicating that scMASKGAN could benefit from further
refinement to better accommodate various sequencing platforms.
Moreover, the pronounced technical noise in sc_CEL-seq2 appears to
destabilize the adversarial training process of the GAN-based model,
resulting in incomplete gene expression recovery.
Enhanced interpretability and downstream analysis
The downstream analyses presented in our study provide comprehensive
evidence of the efficacy and robustness of the scMASKGAN imputation
method. The gene-gene correlation analysis demonstrates that scMASKGAN
accurately restores biologically meaningful relationships between key
cell cycle genes, such as the significant correlations observed among
Mcm6, Nasp, Mcm2, and Pcna, as well as between Cdk1, Cenpl, and Top2a.
Moreover, the emergence of novel correlations that align with known
co-expression patterns further substantiates the ability of scMASKGAN
to recover latent gene-gene interactions while preserving the intrinsic
data structure.
The temporal analysis reinforces these findings by illustrating
scMASKGAN’s capacity to capture dynamic cellular trajectories. As
evidenced by the UMAP plots in Fig. [164]7B, the imputed data exhibit
smoother transitions and enhanced connectivity compared to the original
noisy data, thereby facilitating the reconstruction of continuous
cellular differentiation pathways. Additionally, the temporal
expression profiles of marker genes reveal that dropout-affected genes,
such as FCER1A, regain a gradual expression increase over time, which
is critical for deciphering time-related biological processes. The
corresponding GO enrichment analysis further confirms that these genes
are predominantly involved in immune functions, underscoring the
biological relevance of the imputation.
In the context of batch data imputation (Sect. [165]4.3 and Fig.
[166]8), scMASKGAN not only attenuates technical noise but also
effectively integrates data from multiple samples. Compared to other
methods, UMAP visualizations (Fig. [167]8A) reveal that most algorithms
lead to significant fragmentation in gene expression profiles as a
result of imputation. Furthermore, differential expression analyses of
key marker genes (Fig. [168]8B and C) reveal that scMASKGAN not only
preserves the expression profiles of highly expressed genes and
accurately recovers missing signals, but also avoids introducing
erroneous biological artifacts, thereby maintaining the overall
integrity of the biological signal.
Finally, the gene expression analysis across datasets with varying
dropout rates (Fig. [169]9 and Supplementary Figure 10) further
corroborates the robustness of scMASKGAN. The imputed data retain the
expression profiles of highly variable genes, suggesting that the
method effectively preserves critical biological signals without
introducing significant distortions.
Conclusion
In summary, these results collectively demonstrate that scMASKGAN is a
robust and versatile imputation method. It effectively recovers both
global and local gene expression patterns across diverse scRNA-seq
datasets, ranging from small-scale and highly sparse data to large
heterogeneous datasets and time-course studies. While certain
platform-specific limitations remain, the overall performance of
scMASKGAN underscores its potential as a powerful tool for improving
data quality in downstream single-cell analyses. Future research will
focus on further optimizing the adversarial training strategy and
tailoring the model to the specific characteristics of different
sequencing platforms.
Supplementary Information
[170]Supplementary material 1^ (2.3MB, png)
[171]Supplementary material 2^ (2.4MB, png)
[172]Supplementary material 3^ (70.7KB, pdf)
[173]Supplementary material 4^ (1.4MB, pdf)
[174]Supplementary material 5^ (58.4KB, pdf)
[175]Supplementary material 6^ (69.8KB, pdf)
[176]Supplementary material 7^ (210.5KB, pdf)
[177]Supplementary material 8^ (174.3KB, pdf)
[178]Supplementary material 9^ (1.3MB, pdf)
[179]Supplementary material 10^ (201.9KB, pdf)
[180]Supplementary material 11^ (620B, csv)
[181]Supplementary material 12^ (16.8KB, xlsx)
Acknowledgements