Abstract
Background
The advent of single-cell RNA sequencing (scRNA-seq) has provided
unprecedented insights into cancer cellular diversity, enabling a
comprehensive understanding of cancer at the single-cell level.
However, identifying cancer cells remains challenging due to gene
expression variability caused by tumor or tissue heterogeneity, which
negatively impacts generalization and robustness.
Results
We propose CanCellCap, a multi-domain learning framework, to identify
cancer cells in scRNA-seq data suitable for all tissues, cancers, and
sequencing platforms. Integrating domain adversarial learning and
Mixture of Experts, CanCellCap is able to simultaneously extract common
and specific patterns in gene expression profiles across different
tissues for cancer or normal cells. Moreover, the
masking-reconstruction strategy enables CanCellCap to cope with
scRNA-seq data from different sequencing platforms. CanCellCap achieves
0.977 average accuracy in cancer cell identification across 13 tissue
types, 23 cancer types, and 7 sequencing platforms. It outperforms five
state-of-the-art methods on 33 benchmark datasets. Notably, CanCellCap
maintains high performance on unseen cancer types, tissue types, and
even across species, highlighting its effectiveness in challenging
scenarios. It also excels in spatial transcriptomics by accurately
identifying cancer spots. Furthermore, CanCellCap demonstrates strong
computational efficiency, completing inference on 100,000 cells in a
few minutes. In addition, interpretability analyses reveal critical
biomarkers and pathways, offering valuable biological insights.
Conclusions
CanCellCap provides a robust and accurate framework for identifying
cancer cells across diverse platforms, tissue types, and cancer types.
Its strong generalization to unseen cancers, tissues, and even species,
combined with its adaptability to spatial transcriptomics data,
underscores its versatility for both research and clinical
applications.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12915-025-02337-1.
Keywords: ScRNA-seq, Cancer cell identification, Multi-domain learning,
Domain adversarial learning, Mixture of experts
Background
Single-cell RNA sequencing (scRNA-seq) provides critical insights into
cancer biology at the single-cell resolution [[36]1]. Accurate
identification of cancer cells advances the understanding of tumor
heterogeneity and evolution, thereby enabling personalized therapies
and more effective treatment strategies [[37]2].
Traditional approaches to separate cancer cells from normal cells often
rely on detecting copy number variations (CNVs) from single-cell gene
expression data by comparing target cells to reference cells
[[38]3–[39]5]. For instance, InferCNV [[40]4] requires a reference set
of normal cells, while CopyKAT [[41]3] requires the presence of both
cancer and normal cells [[42]6]. Additionally, CNVs are not exclusive
to cancer cells, as normal cells can also exhibit copy number
alterations [[43]7, [44]8], which decrease the reliability of these
methods.
Recent advances use machine learning to capture gene expression
patterns of cancer cells, removing the dependency on reference cells
and CNV inference. ikarus [[45]9] and PreCanCell [[46]10] distinguish
cancer cells on the extracted cancer- and normal-specific genes from
multiple training datasets. However, their accuracy heavily depends on
the selected genes, making them prone to overfitting and poor
generalization to unseen data. Therefore, CaSee [[47]6] and
Cancer-Finder [[48]11] attempt to improve model generalization. CaSee
employs transfer learning with bulk RNA sequencing (RNA-seq) data to
pre-train capsule network classifiers, but its effectiveness is limited
by different distributions between bulk and single-cell RNA-seq data.
In contrast, Cancer-Finder learns a universal representation of cancer
cells but ignores tissue-specific expression patterns, which might lead
to false positives.
Despite these advancements, identifying cancer cells remains a
significant challenge due to the high dimensionality and inherent
heterogeneity of scRNA-seq data [[49]12]. Cancer cell transcriptional
profiles are highly influenced by their tissues of origin, with
distinct expression patterns observed across different tissues
[[50]13–[51]15]. Moreover, the effect of sequencing platforms,
primarily including dropout events (where genes with low expression are
missed), introduces noise and variability. These confounding factors
complicate cellular state identification.
A single-cell gene expression profile for a cancer cell can be
considered as the coupling of tissue-common cancer expression patterns
(tissue-common feature), tissue-specific expression patterns
(tissue-specific feature), and sequencing platform effects.
Disentangling these three factors enables the model to generalize to
different tissues and sequencing platforms, enhancing the robustness of
cancer cell identification. Therefore, the model needs to
simultaneously learn tissue-common and tissue-specific features while
eliminating sequencing platform effects. As each tissue exhibits unique
gene expression patterns, they can be treated as distinct domains.
Therefore, in our study, a multi-domain learning framework is designed
to facilitate domain disentanglement, allowing models to effectively
extract both domain-common and domain-specific features, that is,
tissue-common and tissue-specific expression patterns in our case.
In this study, we propose a novel multi-domain learning model,
CanCellCap, to separate cancer cells from normal cells, free from the
effects of tissue type, cancer type, and dropout rates from sequencing
platforms. Considering the gene expression profile from scRNA-seq as
the coupling of tissue-common expression patterns, tissue-specific
expression patterns, and sequencing platforms, CanCellCap imports a
feature masking-reconstruction training strategy and integrates domain
adversarial learning and Mixture of Experts (MoE) in the model
structure to improve generalization and accuracy in cancer cell
identification. The domain adversarial learning module captures
tissue-common features, and the MoE module dynamically selects the most
relevant experts for each tissue to capture tissue-specific features.
Additionally, dropout events, a major effect factor in the sequencing
platform, are simulated by random masking of expression values and
reconstructed via the model learning, thereby eliminating the impact of
sequencing platforms. In comprehensive experiments, CanCellCap
outperforms five state-of-the-art methods across 33 datasets,
accurately identifying cancer cells across diverse cancer types, tissue
types, sequencing platforms, and even in unseen cancer types, tissue
types, sequencing platforms, and across species. It also exhibits
excellent computational efficiency, completing inference on large-scale
datasets with up to 100,000 cells in significantly less time than other
models. Additionally, CanCellCap generalizes well to spatial
transcriptomics data and offers biologically interpretable predictions
by highlighting key genes and pathways with potential relevance to
cancer diagnosis and therapy.
Results
Dataset descriptions and experimental design
We collected 74 human cancer single-cell datasets from Tumor Immune
Single-cell Hub (TISCH) [[52]16], which cover 14 different tissues, 23
cancer types, and 7 sequencing platforms. After preprocessing, a total
of 328,230 cells from 13 tissue types were obtained. For each tissue,
the samples were randomly split into a Training Dataset (80%) and a
Validation Dataset (20%) for model training and performance validation,
respectively. The preprocessing steps follow the approach outlined in
Cancer-Finder [[53]11].
To evaluate CanCellCap comprehensively, experiments were designed to
cover multiple performance aspects. First, CanCellCap was benchmarked
against five state-of-the-art (SOTA) methods using 33 testing datasets
spanning 15 tissue types to assess its overall performance. Second, its
generalization ability was validated by testing on four unseen cancer
types, two unseen tissue types, one unseen species, and its robustness
was evaluated across six sequencing platforms. Third, we explored
CanCellCap’s potential applications in the following scenarios: its use
of tissue-specific features was tested through cancer origin
identification, and its applicability in spatial transcriptomics was
demonstrated by identifying cancer spots. Finally, we conducted model
analysis, including cancer/tissue-specific evaluation, ablation
studies, interpretability analyses, and computational complexity
evaluation.
The datasets used in the experiments are summarized below, with
detailed sources provided in an additional file (see Additional file
[54]1: Table 1–5).
1. Training Dataset and Validation Dataset are sourced from TISCH1 and
include 328,230 cells across 13 tissue types: Blood, Bone, Brain,
Breast, Colorectum, Eye, Head and Neck, Liver, Lung, Nervous,
Pancreas, Pelvic Cavity, and Skin. It does not overlap with any of
the subsequent datasets.
2. Testing Dataset comprising 33 testing datasets from human, spanning
15 tissue types, 25 cancer types, and 6 sequencing platforms,
totaling 834,788 cells, which were employed as external independent
testing.
3. Unseen Cancer/Tissue Type and Species Dataset consists of 113,393
cells from unseen four cancer types, 89,922 cells from two unseen
human tissue types, and 63,734 cells from an unseen species
(mouse). “Unseen” means a tissue type, cancer type, or species that
is absent from the Training Dataset.
4. Multiple-Sequencing Platforms Dataset includes 5600 cells from
Smart-seq[55]2, 146,387 cells from 10X Genomics, 10,502 cells from
Microwell-seq, 3134 cells from C1 Fluidigm, 21,011 cells from
Drop-seq, and 86,486 cells from GEXSCOPETM. This dataset was used
to validate CanCellCap’s generalization across different sequencing
platforms, especially considering that C1 Fluidigm, Drop-seq, and
GEXSCOPETM platforms are not present in the training set.
5. TISCH2 Testing Dataset was constructed by TISCH2 [[56]17], which
was employed to further evaluate CanCellCap for identifying cancer
cell origins. This dataset consists of 122,269 cells from seven
tissues, including the blood [[57]18], brain [[58]19], bone
[[59]20], lung [[60]21], pancreas [[61]22], eye [[62]23].
6. Simulated Dropout Rate Dataset was generated on Testing Dataset by
randomly masking data with different dropout rates (0–98%). This
dataset was used to assess CanCellCap’s robustness to different
dropout rates by various sequencing platforms.
7. Spatial Transcriptomics Dataset was obtained from human prostate
cancer samples [[63]24]. It was used to evaluate CanCellCap’s
applicability to spatial transcriptomics analysis.
These datasets and annotations were sourced from the testing sets of
Cancer-Finder [[64]11], TISCH2 [[65]17], the Gene Expression Omnibus
[[66]25], and the Single Cell Portal [[67]26]. Labels distinguishing
cancer and normal cells were curated based on provided annotations to
ensure consistency.
The performance of CanCellCap was compared with five SOTA methods,
including CopyKAT [[68]3] and SCEVAN [[69]5], which rely on CNV
detection; ikarus [[70]9] and PreCanCell [[71]10], which leverage
specific gene expression patterns; Cancer-Finder [[72]11], which
focuses on universal representation learning. The evaluation metrics
include accuracy, F1 score (F1), recall, precision, and area under the
receiver operating characteristic curve (AUROC). Note that since SCEVAN
and CopyKAT do not provide probability scores for cancer cell
classification, AUROC is not available for these methods. All
evaluation metrics were calculated using standard functions from the
scikit-learn library. Additionally, to ensure a statistically rigorous
comparison, we employed the Bonferroni-Dunn test [[73]27] and
visualized the results using the Critical Difference (CD) [[74]28]
diagrams with a 95% confidence level. The CD diagrams allow for
statistical verification of whether the observed performance
differences between methods are significant, following the
implementation outlined in [[75]29]. Detailed performance scores for
all datasets are provided in an additional file (see Additional file
[76]2: Table 1–8).
Model training process
The CanCellCap model was trained with a fixed random seed (0) for 50
epochs, using a batch size of 128 and the stochastic gradient descent
optimizer (learning rate =
[MATH:
1.3×10-3 :MATH]
, momentum =
[MATH: 0.9 :MATH]
, weight decay =
[MATH:
1.5×10-4 :MATH]
). The training losses and the corresponding validation accuracy
throughout the training epochs are shown in Fig. [77]1A and B. The five
losses, Total Loss, Identification Loss, Gate Loss, Domain Adversarial
Loss, and Reconstruction Loss, converged after 40 epochs. On the
Validation Dataset, which includes 13 tissue types, CanCellCap achieves
an impressive average accuracy of 0.9777, and 12 out of the 13 tissue
types achieves exceeded 0.9500 accuracy in cancer cell identification,
demonstrating its reliability across various biological contexts. The
relatively lower performance on bone tissue, with an accuracy of
0.8630, may be due to the high similarity between multiple myeloma
cells and normal plasma cells, making them harder to distinguish
[[78]30].
Fig. 1.
[79]Fig. 1
[80]Open in a new tab
Performance evaluation of CanCellCap during the training stage. A Loss
curves of five loss during the training stage. B Accuracy for different
tissues within the Validation Dataset during the training stage. C
t-SNE visualization of the input expression features
[MATH: X :MATH]
from the Validation Dataset, colored by cell status. D t-SNE
visualization of the final cell embeddings
[MATH: vcat :MATH]
from the Validation Dataset, colored by cell status
The T-Distributed Stochastic Neighbor Embedding (t-SNE) visualizations
for both the input expression features
[MATH: X :MATH]
and the final cell embeddings
[MATH: vcat :MATH]
extracted by CanCellCap from the Validation Dataset are shown in Fig.
[81]1C and D, respectively, where the dots are colored by cancer or
normal cell status. In the t-SNE directly from gene expression, cancer,
and normal cells are mixed. In contrast, in the t-SNE on
[MATH: vcat :MATH]
, cancer and normal cells exhibit a clearer separation.
Experiments for performance from various projects
CanCellCap was compared with five SOTA methods on the Testing Dataset,
comprising 855,713 human cells from 33 datasets. As illustrated in the
boxplots (Fig. [82]2A), CanCellCap consistently outperformed the other
methods across most evaluation metrics, including accuracy, F1,
precision, and AUROC. Notably, the interquartile ranges in the boxplots
are smaller for CanCellCap across most metrics, indicating more stable
performance across datasets. This observation is further supported by
the CD diagrams shown in Fig. [83]2B. In the CD diagrams, methods
connected by a line are considered statistically similar, while methods
not connected show significant performance differences. The diagrams
demonstrate that CanCellCap ranks significantly higher than other
methods in accuracy, F1, and precision, with statistical significance.
Recall is the only metric where CanCellCap performs comparably to
Cancer-Finder, both achieving the highest performance. This is likely
due to the MoE module, which enables tissue-specific discrimination. In
tissues with limited training data, experts tend to adopt conservative
decision boundaries, reducing recall in ambiguous cases while improving
generalization.
Fig. 2.
[84]Fig. 2
[85]Open in a new tab
Performance comparison of CanCellCap and five SOTA methods across 33
testing datasets. A Boxplots illustrate the distribution of five
evaluation metrics across all datasets. B CD diagrams display the
average rankings of all methods based on the five metrics. Note that
since SCEVAN and CopyKAT do not provide probability scores for cancer
cell classification, AUROC is not applicable to these methods and is
thus marked as 'NA'
In addition to the overall performance evaluation, we further assessed
the robustness of CanCellCap across different tissue types within the
Testing Dataset. Specifically, we compared CanCellCap with five SOTA
methods across 15 human tissue types. As shown in Fig. [86]3,
CanCellCap demonstrates superior performance across the majority of
tissues. In contrast, methods such as PreCanCell, SCEVAN, CopyKAT, and
ikarus perform well in certain tissues but exhibit substantial
performance degradation in others, indicating limited robustness.
Although Cancer-Finder shows greater cross-tissue stability than the
aforementioned methods, its overall performance remains lower than that
of CanCellCap.
Fig. 3.
[87]Fig. 3
[88]Open in a new tab
Performance comparison of CanCellCap with five SOTA methods across
tissues. Five performance metrics comparison of CanCellCap with five
SOTA methods in cancer cell identification across 15 human tissue types
from the Testing Dataset. Recall, F1, precision, and AUROC are marked
as'NA'for the Head and neck, which contain only normal cells. Note that
since SCEVAN and CopyKAT do not provide probability scores for cancer
cell classification, AUROC is not applicable to these methods and is
thus marked as 'NA'
Overall, CanCellCap outperforms all other methods in both overall
performance and robustness across diverse tissue types in identifying
cancer cells.
Experiments for generalization for different sequencing platforms
To assess the robustness and generalization performance across
different sequencing platforms, CanCellCap was applied to scRNA-seq
datasets from six different sequencing platforms.
As shown in Fig. [89]4, CanCellCap outperforms all five SOTA methods
across the six sequencing platforms, achieving the highest average
accuracy of 0.9229. Even on unseen sequencing platforms such as C1
Fluidigm, Drop-seq, and GEXSCOPETM, CanCellCap maintains high accuracy,
reaching 0.956, 0.9121, and 0.9393. In contrast, Cancer-Finder shows
more unstable performance, particularly struggling on platforms like
10X Genomics, where it achieves a relatively low accuracy of 0.8539.
Overall, CanCellCap exhibits superior adaptability on the datasets from
the six sequencing platforms.
Fig. 4.
[90]Fig. 4
[91]Open in a new tab
Performance comparison of CanCellCap with five SOTA methods across
platforms. Five performance metrics comparison of CanCellCap with five
SOTA methods in cancer cell identification across six different
sequencing platforms. Note that since SCEVAN and CopyKAT do not provide
probability scores for cancer cell classification, AUROC is not
applicable to these methods and is thus marked as 'NA'
Experiment for robustness on unseen cancer/tissue types and species
CanCellCap was applied to datasets from unseen cancer types: Cervical
Cancer (CC), Hepatoblastoma (HB), Small Cell Lung Cancer (SCLC), and
Cutaneous Melanoma (CMM), two unseen tissue types (soft tissue and
prostate), and one unseen species (mouse) which are not included in the
Training Dataset.
As shown in Fig. [92]5A, CanCellCap outperforms all SOTA methods across
all metrics on the unseen cancer types. Specifically, CanCellCap
achieves the highest accuracy of 0.9297, recall of 0.9682, F1 score of
0.9287, precision of 0.9291, and AUROC of 0.9722 on the unseen cancer
types. In contrast, the second-best model, Cancer-Finder, achieves an
accuracy of 0.8705 with an AUROC of 0.8676. Figure [93]5B presents the
performance on unseen tissue types, such as soft tissue and prostate.
On the unseen tissue types, CanCellCap maintains the highest accuracy
of 0.7877, F1 score of 0.7847, precision of 0.8333, and AUROC of
0.8172. In comparison, Cancer-Finder and ikarus struggle with lower
AUROC, while PreCanCell shows competitive performance, with overall
lower metrics.
Fig. 5.
[94]Fig. 5
[95]Open in a new tab
Performance comparison of CanCellCap with five SOTA methods on datasets
from unseen types. A Five metrics on datasets of unseen cancer type. B
Five metrics on datasets of unseen tissue type. C Five metrics on
datasets from mouse. Note that since SCEVAN and CopyKAT do not provide
probability scores for cancer cell classification, AUROC is not
applicable to these methods and is thus marked as 'NA'
To further evaluate generalization across species, the model was
applied to three mouse single-cell datasets. For compatibility with
mouse data, CanCellCap was adapted by mapping gene homologs between
human and mouse using the HomoloGene database [[96]31]. Since ikarus
and PreCanCell lack pipelines for mouse datasets, and given the
similarity in framework between Cancer-Finder and CanCellCap, a
compatible pipeline for Cancer-Finder was developed in this study to
facilitate direct comparison. Accordingly, evaluations focused on
Cancer-Finder, SCEVAN, and CopyKAT. As in Fig. [97]5C, CanCellCap
achieved the best overall performance with an accuracy of 0.9112,
surpassing the second-best method, Cancer-Finder, which attained
0.8278.
These results underscore CanCellCap’s exceptional adaptability and its
ability to reliably identify cancer cells across both unseen cancer,
tissue types, and species.
Experiment to identify cancer cell origin
To validate the tissue-specific features captured by CanCellCap, this
experiment aims to identify a cancer cell origin on the TISCH2 Testing
Dataset comprising 122,269 cells from 6 tissue origins. The ability to
classify cancer cell origins relies heavily on recognizing
tissue-specific gene expression patterns, which are unique to each
tissue type. This task involved classifying cells as normal or
cancerous and, for cancer cells, identifying their origin.
As shown in Fig. [98]6A, CanCellCap demonstrates strong overall
performance in identifying the origin of cancer cells, achieving an
average accuracy of 0.9521. Since the training set contains 14
categories while the test set only includes 7 categories, some cells
were misclassified into categories not present in the test set. Thus,
the strong performance highlights CanCellCap’s ability to effectively
capture and utilize tissue-specific gene expression patterns.
Fig. 6.
[99]Fig. 6
[100]Open in a new tab
Application of CanCellCap on cancer origin identification and cancer
spot identification. A Confusion matrix of CanCellCap for cancer origin
identification on the TISCH2 Testing Dataset. B Comparison of cancer
spot identification in spatial transcriptomics, where each spot
represents gene expression profiles mapped to spatial coordinates
within the tissue: (i) pathologist annotations, (ii) CanCellCap
predictions, and (iii) confusion matrix for CanCellCap’s cancer spot
identification
Overall, these results demonstrate that CanCellCap excels not only in
distinguishing cancer cells from normal cells but also in leveraging
the extracted tissue-specific features to accurately identify the
cancer cell origin.
Spot-level cancer identification in spatial transcriptomics
Spatial transcriptomics provides critical spatial context, revealing
how cancer cells interact with their microenvironment—key information
for advancing cancer research. To assess CanCellCap’s applicability to
spatial transcriptomics data from cancer tissue sections, CanCellCap,
which was trained on scRNA-seq data, was directly applied to cancer
spot identification on a spatial transcriptomics dataset obtained from
human prostate cancer samples. CanCellCap’s predictions were compared
with pathologist annotations.
Figure [101]6B (i) shows the pathologist’s annotations, indicating
cancer and normal regions, while Fig. [102]6B (ii) presents
CanCellCap’s predictions. The spots predicted by CanCellCap closely
resemble those annotated by the pathologist. Figure [103]6B (iii)
presents the confusion matrix, which further highlights CanCellCap’s
performance in identifying cancer and normal spots. The matrix reveals
a high true positive rate for cancer spots, demonstrating CanCellCap’s
strong ability to detect cancer cells. With an accuracy of 0.7989 and a
recall of 0.9058, CanCellCap shows excellent performance in
distinguishing cancer cells from normal cells.
Overall, CanCellCap’s ability to identify cancer cell distribution at
spot-level granularity underscores its value in spatial
transcriptomics, offering nuanced insights into tumor architecture that
complement pathologist assessments. Notably, despite not being
specifically trained on spatial transcriptomics data, CanCellCap
achieves excellent results.
Model analysis
Evaluation of CanCellCap across diverse cancer and tissue types
To further evaluate the capability of CanCellCap, its cancer cell
identification performance was assessed across various cancer types and
tissue types, based on 33 test datasets. This analysis included both
rare cancer types and under-studied tissue types, highlighting its
robustness in challenging biological contexts.
As shown in Fig. [104]7, CanCellCap achieved over 0.9 accuracy across
most cancer types. Notably, even among rare cancers such as
adamantinomatous craniopharyngioma (ACP), synovial sarcoma (SS),
pleuropulmonary blastoma (PPB), kidney chromophobe (KICH), SCLC, and
gastroenteropancreatic neuroendocrine tumors (GEP-NETs), performance
remained high. Figure [105]7 also demonstrates similarly strong
performance across tissue types, with most exceeding 0.9 accuracy,
including under-studied tissues like Eye, Kidney, and Soft Tissue.
Despite the overall strong performance, certain cancer types such as
prostate adenocarcinoma (PRAD), belonging to prostate tissue, exhibited
relatively lower accuracies, falling below 0.8. This underperformance
is likely attributed to the absence of corresponding samples in the
training dataset. Nonetheless, as discussed in the “[106]Experiment for
robustness on unseen cancer/tissue types and species” section,
CanCellCap still outperforms other models under these conditions.
Fig. 7.
[107]Fig. 7
[108]Open in a new tab
Performance of CanCellCap across cancer types and tissue types.
Abbreviations used in the figure are listed in the Abbreviations
section
Importantly, the ACP dataset was derived from clinical cancer samples
obtained through hospital collaboration. scRNA-seq were generated
following standard protocols, and cancer cells were annotated according
to CTNNB1 mutation status. On this dataset, CanCellCap achieved an
accuracy of 0.9402, demonstrating its potential for clinical diagnostic
use.
These findings collectively emphasize the generalizability of
CanCellCap across diverse biological contexts, including rare
scenarios.
Ablation study
To further evaluate the contributions of various components of
CanCellCap, an ablation study was conducted, with key modules
systematically removed and their impact on model performance assessed
using the Testing Dataset.
As shown in Fig. [109]8A, the performance of the ablation modules on
the Testing Dataset is analyzed. The results indicate that compared to
using both the adversarial module and the MoE module, using only the
domain adversarial module achieves relatively high recall. However, it
suffers from a significant increase in false positives due to the lack
of specificity, and there is a notable decline in performance on other
metrics. This highlights the necessity of integrating both common and
specific gene expression patterns across different tissues to improve
generalization. The complete CanCellCap is slightly better than the
other two models, confirming that each component plays a meaningful
role in enhancing the overall performance of CanCellCap.
Fig. 8.
[110]Fig. 8
[111]Open in a new tab
Ablation study of the components of CanCellCap. A Ablation study of the
components of CanCellCap on the Testing Dataset. B–F Performance
comparison of CanCellCap and CanCellCap without feature
masking-reconstruction strategy across different dropout rates in
simulated dropout datasets, evaluated using five performance metrics
Overall, the complete CanCellCap achieves the highest scores across all
metrics, confirming that each component plays a meaningful role in
enhancing the overall performance of CanCellCap.
To further investigate the impact of dropout rates in the dataset and
evaluate the effectiveness of the feature masking-reconstruction
strategy, CanCellCap was compared with a variant without this strategy
across different dropout rates in simulated dropout datasets. These
datasets are generated by randomly masking expression values with
varying dropout probabilities (0–98%), based on the Testing Dataset.
As shown in Fig. [112]8B–F, CanCellCap consistently outperforms
CanCellCap without the feature masking-reconstruction at all dropout
rates. For instance, at a low dropout rate (5%), CanCellCap achieves an
accuracy of 0.9269, compared to 0.9146 for CanCellCap without this
strategy. As the dropout rate increased, the performance gap widened,
with CanCellCap maintaining superior results in accuracy, recall, F1
score, and precision. Notably, even when the dropout probability
exceeded 50%, CanCellCap’s accuracy remained above 0.9. The observed
increase in recall at higher dropout rates may be due to CanCellCap’s
tendency to classify uncertain cases as cancer cells, indicating a
cautious bias toward identifying potential cancer cells in sparse data.
Gating weights vector analysis
To illustrate how CanCellCap effectively assigns cells from various
tissues to the most suitable combination of experts, the gating weight
vector
[MATH: g(X) :MATH]
is analyzed. Two visualizations were employed to examine the
distribution of gating weight vectors for each cell in the validation
set:
Heatmap of gating weight vectors
A heatmap was created to visualize the gating weight vectors assigned
to each cell. As shown in Fig. [113]9A, the gating weight vectors are
similar within the same tissue type while differing across different
tissues. This confirms that the gate network dynamically adapts to the
specific expression patterns of each tissue, ensuring that cells are
directed to the most appropriate experts.
Fig. 9.
[114]Fig. 9
[115]Open in a new tab
Gating weight vectors analysis. A Heatmap of gating weight vectors
across tissue types. B t-SNE visualization of gating weight vectors
across tissue types, with different colors representing different
tissue types
T-SNE visualization of gating weight vectors
t-SNE was used to visualize the gating weight vectors assigned to each
cell. As shown in Fig. [116]9B, cells from the same tissue type tend to
cluster together, indicating that CanCellCap consistently assigns
similar weights to cells of the same tissue. This clustering behavior
illustrates that the gate network effectively allocates cells to the
most relevant tissue experts.
Together, these visualizations provide compelling evidence that the
gate network of the MoE module is capable of intelligently allocating
cells to the most relevant experts, thereby optimizing performance
across diverse tissue types.
Key genes and pathways for cancer cell identification
To enhance our understanding of CanCellCap’s decision-making process
and its biological relevance, an interpretability experiment was
conducted using SHapley Additive exPlanations (SHAP) and GradientSHAP
[[117]32]. Based on these methods, the top 10 key genes for cancer cell
identification were identified, and Kyoto Encyclopedia of Genes and
Genomes (KEGG) [[118]33] pathway enrichment analysis was subsequently
performed on the overlapping genes between the top 300 genes ranked by
SHAP and those by GradientSHAP to further explore their biological
relevance.
As shown in Fig. [119]10A and B, SHAP and GradientSHAP were employed to
rank the key genes affecting CanCellCap’s predictions, with higher
values indicating greater importance. Among them, several are
well-established biomarkers, including IFI27 [[120]34], CDKN2A
[[121]35], CXCR4 [[122]36], S100B [[123]37] and PTPRC [[124]38], which
are widely recognized in cancer research. Additionally, other
potentially relevant genes were identified, such as SPINT2 [[125]39],
SH3bgRL3 [[126]40], RGS1 [[127]41], and IGFBP7 [[128]42], suggesting
potential avenues for cancer biomarker discovery.
Fig. 10.
[129]Fig. 10
[130]Open in a new tab
Identification of key genes and pathway enrichment across tissues. A
Top 10 important genes ranked by SHAP in pelvic cavity and skin
tissues. B Top 10 important genes ranked by GradientSHAP in the same
tissues. C KEGG pathway enrichment based on the overlapping genes
between the top 300 ranked by SHAP and the top 300 ranked by
GradientSHAP
As shown in Fig. [131]10C, KEGG pathway enrichment analysis was
conducted on the overlapping genes between the top 300 genes ranked by
SHAP and those by GradientSHAP, to further validate the biological
relevance of CanCellCap’s predictions. The overlap of key genes from
both methods ensures that the identified pathways are consistently
represented across different interpretability techniques, providing
biologically meaningful and consistent insights. In both pelvic cavity
and skin tissues, the enriched pathways were strongly associated with
cancer-related processes. In pelvic cavity tissue, key enriched
pathways included “Pathways in cancer” and “Regulation of actin
cytoskeleton,” both of which play crucial roles in cancer development,
metastasis, and oncogenic signaling [[132]43, [133]44]. Similarly, in
skin tissue, enrichment of pathways such as “Pathways in cancer” and
the “MAPK signaling pathway” was observed, indicating the involvement
of oncogenic mechanisms relevant to skin cancer [[134]45, [135]46].
This interpretability analysis highlights that CanCellCap not only
accurately identifies cancer cells, but also uncovers critical genetic
markers and biologically meaningful pathways that may inform future
cancer research and therapeutic development. Furthermore, the provided
CanCellCap code supports additional interpretability methods, including
DeepLIFT [[136]47] and Integrated Gradients [[137]48]. The
interpretability results for other tissue types can be found in an
additional file (see Additional file [138]3: Fig. S1–S5).
Computational efficiency
To evaluate the computational efficiency of each method, a series of
runtime experiments were conducted on a workstation equipped with an
NVIDIA 2080Ti GPU (12 GB memory) and a 24-core 2.2 GHz CPU, using
datasets of varying sizes.
First, the runtime of CanCellCap was analyzed by comparing the total
runtime (including data loading and inference) to the inference time.
The experiments used scRNA-seq datasets stored in CSV format and loaded
via pandas. As shown in Fig. [139]11A, the results indicate that the
majority of time is spent on data loading, rather than on the actual
model inference. To alleviate this bottleneck, two solutions were
explored: (i) using the binary parquet stored format, which
significantly reduces data loading time; and (ii) leveraging Modin
[[140]49], a parallelized alternative to pandas that accelerates CSV
file reading, especially for large datasets. These solutions have been
implemented in our public pipeline.
Fig. 11.
[141]Fig. 11
[142]Open in a new tab
Computational efficiency analysis. A Evaluation of computational
efficiency and total time (including both data loading and model
inference) across different data formats (Parquet, CSV) and data
loading methods (pandas, Modin) on datasets of varying sizes. B
Comparison of total time on datasets of varying sizes using the CSV
format. The lighter-shaded portion of each bar represents the model
inference time. “NA” denotes times that exceed one day
As demonstrated in Fig. [143]11A, CanCellCap (Parquet), which uses the
binary Parquet format, consistently reduced total runtime across
datasets of varying sizes, making it the preferred option for both
storage and inference. The benefit became increasingly pronounced with
larger datasets, achieving up to a 100 × speedup over CanCellCap (CSV
with pandas) on datasets containing 100,000 cells. For pre-existing
large-scale CSV files (e.g., 100,000 cells), enabling Modin provided
additional improvements, with CanCellCap (CSV with Modin) achieving up
to a 10 × speedup over the pandas-based implementation. However, for
smaller datasets, the initialization overhead of Modin may outweigh its
benefits.
Additionally, Fig. [144]11B compares the total runtime of models using
CSV-formatted input. CanCellCap demonstrated superior computational
efficiency on datasets exceeding 10,000 cells, with the use of Modin
significantly reducing runtime. For the largest dataset tested (100,000
cells), CanCellCap completed inference in just 396.46 s, which is
substantially faster than the second-best method, Cancer-Finder,
requiring 3,850.88 s. These results highlight CanCellCap’s improved
computational efficiency, allowing it to outperform other models in
large-scale inference datasets.
Discussion
In this study, we introduce CanCellCap, a multi-domain learning
framework for identifying cancer cells from scRNA-seq data across
diverse tissue types. CanCellCap integrates domain adversarial learning
with a Mixture of Experts module to effectively capture both
tissue-common and tissue-specific features. This dual capability
enhances its generalization across diverse tissue types. To mitigate
the effects of various dropout rates from different sequencing
platforms, CanCellCap employs a feature masking-reconstruction
strategy, enabling robust performance across different sequencing
platforms.
Extensive experiments demonstrate that CanCellCap outperforms five
state-of-the-art methods across 33 testing datasets, accurately
identifying cancer cells across diverse cancer types and tissue types,
including both seen and unseen cancer and tissue types, as well as
sequencing platforms. CanCellCap also successfully identifies cancer
spots in spatial transcriptomics data and cancer cells in mouse
datasets, despite being trained exclusively on human scRNA-seq data.
Ablation studies highlight the essential roles of its core components,
while analysis of gating weight vectors further supports CanCellCap’s
ability to effectively assign input data to the most relevant expert
models. Moreover, CanCellCap accurately identified cancer cells in
clinically derived samples, demonstrating its potential for clinical
translation. Interpretability analysis revealed key genes and pathways,
including both established cancer biomarkers and potential novel
therapeutic targets. These findings offer valuable insights for cancer
biology and biomarker discovery.
Despite its overall strong performance, CanCellCap shows slightly lower
recall, which may stem from the MoE module’s reliance on
tissue-specific experts. In tissues with limited training data, these
experts may adopt conservative decision boundaries. To address this, we
plan to enrich underrepresented tissue types in the training set and
explore adaptive training strategies to better balance recall and
generalization.
Additionally, the current training dataset is based solely on human
data, thereby limiting CanCellCap’s generalization capacity across
species. While early results show promise in identifying mouse cancer
cells, its applicability to non-human species remains constrained.
Expanding the training dataset by incorporating data from multiple
species will be a key priority in future work. This will enhance
CanCellCap’s ability to generalize across diverse biological contexts,
increasing its utility across a broader range of applications.
Another limitation lies in CanCellCap’s memory consumption. Although
its usage is moderate compared to other methods, there is still room
for optimization. Designing a lightweight version of the model would
help reduce both memory usage and runtime, thereby enabling wider
adoption in both research and clinical settings.
Interpretability analysis identified key genes driving CanCellCap’s
predictions, including both well-established cancer biomarkers and
potential novel candidates for diagnosis or therapy. However, only a
subset of these genes has been experimentally validated. To enhance the
clinical relevance of these findings, we have initiated collaborations
with hospitals to validate candidate biomarkers using real-world
clinical datasets.
Conclusions
CanCellCap is introduced as a robust and generalizable framework for
cancer cell identification from single-cell RNA sequencing (scRNA-seq)
data. Leveraging a multi-domain design and feature
masking-reconstruction strategies, CanCellCap achieves high accuracy
across a diverse array of tissues, cancer types, and sequencing
technologies. Importantly, CanCellCap demonstrates strong
generalization to entirely unseen cancer types, tissue types, and even
species, showcasing its adaptability in real-world and cross-species
scenarios. In addition, its biologically interpretable outputs enable
the identification of potential cancer biomarkers, supporting both
translational research and fundamental insights into cancer biology.
Future efforts will be directed toward extending its applicability to
broader biological and clinical contexts, including cross-species
generalization and enhanced computational efficiency for clinical
deployment. Collaborations with clinical researchers are ongoing to
explore CanCellCap’s utility in early cancer detection and personalized
therapeutic strategy development, ultimately contributing to the
advancement of cancer research and precision medicine.
Methods
Model framework
With a gene expression matrix as input, which rows represent gene
expression levels and columns represent individual cells from diverse
human tissues, CanCellCap is composed of the following four modules, as
shown in Fig. [145]12.
1. A feature masking-reconstruction strategy: The expression matrix is
randomly masked to simulate dropout during the sequencing and
reconstructed by a decoder. The reconstruction loss
[MATH: Lrecon :MATH]
ensures the accurate recovery of dropout gene expressions.
2. A domain adversarial learning module: A gradient reversal layer
(GRL) is introduced to implement the domain adversarial learning by
building a tissue discriminator and a tissue confuser to extract
tissue-common features. The tissue discriminator predicts the
tissue source of cells, while the tissue confuser learns to extract
common gene expression patterns to confuse the discriminator. The
domain adversarial loss
[MATH: Lad :MATH]
measures the cross-entropy between the true and predicted tissue
labels of cells to guide the adversarial training.
3. A Mixture of Experts module: To capture tissue-specific features,
CanCellCap employs a Mixture of Experts (MoE) module consisting of
multiple expert networks and a gating network. The gating network,
guided by the gate loss
[MATH: Lgate :MATH]
, dynamically selects the most appropriate experts for each tissue
type, enabling CanCellCap to adapt to tissue-specific expression
patterns.
4. An MLP cancer-cell classifier: The tissue-common and
tissue-specific features are concatenated and input into a
Multilayer Perceptron (MLP) network for the primary task of cancer
cell identification. The loss
[MATH: Lide :MATH]
is the cross-entropy between the predicted label and the true
label.
Fig. 12.
[146]Fig. 12
[147]Open in a new tab
The model structure of CanCellCap. It processes expression over cell
matrices from various tissues and simulates dropout effects using
random masking. CanCellCap employs domain adversarial learning with a
gradient reversal layer (GRL) to extract tissue-common features
[MATH: vcom :MATH]
and a MoE module to extract tissue-specific features
[MATH: vspe :MATH]
. These feature embeddings are combined for feature reconstruction to
remove the effect of various dropout rates from different sequencing
platforms and to identify cancer cells. CanCellCap is trained using a
comprehensive loss function that enhances generalization and accuracy
across diverse tissue types
The overall loss function is the integration of the four losses,
[MATH: L=Lide+α1Lad+α2
Lgate+α3Lrecon :MATH]
, where
[MATH: α1 :MATH]
,
[MATH: α2 :MATH]
, and
[MATH: α3 :MATH]
are weights of the four loss components.
Data preprocessing
The gene expression is log-transformed and globally scaled normalized
by the cell with Seurat [[148]50]. The gene intersection set across all
tissues is extracted by Eq. ([149]1), where
[MATH: Gd :MATH]
represents the gene list in tissue
[MATH: d :MATH]
, and
[MATH: Nt :MATH]
is the number of tissues. The expressions of genes in
[MATH: Gselected
:MATH]
are extracted to construct the input expression matrix
[MATH: X∈R
mtext>Nc×Ng :MATH]
, where
[MATH: Nc :MATH]
is the number of cells and
[MATH: Ng :MATH]
is the number of selected genes.
[MATH: Gselected=⋂<
mrow>d=1tGd. :MATH]
1
Feature masking-reconstruction
A binary random masking matrix
[MATH: Mmask :MATH]
is constructed according to Eq. ([150]2), where
[MATH: r(i,j) :MATH]
is a number randomly sampled from
[MATH: [0,1] :MATH]
for the
[MATH: i :MATH]
-th cell and the
[MATH: j :MATH]
-th gene, and
[MATH: pmask :MATH]
is the masking probability with a default value of 0.3.
[MATH: Mmask(i,j)=0 :MATH]
indicates gene
[MATH: j :MATH]
is masked (set to 0) in cell
[MATH: i :MATH]
, while
[MATH: Mmask(i,j)=1 :MATH]
means that the expression of gene
[MATH: j :MATH]
is retained.
[MATH: Mmask(i,j)=1,ifr(i,j)≥pmask0,ifr(i,j)≤pmask :MATH]
2
The masked input matrix
[MATH: X~ :MATH]
is obtained by element-wise multiplication of the original matrix
[MATH: X :MATH]
and the masking matrix
[MATH: Mmask :MATH]
, as shown in Eq. ([151]3), where
[MATH: ⊙ :MATH]
denotes element-wise multiplication.
[MATH: X~ :MATH]
simulates the dropouts in the sequencing.
[MATH: X~=X⊙Mmask :MATH]
3
CanCellCap utilizes an encoder-decoder architecture to reconstruct
[MATH: X :MATH]
from the masked
[MATH: X~ :MATH]
.
[MATH: X~ :MATH]
is mapped to a latent embedding
[MATH: vi :MATH]
by an encoder, which are composed of the domain adversarial module and
the MoE module. The decoder, implemented by a two-layer MLP,
reconstructs the original gene expression matrix
[MATH: X :MATH]
on
[MATH: vi :MATH]
. The reconstruction loss is defined as Eq. ([152]4), where
[MATH: Xi :MATH]
denotes the original gene expression vector for the
[MATH: i :MATH]
-th cell, and
[MATH:
decoder
(vi) :MATH]
is the reconstructed vector from latent embedding
[MATH: vi :MATH]
.
[MATH: Lrecon=1Nc∑
i=1c||Xi-d
ecoder
(vi)||2. :MATH]
4
Domain adversarial learning for extracting tissue-common features
The domain adversarial learning module consists of a tissue
discriminator and a tissue confuser, which are trained in an
adversarial manner. The tissue confuser aims to learn common features
across tissues, while the tissue discriminator attempts to predict
tissue labels from these features. The adversarial training encourages
the tissue confuser to generate tissue-common features that make it
difficult for the discriminator to correctly identify tissue types.
The tissue confuser, denoted as
[MATH: ϕf :MATH]
, is a two-layer MLP. It processes the expression vector
[MATH: X~i :MATH]
of cell
[MATH: i :MATH]
and extracts the tissue-common feature vector
[MATH: vcom=ϕf(X~i,<
mi mathvariant="bold-italic">θf) :MATH]
, where
[MATH: θf :MATH]
represents the parameters of the tissue confuser. The tissue
discriminator, denoted as
[MATH: ϕtc :MATH]
, is a single-layer MLP that takes the tissue-common features
[MATH: vcom :MATH]
as input and predicts the tissue labels.
For adversarial training, a GRL is incorporated between the tissue
confuser and the tissue discriminator. The GRL enables adversarial
training by reversing the gradient during backpropagation. During the
forward pass, the GRL simply passes the input unchanged, as shown in
Eq. ([153]5).
[MATH: GRL(x)=x. :MATH]
5
During the backward pass, the GRL reverses the gradient direction, that
is, replace
[MATH:
∂L∂θf
:MATH]
with
[MATH:
-∂L∂θf
:MATH]
.
The domain adversarial loss function
[MATH: Lad :MATH]
is defined in Eq. ([154]6), where
[MATH:
ti,d
:MATH]
is the true tissue label in one-hot encoding for the
[MATH: i :MATH]
-th cell and
[MATH: d :MATH]
-th tissue type.
[MATH: Lad=-1Nc∑
i=1c∑d=1t<
mi>ti,dlog
mtext>(ϕtc(GRL(ϕf(X~i,<
mi mathvariant="bold-italic">θf),θtc))). :MATH]
6
Utilizing the GRL layer, the tissue confuser aims to minimize this
loss, while the tissue discriminator attempts to maximize it, thus
creating the adversarial setup. Therefore, the parameters of the tissue
confuser
[MATH: θ^f :MATH]
and the tissue discriminator
[MATH: θ^tc :MATH]
are updated through the following optimization steps:
[MATH: θ^f=argminθfLadθf,θ^tc,
:MATH]
7
[MATH: θ^tc=arg<
munder>maxθtcLadθ^f,<
mi mathvariant="bold-italic">θtc.
:MATH]
8
This adversarial training setup enables the module to capture
tissue-common features, encouraging CanCellCap to generalize across
various tissue types.
Mixture of experts for extracting tissue-specific features
The MoE module consists of
[MATH: Nt :MATH]
expert networks and a gating network, where
[MATH: Nt :MATH]
corresponds to the number of tissue types. Each expert network
[MATH: Ed :MATH]
is responsible for learning specific features for tissue
[MATH: d :MATH]
. The gating network
[MATH: gX~i
:MATH]
is a single-layer MLP that takes
[MATH: X~i :MATH]
as input and outputs a set of weights for each expert network.
Specifically, the gating network assigns a weight distribution
[MATH: gX~i=g1X~i,g2X~i,⋯,gNtX~i :MATH]
, where each weight corresponds to the importance of a particular
expert network for the
[MATH: i :MATH]
-th cell. The tissue-specific features
[MATH: vspe :MATH]
are a weighted sum of the expert network outputs:
[MATH: vspe=∑d=1t<
mi>gdX~iE
mi>dX~i. :MATH]
9
To ensure that the gating network assigns cells from different tissues
to their corresponding expert networks, a gating loss
[MATH: Lgate :MATH]
(Eq. ([155]10)) is minimized, where
[MATH:
ti,d
:MATH]
is the true tissue label in one-hot encoding for the
[MATH: i :MATH]
-th cell and
[MATH: d :MATH]
-th tissue type.
[MATH: Lgate=-1
mn>Nc∑i=1c∑
mo>d=1tti,dlog
gi,dX~i. :MATH]
10
During training, tissue labels guide the gating network to assign cells
to the appropriate experts, enabling CanCellCap to learn
tissue-specific gene expression patterns. During testing, tissue labels
are not required, and CanCellCap would find the proper expert to
identify cancer cells.
An MLP for cancer cell identification
The tissue-common features
[MATH: vcom :MATH]
and the tissue-specific features
[MATH: vspe :MATH]
are concatenated as
[MATH: vcat=[vcom|vspe]. :MATH]
[MATH: vcat :MATH]
is input to a single-layer MLP classifier
[MATH: ϕci :MATH]
to distinguish cancer and normal cells. The cancer cell identification
loss,
[MATH: Lide :MATH]
, defined in Eq. ([156]11), which quantifies the cross-entropy between
the predicted output
[MATH: ϕci(vcat,θci) :MATH]
and the true label
[MATH: yi :MATH]
for the
[MATH: i :MATH]
-th cell.
[MATH: Lide=-1Nc∑i=1c(yilogϕci(vcat,θci)+(1-yi<
/msub>)log(1-ϕci(vcat,θci)). :MATH]
11
Loss function
CanCellCap is optimized to minimize the overall loss in Eq. ([157]12),
which effectively distinguishes cancer cells while maintaining
robustness against the effects of different sequencing platforms and
variability across tissue types.
[MATH: α1 :MATH]
,
[MATH: α2 :MATH]
, and
[MATH: α3 :MATH]
are weights of the loss components, with default values of 0.2, 0.3,
and 0.1.
[MATH: L=-α<
mn>1Nc∑
i=1c∑d=1t
msubsup>ti,dlog(ϕtc(GRL(ϕf(X~i,<
mi mathvariant="bold-italic">θf),θtc)))-α2Nc∑i=1c<
mo>∑d=1tti,dloggi,dX~i+α3Nc<
/mfrac>∑i=1c|Xi-d
ecoder
(vcat)|2-1
mn>Nc∑i=1c(yilogϕcivcat,θci+(1-yi<
/msub>)log(1-ϕci(vcat,θci))
:MATH]
12
Supplementary Information
[158]12915_2025_2337_MOESM1_ESM.xlsx^ (30.2KB, xlsx)
Additional file 1: Table 1. Composition and characteristics of human
single-cell RNA-seq datasets for model testing. Table 2. Composition
and characteristics of mouse single-cell RNA-seq datasets used for
cross-species model testing. Table 3. Composition and characteristics
of human spatial transcriptomics datasets used for model testing. Table
4. Summary of human single-cell RNA-seq training datasets. Table 5.
Detailed metadata of human single-cell RNA-seq training datasets
[159]12915_2025_2337_MOESM2_ESM.xlsx^ (51.7KB, xlsx)
Additional file 2: Table 1. Performance comparison of CanCellCap and
five SOTA methods across 33 testing datasets. Table 2. Performance
comparison of CanCellCap with five SOTA methods across platforms. Table
3. Performance comparison of CanCellCap with five SOTA methods on
datasets from unseen types. Table 4. Ablation study of the components
of CanCellCap on the Testing Datasets. Table 5. Performance comparison
of CanCellCap and CanCellCap without feature masking-reconstruction
strategy across different dropout rates in simulated dropout datasets.
Table 6. Computational efficiency analysis. Table 7. Confusion matrix
of CanCellCap for cancer origin identification on the TISCH2 Testing
Dataset. Table 8. Accuracy for different tissue validation datasets of
CanCellCap
[160]12915_2025_2337_MOESM3_ESM.docx^ (5.6MB, docx)
Additional file 3: Fig. S1. Top 10 key genes identified by SHAP across
13 tissue types. Fig. S2. Top 10 key genes identified by GradientSHAP
across 13 tissue types. Fig. S3. Top 10 key genes identified by
Integrated Gradients across 13 tissue types. Fig. S4. Top 10 key genes
identified by Deeplift across 13 tissue types. Fig. S5. KEGG pathway
enrichment analysis of the overlapping genes between the top 300 genes
ranked by SHAP and those by GradientSHAP across 13 tissue types. Table
1. Total memory consumption of model inference using csv format. Table
2. Architectural components and layer configurations of the CanCellCap
model. Table 3. Accuracy metrics of CanCellCap and five
state-of-the-art methods on 10 cancer single-cell datasets, with
rejection rates indicated in parentheses for models that abstain from
certain predictions
Acknowledgements