Abstract
Background
   The advent of single-cell RNA sequencing (scRNA-seq) has provided
   unprecedented insights into cancer cellular diversity, enabling a
   comprehensive understanding of cancer at the single-cell level.
   However, identifying cancer cells remains challenging due to gene
   expression variability caused by tumor or tissue heterogeneity, which
   negatively impacts generalization and robustness.
Results
   We propose CanCellCap, a multi-domain learning framework, to identify
   cancer cells in scRNA-seq data suitable for all tissues, cancers, and
   sequencing platforms. Integrating domain adversarial learning and
   Mixture of Experts, CanCellCap is able to simultaneously extract common
   and specific patterns in gene expression profiles across different
   tissues for cancer or normal cells. Moreover, the
   masking-reconstruction strategy enables CanCellCap to cope with
   scRNA-seq data from different sequencing platforms. CanCellCap achieves
   0.977 average accuracy in cancer cell identification across 13 tissue
   types, 23 cancer types, and 7 sequencing platforms. It outperforms five
   state-of-the-art methods on 33 benchmark datasets. Notably, CanCellCap
   maintains high performance on unseen cancer types, tissue types, and
   even across species, highlighting its effectiveness in challenging
   scenarios. It also excels in spatial transcriptomics by accurately
   identifying cancer spots. Furthermore, CanCellCap demonstrates strong
   computational efficiency, completing inference on 100,000 cells in a
   few minutes. In addition, interpretability analyses reveal critical
   biomarkers and pathways, offering valuable biological insights.
Conclusions
   CanCellCap provides a robust and accurate framework for identifying
   cancer cells across diverse platforms, tissue types, and cancer types.
   Its strong generalization to unseen cancers, tissues, and even species,
   combined with its adaptability to spatial transcriptomics data,
   underscores its versatility for both research and clinical
   applications.
Supplementary Information
   The online version contains supplementary material available at
   10.1186/s12915-025-02337-1.
   Keywords: ScRNA-seq, Cancer cell identification, Multi-domain learning,
   Domain adversarial learning, Mixture of experts
Background
   Single-cell RNA sequencing (scRNA-seq) provides critical insights into
   cancer biology at the single-cell resolution [[36]1]. Accurate
   identification of cancer cells advances the understanding of tumor
   heterogeneity and evolution, thereby enabling personalized therapies
   and more effective treatment strategies [[37]2].
   Traditional approaches to separate cancer cells from normal cells often
   rely on detecting copy number variations (CNVs) from single-cell gene
   expression data by comparing target cells to reference cells
   [[38]3–[39]5]. For instance, InferCNV [[40]4] requires a reference set
   of normal cells, while CopyKAT [[41]3] requires the presence of both
   cancer and normal cells [[42]6]. Additionally, CNVs are not exclusive
   to cancer cells, as normal cells can also exhibit copy number
   alterations [[43]7, [44]8], which decrease the reliability of these
   methods.
   Recent advances use machine learning to capture gene expression
   patterns of cancer cells, removing the dependency on reference cells
   and CNV inference. ikarus [[45]9] and PreCanCell [[46]10] distinguish
   cancer cells on the extracted cancer- and normal-specific genes from
   multiple training datasets. However, their accuracy heavily depends on
   the selected genes, making them prone to overfitting and poor
   generalization to unseen data. Therefore, CaSee [[47]6] and
   Cancer-Finder [[48]11] attempt to improve model generalization. CaSee
   employs transfer learning with bulk RNA sequencing (RNA-seq) data to
   pre-train capsule network classifiers, but its effectiveness is limited
   by different distributions between bulk and single-cell RNA-seq data.
   In contrast, Cancer-Finder learns a universal representation of cancer
   cells but ignores tissue-specific expression patterns, which might lead
   to false positives.
   Despite these advancements, identifying cancer cells remains a
   significant challenge due to the high dimensionality and inherent
   heterogeneity of scRNA-seq data [[49]12]. Cancer cell transcriptional
   profiles are highly influenced by their tissues of origin, with
   distinct expression patterns observed across different tissues
   [[50]13–[51]15]. Moreover, the effect of sequencing platforms,
   primarily including dropout events (where genes with low expression are
   missed), introduces noise and variability. These confounding factors
   complicate cellular state identification.
   A single-cell gene expression profile for a cancer cell can be
   considered as the coupling of tissue-common cancer expression patterns
   (tissue-common feature), tissue-specific expression patterns
   (tissue-specific feature), and sequencing platform effects.
   Disentangling these three factors enables the model to generalize to
   different tissues and sequencing platforms, enhancing the robustness of
   cancer cell identification. Therefore, the model needs to
   simultaneously learn tissue-common and tissue-specific features while
   eliminating sequencing platform effects. As each tissue exhibits unique
   gene expression patterns, they can be treated as distinct domains.
   Therefore, in our study, a multi-domain learning framework is designed
   to facilitate domain disentanglement, allowing models to effectively
   extract both domain-common and domain-specific features, that is,
   tissue-common and tissue-specific expression patterns in our case.
   In this study, we propose a novel multi-domain learning model,
   CanCellCap, to separate cancer cells from normal cells, free from the
   effects of tissue type, cancer type, and dropout rates from sequencing
   platforms. Considering the gene expression profile from scRNA-seq as
   the coupling of tissue-common expression patterns, tissue-specific
   expression patterns, and sequencing platforms, CanCellCap imports a
   feature masking-reconstruction training strategy and integrates domain
   adversarial learning and Mixture of Experts (MoE) in the model
   structure to improve generalization and accuracy in cancer cell
   identification. The domain adversarial learning module captures
   tissue-common features, and the MoE module dynamically selects the most
   relevant experts for each tissue to capture tissue-specific features.
   Additionally, dropout events, a major effect factor in the sequencing
   platform, are simulated by random masking of expression values and
   reconstructed via the model learning, thereby eliminating the impact of
   sequencing platforms. In comprehensive experiments, CanCellCap
   outperforms five state-of-the-art methods across 33 datasets,
   accurately identifying cancer cells across diverse cancer types, tissue
   types, sequencing platforms, and even in unseen cancer types, tissue
   types, sequencing platforms, and across species. It also exhibits
   excellent computational efficiency, completing inference on large-scale
   datasets with up to 100,000 cells in significantly less time than other
   models. Additionally, CanCellCap generalizes well to spatial
   transcriptomics data and offers biologically interpretable predictions
   by highlighting key genes and pathways with potential relevance to
   cancer diagnosis and therapy.
Results
Dataset descriptions and experimental design
   We collected 74 human cancer single-cell datasets from Tumor Immune
   Single-cell Hub (TISCH) [[52]16], which cover 14 different tissues, 23
   cancer types, and 7 sequencing platforms. After preprocessing, a total
   of 328,230 cells from 13 tissue types were obtained. For each tissue,
   the samples were randomly split into a Training Dataset (80%) and a
   Validation Dataset (20%) for model training and performance validation,
   respectively. The preprocessing steps follow the approach outlined in
   Cancer-Finder [[53]11].
   To evaluate CanCellCap comprehensively, experiments were designed to
   cover multiple performance aspects. First, CanCellCap was benchmarked
   against five state-of-the-art (SOTA) methods using 33 testing datasets
   spanning 15 tissue types to assess its overall performance. Second, its
   generalization ability was validated by testing on four unseen cancer
   types, two unseen tissue types, one unseen species, and its robustness
   was evaluated across six sequencing platforms. Third, we explored
   CanCellCap’s potential applications in the following scenarios: its use
   of tissue-specific features was tested through cancer origin
   identification, and its applicability in spatial transcriptomics was
   demonstrated by identifying cancer spots. Finally, we conducted model
   analysis, including cancer/tissue-specific evaluation, ablation
   studies, interpretability analyses, and computational complexity
   evaluation.
   The datasets used in the experiments are summarized below, with
   detailed sources provided in an additional file (see Additional file
   [54]1: Table 1–5).
    1. Training Dataset and Validation Dataset are sourced from TISCH1 and
       include 328,230 cells across 13 tissue types: Blood, Bone, Brain,
       Breast, Colorectum, Eye, Head and Neck, Liver, Lung, Nervous,
       Pancreas, Pelvic Cavity, and Skin. It does not overlap with any of
       the subsequent datasets.
    2. Testing Dataset comprising 33 testing datasets from human, spanning
       15 tissue types, 25 cancer types, and 6 sequencing platforms,
       totaling 834,788 cells, which were employed as external independent
       testing.
    3. Unseen Cancer/Tissue Type and Species Dataset consists of 113,393
       cells from unseen four cancer types, 89,922 cells from two unseen
       human tissue types, and 63,734 cells from an unseen species
       (mouse). “Unseen” means a tissue type, cancer type, or species that
       is absent from the Training Dataset.
    4. Multiple-Sequencing Platforms Dataset includes 5600 cells from
       Smart-seq[55]2, 146,387 cells from 10X Genomics, 10,502 cells from
       Microwell-seq, 3134 cells from C1 Fluidigm, 21,011 cells from
       Drop-seq, and 86,486 cells from GEXSCOPETM. This dataset was used
       to validate CanCellCap’s generalization across different sequencing
       platforms, especially considering that C1 Fluidigm, Drop-seq, and
       GEXSCOPETM platforms are not present in the training set.
    5. TISCH2 Testing Dataset was constructed by TISCH2 [[56]17], which
       was employed to further evaluate CanCellCap for identifying cancer
       cell origins. This dataset consists of 122,269 cells from seven
       tissues, including the blood [[57]18], brain [[58]19], bone
       [[59]20], lung [[60]21], pancreas [[61]22], eye [[62]23].
    6. Simulated Dropout Rate Dataset was generated on Testing Dataset by
       randomly masking data with different dropout rates (0–98%). This
       dataset was used to assess CanCellCap’s robustness to different
       dropout rates by various sequencing platforms.
    7. Spatial Transcriptomics Dataset was obtained from human prostate
       cancer samples [[63]24]. It was used to evaluate CanCellCap’s
       applicability to spatial transcriptomics analysis.
   These datasets and annotations were sourced from the testing sets of
   Cancer-Finder [[64]11], TISCH2 [[65]17], the Gene Expression Omnibus
   [[66]25], and the Single Cell Portal [[67]26]. Labels distinguishing
   cancer and normal cells were curated based on provided annotations to
   ensure consistency.
   The performance of CanCellCap was compared with five SOTA methods,
   including CopyKAT [[68]3] and SCEVAN [[69]5], which rely on CNV
   detection; ikarus [[70]9] and PreCanCell [[71]10], which leverage
   specific gene expression patterns; Cancer-Finder [[72]11], which
   focuses on universal representation learning. The evaluation metrics
   include accuracy, F1 score (F1), recall, precision, and area under the
   receiver operating characteristic curve (AUROC). Note that since SCEVAN
   and CopyKAT do not provide probability scores for cancer cell
   classification, AUROC is not available for these methods. All
   evaluation metrics were calculated using standard functions from the
   scikit-learn library. Additionally, to ensure a statistically rigorous
   comparison, we employed the Bonferroni-Dunn test [[73]27] and
   visualized the results using the Critical Difference (CD) [[74]28]
   diagrams with a 95% confidence level. The CD diagrams allow for
   statistical verification of whether the observed performance
   differences between methods are significant, following the
   implementation outlined in [[75]29]. Detailed performance scores for
   all datasets are provided in an additional file (see Additional file
   [76]2: Table 1–8).
Model training process
   The CanCellCap model was trained with a fixed random seed (0) for 50
   epochs, using a batch size of 128 and the stochastic gradient descent
   optimizer (learning rate = 
   [MATH:
   1.3×10-3 :MATH]
   , momentum = 
   [MATH: 0.9 :MATH]
   , weight decay = 
   [MATH:
   1.5×10-4 :MATH]
   ). The training losses and the corresponding validation accuracy
   throughout the training epochs are shown in Fig. [77]1A and B. The five
   losses, Total Loss, Identification Loss, Gate Loss, Domain Adversarial
   Loss, and Reconstruction Loss, converged after 40 epochs. On the
   Validation Dataset, which includes 13 tissue types, CanCellCap achieves
   an impressive average accuracy of 0.9777, and 12 out of the 13 tissue
   types achieves exceeded 0.9500 accuracy in cancer cell identification,
   demonstrating its reliability across various biological contexts. The
   relatively lower performance on bone tissue, with an accuracy of
   0.8630, may be due to the high similarity between multiple myeloma
   cells and normal plasma cells, making them harder to distinguish
   [[78]30].
Fig. 1.
   [79]Fig. 1
   [80]Open in a new tab
   Performance evaluation of CanCellCap during the training stage. A Loss
   curves of five loss during the training stage. B Accuracy for different
   tissues within the Validation Dataset during the training stage. C
   t-SNE visualization of the input expression features
   [MATH: X :MATH]
   from the Validation Dataset, colored by cell status. D t-SNE
   visualization of the final cell embeddings
   [MATH: vcat :MATH]
   from the Validation Dataset, colored by cell status
   The T-Distributed Stochastic Neighbor Embedding (t-SNE) visualizations
   for both the input expression features
   [MATH: X :MATH]
   and the final cell embeddings
   [MATH: vcat :MATH]
   extracted by CanCellCap from the Validation Dataset are shown in Fig.
   [81]1C and D, respectively, where the dots are colored by cancer or
   normal cell status. In the t-SNE directly from gene expression, cancer,
   and normal cells are mixed. In contrast, in the t-SNE on
   [MATH: vcat :MATH]
   , cancer and normal cells exhibit a clearer separation.
Experiments for performance from various projects
   CanCellCap was compared with five SOTA methods on the Testing Dataset,
   comprising 855,713 human cells from 33 datasets. As illustrated in the
   boxplots (Fig. [82]2A), CanCellCap consistently outperformed the other
   methods across most evaluation metrics, including accuracy, F1,
   precision, and AUROC. Notably, the interquartile ranges in the boxplots
   are smaller for CanCellCap across most metrics, indicating more stable
   performance across datasets. This observation is further supported by
   the CD diagrams shown in Fig. [83]2B. In the CD diagrams, methods
   connected by a line are considered statistically similar, while methods
   not connected show significant performance differences. The diagrams
   demonstrate that CanCellCap ranks significantly higher than other
   methods in accuracy, F1, and precision, with statistical significance.
   Recall is the only metric where CanCellCap performs comparably to
   Cancer-Finder, both achieving the highest performance. This is likely
   due to the MoE module, which enables tissue-specific discrimination. In
   tissues with limited training data, experts tend to adopt conservative
   decision boundaries, reducing recall in ambiguous cases while improving
   generalization.
Fig. 2.
   [84]Fig. 2
   [85]Open in a new tab
   Performance comparison of CanCellCap and five SOTA methods across 33
   testing datasets. A Boxplots illustrate the distribution of five
   evaluation metrics across all datasets. B CD diagrams display the
   average rankings of all methods based on the five metrics. Note that
   since SCEVAN and CopyKAT do not provide probability scores for cancer
   cell classification, AUROC is not applicable to these methods and is
   thus marked as 'NA'
   In addition to the overall performance evaluation, we further assessed
   the robustness of CanCellCap across different tissue types within the
   Testing Dataset. Specifically, we compared CanCellCap with five SOTA
   methods across 15 human tissue types. As shown in Fig. [86]3,
   CanCellCap demonstrates superior performance across the majority of
   tissues. In contrast, methods such as PreCanCell, SCEVAN, CopyKAT, and
   ikarus perform well in certain tissues but exhibit substantial
   performance degradation in others, indicating limited robustness.
   Although Cancer-Finder shows greater cross-tissue stability than the
   aforementioned methods, its overall performance remains lower than that
   of CanCellCap.
Fig. 3.
   [87]Fig. 3
   [88]Open in a new tab
   Performance comparison of CanCellCap with five SOTA methods across
   tissues. Five performance metrics comparison of CanCellCap with five
   SOTA methods in cancer cell identification across 15 human tissue types
   from the Testing Dataset. Recall, F1, precision, and AUROC are marked
   as'NA'for the Head and neck, which contain only normal cells. Note that
   since SCEVAN and CopyKAT do not provide probability scores for cancer
   cell classification, AUROC is not applicable to these methods and is
   thus marked as 'NA'
   Overall, CanCellCap outperforms all other methods in both overall
   performance and robustness across diverse tissue types in identifying
   cancer cells.
Experiments for generalization for different sequencing platforms
   To assess the robustness and generalization performance across
   different sequencing platforms, CanCellCap was applied to scRNA-seq
   datasets from six different sequencing platforms.
   As shown in Fig. [89]4, CanCellCap outperforms all five SOTA methods
   across the six sequencing platforms, achieving the highest average
   accuracy of 0.9229. Even on unseen sequencing platforms such as C1
   Fluidigm, Drop-seq, and GEXSCOPETM, CanCellCap maintains high accuracy,
   reaching 0.956, 0.9121, and 0.9393. In contrast, Cancer-Finder shows
   more unstable performance, particularly struggling on platforms like
   10X Genomics, where it achieves a relatively low accuracy of 0.8539.
   Overall, CanCellCap exhibits superior adaptability on the datasets from
   the six sequencing platforms.
Fig. 4.
   [90]Fig. 4
   [91]Open in a new tab
   Performance comparison of CanCellCap with five SOTA methods across
   platforms. Five performance metrics comparison of CanCellCap with five
   SOTA methods in cancer cell identification across six different
   sequencing platforms. Note that since SCEVAN and CopyKAT do not provide
   probability scores for cancer cell classification, AUROC is not
   applicable to these methods and is thus marked as 'NA'
Experiment for robustness on unseen cancer/tissue types and species
   CanCellCap was applied to datasets from unseen cancer types: Cervical
   Cancer (CC), Hepatoblastoma (HB), Small Cell Lung Cancer (SCLC), and
   Cutaneous Melanoma (CMM), two unseen tissue types (soft tissue and
   prostate), and one unseen species (mouse) which are not included in the
   Training Dataset.
   As shown in Fig. [92]5A, CanCellCap outperforms all SOTA methods across
   all metrics on the unseen cancer types. Specifically, CanCellCap
   achieves the highest accuracy of 0.9297, recall of 0.9682, F1 score of
   0.9287, precision of 0.9291, and AUROC of 0.9722 on the unseen cancer
   types. In contrast, the second-best model, Cancer-Finder, achieves an
   accuracy of 0.8705 with an AUROC of 0.8676. Figure [93]5B presents the
   performance on unseen tissue types, such as soft tissue and prostate.
   On the unseen tissue types, CanCellCap maintains the highest accuracy
   of 0.7877, F1 score of 0.7847, precision of 0.8333, and AUROC of
   0.8172. In comparison, Cancer-Finder and ikarus struggle with lower
   AUROC, while PreCanCell shows competitive performance, with overall
   lower metrics.
Fig. 5.
   [94]Fig. 5
   [95]Open in a new tab
   Performance comparison of CanCellCap with five SOTA methods on datasets
   from unseen types. A Five metrics on datasets of unseen cancer type. B
   Five metrics on datasets of unseen tissue type. C Five metrics on
   datasets from mouse. Note that since SCEVAN and CopyKAT do not provide
   probability scores for cancer cell classification, AUROC is not
   applicable to these methods and is thus marked as 'NA'
   To further evaluate generalization across species, the model was
   applied to three mouse single-cell datasets. For compatibility with
   mouse data, CanCellCap was adapted by mapping gene homologs between
   human and mouse using the HomoloGene database [[96]31]. Since ikarus
   and PreCanCell lack pipelines for mouse datasets, and given the
   similarity in framework between Cancer-Finder and CanCellCap, a
   compatible pipeline for Cancer-Finder was developed in this study to
   facilitate direct comparison. Accordingly, evaluations focused on
   Cancer-Finder, SCEVAN, and CopyKAT. As in Fig. [97]5C, CanCellCap
   achieved the best overall performance with an accuracy of 0.9112,
   surpassing the second-best method, Cancer-Finder, which attained
   0.8278.
   These results underscore CanCellCap’s exceptional adaptability and its
   ability to reliably identify cancer cells across both unseen cancer,
   tissue types, and species.
Experiment to identify cancer cell origin
   To validate the tissue-specific features captured by CanCellCap, this
   experiment aims to identify a cancer cell origin on the TISCH2 Testing
   Dataset comprising 122,269 cells from 6 tissue origins. The ability to
   classify cancer cell origins relies heavily on recognizing
   tissue-specific gene expression patterns, which are unique to each
   tissue type. This task involved classifying cells as normal or
   cancerous and, for cancer cells, identifying their origin.
   As shown in Fig. [98]6A, CanCellCap demonstrates strong overall
   performance in identifying the origin of cancer cells, achieving an
   average accuracy of 0.9521. Since the training set contains 14
   categories while the test set only includes 7 categories, some cells
   were misclassified into categories not present in the test set. Thus,
   the strong performance highlights CanCellCap’s ability to effectively
   capture and utilize tissue-specific gene expression patterns.
Fig. 6.
   [99]Fig. 6
   [100]Open in a new tab
   Application of CanCellCap on cancer origin identification and cancer
   spot identification. A Confusion matrix of CanCellCap for cancer origin
   identification on the TISCH2 Testing Dataset. B Comparison of cancer
   spot identification in spatial transcriptomics, where each spot
   represents gene expression profiles mapped to spatial coordinates
   within the tissue: (i) pathologist annotations, (ii) CanCellCap
   predictions, and (iii) confusion matrix for CanCellCap’s cancer spot
   identification
   Overall, these results demonstrate that CanCellCap excels not only in
   distinguishing cancer cells from normal cells but also in leveraging
   the extracted tissue-specific features to accurately identify the
   cancer cell origin.
Spot-level cancer identification in spatial transcriptomics
   Spatial transcriptomics provides critical spatial context, revealing
   how cancer cells interact with their microenvironment—key information
   for advancing cancer research. To assess CanCellCap’s applicability to
   spatial transcriptomics data from cancer tissue sections, CanCellCap,
   which was trained on scRNA-seq data, was directly applied to cancer
   spot identification on a spatial transcriptomics dataset obtained from
   human prostate cancer samples. CanCellCap’s predictions were compared
   with pathologist annotations.
   Figure [101]6B (i) shows the pathologist’s annotations, indicating
   cancer and normal regions, while Fig. [102]6B (ii) presents
   CanCellCap’s predictions. The spots predicted by CanCellCap closely
   resemble those annotated by the pathologist. Figure [103]6B (iii)
   presents the confusion matrix, which further highlights CanCellCap’s
   performance in identifying cancer and normal spots. The matrix reveals
   a high true positive rate for cancer spots, demonstrating CanCellCap’s
   strong ability to detect cancer cells. With an accuracy of 0.7989 and a
   recall of 0.9058, CanCellCap shows excellent performance in
   distinguishing cancer cells from normal cells.
   Overall, CanCellCap’s ability to identify cancer cell distribution at
   spot-level granularity underscores its value in spatial
   transcriptomics, offering nuanced insights into tumor architecture that
   complement pathologist assessments. Notably, despite not being
   specifically trained on spatial transcriptomics data, CanCellCap
   achieves excellent results.
Model analysis
Evaluation of CanCellCap across diverse cancer and tissue types
   To further evaluate the capability of CanCellCap, its cancer cell
   identification performance was assessed across various cancer types and
   tissue types, based on 33 test datasets. This analysis included both
   rare cancer types and under-studied tissue types, highlighting its
   robustness in challenging biological contexts.
   As shown in Fig. [104]7, CanCellCap achieved over 0.9 accuracy across
   most cancer types. Notably, even among rare cancers such as
   adamantinomatous craniopharyngioma (ACP), synovial sarcoma (SS),
   pleuropulmonary blastoma (PPB), kidney chromophobe (KICH), SCLC, and
   gastroenteropancreatic neuroendocrine tumors (GEP-NETs), performance
   remained high. Figure [105]7 also demonstrates similarly strong
   performance across tissue types, with most exceeding 0.9 accuracy,
   including under-studied tissues like Eye, Kidney, and Soft Tissue.
   Despite the overall strong performance, certain cancer types such as
   prostate adenocarcinoma (PRAD), belonging to prostate tissue, exhibited
   relatively lower accuracies, falling below 0.8. This underperformance
   is likely attributed to the absence of corresponding samples in the
   training dataset. Nonetheless, as discussed in the “[106]Experiment for
   robustness on unseen cancer/tissue types and species” section,
   CanCellCap still outperforms other models under these conditions.
Fig. 7.
   [107]Fig. 7
   [108]Open in a new tab
   Performance of CanCellCap across cancer types and tissue types.
   Abbreviations used in the figure are listed in the Abbreviations
   section
   Importantly, the ACP dataset was derived from clinical cancer samples
   obtained through hospital collaboration. scRNA-seq were generated
   following standard protocols, and cancer cells were annotated according
   to CTNNB1 mutation status. On this dataset, CanCellCap achieved an
   accuracy of 0.9402, demonstrating its potential for clinical diagnostic
   use.
   These findings collectively emphasize the generalizability of
   CanCellCap across diverse biological contexts, including rare
   scenarios.
Ablation study
   To further evaluate the contributions of various components of
   CanCellCap, an ablation study was conducted, with key modules
   systematically removed and their impact on model performance assessed
   using the Testing Dataset.
   As shown in Fig. [109]8A, the performance of the ablation modules on
   the Testing Dataset is analyzed. The results indicate that compared to
   using both the adversarial module and the MoE module, using only the
   domain adversarial module achieves relatively high recall. However, it
   suffers from a significant increase in false positives due to the lack
   of specificity, and there is a notable decline in performance on other
   metrics. This highlights the necessity of integrating both common and
   specific gene expression patterns across different tissues to improve
   generalization. The complete CanCellCap is slightly better than the
   other two models, confirming that each component plays a meaningful
   role in enhancing the overall performance of CanCellCap.
Fig. 8.
   [110]Fig. 8
   [111]Open in a new tab
   Ablation study of the components of CanCellCap. A Ablation study of the
   components of CanCellCap on the Testing Dataset. B–F Performance
   comparison of CanCellCap and CanCellCap without feature
   masking-reconstruction strategy across different dropout rates in
   simulated dropout datasets, evaluated using five performance metrics
   Overall, the complete CanCellCap achieves the highest scores across all
   metrics, confirming that each component plays a meaningful role in
   enhancing the overall performance of CanCellCap.
   To further investigate the impact of dropout rates in the dataset and
   evaluate the effectiveness of the feature masking-reconstruction
   strategy, CanCellCap was compared with a variant without this strategy
   across different dropout rates in simulated dropout datasets. These
   datasets are generated by randomly masking expression values with
   varying dropout probabilities (0–98%), based on the Testing Dataset.
   As shown in Fig. [112]8B–F, CanCellCap consistently outperforms
   CanCellCap without the feature masking-reconstruction at all dropout
   rates. For instance, at a low dropout rate (5%), CanCellCap achieves an
   accuracy of 0.9269, compared to 0.9146 for CanCellCap without this
   strategy. As the dropout rate increased, the performance gap widened,
   with CanCellCap maintaining superior results in accuracy, recall, F1
   score, and precision. Notably, even when the dropout probability
   exceeded 50%, CanCellCap’s accuracy remained above 0.9. The observed
   increase in recall at higher dropout rates may be due to CanCellCap’s
   tendency to classify uncertain cases as cancer cells, indicating a
   cautious bias toward identifying potential cancer cells in sparse data.
Gating weights vector analysis
   To illustrate how CanCellCap effectively assigns cells from various
   tissues to the most suitable combination of experts, the gating weight
   vector
   [MATH: g(X) :MATH]
   is analyzed. Two visualizations were employed to examine the
   distribution of gating weight vectors for each cell in the validation
   set:
Heatmap of gating weight vectors
   A heatmap was created to visualize the gating weight vectors assigned
   to each cell. As shown in Fig. [113]9A, the gating weight vectors are
   similar within the same tissue type while differing across different
   tissues. This confirms that the gate network dynamically adapts to the
   specific expression patterns of each tissue, ensuring that cells are
   directed to the most appropriate experts.
Fig. 9.
   [114]Fig. 9
   [115]Open in a new tab
   Gating weight vectors analysis. A Heatmap of gating weight vectors
   across tissue types. B t-SNE visualization of gating weight vectors
   across tissue types, with different colors representing different
   tissue types
T-SNE visualization of gating weight vectors
   t-SNE was used to visualize the gating weight vectors assigned to each
   cell. As shown in Fig. [116]9B, cells from the same tissue type tend to
   cluster together, indicating that CanCellCap consistently assigns
   similar weights to cells of the same tissue. This clustering behavior
   illustrates that the gate network effectively allocates cells to the
   most relevant tissue experts.
   Together, these visualizations provide compelling evidence that the
   gate network of the MoE module is capable of intelligently allocating
   cells to the most relevant experts, thereby optimizing performance
   across diverse tissue types.
Key genes and pathways for cancer cell identification
   To enhance our understanding of CanCellCap’s decision-making process
   and its biological relevance, an interpretability experiment was
   conducted using SHapley Additive exPlanations (SHAP) and GradientSHAP
   [[117]32]. Based on these methods, the top 10 key genes for cancer cell
   identification were identified, and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) [[118]33] pathway enrichment analysis was subsequently
   performed on the overlapping genes between the top 300 genes ranked by
   SHAP and those by GradientSHAP to further explore their biological
   relevance.
   As shown in Fig. [119]10A and B, SHAP and GradientSHAP were employed to
   rank the key genes affecting CanCellCap’s predictions, with higher
   values indicating greater importance. Among them, several are
   well-established biomarkers, including IFI27 [[120]34], CDKN2A
   [[121]35], CXCR4 [[122]36], S100B [[123]37] and PTPRC [[124]38], which
   are widely recognized in cancer research. Additionally, other
   potentially relevant genes were identified, such as SPINT2 [[125]39],
   SH3bgRL3 [[126]40], RGS1 [[127]41], and IGFBP7 [[128]42], suggesting
   potential avenues for cancer biomarker discovery.
Fig. 10.
   [129]Fig. 10
   [130]Open in a new tab
   Identification of key genes and pathway enrichment across tissues. A
   Top 10 important genes ranked by SHAP in pelvic cavity and skin
   tissues. B Top 10 important genes ranked by GradientSHAP in the same
   tissues. C KEGG pathway enrichment based on the overlapping genes
   between the top 300 ranked by SHAP and the top 300 ranked by
   GradientSHAP
   As shown in Fig. [131]10C, KEGG pathway enrichment analysis was
   conducted on the overlapping genes between the top 300 genes ranked by
   SHAP and those by GradientSHAP, to further validate the biological
   relevance of CanCellCap’s predictions. The overlap of key genes from
   both methods ensures that the identified pathways are consistently
   represented across different interpretability techniques, providing
   biologically meaningful and consistent insights. In both pelvic cavity
   and skin tissues, the enriched pathways were strongly associated with
   cancer-related processes. In pelvic cavity tissue, key enriched
   pathways included “Pathways in cancer” and “Regulation of actin
   cytoskeleton,” both of which play crucial roles in cancer development,
   metastasis, and oncogenic signaling [[132]43, [133]44]. Similarly, in
   skin tissue, enrichment of pathways such as “Pathways in cancer” and
   the “MAPK signaling pathway” was observed, indicating the involvement
   of oncogenic mechanisms relevant to skin cancer [[134]45, [135]46].
   This interpretability analysis highlights that CanCellCap not only
   accurately identifies cancer cells, but also uncovers critical genetic
   markers and biologically meaningful pathways that may inform future
   cancer research and therapeutic development. Furthermore, the provided
   CanCellCap code supports additional interpretability methods, including
   DeepLIFT [[136]47] and Integrated Gradients [[137]48]. The
   interpretability results for other tissue types can be found in an
   additional file (see Additional file [138]3: Fig. S1–S5).
Computational efficiency
   To evaluate the computational efficiency of each method, a series of
   runtime experiments were conducted on a workstation equipped with an
   NVIDIA 2080Ti GPU (12 GB memory) and a 24-core 2.2 GHz CPU, using
   datasets of varying sizes.
   First, the runtime of CanCellCap was analyzed by comparing the total
   runtime (including data loading and inference) to the inference time.
   The experiments used scRNA-seq datasets stored in CSV format and loaded
   via pandas. As shown in Fig. [139]11A, the results indicate that the
   majority of time is spent on data loading, rather than on the actual
   model inference. To alleviate this bottleneck, two solutions were
   explored: (i) using the binary parquet stored format, which
   significantly reduces data loading time; and (ii) leveraging Modin
   [[140]49], a parallelized alternative to pandas that accelerates CSV
   file reading, especially for large datasets. These solutions have been
   implemented in our public pipeline.
Fig. 11.
   [141]Fig. 11
   [142]Open in a new tab
   Computational efficiency analysis. A Evaluation of computational
   efficiency and total time (including both data loading and model
   inference) across different data formats (Parquet, CSV) and data
   loading methods (pandas, Modin) on datasets of varying sizes. B
   Comparison of total time on datasets of varying sizes using the CSV
   format. The lighter-shaded portion of each bar represents the model
   inference time. “NA” denotes times that exceed one day
   As demonstrated in Fig. [143]11A, CanCellCap (Parquet), which uses the
   binary Parquet format, consistently reduced total runtime across
   datasets of varying sizes, making it the preferred option for both
   storage and inference. The benefit became increasingly pronounced with
   larger datasets, achieving up to a 100 × speedup over CanCellCap (CSV
   with pandas) on datasets containing 100,000 cells. For pre-existing
   large-scale CSV files (e.g., 100,000 cells), enabling Modin provided
   additional improvements, with CanCellCap (CSV with Modin) achieving up
   to a 10 × speedup over the pandas-based implementation. However, for
   smaller datasets, the initialization overhead of Modin may outweigh its
   benefits.
   Additionally, Fig. [144]11B compares the total runtime of models using
   CSV-formatted input. CanCellCap demonstrated superior computational
   efficiency on datasets exceeding 10,000 cells, with the use of Modin
   significantly reducing runtime. For the largest dataset tested (100,000
   cells), CanCellCap completed inference in just 396.46 s, which is
   substantially faster than the second-best method, Cancer-Finder,
   requiring 3,850.88 s. These results highlight CanCellCap’s improved
   computational efficiency, allowing it to outperform other models in
   large-scale inference datasets.
Discussion
   In this study, we introduce CanCellCap, a multi-domain learning
   framework for identifying cancer cells from scRNA-seq data across
   diverse tissue types. CanCellCap integrates domain adversarial learning
   with a Mixture of Experts module to effectively capture both
   tissue-common and tissue-specific features. This dual capability
   enhances its generalization across diverse tissue types. To mitigate
   the effects of various dropout rates from different sequencing
   platforms, CanCellCap employs a feature masking-reconstruction
   strategy, enabling robust performance across different sequencing
   platforms.
   Extensive experiments demonstrate that CanCellCap outperforms five
   state-of-the-art methods across 33 testing datasets, accurately
   identifying cancer cells across diverse cancer types and tissue types,
   including both seen and unseen cancer and tissue types, as well as
   sequencing platforms. CanCellCap also successfully identifies cancer
   spots in spatial transcriptomics data and cancer cells in mouse
   datasets, despite being trained exclusively on human scRNA-seq data.
   Ablation studies highlight the essential roles of its core components,
   while analysis of gating weight vectors further supports CanCellCap’s
   ability to effectively assign input data to the most relevant expert
   models. Moreover, CanCellCap accurately identified cancer cells in
   clinically derived samples, demonstrating its potential for clinical
   translation. Interpretability analysis revealed key genes and pathways,
   including both established cancer biomarkers and potential novel
   therapeutic targets. These findings offer valuable insights for cancer
   biology and biomarker discovery.
   Despite its overall strong performance, CanCellCap shows slightly lower
   recall, which may stem from the MoE module’s reliance on
   tissue-specific experts. In tissues with limited training data, these
   experts may adopt conservative decision boundaries. To address this, we
   plan to enrich underrepresented tissue types in the training set and
   explore adaptive training strategies to better balance recall and
   generalization.
   Additionally, the current training dataset is based solely on human
   data, thereby limiting CanCellCap’s generalization capacity across
   species. While early results show promise in identifying mouse cancer
   cells, its applicability to non-human species remains constrained.
   Expanding the training dataset by incorporating data from multiple
   species will be a key priority in future work. This will enhance
   CanCellCap’s ability to generalize across diverse biological contexts,
   increasing its utility across a broader range of applications.
   Another limitation lies in CanCellCap’s memory consumption. Although
   its usage is moderate compared to other methods, there is still room
   for optimization. Designing a lightweight version of the model would
   help reduce both memory usage and runtime, thereby enabling wider
   adoption in both research and clinical settings.
   Interpretability analysis identified key genes driving CanCellCap’s
   predictions, including both well-established cancer biomarkers and
   potential novel candidates for diagnosis or therapy. However, only a
   subset of these genes has been experimentally validated. To enhance the
   clinical relevance of these findings, we have initiated collaborations
   with hospitals to validate candidate biomarkers using real-world
   clinical datasets.
Conclusions
   CanCellCap is introduced as a robust and generalizable framework for
   cancer cell identification from single-cell RNA sequencing (scRNA-seq)
   data. Leveraging a multi-domain design and feature
   masking-reconstruction strategies, CanCellCap achieves high accuracy
   across a diverse array of tissues, cancer types, and sequencing
   technologies. Importantly, CanCellCap demonstrates strong
   generalization to entirely unseen cancer types, tissue types, and even
   species, showcasing its adaptability in real-world and cross-species
   scenarios. In addition, its biologically interpretable outputs enable
   the identification of potential cancer biomarkers, supporting both
   translational research and fundamental insights into cancer biology.
   Future efforts will be directed toward extending its applicability to
   broader biological and clinical contexts, including cross-species
   generalization and enhanced computational efficiency for clinical
   deployment. Collaborations with clinical researchers are ongoing to
   explore CanCellCap’s utility in early cancer detection and personalized
   therapeutic strategy development, ultimately contributing to the
   advancement of cancer research and precision medicine.
Methods
Model framework
   With a gene expression matrix as input, which rows represent gene
   expression levels and columns represent individual cells from diverse
   human tissues, CanCellCap is composed of the following four modules, as
   shown in Fig. [145]12.
    1. A feature masking-reconstruction strategy: The expression matrix is
       randomly masked to simulate dropout during the sequencing and
       reconstructed by a decoder. The reconstruction loss
       [MATH: Lrecon :MATH]
       ensures the accurate recovery of dropout gene expressions.
    2. A domain adversarial learning module: A gradient reversal layer
       (GRL) is introduced to implement the domain adversarial learning by
       building a tissue discriminator and a tissue confuser to extract
       tissue-common features. The tissue discriminator predicts the
       tissue source of cells, while the tissue confuser learns to extract
       common gene expression patterns to confuse the discriminator. The
       domain adversarial loss
       [MATH: Lad :MATH]
       measures the cross-entropy between the true and predicted tissue
       labels of cells to guide the adversarial training.
    3. A Mixture of Experts module: To capture tissue-specific features,
       CanCellCap employs a Mixture of Experts (MoE) module consisting of
       multiple expert networks and a gating network. The gating network,
       guided by the gate loss
       [MATH: Lgate :MATH]
       , dynamically selects the most appropriate experts for each tissue
       type, enabling CanCellCap to adapt to tissue-specific expression
       patterns.
    4. An MLP cancer-cell classifier: The tissue-common and
       tissue-specific features are concatenated and input into a
       Multilayer Perceptron (MLP) network for the primary task of cancer
       cell identification. The loss
       [MATH: Lide :MATH]
       is the cross-entropy between the predicted label and the true
       label.
Fig. 12.
   [146]Fig. 12
   [147]Open in a new tab
   The model structure of CanCellCap. It processes expression over cell
   matrices from various tissues and simulates dropout effects using
   random masking. CanCellCap employs domain adversarial learning with a
   gradient reversal layer (GRL) to extract tissue-common features
   [MATH: vcom :MATH]
   and a MoE module to extract tissue-specific features
   [MATH: vspe :MATH]
   . These feature embeddings are combined for feature reconstruction to
   remove the effect of various dropout rates from different sequencing
   platforms and to identify cancer cells. CanCellCap is trained using a
   comprehensive loss function that enhances generalization and accuracy
   across diverse tissue types
   The overall loss function is the integration of the four losses,
   [MATH: L=Lide+α1Lad+α2
   Lgate+α3Lrecon :MATH]
   , where
   [MATH: α1 :MATH]
   ,
   [MATH: α2 :MATH]
   , and
   [MATH: α3 :MATH]
   are weights of the four loss components.
Data preprocessing
   The gene expression is log-transformed and globally scaled normalized
   by the cell with Seurat [[148]50]. The gene intersection set across all
   tissues is extracted by Eq. ([149]1), where
   [MATH: Gd :MATH]
   represents the gene list in tissue
   [MATH: d :MATH]
   , and
   [MATH: Nt :MATH]
   is the number of tissues. The expressions of genes in
   [MATH: Gselected
   :MATH]
   are extracted to construct the input expression matrix
   [MATH: X∈R
   mtext>Nc×Ng :MATH]
   , where
   [MATH: Nc :MATH]
   is the number of cells and
   [MATH: Ng :MATH]
   is the number of selected genes.
   [MATH: Gselected=⋂<
   mrow>d=1tGd. :MATH]
   1
Feature masking-reconstruction
   A binary random masking matrix
   [MATH: Mmask :MATH]
   is constructed according to Eq. ([150]2), where
   [MATH: r(i,j) :MATH]
   is a number randomly sampled from
   [MATH: [0,1] :MATH]
   for the
   [MATH: i :MATH]
   -th cell and the
   [MATH: j :MATH]
   -th gene, and
   [MATH: pmask :MATH]
   is the masking probability with a default value of 0.3.
   [MATH: Mmask(i,j)=0 :MATH]
   indicates gene
   [MATH: j :MATH]
   is masked (set to 0) in cell
   [MATH: i :MATH]
   , while
   [MATH: Mmask(i,j)=1 :MATH]
   means that the expression of gene
   [MATH: j :MATH]
   is retained.
   [MATH: Mmask(i,j)=1,ifr(i,j)≥pmask0,ifr(i,j)≤pmask :MATH]
   2
   The masked input matrix
   [MATH: X~ :MATH]
   is obtained by element-wise multiplication of the original matrix
   [MATH: X :MATH]
   and the masking matrix
   [MATH: Mmask :MATH]
   , as shown in Eq. ([151]3), where
   [MATH: ⊙ :MATH]
   denotes element-wise multiplication.
   [MATH: X~ :MATH]
   simulates the dropouts in the sequencing.
   [MATH: X~=X⊙Mmask :MATH]
   3
   CanCellCap utilizes an encoder-decoder architecture to reconstruct
   [MATH: X :MATH]
   from the masked
   [MATH: X~ :MATH]
   .
   [MATH: X~ :MATH]
   is mapped to a latent embedding
   [MATH: vi :MATH]
   by an encoder, which are composed of the domain adversarial module and
   the MoE module. The decoder, implemented by a two-layer MLP,
   reconstructs the original gene expression matrix
   [MATH: X :MATH]
   on
   [MATH: vi :MATH]
   . The reconstruction loss is defined as Eq. ([152]4), where
   [MATH: Xi :MATH]
   denotes the original gene expression vector for the
   [MATH: i :MATH]
   -th cell, and
   [MATH:
   decoder
   (vi) :MATH]
   is the reconstructed vector from latent embedding
   [MATH: vi :MATH]
   .
   [MATH: Lrecon=1Nc∑
   i=1c||Xi-d
   ecoder
   (vi)||2. :MATH]
   4
Domain adversarial learning for extracting tissue-common features
   The domain adversarial learning module consists of a tissue
   discriminator and a tissue confuser, which are trained in an
   adversarial manner. The tissue confuser aims to learn common features
   across tissues, while the tissue discriminator attempts to predict
   tissue labels from these features. The adversarial training encourages
   the tissue confuser to generate tissue-common features that make it
   difficult for the discriminator to correctly identify tissue types.
   The tissue confuser, denoted as
   [MATH: ϕf :MATH]
   , is a two-layer MLP. It processes the expression vector
   [MATH: X~i :MATH]
   of cell
   [MATH: i :MATH]
   and extracts the tissue-common feature vector
   [MATH: vcom=ϕf(X~i,<
   mi mathvariant="bold-italic">θf) :MATH]
   , where
   [MATH: θf :MATH]
   represents the parameters of the tissue confuser. The tissue
   discriminator, denoted as
   [MATH: ϕtc :MATH]
   , is a single-layer MLP that takes the tissue-common features
   [MATH: vcom :MATH]
   as input and predicts the tissue labels.
   For adversarial training, a GRL is incorporated between the tissue
   confuser and the tissue discriminator. The GRL enables adversarial
   training by reversing the gradient during backpropagation. During the
   forward pass, the GRL simply passes the input unchanged, as shown in
   Eq. ([153]5).
   [MATH: GRL(x)=x. :MATH]
   5
   During the backward pass, the GRL reverses the gradient direction, that
   is, replace
   [MATH:
   ∂L∂θf
   :MATH]
   with
   [MATH:
   -∂L∂θf
    :MATH]
   .
   The domain adversarial loss function
   [MATH: Lad :MATH]
   is defined in Eq. ([154]6), where
   [MATH:
   ti,d
   :MATH]
   is the true tissue label in one-hot encoding for the
   [MATH: i :MATH]
   -th cell and
   [MATH: d :MATH]
   -th tissue type.
   [MATH: Lad=-1Nc∑
   i=1c∑d=1t<
   mi>ti,dlog
   mtext>(ϕtc(GRL(ϕf(X~i,<
   mi mathvariant="bold-italic">θf),θtc))). :MATH]
   6
   Utilizing the GRL layer, the tissue confuser aims to minimize this
   loss, while the tissue discriminator attempts to maximize it, thus
   creating the adversarial setup. Therefore, the parameters of the tissue
   confuser
   [MATH: θ^f :MATH]
   and the tissue discriminator
   [MATH: θ^tc :MATH]
   are updated through the following optimization steps:
   [MATH: θ^f=argminθfLadθf,θ^tc,
   :MATH]
   7
   [MATH: θ^tc=arg<
   munder>maxθtcLadθ^f,<
   mi mathvariant="bold-italic">θtc.
   :MATH]
   8
   This adversarial training setup enables the module to capture
   tissue-common features, encouraging CanCellCap to generalize across
   various tissue types.
Mixture of experts for extracting tissue-specific features
   The MoE module consists of
   [MATH: Nt :MATH]
   expert networks and a gating network, where
   [MATH: Nt :MATH]
   corresponds to the number of tissue types. Each expert network
   [MATH: Ed :MATH]
   is responsible for learning specific features for tissue
   [MATH: d :MATH]
   . The gating network
   [MATH: gX~i
   :MATH]
   is a single-layer MLP that takes
   [MATH: X~i :MATH]
   as input and outputs a set of weights for each expert network.
   Specifically, the gating network assigns a weight distribution
   [MATH: gX~i=g1X~i,g2X~i,⋯,gNtX~i :MATH]
   , where each weight corresponds to the importance of a particular
   expert network for the
   [MATH: i :MATH]
   -th cell. The tissue-specific features
   [MATH: vspe :MATH]
   are a weighted sum of the expert network outputs:
   [MATH: vspe=∑d=1t<
   mi>gdX~iE
   mi>dX~i. :MATH]
   9
   To ensure that the gating network assigns cells from different tissues
   to their corresponding expert networks, a gating loss
   [MATH: Lgate :MATH]
   (Eq. ([155]10)) is minimized, where
   [MATH:
   ti,d
   :MATH]
   is the true tissue label in one-hot encoding for the
   [MATH: i :MATH]
   -th cell and
   [MATH: d :MATH]
   -th tissue type.
   [MATH: Lgate=-1
   mn>Nc∑i=1c∑
   mo>d=1tti,dlog
   gi,dX~i. :MATH]
   10
   During training, tissue labels guide the gating network to assign cells
   to the appropriate experts, enabling CanCellCap to learn
   tissue-specific gene expression patterns. During testing, tissue labels
   are not required, and CanCellCap would find the proper expert to
   identify cancer cells.
An MLP for cancer cell identification
   The tissue-common features
   [MATH: vcom :MATH]
   and the tissue-specific features
   [MATH: vspe :MATH]
   are concatenated as
   [MATH: vcat=[vcom|vspe]. :MATH]
   [MATH: vcat :MATH]
   is input to a single-layer MLP classifier
   [MATH: ϕci :MATH]
   to distinguish cancer and normal cells. The cancer cell identification
   loss,
   [MATH: Lide :MATH]
   , defined in Eq. ([156]11), which quantifies the cross-entropy between
   the predicted output
   [MATH: ϕci(vcat,θci) :MATH]
   and the true label
   [MATH: yi :MATH]
   for the
   [MATH: i :MATH]
   -th cell.
   [MATH: Lide=-1Nc∑i=1c(yilogϕci(vcat,θci)+(1-yi<
   /msub>)log(1-ϕci(vcat,θci)). :MATH]
   11
Loss function
   CanCellCap is optimized to minimize the overall loss in Eq. ([157]12),
   which effectively distinguishes cancer cells while maintaining
   robustness against the effects of different sequencing platforms and
   variability across tissue types.
   [MATH: α1 :MATH]
   ,
   [MATH: α2 :MATH]
   , and
   [MATH: α3 :MATH]
   are weights of the loss components, with default values of 0.2, 0.3,
   and 0.1.
   [MATH: L=-α<
   mn>1Nc∑
   i=1c∑d=1t
   msubsup>ti,dlog(ϕtc(GRL(ϕf(X~i,<
   mi mathvariant="bold-italic">θf),θtc)))-α2Nc∑i=1c<
   mo>∑d=1tti,dloggi,dX~i+α3Nc<
   /mfrac>∑i=1c|Xi-d
   ecoder
   (vcat)|2-1
   mn>Nc∑i=1c(yilogϕcivcat,θci+(1-yi<
   /msub>)log(1-ϕci(vcat,θci))
   :MATH]
   12
Supplementary Information
   [158]12915_2025_2337_MOESM1_ESM.xlsx^ (30.2KB, xlsx)
   Additional file 1: Table 1. Composition and characteristics of human
   single-cell RNA-seq datasets for model testing. Table 2. Composition
   and characteristics of mouse single-cell RNA-seq datasets used for
   cross-species model testing. Table 3. Composition and characteristics
   of human spatial transcriptomics datasets used for model testing. Table
   4. Summary of human single-cell RNA-seq training datasets. Table 5.
   Detailed metadata of human single-cell RNA-seq training datasets
   [159]12915_2025_2337_MOESM2_ESM.xlsx^ (51.7KB, xlsx)
   Additional file 2: Table 1. Performance comparison of CanCellCap and
   five SOTA methods across 33 testing datasets. Table 2. Performance
   comparison of CanCellCap with five SOTA methods across platforms. Table
   3. Performance comparison of CanCellCap with five SOTA methods on
   datasets from unseen types. Table 4. Ablation study of the components
   of CanCellCap on the Testing Datasets. Table 5. Performance comparison
   of CanCellCap and CanCellCap without feature masking-reconstruction
   strategy across different dropout rates in simulated dropout datasets.
   Table 6. Computational efficiency analysis. Table 7. Confusion matrix
   of CanCellCap for cancer origin identification on the TISCH2 Testing
   Dataset. Table 8. Accuracy for different tissue validation datasets of
   CanCellCap
   [160]12915_2025_2337_MOESM3_ESM.docx^ (5.6MB, docx)
   Additional file 3: Fig. S1. Top 10 key genes identified by SHAP across
   13 tissue types. Fig. S2. Top 10 key genes identified by GradientSHAP
   across 13 tissue types. Fig. S3. Top 10 key genes identified by
   Integrated Gradients across 13 tissue types. Fig. S4. Top 10 key genes
   identified by Deeplift across 13 tissue types. Fig. S5. KEGG pathway
   enrichment analysis of the overlapping genes between the top 300 genes
   ranked by SHAP and those by GradientSHAP across 13 tissue types. Table
   1. Total memory consumption of model inference using csv format. Table
   2. Architectural components and layer configurations of the CanCellCap
   model. Table 3. Accuracy metrics of CanCellCap and five
   state-of-the-art methods on 10 cancer single-cell datasets, with
   rejection rates indicated in parentheses for models that abstain from
   certain predictions
Acknowledgements