Abstract Background The advent of single-cell RNA sequencing (scRNA-seq) has provided unprecedented insights into cancer cellular diversity, enabling a comprehensive understanding of cancer at the single-cell level. However, identifying cancer cells remains challenging due to gene expression variability caused by tumor or tissue heterogeneity, which negatively impacts generalization and robustness. Results We propose CanCellCap, a multi-domain learning framework, to identify cancer cells in scRNA-seq data suitable for all tissues, cancers, and sequencing platforms. Integrating domain adversarial learning and Mixture of Experts, CanCellCap is able to simultaneously extract common and specific patterns in gene expression profiles across different tissues for cancer or normal cells. Moreover, the masking-reconstruction strategy enables CanCellCap to cope with scRNA-seq data from different sequencing platforms. CanCellCap achieves 0.977 average accuracy in cancer cell identification across 13 tissue types, 23 cancer types, and 7 sequencing platforms. It outperforms five state-of-the-art methods on 33 benchmark datasets. Notably, CanCellCap maintains high performance on unseen cancer types, tissue types, and even across species, highlighting its effectiveness in challenging scenarios. It also excels in spatial transcriptomics by accurately identifying cancer spots. Furthermore, CanCellCap demonstrates strong computational efficiency, completing inference on 100,000 cells in a few minutes. In addition, interpretability analyses reveal critical biomarkers and pathways, offering valuable biological insights. Conclusions CanCellCap provides a robust and accurate framework for identifying cancer cells across diverse platforms, tissue types, and cancer types. Its strong generalization to unseen cancers, tissues, and even species, combined with its adaptability to spatial transcriptomics data, underscores its versatility for both research and clinical applications. Supplementary Information The online version contains supplementary material available at 10.1186/s12915-025-02337-1. Keywords: ScRNA-seq, Cancer cell identification, Multi-domain learning, Domain adversarial learning, Mixture of experts Background Single-cell RNA sequencing (scRNA-seq) provides critical insights into cancer biology at the single-cell resolution [[36]1]. Accurate identification of cancer cells advances the understanding of tumor heterogeneity and evolution, thereby enabling personalized therapies and more effective treatment strategies [[37]2]. Traditional approaches to separate cancer cells from normal cells often rely on detecting copy number variations (CNVs) from single-cell gene expression data by comparing target cells to reference cells [[38]3–[39]5]. For instance, InferCNV [[40]4] requires a reference set of normal cells, while CopyKAT [[41]3] requires the presence of both cancer and normal cells [[42]6]. Additionally, CNVs are not exclusive to cancer cells, as normal cells can also exhibit copy number alterations [[43]7, [44]8], which decrease the reliability of these methods. Recent advances use machine learning to capture gene expression patterns of cancer cells, removing the dependency on reference cells and CNV inference. ikarus [[45]9] and PreCanCell [[46]10] distinguish cancer cells on the extracted cancer- and normal-specific genes from multiple training datasets. However, their accuracy heavily depends on the selected genes, making them prone to overfitting and poor generalization to unseen data. Therefore, CaSee [[47]6] and Cancer-Finder [[48]11] attempt to improve model generalization. CaSee employs transfer learning with bulk RNA sequencing (RNA-seq) data to pre-train capsule network classifiers, but its effectiveness is limited by different distributions between bulk and single-cell RNA-seq data. In contrast, Cancer-Finder learns a universal representation of cancer cells but ignores tissue-specific expression patterns, which might lead to false positives. Despite these advancements, identifying cancer cells remains a significant challenge due to the high dimensionality and inherent heterogeneity of scRNA-seq data [[49]12]. Cancer cell transcriptional profiles are highly influenced by their tissues of origin, with distinct expression patterns observed across different tissues [[50]13–[51]15]. Moreover, the effect of sequencing platforms, primarily including dropout events (where genes with low expression are missed), introduces noise and variability. These confounding factors complicate cellular state identification. A single-cell gene expression profile for a cancer cell can be considered as the coupling of tissue-common cancer expression patterns (tissue-common feature), tissue-specific expression patterns (tissue-specific feature), and sequencing platform effects. Disentangling these three factors enables the model to generalize to different tissues and sequencing platforms, enhancing the robustness of cancer cell identification. Therefore, the model needs to simultaneously learn tissue-common and tissue-specific features while eliminating sequencing platform effects. As each tissue exhibits unique gene expression patterns, they can be treated as distinct domains. Therefore, in our study, a multi-domain learning framework is designed to facilitate domain disentanglement, allowing models to effectively extract both domain-common and domain-specific features, that is, tissue-common and tissue-specific expression patterns in our case. In this study, we propose a novel multi-domain learning model, CanCellCap, to separate cancer cells from normal cells, free from the effects of tissue type, cancer type, and dropout rates from sequencing platforms. Considering the gene expression profile from scRNA-seq as the coupling of tissue-common expression patterns, tissue-specific expression patterns, and sequencing platforms, CanCellCap imports a feature masking-reconstruction training strategy and integrates domain adversarial learning and Mixture of Experts (MoE) in the model structure to improve generalization and accuracy in cancer cell identification. The domain adversarial learning module captures tissue-common features, and the MoE module dynamically selects the most relevant experts for each tissue to capture tissue-specific features. Additionally, dropout events, a major effect factor in the sequencing platform, are simulated by random masking of expression values and reconstructed via the model learning, thereby eliminating the impact of sequencing platforms. In comprehensive experiments, CanCellCap outperforms five state-of-the-art methods across 33 datasets, accurately identifying cancer cells across diverse cancer types, tissue types, sequencing platforms, and even in unseen cancer types, tissue types, sequencing platforms, and across species. It also exhibits excellent computational efficiency, completing inference on large-scale datasets with up to 100,000 cells in significantly less time than other models. Additionally, CanCellCap generalizes well to spatial transcriptomics data and offers biologically interpretable predictions by highlighting key genes and pathways with potential relevance to cancer diagnosis and therapy. Results Dataset descriptions and experimental design We collected 74 human cancer single-cell datasets from Tumor Immune Single-cell Hub (TISCH) [[52]16], which cover 14 different tissues, 23 cancer types, and 7 sequencing platforms. After preprocessing, a total of 328,230 cells from 13 tissue types were obtained. For each tissue, the samples were randomly split into a Training Dataset (80%) and a Validation Dataset (20%) for model training and performance validation, respectively. The preprocessing steps follow the approach outlined in Cancer-Finder [[53]11]. To evaluate CanCellCap comprehensively, experiments were designed to cover multiple performance aspects. First, CanCellCap was benchmarked against five state-of-the-art (SOTA) methods using 33 testing datasets spanning 15 tissue types to assess its overall performance. Second, its generalization ability was validated by testing on four unseen cancer types, two unseen tissue types, one unseen species, and its robustness was evaluated across six sequencing platforms. Third, we explored CanCellCap’s potential applications in the following scenarios: its use of tissue-specific features was tested through cancer origin identification, and its applicability in spatial transcriptomics was demonstrated by identifying cancer spots. Finally, we conducted model analysis, including cancer/tissue-specific evaluation, ablation studies, interpretability analyses, and computational complexity evaluation. The datasets used in the experiments are summarized below, with detailed sources provided in an additional file (see Additional file [54]1: Table 1–5). 1. Training Dataset and Validation Dataset are sourced from TISCH1 and include 328,230 cells across 13 tissue types: Blood, Bone, Brain, Breast, Colorectum, Eye, Head and Neck, Liver, Lung, Nervous, Pancreas, Pelvic Cavity, and Skin. It does not overlap with any of the subsequent datasets. 2. Testing Dataset comprising 33 testing datasets from human, spanning 15 tissue types, 25 cancer types, and 6 sequencing platforms, totaling 834,788 cells, which were employed as external independent testing. 3. Unseen Cancer/Tissue Type and Species Dataset consists of 113,393 cells from unseen four cancer types, 89,922 cells from two unseen human tissue types, and 63,734 cells from an unseen species (mouse). “Unseen” means a tissue type, cancer type, or species that is absent from the Training Dataset. 4. Multiple-Sequencing Platforms Dataset includes 5600 cells from Smart-seq[55]2, 146,387 cells from 10X Genomics, 10,502 cells from Microwell-seq, 3134 cells from C1 Fluidigm, 21,011 cells from Drop-seq, and 86,486 cells from GEXSCOPETM. This dataset was used to validate CanCellCap’s generalization across different sequencing platforms, especially considering that C1 Fluidigm, Drop-seq, and GEXSCOPETM platforms are not present in the training set. 5. TISCH2 Testing Dataset was constructed by TISCH2 [[56]17], which was employed to further evaluate CanCellCap for identifying cancer cell origins. This dataset consists of 122,269 cells from seven tissues, including the blood [[57]18], brain [[58]19], bone [[59]20], lung [[60]21], pancreas [[61]22], eye [[62]23]. 6. Simulated Dropout Rate Dataset was generated on Testing Dataset by randomly masking data with different dropout rates (0–98%). This dataset was used to assess CanCellCap’s robustness to different dropout rates by various sequencing platforms. 7. Spatial Transcriptomics Dataset was obtained from human prostate cancer samples [[63]24]. It was used to evaluate CanCellCap’s applicability to spatial transcriptomics analysis. These datasets and annotations were sourced from the testing sets of Cancer-Finder [[64]11], TISCH2 [[65]17], the Gene Expression Omnibus [[66]25], and the Single Cell Portal [[67]26]. Labels distinguishing cancer and normal cells were curated based on provided annotations to ensure consistency. The performance of CanCellCap was compared with five SOTA methods, including CopyKAT [[68]3] and SCEVAN [[69]5], which rely on CNV detection; ikarus [[70]9] and PreCanCell [[71]10], which leverage specific gene expression patterns; Cancer-Finder [[72]11], which focuses on universal representation learning. The evaluation metrics include accuracy, F1 score (F1), recall, precision, and area under the receiver operating characteristic curve (AUROC). Note that since SCEVAN and CopyKAT do not provide probability scores for cancer cell classification, AUROC is not available for these methods. All evaluation metrics were calculated using standard functions from the scikit-learn library. Additionally, to ensure a statistically rigorous comparison, we employed the Bonferroni-Dunn test [[73]27] and visualized the results using the Critical Difference (CD) [[74]28] diagrams with a 95% confidence level. The CD diagrams allow for statistical verification of whether the observed performance differences between methods are significant, following the implementation outlined in [[75]29]. Detailed performance scores for all datasets are provided in an additional file (see Additional file [76]2: Table 1–8). Model training process The CanCellCap model was trained with a fixed random seed (0) for 50 epochs, using a batch size of 128 and the stochastic gradient descent optimizer (learning rate =  [MATH: 1.3×10-3 :MATH] , momentum =  [MATH: 0.9 :MATH] , weight decay =  [MATH: 1.5×10-4 :MATH] ). The training losses and the corresponding validation accuracy throughout the training epochs are shown in Fig. [77]1A and B. The five losses, Total Loss, Identification Loss, Gate Loss, Domain Adversarial Loss, and Reconstruction Loss, converged after 40 epochs. On the Validation Dataset, which includes 13 tissue types, CanCellCap achieves an impressive average accuracy of 0.9777, and 12 out of the 13 tissue types achieves exceeded 0.9500 accuracy in cancer cell identification, demonstrating its reliability across various biological contexts. The relatively lower performance on bone tissue, with an accuracy of 0.8630, may be due to the high similarity between multiple myeloma cells and normal plasma cells, making them harder to distinguish [[78]30]. Fig. 1. [79]Fig. 1 [80]Open in a new tab Performance evaluation of CanCellCap during the training stage. A Loss curves of five loss during the training stage. B Accuracy for different tissues within the Validation Dataset during the training stage. C t-SNE visualization of the input expression features [MATH: X :MATH] from the Validation Dataset, colored by cell status. D t-SNE visualization of the final cell embeddings [MATH: vcat :MATH] from the Validation Dataset, colored by cell status The T-Distributed Stochastic Neighbor Embedding (t-SNE) visualizations for both the input expression features [MATH: X :MATH] and the final cell embeddings [MATH: vcat :MATH] extracted by CanCellCap from the Validation Dataset are shown in Fig. [81]1C and D, respectively, where the dots are colored by cancer or normal cell status. In the t-SNE directly from gene expression, cancer, and normal cells are mixed. In contrast, in the t-SNE on [MATH: vcat :MATH] , cancer and normal cells exhibit a clearer separation. Experiments for performance from various projects CanCellCap was compared with five SOTA methods on the Testing Dataset, comprising 855,713 human cells from 33 datasets. As illustrated in the boxplots (Fig. [82]2A), CanCellCap consistently outperformed the other methods across most evaluation metrics, including accuracy, F1, precision, and AUROC. Notably, the interquartile ranges in the boxplots are smaller for CanCellCap across most metrics, indicating more stable performance across datasets. This observation is further supported by the CD diagrams shown in Fig. [83]2B. In the CD diagrams, methods connected by a line are considered statistically similar, while methods not connected show significant performance differences. The diagrams demonstrate that CanCellCap ranks significantly higher than other methods in accuracy, F1, and precision, with statistical significance. Recall is the only metric where CanCellCap performs comparably to Cancer-Finder, both achieving the highest performance. This is likely due to the MoE module, which enables tissue-specific discrimination. In tissues with limited training data, experts tend to adopt conservative decision boundaries, reducing recall in ambiguous cases while improving generalization. Fig. 2. [84]Fig. 2 [85]Open in a new tab Performance comparison of CanCellCap and five SOTA methods across 33 testing datasets. A Boxplots illustrate the distribution of five evaluation metrics across all datasets. B CD diagrams display the average rankings of all methods based on the five metrics. Note that since SCEVAN and CopyKAT do not provide probability scores for cancer cell classification, AUROC is not applicable to these methods and is thus marked as 'NA' In addition to the overall performance evaluation, we further assessed the robustness of CanCellCap across different tissue types within the Testing Dataset. Specifically, we compared CanCellCap with five SOTA methods across 15 human tissue types. As shown in Fig. [86]3, CanCellCap demonstrates superior performance across the majority of tissues. In contrast, methods such as PreCanCell, SCEVAN, CopyKAT, and ikarus perform well in certain tissues but exhibit substantial performance degradation in others, indicating limited robustness. Although Cancer-Finder shows greater cross-tissue stability than the aforementioned methods, its overall performance remains lower than that of CanCellCap. Fig. 3. [87]Fig. 3 [88]Open in a new tab Performance comparison of CanCellCap with five SOTA methods across tissues. Five performance metrics comparison of CanCellCap with five SOTA methods in cancer cell identification across 15 human tissue types from the Testing Dataset. Recall, F1, precision, and AUROC are marked as'NA'for the Head and neck, which contain only normal cells. Note that since SCEVAN and CopyKAT do not provide probability scores for cancer cell classification, AUROC is not applicable to these methods and is thus marked as 'NA' Overall, CanCellCap outperforms all other methods in both overall performance and robustness across diverse tissue types in identifying cancer cells. Experiments for generalization for different sequencing platforms To assess the robustness and generalization performance across different sequencing platforms, CanCellCap was applied to scRNA-seq datasets from six different sequencing platforms. As shown in Fig. [89]4, CanCellCap outperforms all five SOTA methods across the six sequencing platforms, achieving the highest average accuracy of 0.9229. Even on unseen sequencing platforms such as C1 Fluidigm, Drop-seq, and GEXSCOPETM, CanCellCap maintains high accuracy, reaching 0.956, 0.9121, and 0.9393. In contrast, Cancer-Finder shows more unstable performance, particularly struggling on platforms like 10X Genomics, where it achieves a relatively low accuracy of 0.8539. Overall, CanCellCap exhibits superior adaptability on the datasets from the six sequencing platforms. Fig. 4. [90]Fig. 4 [91]Open in a new tab Performance comparison of CanCellCap with five SOTA methods across platforms. Five performance metrics comparison of CanCellCap with five SOTA methods in cancer cell identification across six different sequencing platforms. Note that since SCEVAN and CopyKAT do not provide probability scores for cancer cell classification, AUROC is not applicable to these methods and is thus marked as 'NA' Experiment for robustness on unseen cancer/tissue types and species CanCellCap was applied to datasets from unseen cancer types: Cervical Cancer (CC), Hepatoblastoma (HB), Small Cell Lung Cancer (SCLC), and Cutaneous Melanoma (CMM), two unseen tissue types (soft tissue and prostate), and one unseen species (mouse) which are not included in the Training Dataset. As shown in Fig. [92]5A, CanCellCap outperforms all SOTA methods across all metrics on the unseen cancer types. Specifically, CanCellCap achieves the highest accuracy of 0.9297, recall of 0.9682, F1 score of 0.9287, precision of 0.9291, and AUROC of 0.9722 on the unseen cancer types. In contrast, the second-best model, Cancer-Finder, achieves an accuracy of 0.8705 with an AUROC of 0.8676. Figure [93]5B presents the performance on unseen tissue types, such as soft tissue and prostate. On the unseen tissue types, CanCellCap maintains the highest accuracy of 0.7877, F1 score of 0.7847, precision of 0.8333, and AUROC of 0.8172. In comparison, Cancer-Finder and ikarus struggle with lower AUROC, while PreCanCell shows competitive performance, with overall lower metrics. Fig. 5. [94]Fig. 5 [95]Open in a new tab Performance comparison of CanCellCap with five SOTA methods on datasets from unseen types. A Five metrics on datasets of unseen cancer type. B Five metrics on datasets of unseen tissue type. C Five metrics on datasets from mouse. Note that since SCEVAN and CopyKAT do not provide probability scores for cancer cell classification, AUROC is not applicable to these methods and is thus marked as 'NA' To further evaluate generalization across species, the model was applied to three mouse single-cell datasets. For compatibility with mouse data, CanCellCap was adapted by mapping gene homologs between human and mouse using the HomoloGene database [[96]31]. Since ikarus and PreCanCell lack pipelines for mouse datasets, and given the similarity in framework between Cancer-Finder and CanCellCap, a compatible pipeline for Cancer-Finder was developed in this study to facilitate direct comparison. Accordingly, evaluations focused on Cancer-Finder, SCEVAN, and CopyKAT. As in Fig. [97]5C, CanCellCap achieved the best overall performance with an accuracy of 0.9112, surpassing the second-best method, Cancer-Finder, which attained 0.8278. These results underscore CanCellCap’s exceptional adaptability and its ability to reliably identify cancer cells across both unseen cancer, tissue types, and species. Experiment to identify cancer cell origin To validate the tissue-specific features captured by CanCellCap, this experiment aims to identify a cancer cell origin on the TISCH2 Testing Dataset comprising 122,269 cells from 6 tissue origins. The ability to classify cancer cell origins relies heavily on recognizing tissue-specific gene expression patterns, which are unique to each tissue type. This task involved classifying cells as normal or cancerous and, for cancer cells, identifying their origin. As shown in Fig. [98]6A, CanCellCap demonstrates strong overall performance in identifying the origin of cancer cells, achieving an average accuracy of 0.9521. Since the training set contains 14 categories while the test set only includes 7 categories, some cells were misclassified into categories not present in the test set. Thus, the strong performance highlights CanCellCap’s ability to effectively capture and utilize tissue-specific gene expression patterns. Fig. 6. [99]Fig. 6 [100]Open in a new tab Application of CanCellCap on cancer origin identification and cancer spot identification. A Confusion matrix of CanCellCap for cancer origin identification on the TISCH2 Testing Dataset. B Comparison of cancer spot identification in spatial transcriptomics, where each spot represents gene expression profiles mapped to spatial coordinates within the tissue: (i) pathologist annotations, (ii) CanCellCap predictions, and (iii) confusion matrix for CanCellCap’s cancer spot identification Overall, these results demonstrate that CanCellCap excels not only in distinguishing cancer cells from normal cells but also in leveraging the extracted tissue-specific features to accurately identify the cancer cell origin. Spot-level cancer identification in spatial transcriptomics Spatial transcriptomics provides critical spatial context, revealing how cancer cells interact with their microenvironment—key information for advancing cancer research. To assess CanCellCap’s applicability to spatial transcriptomics data from cancer tissue sections, CanCellCap, which was trained on scRNA-seq data, was directly applied to cancer spot identification on a spatial transcriptomics dataset obtained from human prostate cancer samples. CanCellCap’s predictions were compared with pathologist annotations. Figure [101]6B (i) shows the pathologist’s annotations, indicating cancer and normal regions, while Fig. [102]6B (ii) presents CanCellCap’s predictions. The spots predicted by CanCellCap closely resemble those annotated by the pathologist. Figure [103]6B (iii) presents the confusion matrix, which further highlights CanCellCap’s performance in identifying cancer and normal spots. The matrix reveals a high true positive rate for cancer spots, demonstrating CanCellCap’s strong ability to detect cancer cells. With an accuracy of 0.7989 and a recall of 0.9058, CanCellCap shows excellent performance in distinguishing cancer cells from normal cells. Overall, CanCellCap’s ability to identify cancer cell distribution at spot-level granularity underscores its value in spatial transcriptomics, offering nuanced insights into tumor architecture that complement pathologist assessments. Notably, despite not being specifically trained on spatial transcriptomics data, CanCellCap achieves excellent results. Model analysis Evaluation of CanCellCap across diverse cancer and tissue types To further evaluate the capability of CanCellCap, its cancer cell identification performance was assessed across various cancer types and tissue types, based on 33 test datasets. This analysis included both rare cancer types and under-studied tissue types, highlighting its robustness in challenging biological contexts. As shown in Fig. [104]7, CanCellCap achieved over 0.9 accuracy across most cancer types. Notably, even among rare cancers such as adamantinomatous craniopharyngioma (ACP), synovial sarcoma (SS), pleuropulmonary blastoma (PPB), kidney chromophobe (KICH), SCLC, and gastroenteropancreatic neuroendocrine tumors (GEP-NETs), performance remained high. Figure [105]7 also demonstrates similarly strong performance across tissue types, with most exceeding 0.9 accuracy, including under-studied tissues like Eye, Kidney, and Soft Tissue. Despite the overall strong performance, certain cancer types such as prostate adenocarcinoma (PRAD), belonging to prostate tissue, exhibited relatively lower accuracies, falling below 0.8. This underperformance is likely attributed to the absence of corresponding samples in the training dataset. Nonetheless, as discussed in the “[106]Experiment for robustness on unseen cancer/tissue types and species” section, CanCellCap still outperforms other models under these conditions. Fig. 7. [107]Fig. 7 [108]Open in a new tab Performance of CanCellCap across cancer types and tissue types. Abbreviations used in the figure are listed in the Abbreviations section Importantly, the ACP dataset was derived from clinical cancer samples obtained through hospital collaboration. scRNA-seq were generated following standard protocols, and cancer cells were annotated according to CTNNB1 mutation status. On this dataset, CanCellCap achieved an accuracy of 0.9402, demonstrating its potential for clinical diagnostic use. These findings collectively emphasize the generalizability of CanCellCap across diverse biological contexts, including rare scenarios. Ablation study To further evaluate the contributions of various components of CanCellCap, an ablation study was conducted, with key modules systematically removed and their impact on model performance assessed using the Testing Dataset. As shown in Fig. [109]8A, the performance of the ablation modules on the Testing Dataset is analyzed. The results indicate that compared to using both the adversarial module and the MoE module, using only the domain adversarial module achieves relatively high recall. However, it suffers from a significant increase in false positives due to the lack of specificity, and there is a notable decline in performance on other metrics. This highlights the necessity of integrating both common and specific gene expression patterns across different tissues to improve generalization. The complete CanCellCap is slightly better than the other two models, confirming that each component plays a meaningful role in enhancing the overall performance of CanCellCap. Fig. 8. [110]Fig. 8 [111]Open in a new tab Ablation study of the components of CanCellCap. A Ablation study of the components of CanCellCap on the Testing Dataset. B–F Performance comparison of CanCellCap and CanCellCap without feature masking-reconstruction strategy across different dropout rates in simulated dropout datasets, evaluated using five performance metrics Overall, the complete CanCellCap achieves the highest scores across all metrics, confirming that each component plays a meaningful role in enhancing the overall performance of CanCellCap. To further investigate the impact of dropout rates in the dataset and evaluate the effectiveness of the feature masking-reconstruction strategy, CanCellCap was compared with a variant without this strategy across different dropout rates in simulated dropout datasets. These datasets are generated by randomly masking expression values with varying dropout probabilities (0–98%), based on the Testing Dataset. As shown in Fig. [112]8B–F, CanCellCap consistently outperforms CanCellCap without the feature masking-reconstruction at all dropout rates. For instance, at a low dropout rate (5%), CanCellCap achieves an accuracy of 0.9269, compared to 0.9146 for CanCellCap without this strategy. As the dropout rate increased, the performance gap widened, with CanCellCap maintaining superior results in accuracy, recall, F1 score, and precision. Notably, even when the dropout probability exceeded 50%, CanCellCap’s accuracy remained above 0.9. The observed increase in recall at higher dropout rates may be due to CanCellCap’s tendency to classify uncertain cases as cancer cells, indicating a cautious bias toward identifying potential cancer cells in sparse data. Gating weights vector analysis To illustrate how CanCellCap effectively assigns cells from various tissues to the most suitable combination of experts, the gating weight vector [MATH: g(X) :MATH] is analyzed. Two visualizations were employed to examine the distribution of gating weight vectors for each cell in the validation set: Heatmap of gating weight vectors A heatmap was created to visualize the gating weight vectors assigned to each cell. As shown in Fig. [113]9A, the gating weight vectors are similar within the same tissue type while differing across different tissues. This confirms that the gate network dynamically adapts to the specific expression patterns of each tissue, ensuring that cells are directed to the most appropriate experts. Fig. 9. [114]Fig. 9 [115]Open in a new tab Gating weight vectors analysis. A Heatmap of gating weight vectors across tissue types. B t-SNE visualization of gating weight vectors across tissue types, with different colors representing different tissue types T-SNE visualization of gating weight vectors t-SNE was used to visualize the gating weight vectors assigned to each cell. As shown in Fig. [116]9B, cells from the same tissue type tend to cluster together, indicating that CanCellCap consistently assigns similar weights to cells of the same tissue. This clustering behavior illustrates that the gate network effectively allocates cells to the most relevant tissue experts. Together, these visualizations provide compelling evidence that the gate network of the MoE module is capable of intelligently allocating cells to the most relevant experts, thereby optimizing performance across diverse tissue types. Key genes and pathways for cancer cell identification To enhance our understanding of CanCellCap’s decision-making process and its biological relevance, an interpretability experiment was conducted using SHapley Additive exPlanations (SHAP) and GradientSHAP [[117]32]. Based on these methods, the top 10 key genes for cancer cell identification were identified, and Kyoto Encyclopedia of Genes and Genomes (KEGG) [[118]33] pathway enrichment analysis was subsequently performed on the overlapping genes between the top 300 genes ranked by SHAP and those by GradientSHAP to further explore their biological relevance. As shown in Fig. [119]10A and B, SHAP and GradientSHAP were employed to rank the key genes affecting CanCellCap’s predictions, with higher values indicating greater importance. Among them, several are well-established biomarkers, including IFI27 [[120]34], CDKN2A [[121]35], CXCR4 [[122]36], S100B [[123]37] and PTPRC [[124]38], which are widely recognized in cancer research. Additionally, other potentially relevant genes were identified, such as SPINT2 [[125]39], SH3bgRL3 [[126]40], RGS1 [[127]41], and IGFBP7 [[128]42], suggesting potential avenues for cancer biomarker discovery. Fig. 10. [129]Fig. 10 [130]Open in a new tab Identification of key genes and pathway enrichment across tissues. A Top 10 important genes ranked by SHAP in pelvic cavity and skin tissues. B Top 10 important genes ranked by GradientSHAP in the same tissues. C KEGG pathway enrichment based on the overlapping genes between the top 300 ranked by SHAP and the top 300 ranked by GradientSHAP As shown in Fig. [131]10C, KEGG pathway enrichment analysis was conducted on the overlapping genes between the top 300 genes ranked by SHAP and those by GradientSHAP, to further validate the biological relevance of CanCellCap’s predictions. The overlap of key genes from both methods ensures that the identified pathways are consistently represented across different interpretability techniques, providing biologically meaningful and consistent insights. In both pelvic cavity and skin tissues, the enriched pathways were strongly associated with cancer-related processes. In pelvic cavity tissue, key enriched pathways included “Pathways in cancer” and “Regulation of actin cytoskeleton,” both of which play crucial roles in cancer development, metastasis, and oncogenic signaling [[132]43, [133]44]. Similarly, in skin tissue, enrichment of pathways such as “Pathways in cancer” and the “MAPK signaling pathway” was observed, indicating the involvement of oncogenic mechanisms relevant to skin cancer [[134]45, [135]46]. This interpretability analysis highlights that CanCellCap not only accurately identifies cancer cells, but also uncovers critical genetic markers and biologically meaningful pathways that may inform future cancer research and therapeutic development. Furthermore, the provided CanCellCap code supports additional interpretability methods, including DeepLIFT [[136]47] and Integrated Gradients [[137]48]. The interpretability results for other tissue types can be found in an additional file (see Additional file [138]3: Fig. S1–S5). Computational efficiency To evaluate the computational efficiency of each method, a series of runtime experiments were conducted on a workstation equipped with an NVIDIA 2080Ti GPU (12 GB memory) and a 24-core 2.2 GHz CPU, using datasets of varying sizes. First, the runtime of CanCellCap was analyzed by comparing the total runtime (including data loading and inference) to the inference time. The experiments used scRNA-seq datasets stored in CSV format and loaded via pandas. As shown in Fig. [139]11A, the results indicate that the majority of time is spent on data loading, rather than on the actual model inference. To alleviate this bottleneck, two solutions were explored: (i) using the binary parquet stored format, which significantly reduces data loading time; and (ii) leveraging Modin [[140]49], a parallelized alternative to pandas that accelerates CSV file reading, especially for large datasets. These solutions have been implemented in our public pipeline. Fig. 11. [141]Fig. 11 [142]Open in a new tab Computational efficiency analysis. A Evaluation of computational efficiency and total time (including both data loading and model inference) across different data formats (Parquet, CSV) and data loading methods (pandas, Modin) on datasets of varying sizes. B Comparison of total time on datasets of varying sizes using the CSV format. The lighter-shaded portion of each bar represents the model inference time. “NA” denotes times that exceed one day As demonstrated in Fig. [143]11A, CanCellCap (Parquet), which uses the binary Parquet format, consistently reduced total runtime across datasets of varying sizes, making it the preferred option for both storage and inference. The benefit became increasingly pronounced with larger datasets, achieving up to a 100 × speedup over CanCellCap (CSV with pandas) on datasets containing 100,000 cells. For pre-existing large-scale CSV files (e.g., 100,000 cells), enabling Modin provided additional improvements, with CanCellCap (CSV with Modin) achieving up to a 10 × speedup over the pandas-based implementation. However, for smaller datasets, the initialization overhead of Modin may outweigh its benefits. Additionally, Fig. [144]11B compares the total runtime of models using CSV-formatted input. CanCellCap demonstrated superior computational efficiency on datasets exceeding 10,000 cells, with the use of Modin significantly reducing runtime. For the largest dataset tested (100,000 cells), CanCellCap completed inference in just 396.46 s, which is substantially faster than the second-best method, Cancer-Finder, requiring 3,850.88 s. These results highlight CanCellCap’s improved computational efficiency, allowing it to outperform other models in large-scale inference datasets. Discussion In this study, we introduce CanCellCap, a multi-domain learning framework for identifying cancer cells from scRNA-seq data across diverse tissue types. CanCellCap integrates domain adversarial learning with a Mixture of Experts module to effectively capture both tissue-common and tissue-specific features. This dual capability enhances its generalization across diverse tissue types. To mitigate the effects of various dropout rates from different sequencing platforms, CanCellCap employs a feature masking-reconstruction strategy, enabling robust performance across different sequencing platforms. Extensive experiments demonstrate that CanCellCap outperforms five state-of-the-art methods across 33 testing datasets, accurately identifying cancer cells across diverse cancer types and tissue types, including both seen and unseen cancer and tissue types, as well as sequencing platforms. CanCellCap also successfully identifies cancer spots in spatial transcriptomics data and cancer cells in mouse datasets, despite being trained exclusively on human scRNA-seq data. Ablation studies highlight the essential roles of its core components, while analysis of gating weight vectors further supports CanCellCap’s ability to effectively assign input data to the most relevant expert models. Moreover, CanCellCap accurately identified cancer cells in clinically derived samples, demonstrating its potential for clinical translation. Interpretability analysis revealed key genes and pathways, including both established cancer biomarkers and potential novel therapeutic targets. These findings offer valuable insights for cancer biology and biomarker discovery. Despite its overall strong performance, CanCellCap shows slightly lower recall, which may stem from the MoE module’s reliance on tissue-specific experts. In tissues with limited training data, these experts may adopt conservative decision boundaries. To address this, we plan to enrich underrepresented tissue types in the training set and explore adaptive training strategies to better balance recall and generalization. Additionally, the current training dataset is based solely on human data, thereby limiting CanCellCap’s generalization capacity across species. While early results show promise in identifying mouse cancer cells, its applicability to non-human species remains constrained. Expanding the training dataset by incorporating data from multiple species will be a key priority in future work. This will enhance CanCellCap’s ability to generalize across diverse biological contexts, increasing its utility across a broader range of applications. Another limitation lies in CanCellCap’s memory consumption. Although its usage is moderate compared to other methods, there is still room for optimization. Designing a lightweight version of the model would help reduce both memory usage and runtime, thereby enabling wider adoption in both research and clinical settings. Interpretability analysis identified key genes driving CanCellCap’s predictions, including both well-established cancer biomarkers and potential novel candidates for diagnosis or therapy. However, only a subset of these genes has been experimentally validated. To enhance the clinical relevance of these findings, we have initiated collaborations with hospitals to validate candidate biomarkers using real-world clinical datasets. Conclusions CanCellCap is introduced as a robust and generalizable framework for cancer cell identification from single-cell RNA sequencing (scRNA-seq) data. Leveraging a multi-domain design and feature masking-reconstruction strategies, CanCellCap achieves high accuracy across a diverse array of tissues, cancer types, and sequencing technologies. Importantly, CanCellCap demonstrates strong generalization to entirely unseen cancer types, tissue types, and even species, showcasing its adaptability in real-world and cross-species scenarios. In addition, its biologically interpretable outputs enable the identification of potential cancer biomarkers, supporting both translational research and fundamental insights into cancer biology. Future efforts will be directed toward extending its applicability to broader biological and clinical contexts, including cross-species generalization and enhanced computational efficiency for clinical deployment. Collaborations with clinical researchers are ongoing to explore CanCellCap’s utility in early cancer detection and personalized therapeutic strategy development, ultimately contributing to the advancement of cancer research and precision medicine. Methods Model framework With a gene expression matrix as input, which rows represent gene expression levels and columns represent individual cells from diverse human tissues, CanCellCap is composed of the following four modules, as shown in Fig. [145]12. 1. A feature masking-reconstruction strategy: The expression matrix is randomly masked to simulate dropout during the sequencing and reconstructed by a decoder. The reconstruction loss [MATH: Lrecon :MATH] ensures the accurate recovery of dropout gene expressions. 2. A domain adversarial learning module: A gradient reversal layer (GRL) is introduced to implement the domain adversarial learning by building a tissue discriminator and a tissue confuser to extract tissue-common features. The tissue discriminator predicts the tissue source of cells, while the tissue confuser learns to extract common gene expression patterns to confuse the discriminator. The domain adversarial loss [MATH: Lad :MATH] measures the cross-entropy between the true and predicted tissue labels of cells to guide the adversarial training. 3. A Mixture of Experts module: To capture tissue-specific features, CanCellCap employs a Mixture of Experts (MoE) module consisting of multiple expert networks and a gating network. The gating network, guided by the gate loss [MATH: Lgate :MATH] , dynamically selects the most appropriate experts for each tissue type, enabling CanCellCap to adapt to tissue-specific expression patterns. 4. An MLP cancer-cell classifier: The tissue-common and tissue-specific features are concatenated and input into a Multilayer Perceptron (MLP) network for the primary task of cancer cell identification. The loss [MATH: Lide :MATH] is the cross-entropy between the predicted label and the true label. Fig. 12. [146]Fig. 12 [147]Open in a new tab The model structure of CanCellCap. It processes expression over cell matrices from various tissues and simulates dropout effects using random masking. CanCellCap employs domain adversarial learning with a gradient reversal layer (GRL) to extract tissue-common features [MATH: vcom :MATH] and a MoE module to extract tissue-specific features [MATH: vspe :MATH] . These feature embeddings are combined for feature reconstruction to remove the effect of various dropout rates from different sequencing platforms and to identify cancer cells. CanCellCap is trained using a comprehensive loss function that enhances generalization and accuracy across diverse tissue types The overall loss function is the integration of the four losses, [MATH: L=Lide+α1Lad+α2 Lgate+α3Lrecon :MATH] , where [MATH: α1 :MATH] , [MATH: α2 :MATH] , and [MATH: α3 :MATH] are weights of the four loss components. Data preprocessing The gene expression is log-transformed and globally scaled normalized by the cell with Seurat [[148]50]. The gene intersection set across all tissues is extracted by Eq. ([149]1), where [MATH: Gd :MATH] represents the gene list in tissue [MATH: d :MATH] , and [MATH: Nt :MATH] is the number of tissues. The expressions of genes in [MATH: Gselected :MATH] are extracted to construct the input expression matrix [MATH: XRNc×Ng :MATH] , where [MATH: Nc :MATH] is the number of cells and [MATH: Ng :MATH] is the number of selected genes. [MATH: Gselected=< mrow>d=1tGd. :MATH] 1 Feature masking-reconstruction A binary random masking matrix [MATH: Mmask :MATH] is constructed according to Eq. ([150]2), where [MATH: r(i,j) :MATH] is a number randomly sampled from [MATH: [0,1] :MATH] for the [MATH: i :MATH] -th cell and the [MATH: j :MATH] -th gene, and [MATH: pmask :MATH] is the masking probability with a default value of 0.3. [MATH: Mmask(i,j)=0 :MATH] indicates gene [MATH: j :MATH] is masked (set to 0) in cell [MATH: i :MATH] , while [MATH: Mmask(i,j)=1 :MATH] means that the expression of gene [MATH: j :MATH] is retained. [MATH: Mmask(i,j)=1,ifr(i,j)pmask0,ifr(i,j)pmask :MATH] 2 The masked input matrix [MATH: X~ :MATH] is obtained by element-wise multiplication of the original matrix [MATH: X :MATH] and the masking matrix [MATH: Mmask :MATH] , as shown in Eq. ([151]3), where [MATH: :MATH] denotes element-wise multiplication. [MATH: X~ :MATH] simulates the dropouts in the sequencing. [MATH: X~=XMmask :MATH] 3 CanCellCap utilizes an encoder-decoder architecture to reconstruct [MATH: X :MATH] from the masked [MATH: X~ :MATH] . [MATH: X~ :MATH] is mapped to a latent embedding [MATH: vi :MATH] by an encoder, which are composed of the domain adversarial module and the MoE module. The decoder, implemented by a two-layer MLP, reconstructs the original gene expression matrix [MATH: X :MATH] on [MATH: vi :MATH] . The reconstruction loss is defined as Eq. ([152]4), where [MATH: Xi :MATH] denotes the original gene expression vector for the [MATH: i :MATH] -th cell, and [MATH: decoder (vi) :MATH] is the reconstructed vector from latent embedding [MATH: vi :MATH] . [MATH: Lrecon=1Nc i=1c||Xi-d ecoder (vi)||2. :MATH] 4 Domain adversarial learning for extracting tissue-common features The domain adversarial learning module consists of a tissue discriminator and a tissue confuser, which are trained in an adversarial manner. The tissue confuser aims to learn common features across tissues, while the tissue discriminator attempts to predict tissue labels from these features. The adversarial training encourages the tissue confuser to generate tissue-common features that make it difficult for the discriminator to correctly identify tissue types. The tissue confuser, denoted as [MATH: ϕf :MATH] , is a two-layer MLP. It processes the expression vector [MATH: X~i :MATH] of cell [MATH: i :MATH] and extracts the tissue-common feature vector [MATH: vcom=ϕf(X~i,< mi mathvariant="bold-italic">θf) :MATH] , where [MATH: θf :MATH] represents the parameters of the tissue confuser. The tissue discriminator, denoted as [MATH: ϕtc :MATH] , is a single-layer MLP that takes the tissue-common features [MATH: vcom :MATH] as input and predicts the tissue labels. For adversarial training, a GRL is incorporated between the tissue confuser and the tissue discriminator. The GRL enables adversarial training by reversing the gradient during backpropagation. During the forward pass, the GRL simply passes the input unchanged, as shown in Eq. ([153]5). [MATH: GRL(x)=x. :MATH] 5 During the backward pass, the GRL reverses the gradient direction, that is, replace [MATH: Lθf :MATH] with [MATH: -Lθf :MATH] . The domain adversarial loss function [MATH: Lad :MATH] is defined in Eq. ([154]6), where [MATH: ti,d :MATH] is the true tissue label in one-hot encoding for the [MATH: i :MATH] -th cell and [MATH: d :MATH] -th tissue type. [MATH: Lad=-1Nc i=1cd=1t< mi>ti,dlog(ϕtc(GRL(ϕf(X~i,< mi mathvariant="bold-italic">θf),θtc))). :MATH] 6 Utilizing the GRL layer, the tissue confuser aims to minimize this loss, while the tissue discriminator attempts to maximize it, thus creating the adversarial setup. Therefore, the parameters of the tissue confuser [MATH: θ^f :MATH] and the tissue discriminator [MATH: θ^tc :MATH] are updated through the following optimization steps: [MATH: θ^f=argminθfLadθf,θ^tc, :MATH] 7 [MATH: θ^tc=arg< munder>maxθtcLadθ^f,< mi mathvariant="bold-italic">θtc. :MATH] 8 This adversarial training setup enables the module to capture tissue-common features, encouraging CanCellCap to generalize across various tissue types. Mixture of experts for extracting tissue-specific features The MoE module consists of [MATH: Nt :MATH] expert networks and a gating network, where [MATH: Nt :MATH] corresponds to the number of tissue types. Each expert network [MATH: Ed :MATH] is responsible for learning specific features for tissue [MATH: d :MATH] . The gating network [MATH: gX~i :MATH] is a single-layer MLP that takes [MATH: X~i :MATH] as input and outputs a set of weights for each expert network. Specifically, the gating network assigns a weight distribution [MATH: gX~i=g1X~i,g2X~i,,gNtX~i :MATH] , where each weight corresponds to the importance of a particular expert network for the [MATH: i :MATH] -th cell. The tissue-specific features [MATH: vspe :MATH] are a weighted sum of the expert network outputs: [MATH: vspe=d=1t< mi>gdX~iEdX~i. :MATH] 9 To ensure that the gating network assigns cells from different tissues to their corresponding expert networks, a gating loss [MATH: Lgate :MATH] (Eq. ([155]10)) is minimized, where [MATH: ti,d :MATH] is the true tissue label in one-hot encoding for the [MATH: i :MATH] -th cell and [MATH: d :MATH] -th tissue type. [MATH: Lgate=-1Nci=1cd=1tti,dlog gi,dX~i. :MATH] 10 During training, tissue labels guide the gating network to assign cells to the appropriate experts, enabling CanCellCap to learn tissue-specific gene expression patterns. During testing, tissue labels are not required, and CanCellCap would find the proper expert to identify cancer cells. An MLP for cancer cell identification The tissue-common features [MATH: vcom :MATH] and the tissue-specific features [MATH: vspe :MATH] are concatenated as [MATH: vcat=[vcom|vspe]. :MATH] [MATH: vcat :MATH] is input to a single-layer MLP classifier [MATH: ϕci :MATH] to distinguish cancer and normal cells. The cancer cell identification loss, [MATH: Lide :MATH] , defined in Eq. ([156]11), which quantifies the cross-entropy between the predicted output [MATH: ϕci(vcat,θci) :MATH] and the true label [MATH: yi :MATH] for the [MATH: i :MATH] -th cell. [MATH: Lide=-1Nci=1c(yilogϕci(vcat,θci)+(1-yi< /msub>)log(1-ϕci(vcat,θci)). :MATH] 11 Loss function CanCellCap is optimized to minimize the overall loss in Eq. ([157]12), which effectively distinguishes cancer cells while maintaining robustness against the effects of different sequencing platforms and variability across tissue types. [MATH: α1 :MATH] , [MATH: α2 :MATH] , and [MATH: α3 :MATH] are weights of the loss components, with default values of 0.2, 0.3, and 0.1. [MATH: L=-α< mn>1Nci=1cd=1tti,dlog(ϕtc(GRL(ϕf(X~i,< mi mathvariant="bold-italic">θf),θtc)))-α2Nci=1c< mo>∑d=1tti,dloggi,dX~i+α3Nc< /mfrac>i=1c|Xi-d ecoder (vcat)|2-1Nci=1c(yilogϕcivcat,θci+(1-yi< /msub>)log(1-ϕci(vcat,θci)) :MATH] 12 Supplementary Information [158]12915_2025_2337_MOESM1_ESM.xlsx^ (30.2KB, xlsx) Additional file 1: Table 1. Composition and characteristics of human single-cell RNA-seq datasets for model testing. Table 2. Composition and characteristics of mouse single-cell RNA-seq datasets used for cross-species model testing. Table 3. Composition and characteristics of human spatial transcriptomics datasets used for model testing. Table 4. Summary of human single-cell RNA-seq training datasets. Table 5. Detailed metadata of human single-cell RNA-seq training datasets [159]12915_2025_2337_MOESM2_ESM.xlsx^ (51.7KB, xlsx) Additional file 2: Table 1. Performance comparison of CanCellCap and five SOTA methods across 33 testing datasets. Table 2. Performance comparison of CanCellCap with five SOTA methods across platforms. Table 3. Performance comparison of CanCellCap with five SOTA methods on datasets from unseen types. Table 4. Ablation study of the components of CanCellCap on the Testing Datasets. Table 5. Performance comparison of CanCellCap and CanCellCap without feature masking-reconstruction strategy across different dropout rates in simulated dropout datasets. Table 6. Computational efficiency analysis. Table 7. Confusion matrix of CanCellCap for cancer origin identification on the TISCH2 Testing Dataset. Table 8. Accuracy for different tissue validation datasets of CanCellCap [160]12915_2025_2337_MOESM3_ESM.docx^ (5.6MB, docx) Additional file 3: Fig. S1. Top 10 key genes identified by SHAP across 13 tissue types. Fig. S2. Top 10 key genes identified by GradientSHAP across 13 tissue types. Fig. S3. Top 10 key genes identified by Integrated Gradients across 13 tissue types. Fig. S4. Top 10 key genes identified by Deeplift across 13 tissue types. Fig. S5. KEGG pathway enrichment analysis of the overlapping genes between the top 300 genes ranked by SHAP and those by GradientSHAP across 13 tissue types. Table 1. Total memory consumption of model inference using csv format. Table 2. Architectural components and layer configurations of the CanCellCap model. Table 3. Accuracy metrics of CanCellCap and five state-of-the-art methods on 10 cancer single-cell datasets, with rejection rates indicated in parentheses for models that abstain from certain predictions Acknowledgements