Abstract

   Single-cell RNA sequencing (scRNA-seq) is a powerful tool for
   characterizing tumor heterogeneity, yet accurately identifying
   malignant cells remains challenging. Here, we propose
   scMalignantFinder, a machine learning tool specifically designed to
   distinguish malignant cells from their normal counterparts using a
   data- and knowledge-driven strategy. To develop the tool, multiple
   cancer datasets were collected, and the initially annotated malignant
   cells were calibrated using nine carefully curated pan-cancer gene
   signatures, resulting in over 400,000 single-cell transcriptomes for
   training. The union of differentially expressed genes across datasets
   was taken as the features for model construction to comprehensively
   capture tumor transcriptional diversity. scMalignantFinder outperformed
   existing automated methods across two gold-standard and eleven
   patient-derived scRNA-seq datasets. The capability to predict
   malignancy probability empowers scMalignantFinder to capture dynamic
   characteristics during tumor progression. Furthermore,
   scMalignantFinder holds the potential to annotate malignant regions in
   tumor spatial transcriptomics. Overall, we provide an efficient tool
   for detecting heterogeneous malignant cell populations.

   Subject terms: Computational models, Tumour heterogeneity, Machine
   learning
     __________________________________________________________________

   Editorial Summary: scMalignantFinder accurately identifies malignant
   cells in single-cell RNA-seq and spatial transcriptomics by leveraging
   pan-cancer gene signatures.

Introduction

   Tumors are highly heterogeneous diseases, driven by molecular
   disturbances at the genetic, epigenetic, and gene expression levels
   within tumor cells^[28]1. Accurately characterizing tumor heterogeneity
   is crucial for understanding the mechanisms of cancer development and
   devising effective and durable therapeutic strategies^[29]2.
   Single-cell RNA sequencing (scRNA-seq) technology has revolutionized
   the field of cancer biology^[30]3 by enabling the measurement of the
   complete transcriptome of each cell within a tumor tissue, thus
   facilitating the identification of distinct cell types and
   states^[31]4.

   Identifying malignant cells and distinguishing them from other
   non-malignant cells is a crucial step in dissecting tumor heterogeneity
   from cancer scRNA-seq experiments^[32]3,[33]5,[34]6. This process aids
   in gaining a deep understanding of transcriptional reprogramming
   underlying the malignant transformation of cells during cancer
   development. Copy number variation (CNV) inference methods are commonly
   employed to identify malignant cells^[35]7,[36]8, and their reliability
   strongly depends on how well the observed gene expression deviations
   correlate with underlying copy number changes rather than other
   biological and technical factors^[37]1. Furthermore, CNV inference
   often requires the specification of appropriate normal reference cells,
   hindering the complete automation of the analysis process. Therefore,
   relying solely on inferred CNVs has limitations in annotating malignant
   cells, especially in cancers with minimal genomic structural variation
   or other unknown origins^[38]5,[39]9. Regarding the former, a potential
   alternative is to identify malignant cells through smaller-scale
   genetic alterations, such as single nucleotide variants (SNVs).
   However, detecting SNVs from scRNA-seq data requires sufficient read
   coverage over the expressed exons, making it primarily applicable to
   the scRNA-seq protocols that capture full-length rather than 3’ or 5’
   transcripts^[40]1.

   Given these limitations, recent advancements have introduced automated
   approaches for identifying malignant cells through supervised learning,
   utilizing training data with cell identities provided by the original
   studies^[41]6,[42]10,[43]11. However, the performance of these
   supervised learning methods would be greatly compromised by the lack of
   optimal reference datasets that accurately annotate the malignancy
   status of each cell^[44]1,[45]12,[46]13. Additionally, current
   approaches^[47]6,[48]10 often select genes that consistently exhibit
   differential expression across datasets as model features, potentially
   overlooking substantial inter-tumor heterogeneity. Consequently, there
   is an urgent need for specialized methods that are guided by both
   high-quality data and well-established knowledge, incorporated with
   accurately labelled datasets and robust feature selection strategy to
   effectively identify heterogeneous malignant cells from cancer
   scRNA-seq data.

   This study introduces scMalignantFinder, a machine learning-based
   automated classifier, specially designed to distinguish malignant cell
   from their originating normal cells, rather than from all other
   non-malignant cells as existing methods usually do. We systemically
   reviewed multiple scRNA-seq datasets with cell type annotations and
   calibrated the initially annotated malignant cells using nine carefully
   curated cancer gene signatures that exhibited consistent
   transcriptional patterns across diverse cancer types to construct the
   gold standard training set. The union set of differentially expressed
   genes (DEGs) between the calibrated malignant cells and normal cells
   across datasets was adopted to build a classification model.
   scMalignantFinder demonstrated superior performance compared to current
   automated methods on independent test sets ranging cancer cell lines,
   non-cancer tissues, and eleven cancer single-cell datasets covering
   nine cancer types. The capability of scMalignantFinder to predict
   malignancy probability empowers it to capture dynamic characteristics
   during tumor progression. Moreover, we extended the classifier to
   discover malignant spots in spatial transcriptomics (ST) data without
   retraining, achieving high concordance with pathologists’ annotations
   across multiple cancer ST slices. These findings underscore
   scMalignantFinder as a versatile and generalizable tool for
   investigating malignant cell biology in cancer research.

Results

An overview of scMalignantFinder

   We developed scMalignantFinder to distinguishing malignant cells from
   their originating normal cells. Our focus was on carcinomas, which
   originate from epithelial tissue and accounts for 80 ~ 90% of all
   cancer cases. To achieve this, we collected five publicly available
   single-cell RNA-seq (scRNA-seq) datasets with cell type
   annotations^[49]14–[50]18, encompassing four cancer types: lung,
   colorectal, liver, and gastric cancer. These datasets, containing
   expression profiles over 400,000 malignant and normal epithelial cells,
   served as the basis for constructing our training set (Supplementary
   Data [51]1). scMalignantFinder was designed through two major steps:
   (1) defining the training dataset by calibrating malignant cells using
   cancer hallmark-based gene signatures, and (2) selecting model features
   by taking the union of significantly differentially expressed genes
   (DEGs) across datasets (Fig. [52]1).

Fig. 1. Workflow of scMalignantFinder.

   [53]Fig. 1
   [54]Open in a new tab

   scMalignantFinder identifies malignant cells through two steps: (1)
   constructing a training set by calibrating malignant cells using nine
   curated cancer gene signatures across over 400,000 epithelial cells
   from four cancer types, and (2) performing feature selection by taking
   the union of differentially expressed genes (DEGs) across datasets,
   which captures both common DEGs and those unique to specific datasets.
   These refined steps provide the foundation for building a logistic
   regression model that classifies cells as malignant or normal based on
   single-cell RNA-seq (scRNA-seq) data.

   A key challenge in constructing a training set for supervised learning
   in the context of identifying malignant cells is the lack of a
   gold-standard annotation for them. To address this, we adopted a gene
   signature-based approach to uniformly calibrate the malignant cells
   across collected datasets (Fig. [55]1). Cancer hallmarks represent a
   set of common functional capabilities that human cells acquire during
   the transition from normal to tumorigenic states^[56]19, making them
   potential common markers of malignant cells. Based on this rationale,
   we curated 29 gene signatures associated with cancer hallmarks from
   both knowledge-based and data-driven sources^[57]20–[58]22.

   Gene set activity analysis was applied to the five scRNA-seq datasets
   and bulk RNA-seq datasets of 16 cancer types from The Cancer Genome
   Atlas (TCGA) (Supplementary Data [59]1). Differential activity analysis
   was then performed for each gene signature between malignant and normal
   epithelial cells, as well as tumor and adjacent normal samples. Nine
   out of the 29 gene signatures exhibited consistent trends across
   different datasets at both single-cell and bulk levels, thus
   representing pan-cancer transcriptional characteristics (Supplementary
   Fig. [60]1; Supplementary Data [61]2). Among these signatures, eight
   were upregulated in both malignant cells and tumor samples,
   representing cellular processes such as cell cycle, DNA damage, and DNA
   repair; while one signature, known to be enriched in normal
   specimens^[62]21, was consistently downregulated. Using the activity of
   these nine cancer gene signatures in normal epithelial cells as a
   baseline, we filtered out 2.5% of the original malignant cells
   (Supplementary Fig. [63]2a) and obtained the final training set
   consisting of 416,774 cells, including 134,053 malignant cells and
   282,721 normal epithelial cells.

   Next, we conducted differential gene expression analysis between
   malignant cells and normal epithelial cells within each dataset in the
   training set, and identified a range of 327 to 2226 DEGs per dataset
   (Fig. [64]1). Compared to downregulated DEGs, upregulated DEGs in
   malignant cells tended to be more common across different cancer types
   (Supplementary Fig. [65]2b). Notably, 68.3% of upregulated DEGs and
   87.4% of downregulated DEGs were found to be specific to individual
   datasets. Considering that tumors exhibit both shared functional
   characteristics^[66]19 and high heterogeneity^[67]23, we incorporated
   features that capture both shared and distinct characteristics of
   malignant cells. Specifically, we retained DEGs that are common across
   datasets, representing universal features of malignant cells, as well
   as dataset-specific DEGs to account for unique characteristics of
   individual tumors. Finally, a set of 2707 DEGs, comprising 1656
   upregulated genes and 1051 downregulated genes in malignant cells
   (Supplementary Data [68]3), were used to construct the input expression
   matrix for the logistic regression classifier- scMalignantFinder
   (Fig. [69]1).

Performance validation by diverse scRNA-seq datasets

   To evaluate the performance of scMalignantFinder, we collected multiple
   independent scRNA-seq datasets from various sample sources, including
   cancer cell lines, tumor biopsies, pre-cancerous tissues,
   tumor-adjacent tissues, and deceased non-cancer donor specimens. The
   collected test set used for validation consisted of 607,109 cells,
   including 260,734 malignant cells and 346,375 normal tissue cells, from
   633 samples spanning 22 cancer types. We compared scMalignantFinder
   with four other tools: PreCanCell^[70]6, ikarus^[71]10,
   Cancer-Finder^[72]11, and CopyKAT^[73]8. and evaluated the performance
   using seven metrics, including the area under the receiver operating
   characteristic curve (AUROC), accuracy, balanced accuracy (the average
   of sensitivity and specificity), sensitivity, specificity, positive
   predictive value (PPV), and negative predictive value (NPV).

   We first evaluated scMalignantFinder using two expression datasets
   including 53,513 malignant cells from 198 cancer cell lines^[74]24 and
   104,148 normal epithelial cells from 18 tissues from non-cancer
   donors^[75]25, which served as gold standard datasets due to their
   evident sample sources (Supplementary Data [76]5 and [77]9). When
   pooling all cells from each dataset together, scMalignantFinder
   achieved a sensitivity of 1.000 in identifying malignant cells and a
   specificity of 0.786 in identifying normal epithelial cells,
   outperforming other methods (PreCanCell: 0.996 and 0.503; ikarus: 0.834
   and 0.642; Cancer-Finder: 1.000 and 0.022; CopyKAT: 0.594 and 0.397)
   (Fig. [78]2a). When evaluating individual cancer cell line,
   scMalignantFinder demonstrated high and consistent sensitivity in
   detecting malignant cells, with a minimum sensitivity of 0.98
   (Fig. [79]2b), indicating its robustness across diverse cell lines and
   cancer types.

Fig. 2. Performance evaluation of scMalignantFinder.

   [80]Fig. 2
   [81]Open in a new tab

   a Barplot showing the sensitivity of each method applied to the cancer
   cell line scRNA-seq dataset (left) and the specificity of each method
   applied to the non-cancer tissue scRNA-seq dataset (right). Each point
   represented a cell line (left) or a tissue (right). Statistical
   significance was determined by a two-sided Wilcoxon rank-sum test.
   n = 198 cell lines (left) or 18 tissues (right); bar height represents
   the mean value; whiskers extend to the mean ± 95% confidence intervals.
   b Scatterplot showing the mean sensitivity (calculated from the cancer
   cell line scRNA-seq dataset) and mean specificity (calculated from the
   non-cancer tissue scRNA-seq dataset) for each method presented in a. c
   Boxplot showing seven metrics of each method on 13 scRNA-seq datasets
   (Supplementary Data [82]1). “Retrained” indicates the models were
   re-trained using the calibrated training set from this study, while
   “DEGs in this study as the source of signatures” refers to ikarus being
   retrained with gene signatures derived from the DEGs identified in this
   study. Statistical significance was determined by a two-sided Wilcoxon
   rank-sum test. n = 13 datasets; center line represents the median
   value; box limits indicate the upper and lower quartiles; whiskers
   extend to 1.5× the interquartile range. Source data are provided as a
   Source Data file (Supplementary Data [83]5 and [84]9).

   Furthermore, we applied scMalignantFinder to 11 additional scRNA-seq
   datasets from cancer patients^[85]16,[86]17,[87]23,[88]26–[89]28 (see
   Supplementary Data [90]1), involving nine cancer types, three of which
   were included in the training set (lung, colorectal, and liver cancer),
   and six not included (head and neck, kidney, prostate, breast,
   pancreatic, and ovarian cancer). The original cell annotations provided
   in the respective studies were used as true labels for validation.
   Across all 13 datasets, scMalignantFinder achieved an average accuracy
   of 0.824 and an average balanced accuracy of 0.732, surpassing other
   methods (PreCanCell: 0.713 and 0.613; ikarus: 0.446 and 0.533;
   Cancer-Finder: 0.654 and 0.531; CopyKAT: 0.427 and 0.571) (Fig. [91]2c;
   Supplementary Fig. [92]3a). Notably, scMalignantFinder exhibited the
   highest accuracy in 6 of 13 datasets and the highest balanced accuracy
   in 7 out of 11 datasets (Fig. [93]2a; Supplementary Fig. [94]3b).

   While PreCanCell and Cancer-Finder demonstrated comparable or superior
   sensitivity compared to scMalignantFinder, they displayed lower
   specificity, and higher false positive rate (Fig. [95]2b, c).
   Conversely, ikarus and CopyKAT exhibited the highest specificity but
   the lowest sensitivity (Fig. [96]2b, c). The combined high sensitivity
   and PPV provided by scMalignantFinder highlight its ability to
   accurately identify malignant cells while maintaining a low false
   positive rate. This is a fundamental prerequisite for gaining an
   in-depth understanding of tumor heterogeneity through single-cell data
   analysis. Overall, scMalignantFinder outperformed the other four
   methods in distinguishing malignant cells from their normal
   counterparts.

   Next, we retrained Cancer-Finder and ikarus using our calibrated
   training set to explore the effects of training set design strategy on
   model performance (Fig. [97]2c; Supplementary Fig. [98]3a, b). After
   retraining, Cancer-Finder exhibited a noticeable improvement, with its
   average accuracy increasing from 0.654 to 0.828 and average balanced
   accuracy rising from 0.531 to 0.669. For ikarus, we also replaced its
   gene signatures with the DEGs identified from our training set.
   Retraining enhanced ikarus’s performance, with its average accuracy
   increasing from 0.446 to 0.488 and average balanced accuracy rising
   from 0.533 to 0.612. These performance improvements were primarily
   reflected in enhanced specificity for identifying normal cells,
   underscoring the importance of training data calibration and rational
   feature selection.

   To assess whether the performance of scMalignantFinder is influenced by
   sequencing depth, we reanalyzed its prediction on samples with
   different median gene counts across multiple test sets. We measured
   Spearman correlation between median gene count and balanced accuracy
   within each dataset, and did not find any significant correlation in
   all the datasets (Supplementary Fig. [99]4), to some extent,
   demonstrating the robustness of scMalignantFinder.

Transcriptional characteristics of misclassified cell populations

   Given that the prediction by scMalignantFinder was not entirely
   consistent with the original label, we investigated the transcriptional
   characteristics of the misclassified cells. The cells in the test set
   were classified into four categories: (1) true malignant, labeled and
   correctly predicted as malignant, (2) predicted malignant, labeled as
   normal but wrongly predicted as malignant, (3) predicted normal,
   labeled as malignant but wrongly predicted as normal, and (4) true
   normal, labeled and correctly predicted as normal (Supplementary
   Data [100]6 and [101]10).

   We first examined the expression profiles of the curated cancer gene
   signatures for these four cell categories (Supplementary Data [102]2).
   As expected, in the liver dataset^[103]28, a substantial increase in
   the activity of multiple upregulated signatures was observed in
   predicted malignant cells compared to true normal cells; conversely,
   predicted normal cells exhibited an enrichment of downregulated
   signature, surpassing the activity of true malignant cells
   (Fig. [104]3a). This trend was consistent across several other datasets
   (Supplementary Fig. [105]5a). Statistical analysis across the test sets
   indicated that predicted malignant cells ranked second in activity
   levels among seven out of nine cancer gene signatures, closely
   following true malignant cells. Similarly, predicted normal cells
   demonstrated activity levels in seven signatures that closely resembled
   those of true normal cells (Fig. [106]3b). Furthermore, Copy Number
   Variation (CNV) has been closely associated with the development and
   progression of various cancers due to its impact on gene
   expression^[107]29. By performing single-cell CNV inference^[108]7, we
   discovered that predicted malignant cells exhibited CNV profiles
   similar to true malignant cells, with significantly higher CNV levels
   compared to true normal cells (Supplementary Fig. [109]5b, c).

Fig. 3. Characterization of transcriptional features in misclassified cells.

   [110]Fig. 3
   [111]Open in a new tab

   a Boxplot showing activity of nine curated cancer gene signatures in
   four groups of cells (true malignant, predicted malignant, predicted
   normal, true normal) in the liver dataset. Statistical significance was
   determined by a two-sided Student’s t test (****P < 0.0001). Center
   line represents the median value; box limits indicate the upper and
   lower quartiles; whiskers extend to 1.5× the interquartile range. b
   Barplot showing reverse ranking of activity of nine curated cancer gene
   signatures in four groups of cells across five test sets including
   lung, liver, and colorectal cancer. Reverse ranking is based on the
   mean activity among the four groups. For a given dataset and a group of
   cells, if the mean of their activities is the highest among the four
   groups, their reverse ranking is 4; if the mean activity is the second
   highest, the reverse ranking is 3, and so on. n = 5 datasets; bar
   height represents the mean value; whiskers extend to the mean ± 95%
   confidence intervals. c Boxplot showing proportion concordance of
   differentially expressed genes (DEGs) in true malignant and predicted
   malignant cells (vs true normal cells). d, e Pathway (d) and
   transcription factor (e) enrichment of DEGs upregulated in both true
   malignant and predicted malignant cells. Source data are provided as a
   Source Data file (Supplementary Data [112]6, [113]10, and [114]11).

   To further characterize the transcriptional features of predicted
   malignant cells, we conducted differential gene expression analysis
   between the two types of malignant cells and true normal cells. Our
   analysis revealed that among the identified DEGs, 20% of the genes
   upregulated in true malignant cells also exhibited elevated expression
   in predicted malignant cells. Similarly, 42% of the downregulated genes
   in true malignant cells showed consistent downregulation in predicted
   malignant cells (Fig. [115]3c). Pathway enrichment analysis revealed
   that the commonly upregulated DEGs were mainly involved in angiogenesis
   and metabolic reprogramming (Fig. [116]3d), which are typical hallmarks
   of cancer^[117]19. Furthermore, we predicted the upstream regulators of
   the upregulated DEGs and identified the highly ranked transcription
   factors (TFs), such as HIF1A^[118]30 and TWIST1^[119]31 (Fig. [120]3e).
   These TFs have been reported to be upregulated in various cancers and
   play roles in downstream cancer-promoting processes through
   transcriptional regulation such as altered substrate metabolism,
   angiogenesis, tumor invasion, and metastasis.

   Overall, these results indicate that predicted malignant cells exhibit
   transcriptional characteristics of malignant cells. The presence of
   false positives (predicted malignant cells) and false negatives
   (predicted normal cells) are probably due to the continuously
   transitional states of cells between normal and malignancy at
   transcriptional profiles.

Application of scMalignantFinder in elucidating cell states during
carcinogenesis

   Since scMalignantFinder can also report the malignancy probability of
   each cell, we set out to analyze the malignant state during cancer
   progression. To this end, we applied scMalignantFinder to colorectal
   cancer datasets^[121]16,[122]17, which comprises epithelial cell
   expression profiles from 36 normal mucosal tissues, 23 colorectal
   polyps (precursors of colorectal carcinoma), and 6 tumors, and
   predicted the malignancy probability and classification for each
   epithelial cell (Supplementary Fig. [123]6a; Supplementary
   Data [124]11). It was shown that the percentage of predicted malignant
   cells was nearly 0% in normal mucosa, significantly increased in
   colorectal polyps, and peaked in tumors. Similarly, the malignancy
   probability exhibited a clear increasing trend from normal mucosa to
   colorectal polyps, and to tumors.

   In addition, we applied scMalignantFinder to an scRNA-seq dataset
   representing different stages of gastric cancer progression^[125]32
   (Supplementary Fig. [126]6b), which included samples from three
   non-atrophic gastritis (NAG) biopsies as normal controls, three chronic
   atrophic gastritis (CAG) biopsies and six intestinal metaplasia (IM)
   biopsies both representing precancerous lesions, and one early gastric
   cancer (EGC) biopsy. The analysis revealed that the percentage of
   predicted malignant cells was 0% in NAG, increased in CAG and IM, and
   reached its highest level in EGC. Notably, both the percentage of
   malignant cells and the malignancy probability in IM were significantly
   higher than in CAG, consistent with the established fact that IM
   carries a higher risk of gastric cancer compared to CAG^[127]33.

   The above results suggested that scMalignantFinder can shed lights on
   the dynamics underlying the malignant transformation of epithelial
   cells.

The roles of different types of DEGs in identifying malignant cells

   We conducted feature importance assessment to quantify the contribution
   of each feature used in scMalignantFinder (Supplementary Fig. [128]7a;
   Supplementary Data [129]4, [130]7, and [131]12). The 2707 DEGs were
   categorized into common DEGs and unique DEGs based on whether they were
   differentially expressed across at least two cancer types in the
   training set (Supplementary Data [132]3). It was found that the
   commonly upregulated or downregulated DEGs in malignant cells exhibited
   significantly higher feature importance than the unique DEGs
   (Fig. [133]4a; Supplementary Fig. [134]7b), supporting the notion that
   there are conserved mechanisms underlying carcinogenesis across cancer
   types^[135]19.

Fig. 4. Contribution of each DEG to model training and prediction.

   [136]Fig. 4
   [137]Open in a new tab

   a Boxplot showing feature importance of two types of DEGs (common and
   unique). Unique DEGs are exclusive to a specific cancer type, while
   common DEGs are found in at least two cancer types (Supplementary
   Data [138]3). The importance of each DEG is determined by the absolute
   value of the model coefficient (Supplementary Data [139]4). Statistical
   significance was determined by a two-sided Student’s t test (****:
   P  <  0.0001). Center line represents the median value; box limits
   indicate the upper and lower quartiles; whiskers extend to 1.5× the
   interquartile range. b Comparison of the original model and a model
   using the top 1500 important DEGs on seven prediction metrics. n = 13
   datasets; center line represents the median value; box limits indicate
   the upper and lower quartiles; whiskers extend to 1.5× the
   interquartile range. c Pie chart illustrating the statistics of how
   many cancer types in the training set the top 1500 important DEGs
   (left) and the remaining DEGs (right) were found in. d Comparison among
   the original model and models using either common DEGs or unique DEGs
   as features. Statistical significance was determined by a two-sided
   Wilcoxon rank-sum test (*: P  <  0.05, **: P  <  0.01, ***:
   P  <  0.001). n = 13 datasets; center line represents the median value;
   box limits indicate the upper and lower quartiles; whiskers extend to
   1.5× the interquartile range. Source data are provided as a Source Data
   file (Supplementary Data [140]7 and [141]12).

   Subsequently, we investigated whether the classification performance of
   the original model could be replicated using a subset of DEGs with
   top-ranked importance. As expected, the classification performance
   improved as the number of important genes increased (Supplementary
   Fig. [142]7c); A model with approximately 1500 input genes achieved
   performance comparable to the original 2707-feature model across all
   metrics (Fig. [143]4b; Supplementary Fig. [144]7c, d). These top 1500
   DEGs included a larger proportion of common DEGs (28%) compared to the
   remaining DEGs (20%) (Fig. [145]4c).

   To further explore the role of common and unique DEGs in identifying
   malignant cells, we constructed separate models using each type of
   DEGs. The model based on the common DEGs performed comparably to the
   original model, showing similar AUROC and balanced accuracy
   (Fig. [146]4d). However, the model based on the unique DEGs exhibited
   significantly poorer performance (Fig. [147]4d). These findings suggest
   that the common transcriptional signatures among malignant cells of
   different cancer types are more crucial for the performance of
   scMalignantFinder compared to those that are specific to individual
   cancer types.

   Interestingly, models constructed solely using common DEGs demonstrated
   better specificity but sacrifice some sensitivity, while models
   constructed using cancer-specific DEGs exhibited better sensitivity
   (Fig. [148]4d). Unlike previous methods that exclusively retain common
   DEGs^[149]6,[150]10, scMalignantFinder integrates both types of DEGs,
   resulting in a more comprehensive discriminatory performance, as
   evidenced by its simultaneous reduction in both false positive and
   false negative rates.

   We also explored the association between upregulated DEGs and clinical
   features. Following a previous study^[151]23, we computed the activity
   of DEGs in bulk TCGA samples and discovered their associations with
   worse prognosis across multiple cancer types (Supplementary
   Fig. [152]7e), suggesting that the transcriptional signatures
   upregulated in malignant cells may involve in cancer progression.
   Furthermore, we analyzed the overlap of DEGs with 704 targets of
   approved drugs with known mechanism in the PHAROS database^[153]34.
   Compared to unique DEGs (OR = 1.0), common DEGs (OR = 1.9) and the DEGs
   that contributed most to the model predictions (OR = 1.5) exhibited
   significant enrichment among approved drug targets (Supplementary
   Fig. [154]7e).

Application of scMalignantFinder in tumor spatial transcriptomics

   The presence of spatial heterogeneity within tumors, resulting from
   clonal expansion and the intricate interplay between cancer cells and
   the surrounding microenvironment, underscores the importance of
   analyzing the transcriptome in a spatial context to understand key
   oncological processes such as tumor progression and metastasis^[155]35.
   In recent years, the rapidly advancing technology of spatial
   transcriptomics (ST) sequencing has emerged as a powerful tool for
   unraveling the complex spatial architecture of the tumor
   microenvironment^[156]36. Identifying malignant regions within this
   context is crucial yet challenging, as ST sequencing techniques capture
   barcode spots with diameters ranging from 55–100 μm, potentially
   measuring a mixture of signals from multiple cells belonging to
   different lineages^[157]37.

   We explored the potential of scMalignantFinder to identify malignant
   spots in ST data (Fig. [158]5a). First, the pretrained model was used
   to calculate two key features for each spot in the ST dataset:
   malignancy probability and malignant signature activity, the latter
   representing the overall expression of upregulated DEGs (Supplementary
   Data [159]3). Additionally, an image score was derived for each spot
   based on the corresponding histological image. These three features -
   malignancy probability, malignant signature activity, and image score -
   formed a feature matrix. hierarchical clustering was then conducted on
   the feature matrix to group the spots into three clusters, considering
   that carcinomas are typically composed of malignant, normal epithelial,
   and non-epithelial regions. To determine the assignment of a region
   cluster, we employed a rank aggregation method. First, we calculated
   the average score for each feature within each cluster. Next, we ranked
   the clusters for each feature based on these average scores. The ranks
   of the three features were then averaged for each cluster. The cluster
   with the top average rank was preliminarily identified as the malignant
   region, while the remaining two clusters were provisionally designated
   as normal regions. Finally, spatial neighborhood relationships were
   incorporated to refine these classifications, yielding the final region
   predictions (Fig. [160]5a).

Fig. 5. Workflow for identifying malignant regions in tumor spatial
transcriptomics (ST) data.

   [161]Fig. 5
   [162]Open in a new tab

   a The pretrained scMalignantFinder model calculates the malignant
   probability and malignant signature activity for each spot.
   Additionally, image scores derived from histological images are
   included as the third feature. The three features (malignant
   probability, malignant signature activity, and image score) are
   combined into a feature matrix for hierarchical clustering, grouping
   spots into three clusters. A rank aggregation method was employed to
   rank clusters based on their average ranks across three features,
   enabling their preliminary classification as malignant or normal.
   Spatial neighborhood relationships are then incorporated to refine
   region annotations, resulting in the final classification of malignant
   and normal spots. b–d Annotations of ST spots by different methods,
   including annotation by pathologists (b), determination of malignancy
   probability and signature activity (c), and final classification (d). e
   Boxplot showing four metrics of scMalignantFinder on eight ST datasets
   (Supplementary Data [163]1). n = 8 datasets; center line represents the
   median value; box limits indicate the upper and lower quartiles;
   whiskers extend to 1.5× the interquartile range. f Spatial
   visualization of copy number variation (CNV) scores in ST slices of
   breast cancer (left) and oral squamous cell carcinoma (right). g
   Boxplot comparison of CNV scores between predicted malignant and normal
   regions in eight ST datasets. n = 6 datasets; center line represents
   the median value; box limits indicate the upper and lower quartiles;
   whiskers extend to 1.5× the interquartile range. h Tumor signature
   activity (left) and boxplot comparison between predicted malignant and
   normal regions (right) in an ST slice of breast cancer. Statistical
   significance was determined by a two-sided Student’s t test (****:
   P  <  0.0001). Center line represents the median value; box limits
   indicate the upper and lower quartiles; whiskers extend to 1.5× the
   interquartile range. Source data are provided as a Source Data file
   (Supplementary Data [164]8, [165]13–[166]15).

   We applied scMalignantFinder to predict malignant regions in eight
   tumor ST datasets^[167]38–[168]41, encompassing five cancer types:
   breast cancer, oral squamous cell carcinoma, renal cell carcinoma,
   prostate cancer, and squamous cell carcinoma. In these datasets, the
   malignant regions had been annotated by pathologists in prior
   studies^[169]11,[170]37,[171]41 (Supplementary Data [172]8 and
   [173]13). Our findings revealed that the two malignant features
   predicted by scMalignantFinder - malignancy probability and signature
   activity - were highly enriched in the annotated malignant spots in
   seven out of the eight ST slices (Fig. [174]5b, c; Supplementary
   Fig. [175]8a, b). Furthermore, the predicted results showed strong
   concordance with expert annotations, achieving an average accuracy of
   0.783 and an average balanced accuracy of 0.800 (Fig. [176]5b, e;
   Supplementary Fig. [177]8). Compared to a prior logistic regression
   model relying solely on malignancy probability to identify malignant
   cells, the current approach, which integrates multi-feature clustering
   and incorporates spatial neighborhood relationships, significantly
   improved the performance (Supplementary Fig. [178]9a; Supplementary
   Data [179]14).

   Next, we benchmarked scMalignantFinder against two recently developed
   methods for tumor ST analysis. The first one is Cancer-Finder^[180]11,
   specifically designed to identify malignant regions in ST data
   (Supplementary Data [181]14). scMalignantFinder demonstrated comparable
   performance across the eight ST slices, with average accuracy and
   balanced accuracy exceeding Cancer-Finder by 0.067 and 0.062,
   respectively (Supplementary Fig. [182]9b, c). The second method,
   Cottrazm^[183]42, is known for its capability to delineate tumor
   boundaries. When applied to three ST slices (Supplementary
   Fig. [184]9d), 87% ~ 99% of the Cottrazm inferred tumor boundary spots
   were adjacent to a mix of scMalignantFinder predicted malignant and
   normal spots (Supplementary Fig. [185]9e, f), aligning with the
   definition of tumor boundaries connecting malignant and normal regions.

   To further validate the reliability of scMalignantFinder identified
   malignant regions, we performed multidimensional functional analyses,
   beginning with CNV profiling (Supplementary Data [186]8 and [187]15).
   Malignant regions are expected to exhibit elevated CNV levels. By using
   inferCNV^[188]7 to calculate the CNV scores for each spot in the ST
   datasets, it was shown that spots classified as malignant exhibited
   significantly higher CNV scores than those as normal (Fig. [189]5f, g).
   For instance, in a breast cancer ST slice, the predicted malignant
   regions exhibited gains in chromosomes 1q and 8q, as well as a loss in
   chromosome 1p, consistent with the notion that breast cancer are
   commonly associated with these CNV alterations^[190]8,[191]43
   (Supplementary Fig. [192]10a). Similarly, deconvolution analysis of
   cell type composition^[193]37 revealed a significant enrichment of
   malignant cells in regions identified as malignant, whereas stromal and
   immune cells were predominantly localized in areas classified as normal
   (Supplementary Fig. [194]10b, c; Supplementary Data [195]15).
   Furthermore, we assessed tumor signature activity in ST slices of
   breast cancer and renal cell carcinoma using previously published
   signature genes for these cancer types^[196]42, and found that the
   tumor signature activity was notably higher in the predicted malignant
   regions compared to the normal regions (Fig. [197]5h; Supplementary
   Fig. [198]10d, e; Supplementary Data [199]15).

   Overall, these results demonstrate the potential of scMalignantFinder
   in annotating spatial transcriptomic data. Further improvements can be
   achieved by including a wider range of cancer types and utilizing more
   tailored training data.

Discussion

   Over the past decade, scRNA-seq technology has been widely applied in
   oncology research^[200]3,[201]23,[202]24,[203]44. However, the lack of
   reliable methods for identifying malignant cells has been a major
   obstacle to accurately studying tumor heterogeneity. Here, we have
   developed scMalignantFinder to annotate malignant cells effectively.
   Built upon meticulously calibrated training datasets and robust feature
   selection strategies, scMalignantFinder outperforms existing automated
   methods, including ikarus, PreCanCell, Cancer-Finder, and CopyKAT,
   across multiple datasets from cancer cell lines, non-cancer tissues,
   and various patient-derived cancer types. Additionally,
   scMalignantFinder shows promising potential in identifying malignant
   regions in ST data and tracking carcinogenic processes, aiding
   researchers in characterizing the spatiotemporal dynamics of tumor
   evolution^[204]45.

   Current automated tools^[205]6,[206]10,[207]11 such as ikarus,
   PreCanCell, and Cancer-Finder were designed to classify malignant cells
   and all other cells in the tumor microenvironment. In contrast,
   scMalignantFinder specifically distinguishes malignant cells from their
   originating normal counterparts within the cancer lineage. This
   strategy aligns with the logic employed in cell subtype or sublineage
   analysis within hierarchical cell classification approaches, which have
   recently proven effective for cell type annotation in the tumor
   microenvironment^[208]5,[209]37,[210]46.

   scMalignantFinder has significantly advanced both the scale and quality
   of training data for malignant cell identification, guided by
   well-established knowledge. It was built on over 400,000 single-cell
   transcriptomes, representing the largest reference dataset for this
   purpose to date. The initially annotated malignant cells, derived from
   various methodologies in the original studies, were meticulously
   refined using curated cancer gene signatures, resulting in a cancer
   signature-calibrated training set where the malignant cells exhibit
   pan-cancer characteristics. As expected, scMalignantFinder assigned
   higher malignancy probability to the calibrated malignant cells in the
   training set compared to the 2.5% of cells that were excluded
   (Supplementary Fig. [211]2a). Consistently, the excluded cells showed a
   loss of either upregulated or downregulated cancer signatures.
   Retraining existing models, such as Cancer-Finder and ikarus, on this
   well-calibrated dataset led to modest performance improvements
   (Fig. [212]2c; Supplementary Fig. [213]3a), highlighting the importance
   of large-scale and high-quality training data in enhancing the accuracy
   of malignant cell identification.

   Based on the cancer signature-calibrated malignant cells, we adopted
   the DEGs between malignant cells and normal epithelial cells as
   features for our model. Due to the extensive molecular heterogeneity
   among tumors, we utilized the union of the DEG set in each dataset,
   differing from previous methods that used intersections^[214]6,[215]10.
   This difference at least partly accounts for the lower rates of false
   positives and false negatives observed in our model. Attempts to
   directly adopt curated cancer gene signatures as model features were
   less effective (Fig. [216]2c; Supplementary Fig. [217]3a), probably
   because DEGs derived from malignant cells are more closely associated
   with malignancy at the transcriptional level. Notably, 75% of the
   identified DEGs are not encompassed in the curated cancer gene
   signatures, which hold the promise to offer additional insights into
   carcinogenesis. We found that the overexpressed genes across different
   cancer types, rather than specific to a particular type, primarily
   contributed to the model and were significantly enriched among approved
   drug targets, supporting the notion that pan-cancer hallmarks may be
   promising therapeutic targets due to their association with fundamental
   principles of carcinogenesis^[218]19.

   Tracking the malignant transformation of cells and unraveling the
   biological process underlying carcinogenesis are crucial for
   identifying molecular targets at key stages and enabling early tumor
   detection and precise diagnosis^[219]47. In this context, we displayed
   the potential of scMalignantFinder to investigate cellular and
   transitional dynamics during tumor progression by distinguishing
   malignant cells from their normal epithelial counterparts. Moving
   forward, we aim to further our initial findings in gastric cancer and
   colorectal cancer by delving into the molecular mechanisms underlying
   these intermediate states.

   Within the recent advancements in ST technology in cancer
   biology^[220]48, we have adapted the pre-trained model to identify
   malignant regions, demonstrating excellent scalability in this field.
   Through a multi-feature clustering framework, incorporating malignant
   features, image-based characteristics, and spatial neighborhood
   networks, scMalignantFinder demonstrated notable improvements in
   accurately identifying malignant regions. Future enhancements could
   focus on the continuous accumulation of expert-curated ST data and the
   incorporation of additional spatial features, such as the spatial
   distribution probability of malignant cells^[221]37, to enhance the
   identification of malignant regions.

   Despite these advantages, scMalignantFinder still has limitations and
   room for improvement. It is designed to identify malignant cells from
   cancerous lineages, relying on the prerequisite identification of basic
   cell types. To address this, we have integrated scATOMIC^[222]5, a
   recently developed automated cell type annotator for pan-cancer
   analysis, into the current version. This integration enables users to
   rapidly identify malignant cells, normal cells, and other cell types
   within the tumor microenvironment using scRNA-seq data. Additionally,
   the capabilities of scMalignantFinder require further testing,
   particularly for cancers with unknown or controversial origins, such as
   synovial sarcoma^[223]49 and certain brain tumors^[224]50.

   In summary, with a machine learning framework driven by both data and
   knowledge, scMalignantFinder achieves a significant advancement in
   malignant cell identification, offering a scalable and versatile tool
   for both single-cell and spatial transcriptomic analyses.

Methods

scMalignantFinder design

Data collection and processing

   We collected five single-cell RNA-seq datasets from various cancer
   types (lung, colorectal, gastric, and liver cancers)^[225]14–[226]18 as
   detailed in Supplementary Data [227]1. Cell type annotations were
   obtained from the original studies, and we retained gene expression
   matrices for malignant and normal epithelial cells. Additionally, bulk
   RNA-seq data for 18 cancer types were sourced from TCGA, with 16
   datasets used after excluding cancer types with fewer than five samples
   (Supplementary Data [228]1).

Selection of cancer gene signatures

   We compiled 29 gene signatures related to cancer hallmarks from the
   following sources:

   (1) 14 cancer functional states (Stemness, Invasion, Metastasis,
   Proliferation, EMT, Angiogenesis, Apoptosis, Cell cycle,
   Differentiation, DNA damage, DNA repair, Hypoxia, Inflammation, and
   Quiescence) from the cancerSEA database^[229]20.

   (2) 10 cancer hallmarks (Sustaining Proliferative Signaling, Evading
   Growth Suppressors, Avoiding Immune Destruction, Enabling Replicative
   Immortality, Tumour-Promoting Inflammation, Activating Invasion and
   Metastasis, Inducing Angiogenesis, Genome Instability and Mutation,
   Resisting Cell Death, Deregulating Cellular Energetics) from manually
   curated pathway gene sets^[230]22.

   (3) 5 epithelial cell states (Basal-like, Normal-enriched,
   Pro-angiogenic, Pro-inflammatory, Metabolic) from the Ecotyper
   framework^[231]21.

   Each cell in the scRNA-seq datasets and each sample in the bulk RNA-seq
   datasets were scored using these gene signatures. For scRNA-seq data,
   normalization was performed using Scanpy’s “pp.normalize_total”
   function (version 1.9.3)
   ([232]https://scanpy.readthedocs.io/en/stable/), followed by scoring
   with the AUCell function in the pySCENIC package (0.12.1)
   ([233]https://github.com/aertslab/pySCENIC). For bulk RNA-seq data, the
   ssGSEA function from the GSVA R package (1.46.0) was used.

   Log2 fold changes in gene activities between malignant and normal cells
   were calculated, and significance was assessed using the Wilcoxon
   rank-sum test. Gene signatures with a P-value ≤ 1×10^−^10 in scRNA-seq
   and ≤1 × 10^−5 in bulk RNA-seq were considered significant.

   Upregulated and downregulated gene signatures were filtered based on
   the following criteria, resulting in nine consistent cancer gene
   signatures (eight upregulated and one downregulated) identified in both
   datasets (Supplementary Data [234]2):

   (1) Gene signatures that were significantly upregulated in ≥ 4
   scRNA-seq datasets and significantly downregulated in ≤ 1 scRNA-seq
   dataset were considered upregulated in malignant cells;

   (2) Gene signatures that were significantly downregulated in ≥4
   scRNA-seq datasets and significantly upregulated in ≤ 1 scRNA-seq
   dataset were considered downregulated in malignant cells;

   (3) Gene signatures that were significantly upregulated in ≥12 bulk
   RNA-seq datasets and significantly downregulated in ≤ 2 bulk RNA-seq
   datasets (or significantly upregulated in ≥6 bulk RNA-seq datasets and
   significantly downregulated in ≤1 bulk RNA-seq dataset) were considered
   upregulated in tumor samples;

   (4) Gene signatures that were significantly downregulated in ≥12 bulk
   RNA-seq datasets and significantly upregulated in ≤ 2 bulk RNA-seq
   datasets (or significantly downregulated in ≥6 bulk RNA-seq datasets
   and significantly upregulated in ≤ 1 bulk RNA-seq dataset) were
   considered downregulated in tumor samples.

Training set calibration with curated cancer gene signatures

   We used a gene signature scoring-based approach to filter originally
   labeled malignant cells. Cells were retained if they had higher
   upregulated signature activity or lower downregulated signature
   activity compared to normal cells:

   (1) malignant cells with upregulated signature activity higher than the
   average activity of normal cells.

   (2) malignant cells with downregulated signature activity lower than
   the average activity of normal cells.

   Raw gene count matrices were filtered to exclude cells with fewer than
   200 or more than 8000 gene counts or with more than 50% mitochondrial
   reads. This resulted in 416,774 cells (134,053 malignant and 282,721
   normal epithelial cells) in the training set.

Feature selection and logistic model training

   Differentially expressed genes (DEGs) were identified using Seurat’s
   FindAllMarkers function (v4.3.0)^[235]51. Genes with a |Log2FC | ≥ 0.5,
   P-value < 0.01, and an expression proportion (pct.1) ≥ 0.2 were
   selected as DEGs. DEGs were determined separately for each dataset, and
   the union of upregulated and downregulated genes across datasets was
   used as features, resulting in 2707 DEGs (1656 upregulated and 1051
   DEGs as features). A logistic regression model was constructed using
   these DEGs as input features and trained with Python’s
   ‘sklearn.linear_model.LogisticRegression’ (v1.2.2). The expression
   profiles from the calibrated training dataset were used for model
   training.

Performance evaluation of scMalignantFinder

   We validated the performance of scMalignantFinder using test sets from
   multiple independent datasets^[236]16,[237]17,[238]23–[239]28
   (Supplementary Data [240]1). The dataset including cells originating
   only from cancer cell lines was used as the positive gold standard test
   set, while the dataset including cells from non-cancer donor tissues
   served as the negative gold standard test set. The cell type
   annotations for the datasets were sourced from the original studies.
   (Supplementary Data [241]1). Following the same normalization method as
   the training set, all datasets were used as inputs for
   scMalignantFinder. The predictive performance was evaluated using seven
   metrics: area under receiver operating characteristic (AUROC),
   accuracy, balanced accuracy, sensitivity, specificity, positive
   predictive value (PPV), and negative predictive value (NPV), calculated
   as follows:
   [MATH:
   <msub><mrow><mi>a</mi><mi>c</mi><mi>c</mi><mi>u</mi><mi>r</mi><mi>a</mi
   ><mi>c</mi><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><mf
   rac><mrow><msub><mrow><mi>T</mi><mi>P</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>+</mo><msub><mrow><mi>T</mi><mi>N</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow><mrow><msub><mrow><mi>T</mi><mi>P</mi></mrow><mr
   ow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow><mi>T</mi><mi>N</mi></
   mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow><mi>F</mi><mi>
   P</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow><mi>F<
   /mi><mi>N</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac>
   :MATH]
   1
   [MATH:
   <msub><mrow><mi>s</mi><mi>e</mi><mi>n</mi><mi>s</mi><mi>i</mi><mi>t</mi
   ><mi>i</mi><mi>v</mi><mi>i</mi><mi>t</mi><mi>y</mi></mrow><mrow><mi>i</
   mi></mrow></msub><mo>=</mo><mfrac><mrow><msub><mrow><mi>T</mi><mi>P</mi
   ></mrow><mrow><mi>i</mi></mrow></msub></mrow><mrow><msub><mrow><mi>T</m
   i><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow>
   <mi>F</mi><mi>N</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac
   > :MATH]
   2
   [MATH:
   <msub><mrow><mi>s</mi><mi>p</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>f</mi
   ><mi>i</mi><mi>c</mi><mi>i</mi><mi>t</mi><mi>y</mi></mrow><mrow><mi>i</
   mi></mrow></msub><mo>=</mo><mfrac><mrow><msub><mrow><mi>T</mi><mi>N</mi
   ></mrow><mrow><mi>i</mi></mrow></msub></mrow><mrow><msub><mrow><mi>T</m
   i><mi>N</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow>
   <mi>F</mi><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac
   > :MATH]
   3
   [MATH:
   <msub><mrow><mi>b</mi><mi>a</mi><mi>l</mi><mi>a</mi><mi>n</mi><mi>c</mi
   ><mi>e</mi><mi>d</mi><mspace
   width="0.16em"></mspace><mi>a</mi><mi>c</mi><mi>c</mi><mi>u</mi><mi>r</
   mi><mi>a</mi><mi>c</mi><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><
   mo>=</mo><mfrac><mrow><msub><mrow><mi>s</mi><mi>e</mi><mi>n</mi><mi>s</
   mi><mi>i</mi><mi>t</mi><mi>i</mi><mi>v</mi><mi>i</mi><mi>t</mi><mi>y</m
   i></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow><mi>s</mi>
   <mi>p</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>f</mi><mi>i</mi><mi>c</mi><
   mi>i</mi><mi>t</mi><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub></mro
   w><mrow><mn>2</mn></mrow></mfrac> :MATH]
   4
   [MATH:
   <msub><mrow><mi>P</mi><mi>P</mi><mi>V</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>=</mo><mfrac><mrow><msub><mrow><mi>T</mi><mi>P</mi></mrow><
   mrow><mi>i</mi></mrow></msub></mrow><mrow><msub><mrow><mi>T</mi><mi>P</
   mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow><mi>F</mi
   ><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac> :MATH]
   5
   [MATH:
   <msub><mrow><mi>N</mi><mi>P</mi><mi>V</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>=</mo><mfrac><mrow><msub><mrow><mi>T</mi><mi>N</mi></mrow><
   mrow><mi>i</mi></mrow></msub></mrow><mrow><msub><mrow><mi>T</mi><mi>N</
   mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><msub><mrow><mi>F</mi
   ><mi>N</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></mfrac> :MATH]
   6

   where
   [MATH:
   <msub><mrow><mi>T</mi><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   ,
   [MATH:
   <msub><mrow><mi>T</mi><mi>N</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   ,
   [MATH:
   <msub><mrow><mi>F</mi><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   ,
   [MATH:
   <msub><mrow><mi>F</mi><mi>N</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   represent the true positives, true negatives, false positives, and
   false negatives on dataset
   [MATH: <mi>i</mi> :MATH]
   .

   We compared scMalignantFinder with four automated tools: ikarus^[242]10
   (0.0.3) ([243]https://github.com/BIMSBbioinfo/ikarus),
   PreCanCell^[244]6 (1.1.0)
   ([245]https://github.com/WangX-Lab/PreCanCell), Cancer-Finder^[246]11
   ([247]https://github.com/Patchouli-M/SequencingCancerFinder), and
   CopyKAT^[248]8 (1.1.0) ([249]https://github.com/navinlabcode/copykat).
   Each tool was used with its respective preprocessing methods and
   default parameters, except for ikarus, where ‘adapt_signatures’ was set
   to True, and CopyKAT, where ‘ngene.chr’ was set to 1.

   We also retrained ikarus and Cancer-Finder using the training set and
   DEGs constructed in this study, as detailed under the sections
   “Training set calibration” and “Feature selection and logistic model
   training.” For ikarus, its gene signatures were replaced with DEGs
   identified from the calibrated training dataset. The retrained models
   were subsequently assessed for their predictive performance using the
   same independent test sets and evaluation metrics.

Characterization of “misclassified” cells in the test sets

   Cells were categorized into four groups based on initial annotations
   and scMalignantFinder predictions: (1) true malignant, labeled as
   malignant and correctly predicted as malignant, (2) predicted
   malignant, predicted as malignant but labeled as normal, (3) predicted
   normal, predicted as normal but labeled as malignant, and (4) true
   normal, labeled as normal and correctly predicted as normal. Cells
   whose prediction did not match the original labels were considered
   “misclassified”. These cells were characterized using functional
   annotation and CNV inference. Functional annotation of selected genes
   was analyzed using Enrichr
   ([250]https://maayanlab.cloud/Enrichr/)^[251]52 against the
   WikiPathways and TRRUST datasets. Pathways and transcription factors
   displaying a P-value < 0.05 were deemed significantly enriched. The CNV
   profiles were inferred using the ‘infercnpy’ package (0.4.2) available
   at [252]https://github.com/icbi-lab/infercnvpy. During the inference
   process, we utilized the following parameters: a window_size of 250 and
   excluded two chromosomes (chrX and chrY). The normal reference cells
   were defined as cells annotated as normal based on the original study.
   The calculation of the CNV score for each cell was performed using the
   ‘infercnvpy.tl.cnv_score’ function.

Application of scMalignantFinder to single-cell data spanning multiple stages
of carcinogenesis

   We analyzed scRNA-seq datasets representing different stages of
   carcinogenesis in colorectal^[253]16,[254]17 and gastric
   cancer^[255]32. For colorectal cancer, epithelial cell profiles from
   normal mucosal tissues, colorectal polyps, and tumors were processed.
   For gastric cancer, samples included non-atrophic gastritis (NAG),
   chronic atrophic gastritis (CAG), intestinal metaplasia (IM), and early
   gastric cancer (EGC). Using scMalignantFinder, malignancy probability
   and classification were predicted for each cell. Percentages of
   predicted malignant cells and malignant probabilities were computed and
   compared across the progression stages of both cancers.

Feature importance analysis

   The ‘coef_’ attribute of the scMalignantFinder model represents the
   coefficients assigned to each feature within the model. These
   coefficients serve as indicators of feature importance and are
   determined by their absolute values. Subsequently, all features are
   sorted in descending order based on their importance, and the
   highest-ranked features are selected to construct a classification
   model. The significance of each performance metric, when compared to
   the original model, is evaluated using a paired-sample t-test.
   Furthermore, the features of the model were categorized as unique or
   common based on their presence across cancer types (Supplementary
   Data [256]3). Unique DEGs refer to genes that are present in only one
   specific cancer type, while common DEGs are genes that are found in two
   or more cancer types.

Association of DEGs with clinical features and drug targets

   To analyze the association between DEG gene sets and overall survival,
   we followed the methodology of a previous study^[257]23. In this
   approach, the calculated effect size was defined as the hazard ratio,
   computed through Cox regression, with the P value obtained via an
   overall likelihood test^[258]23. To test the association between DEG
   gene sets and drug targets, we collected 704 targets with approved
   drugs from the PHAROS database (labeled as Tclin)^[259]34 and computed
   the odds ratio following the methods of a recent study^[260]53.

scMalignantFinder workflow for identifying malignant regions

   For the ST data obtained from previous studies^[261]38–[262]41
   (Supplementary Data [263]1), we applied scMalignantFinder to identify
   malignant regions through a multi-step process that integrates
   malignancy probability, gene signature activity and image-based
   features. Each spot in the ST data was assigned two attributes:
   malignancy probability, calculated using the logistic regression
   classifier trained on single-cell transcriptomics data, and malignant
   signature activity, which quantifies the expression of
   malignant-upregulated DEGs using the AUCell function from the pySCENIC
   package. For datasets with paired histological images, such as
   hematoxylin and eosin (H&E) staining or immunofluorescence, an
   additional image score was computed using ‘calculate_image_features’
   function in Squidpy^[264]54 (1.3.1)
   ([265]https://squidpy.readthedocs.io/en/stable/index.html). These three
   features—malignancy probability, malignant signature activity, and
   image score—were combined into a feature matrix.

   Next, hierarchical clustering was performed on this feature matrix
   using the Ward method to classify spots into three clusters,
   considering that solid tumors typically consist of malignant regions,
   normal epithelial regions, and non-epithelial regions. To determine the
   identity of each cluster, a rank-based vote aggregation method was
   used. The average score for each feature was first calculated within
   each cluster. Clusters were then ranked for each feature based on these
   average scores. The ranks across the three features were averaged for
   each cluster. The cluster with the top average rank was preliminarily
   identified as the malignant region, while the other two clusters were
   provisionally designated as normal regions. If two or more clusters
   shared the highest average rank, the cluster with the highest
   malignancy probability was assigned as the malignant region.

   To refine the preliminary classifications, a spatial neighbor graph was
   constructed using Squidpy’s ‘spatial_neighbors’ function. This graph
   enabled the reassignment of spot classifications based on spatial
   neighborhood relationships. Specifically, if more than half of a spot’s
   neighbors within the closest distance belonged to a particular cluster,
   the spot was reassigned to that cluster. If not, the original
   classification was retained. The images with pathological annotations
   were extracted from previous studies^[266]11,[267]37,[268]41.

Benchmarking against tumor spatial transcriptomics analysis methods

   We benchmarked scMalignantFinder against two recently developed methods
   for tumor ST analysis. The first method, Cancer-Finder^[269]11,
   utilizes a domain generalization-based deep learning approach
   specifically designed for identifying malignant regions in ST data. We
   employed the pretrained Cancer-Finder model along with its associated
   checkpoints optimized for ST datasets
   ([270]https://github.com/Patchouli-M/SequencingCancerFinder). Both
   scMalignantFinder and Cancer-Finder were applied to the eight ST
   datasets, and performance metrics, including accuracy and balanced
   accuracy, were calculated for comparison. The second method,
   Cottrazm^[271]42 (0.1.1) ([272]https://github.com/Yelab2020/Cottrazm),
   was used to delineate tumor boundaries in three ST slices. Cottrazm
   classifies spots into malignant, normal, or boundary regions. To
   evaluate the consistency between scMalignantFinder and Cottrazm, we
   compared the predictions of malignant and normal spots between the two
   methods. Additionally, we analyzed the spatial relationships between
   tumor boundary spots identified by Cottrazm and neighboring malignant
   or normal spots predicted by scMalignantFinder, assessing proximity
   within one or two spot-size distances.

Functional analysis of malignant regions

   To validate the identified malignant regions, we performed functional
   analyses, including CNV profiling, cell type deconvolution, and tumor
   signature activity assessment. CNV scores for each spot in the ST
   datasets were calculated using the STCNV and STCNVScore function
   integrated within the Cottrazm pipeline, which calls inferCNV
   ([273]https://github.com/broadinstitute/inferCNV). Cell type
   deconvolution was conducted using SpaCET^[274]37 (1.2.0)
   ([275]https://github.com/data2intelligence/SpaCET) to estimate the
   proportion of malignant cells, stromal cells, and immune cells within
   the predicted malignant and normal regions. Additionally, tumor
   signature activity was assessed using gene sets previously published
   for breast cancer and renal cell carcinoma^[276]42. The AUCell function
   from the pySCENIC package was used to calculate signature activity for
   each spot, quantifying the expression of tumor-associated genes in
   malignant and normal regions.

Statistics and reproducibility

   All statistical analyses were performed using Python (3.10) with
   appropriate libraries for data processing, statistical testing, and
   visualization. Statistical significance was assessed using two-sided
   Wilcoxon rank-sum tests, two-side Student’s t tests, or Mann-Whitney U
   tests where applicable, as indicated in figure legends. For survival
   analyses, Cox regression and likelihood ratio tests were employed to
   determine statistical significance. Significance levels are denoted as
   *P  <  0.05, **P  <  0.01, ***P  <  0.001, and ****P  <  0.0001.

Reporting summary

   Further information on research design is available in the [277]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [278]Supplementary Information^ (3.3MB, pdf)
   [279]Supplementary Data 1^ (12.4KB, xlsx)
   [280]Supplementary Data 2^ (20.5KB, xlsx)
   [281]Supplementary Data 3^ (55.9KB, xlsx)
   [282]Supplementary Data 4^ (103.6KB, xlsx)
   [283]Supplementary Data 5^ (75.7MB, xlsx)
   [284]Supplementary Data 6^ (42.5MB, xlsx)
   [285]Supplementary Data 7^ (86.3KB, xlsx)
   [286]Supplementary Data 8^ (1.5MB, xlsx)
   [287]Supplementary Data 9^ (75.8MB, xlsx)
   [288]Supplementary Data 10^ (42.6MB, xlsx)
   [289]Supplementary Data 11^ (3.1MB, xlsx)
   [290]Supplementary Data 12^ (104.3KB, xlsx)
   [291]Supplementary Data 13^ (1.1MB, xlsx)
   [292]Supplementary Data 14^ (1.2MB, xlsx)
   [293]Supplementary Data 15^ (2.9MB, xlsx)
   [294]42003_2025_7942_MOESM17_ESM.docx^ (17.2KB, docx)

   Description of Additional Supplementary Files
   [295]Reporting Summary^ (72KB, pdf)
   [296]Transparent Peer Review file^ (4.1MB, pdf)

Acknowledgements