Abstract
Single-cell RNA sequencing (scRNA-seq) is a powerful tool for
characterizing tumor heterogeneity, yet accurately identifying
malignant cells remains challenging. Here, we propose
scMalignantFinder, a machine learning tool specifically designed to
distinguish malignant cells from their normal counterparts using a
data- and knowledge-driven strategy. To develop the tool, multiple
cancer datasets were collected, and the initially annotated malignant
cells were calibrated using nine carefully curated pan-cancer gene
signatures, resulting in over 400,000 single-cell transcriptomes for
training. The union of differentially expressed genes across datasets
was taken as the features for model construction to comprehensively
capture tumor transcriptional diversity. scMalignantFinder outperformed
existing automated methods across two gold-standard and eleven
patient-derived scRNA-seq datasets. The capability to predict
malignancy probability empowers scMalignantFinder to capture dynamic
characteristics during tumor progression. Furthermore,
scMalignantFinder holds the potential to annotate malignant regions in
tumor spatial transcriptomics. Overall, we provide an efficient tool
for detecting heterogeneous malignant cell populations.
Subject terms: Computational models, Tumour heterogeneity, Machine
learning
__________________________________________________________________
Editorial Summary: scMalignantFinder accurately identifies malignant
cells in single-cell RNA-seq and spatial transcriptomics by leveraging
pan-cancer gene signatures.
Introduction
Tumors are highly heterogeneous diseases, driven by molecular
disturbances at the genetic, epigenetic, and gene expression levels
within tumor cells^[28]1. Accurately characterizing tumor heterogeneity
is crucial for understanding the mechanisms of cancer development and
devising effective and durable therapeutic strategies^[29]2.
Single-cell RNA sequencing (scRNA-seq) technology has revolutionized
the field of cancer biology^[30]3 by enabling the measurement of the
complete transcriptome of each cell within a tumor tissue, thus
facilitating the identification of distinct cell types and
states^[31]4.
Identifying malignant cells and distinguishing them from other
non-malignant cells is a crucial step in dissecting tumor heterogeneity
from cancer scRNA-seq experiments^[32]3,[33]5,[34]6. This process aids
in gaining a deep understanding of transcriptional reprogramming
underlying the malignant transformation of cells during cancer
development. Copy number variation (CNV) inference methods are commonly
employed to identify malignant cells^[35]7,[36]8, and their reliability
strongly depends on how well the observed gene expression deviations
correlate with underlying copy number changes rather than other
biological and technical factors^[37]1. Furthermore, CNV inference
often requires the specification of appropriate normal reference cells,
hindering the complete automation of the analysis process. Therefore,
relying solely on inferred CNVs has limitations in annotating malignant
cells, especially in cancers with minimal genomic structural variation
or other unknown origins^[38]5,[39]9. Regarding the former, a potential
alternative is to identify malignant cells through smaller-scale
genetic alterations, such as single nucleotide variants (SNVs).
However, detecting SNVs from scRNA-seq data requires sufficient read
coverage over the expressed exons, making it primarily applicable to
the scRNA-seq protocols that capture full-length rather than 3’ or 5’
transcripts^[40]1.
Given these limitations, recent advancements have introduced automated
approaches for identifying malignant cells through supervised learning,
utilizing training data with cell identities provided by the original
studies^[41]6,[42]10,[43]11. However, the performance of these
supervised learning methods would be greatly compromised by the lack of
optimal reference datasets that accurately annotate the malignancy
status of each cell^[44]1,[45]12,[46]13. Additionally, current
approaches^[47]6,[48]10 often select genes that consistently exhibit
differential expression across datasets as model features, potentially
overlooking substantial inter-tumor heterogeneity. Consequently, there
is an urgent need for specialized methods that are guided by both
high-quality data and well-established knowledge, incorporated with
accurately labelled datasets and robust feature selection strategy to
effectively identify heterogeneous malignant cells from cancer
scRNA-seq data.
This study introduces scMalignantFinder, a machine learning-based
automated classifier, specially designed to distinguish malignant cell
from their originating normal cells, rather than from all other
non-malignant cells as existing methods usually do. We systemically
reviewed multiple scRNA-seq datasets with cell type annotations and
calibrated the initially annotated malignant cells using nine carefully
curated cancer gene signatures that exhibited consistent
transcriptional patterns across diverse cancer types to construct the
gold standard training set. The union set of differentially expressed
genes (DEGs) between the calibrated malignant cells and normal cells
across datasets was adopted to build a classification model.
scMalignantFinder demonstrated superior performance compared to current
automated methods on independent test sets ranging cancer cell lines,
non-cancer tissues, and eleven cancer single-cell datasets covering
nine cancer types. The capability of scMalignantFinder to predict
malignancy probability empowers it to capture dynamic characteristics
during tumor progression. Moreover, we extended the classifier to
discover malignant spots in spatial transcriptomics (ST) data without
retraining, achieving high concordance with pathologists’ annotations
across multiple cancer ST slices. These findings underscore
scMalignantFinder as a versatile and generalizable tool for
investigating malignant cell biology in cancer research.
Results
An overview of scMalignantFinder
We developed scMalignantFinder to distinguishing malignant cells from
their originating normal cells. Our focus was on carcinomas, which
originate from epithelial tissue and accounts for 80 ~ 90% of all
cancer cases. To achieve this, we collected five publicly available
single-cell RNA-seq (scRNA-seq) datasets with cell type
annotations^[49]14–[50]18, encompassing four cancer types: lung,
colorectal, liver, and gastric cancer. These datasets, containing
expression profiles over 400,000 malignant and normal epithelial cells,
served as the basis for constructing our training set (Supplementary
Data [51]1). scMalignantFinder was designed through two major steps:
(1) defining the training dataset by calibrating malignant cells using
cancer hallmark-based gene signatures, and (2) selecting model features
by taking the union of significantly differentially expressed genes
(DEGs) across datasets (Fig. [52]1).
Fig. 1. Workflow of scMalignantFinder.
[53]Fig. 1
[54]Open in a new tab
scMalignantFinder identifies malignant cells through two steps: (1)
constructing a training set by calibrating malignant cells using nine
curated cancer gene signatures across over 400,000 epithelial cells
from four cancer types, and (2) performing feature selection by taking
the union of differentially expressed genes (DEGs) across datasets,
which captures both common DEGs and those unique to specific datasets.
These refined steps provide the foundation for building a logistic
regression model that classifies cells as malignant or normal based on
single-cell RNA-seq (scRNA-seq) data.
A key challenge in constructing a training set for supervised learning
in the context of identifying malignant cells is the lack of a
gold-standard annotation for them. To address this, we adopted a gene
signature-based approach to uniformly calibrate the malignant cells
across collected datasets (Fig. [55]1). Cancer hallmarks represent a
set of common functional capabilities that human cells acquire during
the transition from normal to tumorigenic states^[56]19, making them
potential common markers of malignant cells. Based on this rationale,
we curated 29 gene signatures associated with cancer hallmarks from
both knowledge-based and data-driven sources^[57]20–[58]22.
Gene set activity analysis was applied to the five scRNA-seq datasets
and bulk RNA-seq datasets of 16 cancer types from The Cancer Genome
Atlas (TCGA) (Supplementary Data [59]1). Differential activity analysis
was then performed for each gene signature between malignant and normal
epithelial cells, as well as tumor and adjacent normal samples. Nine
out of the 29 gene signatures exhibited consistent trends across
different datasets at both single-cell and bulk levels, thus
representing pan-cancer transcriptional characteristics (Supplementary
Fig. [60]1; Supplementary Data [61]2). Among these signatures, eight
were upregulated in both malignant cells and tumor samples,
representing cellular processes such as cell cycle, DNA damage, and DNA
repair; while one signature, known to be enriched in normal
specimens^[62]21, was consistently downregulated. Using the activity of
these nine cancer gene signatures in normal epithelial cells as a
baseline, we filtered out 2.5% of the original malignant cells
(Supplementary Fig. [63]2a) and obtained the final training set
consisting of 416,774 cells, including 134,053 malignant cells and
282,721 normal epithelial cells.
Next, we conducted differential gene expression analysis between
malignant cells and normal epithelial cells within each dataset in the
training set, and identified a range of 327 to 2226 DEGs per dataset
(Fig. [64]1). Compared to downregulated DEGs, upregulated DEGs in
malignant cells tended to be more common across different cancer types
(Supplementary Fig. [65]2b). Notably, 68.3% of upregulated DEGs and
87.4% of downregulated DEGs were found to be specific to individual
datasets. Considering that tumors exhibit both shared functional
characteristics^[66]19 and high heterogeneity^[67]23, we incorporated
features that capture both shared and distinct characteristics of
malignant cells. Specifically, we retained DEGs that are common across
datasets, representing universal features of malignant cells, as well
as dataset-specific DEGs to account for unique characteristics of
individual tumors. Finally, a set of 2707 DEGs, comprising 1656
upregulated genes and 1051 downregulated genes in malignant cells
(Supplementary Data [68]3), were used to construct the input expression
matrix for the logistic regression classifier- scMalignantFinder
(Fig. [69]1).
Performance validation by diverse scRNA-seq datasets
To evaluate the performance of scMalignantFinder, we collected multiple
independent scRNA-seq datasets from various sample sources, including
cancer cell lines, tumor biopsies, pre-cancerous tissues,
tumor-adjacent tissues, and deceased non-cancer donor specimens. The
collected test set used for validation consisted of 607,109 cells,
including 260,734 malignant cells and 346,375 normal tissue cells, from
633 samples spanning 22 cancer types. We compared scMalignantFinder
with four other tools: PreCanCell^[70]6, ikarus^[71]10,
Cancer-Finder^[72]11, and CopyKAT^[73]8. and evaluated the performance
using seven metrics, including the area under the receiver operating
characteristic curve (AUROC), accuracy, balanced accuracy (the average
of sensitivity and specificity), sensitivity, specificity, positive
predictive value (PPV), and negative predictive value (NPV).
We first evaluated scMalignantFinder using two expression datasets
including 53,513 malignant cells from 198 cancer cell lines^[74]24 and
104,148 normal epithelial cells from 18 tissues from non-cancer
donors^[75]25, which served as gold standard datasets due to their
evident sample sources (Supplementary Data [76]5 and [77]9). When
pooling all cells from each dataset together, scMalignantFinder
achieved a sensitivity of 1.000 in identifying malignant cells and a
specificity of 0.786 in identifying normal epithelial cells,
outperforming other methods (PreCanCell: 0.996 and 0.503; ikarus: 0.834
and 0.642; Cancer-Finder: 1.000 and 0.022; CopyKAT: 0.594 and 0.397)
(Fig. [78]2a). When evaluating individual cancer cell line,
scMalignantFinder demonstrated high and consistent sensitivity in
detecting malignant cells, with a minimum sensitivity of 0.98
(Fig. [79]2b), indicating its robustness across diverse cell lines and
cancer types.
Fig. 2. Performance evaluation of scMalignantFinder.
[80]Fig. 2
[81]Open in a new tab
a Barplot showing the sensitivity of each method applied to the cancer
cell line scRNA-seq dataset (left) and the specificity of each method
applied to the non-cancer tissue scRNA-seq dataset (right). Each point
represented a cell line (left) or a tissue (right). Statistical
significance was determined by a two-sided Wilcoxon rank-sum test.
n = 198 cell lines (left) or 18 tissues (right); bar height represents
the mean value; whiskers extend to the mean ± 95% confidence intervals.
b Scatterplot showing the mean sensitivity (calculated from the cancer
cell line scRNA-seq dataset) and mean specificity (calculated from the
non-cancer tissue scRNA-seq dataset) for each method presented in a. c
Boxplot showing seven metrics of each method on 13 scRNA-seq datasets
(Supplementary Data [82]1). “Retrained” indicates the models were
re-trained using the calibrated training set from this study, while
“DEGs in this study as the source of signatures” refers to ikarus being
retrained with gene signatures derived from the DEGs identified in this
study. Statistical significance was determined by a two-sided Wilcoxon
rank-sum test. n = 13 datasets; center line represents the median
value; box limits indicate the upper and lower quartiles; whiskers
extend to 1.5× the interquartile range. Source data are provided as a
Source Data file (Supplementary Data [83]5 and [84]9).
Furthermore, we applied scMalignantFinder to 11 additional scRNA-seq
datasets from cancer patients^[85]16,[86]17,[87]23,[88]26–[89]28 (see
Supplementary Data [90]1), involving nine cancer types, three of which
were included in the training set (lung, colorectal, and liver cancer),
and six not included (head and neck, kidney, prostate, breast,
pancreatic, and ovarian cancer). The original cell annotations provided
in the respective studies were used as true labels for validation.
Across all 13 datasets, scMalignantFinder achieved an average accuracy
of 0.824 and an average balanced accuracy of 0.732, surpassing other
methods (PreCanCell: 0.713 and 0.613; ikarus: 0.446 and 0.533;
Cancer-Finder: 0.654 and 0.531; CopyKAT: 0.427 and 0.571) (Fig. [91]2c;
Supplementary Fig. [92]3a). Notably, scMalignantFinder exhibited the
highest accuracy in 6 of 13 datasets and the highest balanced accuracy
in 7 out of 11 datasets (Fig. [93]2a; Supplementary Fig. [94]3b).
While PreCanCell and Cancer-Finder demonstrated comparable or superior
sensitivity compared to scMalignantFinder, they displayed lower
specificity, and higher false positive rate (Fig. [95]2b, c).
Conversely, ikarus and CopyKAT exhibited the highest specificity but
the lowest sensitivity (Fig. [96]2b, c). The combined high sensitivity
and PPV provided by scMalignantFinder highlight its ability to
accurately identify malignant cells while maintaining a low false
positive rate. This is a fundamental prerequisite for gaining an
in-depth understanding of tumor heterogeneity through single-cell data
analysis. Overall, scMalignantFinder outperformed the other four
methods in distinguishing malignant cells from their normal
counterparts.
Next, we retrained Cancer-Finder and ikarus using our calibrated
training set to explore the effects of training set design strategy on
model performance (Fig. [97]2c; Supplementary Fig. [98]3a, b). After
retraining, Cancer-Finder exhibited a noticeable improvement, with its
average accuracy increasing from 0.654 to 0.828 and average balanced
accuracy rising from 0.531 to 0.669. For ikarus, we also replaced its
gene signatures with the DEGs identified from our training set.
Retraining enhanced ikarus’s performance, with its average accuracy
increasing from 0.446 to 0.488 and average balanced accuracy rising
from 0.533 to 0.612. These performance improvements were primarily
reflected in enhanced specificity for identifying normal cells,
underscoring the importance of training data calibration and rational
feature selection.
To assess whether the performance of scMalignantFinder is influenced by
sequencing depth, we reanalyzed its prediction on samples with
different median gene counts across multiple test sets. We measured
Spearman correlation between median gene count and balanced accuracy
within each dataset, and did not find any significant correlation in
all the datasets (Supplementary Fig. [99]4), to some extent,
demonstrating the robustness of scMalignantFinder.
Transcriptional characteristics of misclassified cell populations
Given that the prediction by scMalignantFinder was not entirely
consistent with the original label, we investigated the transcriptional
characteristics of the misclassified cells. The cells in the test set
were classified into four categories: (1) true malignant, labeled and
correctly predicted as malignant, (2) predicted malignant, labeled as
normal but wrongly predicted as malignant, (3) predicted normal,
labeled as malignant but wrongly predicted as normal, and (4) true
normal, labeled and correctly predicted as normal (Supplementary
Data [100]6 and [101]10).
We first examined the expression profiles of the curated cancer gene
signatures for these four cell categories (Supplementary Data [102]2).
As expected, in the liver dataset^[103]28, a substantial increase in
the activity of multiple upregulated signatures was observed in
predicted malignant cells compared to true normal cells; conversely,
predicted normal cells exhibited an enrichment of downregulated
signature, surpassing the activity of true malignant cells
(Fig. [104]3a). This trend was consistent across several other datasets
(Supplementary Fig. [105]5a). Statistical analysis across the test sets
indicated that predicted malignant cells ranked second in activity
levels among seven out of nine cancer gene signatures, closely
following true malignant cells. Similarly, predicted normal cells
demonstrated activity levels in seven signatures that closely resembled
those of true normal cells (Fig. [106]3b). Furthermore, Copy Number
Variation (CNV) has been closely associated with the development and
progression of various cancers due to its impact on gene
expression^[107]29. By performing single-cell CNV inference^[108]7, we
discovered that predicted malignant cells exhibited CNV profiles
similar to true malignant cells, with significantly higher CNV levels
compared to true normal cells (Supplementary Fig. [109]5b, c).
Fig. 3. Characterization of transcriptional features in misclassified cells.
[110]Fig. 3
[111]Open in a new tab
a Boxplot showing activity of nine curated cancer gene signatures in
four groups of cells (true malignant, predicted malignant, predicted
normal, true normal) in the liver dataset. Statistical significance was
determined by a two-sided Student’s t test (****P < 0.0001). Center
line represents the median value; box limits indicate the upper and
lower quartiles; whiskers extend to 1.5× the interquartile range. b
Barplot showing reverse ranking of activity of nine curated cancer gene
signatures in four groups of cells across five test sets including
lung, liver, and colorectal cancer. Reverse ranking is based on the
mean activity among the four groups. For a given dataset and a group of
cells, if the mean of their activities is the highest among the four
groups, their reverse ranking is 4; if the mean activity is the second
highest, the reverse ranking is 3, and so on. n = 5 datasets; bar
height represents the mean value; whiskers extend to the mean ± 95%
confidence intervals. c Boxplot showing proportion concordance of
differentially expressed genes (DEGs) in true malignant and predicted
malignant cells (vs true normal cells). d, e Pathway (d) and
transcription factor (e) enrichment of DEGs upregulated in both true
malignant and predicted malignant cells. Source data are provided as a
Source Data file (Supplementary Data [112]6, [113]10, and [114]11).
To further characterize the transcriptional features of predicted
malignant cells, we conducted differential gene expression analysis
between the two types of malignant cells and true normal cells. Our
analysis revealed that among the identified DEGs, 20% of the genes
upregulated in true malignant cells also exhibited elevated expression
in predicted malignant cells. Similarly, 42% of the downregulated genes
in true malignant cells showed consistent downregulation in predicted
malignant cells (Fig. [115]3c). Pathway enrichment analysis revealed
that the commonly upregulated DEGs were mainly involved in angiogenesis
and metabolic reprogramming (Fig. [116]3d), which are typical hallmarks
of cancer^[117]19. Furthermore, we predicted the upstream regulators of
the upregulated DEGs and identified the highly ranked transcription
factors (TFs), such as HIF1A^[118]30 and TWIST1^[119]31 (Fig. [120]3e).
These TFs have been reported to be upregulated in various cancers and
play roles in downstream cancer-promoting processes through
transcriptional regulation such as altered substrate metabolism,
angiogenesis, tumor invasion, and metastasis.
Overall, these results indicate that predicted malignant cells exhibit
transcriptional characteristics of malignant cells. The presence of
false positives (predicted malignant cells) and false negatives
(predicted normal cells) are probably due to the continuously
transitional states of cells between normal and malignancy at
transcriptional profiles.
Application of scMalignantFinder in elucidating cell states during
carcinogenesis
Since scMalignantFinder can also report the malignancy probability of
each cell, we set out to analyze the malignant state during cancer
progression. To this end, we applied scMalignantFinder to colorectal
cancer datasets^[121]16,[122]17, which comprises epithelial cell
expression profiles from 36 normal mucosal tissues, 23 colorectal
polyps (precursors of colorectal carcinoma), and 6 tumors, and
predicted the malignancy probability and classification for each
epithelial cell (Supplementary Fig. [123]6a; Supplementary
Data [124]11). It was shown that the percentage of predicted malignant
cells was nearly 0% in normal mucosa, significantly increased in
colorectal polyps, and peaked in tumors. Similarly, the malignancy
probability exhibited a clear increasing trend from normal mucosa to
colorectal polyps, and to tumors.
In addition, we applied scMalignantFinder to an scRNA-seq dataset
representing different stages of gastric cancer progression^[125]32
(Supplementary Fig. [126]6b), which included samples from three
non-atrophic gastritis (NAG) biopsies as normal controls, three chronic
atrophic gastritis (CAG) biopsies and six intestinal metaplasia (IM)
biopsies both representing precancerous lesions, and one early gastric
cancer (EGC) biopsy. The analysis revealed that the percentage of
predicted malignant cells was 0% in NAG, increased in CAG and IM, and
reached its highest level in EGC. Notably, both the percentage of
malignant cells and the malignancy probability in IM were significantly
higher than in CAG, consistent with the established fact that IM
carries a higher risk of gastric cancer compared to CAG^[127]33.
The above results suggested that scMalignantFinder can shed lights on
the dynamics underlying the malignant transformation of epithelial
cells.
The roles of different types of DEGs in identifying malignant cells
We conducted feature importance assessment to quantify the contribution
of each feature used in scMalignantFinder (Supplementary Fig. [128]7a;
Supplementary Data [129]4, [130]7, and [131]12). The 2707 DEGs were
categorized into common DEGs and unique DEGs based on whether they were
differentially expressed across at least two cancer types in the
training set (Supplementary Data [132]3). It was found that the
commonly upregulated or downregulated DEGs in malignant cells exhibited
significantly higher feature importance than the unique DEGs
(Fig. [133]4a; Supplementary Fig. [134]7b), supporting the notion that
there are conserved mechanisms underlying carcinogenesis across cancer
types^[135]19.
Fig. 4. Contribution of each DEG to model training and prediction.
[136]Fig. 4
[137]Open in a new tab
a Boxplot showing feature importance of two types of DEGs (common and
unique). Unique DEGs are exclusive to a specific cancer type, while
common DEGs are found in at least two cancer types (Supplementary
Data [138]3). The importance of each DEG is determined by the absolute
value of the model coefficient (Supplementary Data [139]4). Statistical
significance was determined by a two-sided Student’s t test (****:
P < 0.0001). Center line represents the median value; box limits
indicate the upper and lower quartiles; whiskers extend to 1.5× the
interquartile range. b Comparison of the original model and a model
using the top 1500 important DEGs on seven prediction metrics. n = 13
datasets; center line represents the median value; box limits indicate
the upper and lower quartiles; whiskers extend to 1.5× the
interquartile range. c Pie chart illustrating the statistics of how
many cancer types in the training set the top 1500 important DEGs
(left) and the remaining DEGs (right) were found in. d Comparison among
the original model and models using either common DEGs or unique DEGs
as features. Statistical significance was determined by a two-sided
Wilcoxon rank-sum test (*: P < 0.05, **: P < 0.01, ***:
P < 0.001). n = 13 datasets; center line represents the median value;
box limits indicate the upper and lower quartiles; whiskers extend to
1.5× the interquartile range. Source data are provided as a Source Data
file (Supplementary Data [140]7 and [141]12).
Subsequently, we investigated whether the classification performance of
the original model could be replicated using a subset of DEGs with
top-ranked importance. As expected, the classification performance
improved as the number of important genes increased (Supplementary
Fig. [142]7c); A model with approximately 1500 input genes achieved
performance comparable to the original 2707-feature model across all
metrics (Fig. [143]4b; Supplementary Fig. [144]7c, d). These top 1500
DEGs included a larger proportion of common DEGs (28%) compared to the
remaining DEGs (20%) (Fig. [145]4c).
To further explore the role of common and unique DEGs in identifying
malignant cells, we constructed separate models using each type of
DEGs. The model based on the common DEGs performed comparably to the
original model, showing similar AUROC and balanced accuracy
(Fig. [146]4d). However, the model based on the unique DEGs exhibited
significantly poorer performance (Fig. [147]4d). These findings suggest
that the common transcriptional signatures among malignant cells of
different cancer types are more crucial for the performance of
scMalignantFinder compared to those that are specific to individual
cancer types.
Interestingly, models constructed solely using common DEGs demonstrated
better specificity but sacrifice some sensitivity, while models
constructed using cancer-specific DEGs exhibited better sensitivity
(Fig. [148]4d). Unlike previous methods that exclusively retain common
DEGs^[149]6,[150]10, scMalignantFinder integrates both types of DEGs,
resulting in a more comprehensive discriminatory performance, as
evidenced by its simultaneous reduction in both false positive and
false negative rates.
We also explored the association between upregulated DEGs and clinical
features. Following a previous study^[151]23, we computed the activity
of DEGs in bulk TCGA samples and discovered their associations with
worse prognosis across multiple cancer types (Supplementary
Fig. [152]7e), suggesting that the transcriptional signatures
upregulated in malignant cells may involve in cancer progression.
Furthermore, we analyzed the overlap of DEGs with 704 targets of
approved drugs with known mechanism in the PHAROS database^[153]34.
Compared to unique DEGs (OR = 1.0), common DEGs (OR = 1.9) and the DEGs
that contributed most to the model predictions (OR = 1.5) exhibited
significant enrichment among approved drug targets (Supplementary
Fig. [154]7e).
Application of scMalignantFinder in tumor spatial transcriptomics
The presence of spatial heterogeneity within tumors, resulting from
clonal expansion and the intricate interplay between cancer cells and
the surrounding microenvironment, underscores the importance of
analyzing the transcriptome in a spatial context to understand key
oncological processes such as tumor progression and metastasis^[155]35.
In recent years, the rapidly advancing technology of spatial
transcriptomics (ST) sequencing has emerged as a powerful tool for
unraveling the complex spatial architecture of the tumor
microenvironment^[156]36. Identifying malignant regions within this
context is crucial yet challenging, as ST sequencing techniques capture
barcode spots with diameters ranging from 55–100 μm, potentially
measuring a mixture of signals from multiple cells belonging to
different lineages^[157]37.
We explored the potential of scMalignantFinder to identify malignant
spots in ST data (Fig. [158]5a). First, the pretrained model was used
to calculate two key features for each spot in the ST dataset:
malignancy probability and malignant signature activity, the latter
representing the overall expression of upregulated DEGs (Supplementary
Data [159]3). Additionally, an image score was derived for each spot
based on the corresponding histological image. These three features -
malignancy probability, malignant signature activity, and image score -
formed a feature matrix. hierarchical clustering was then conducted on
the feature matrix to group the spots into three clusters, considering
that carcinomas are typically composed of malignant, normal epithelial,
and non-epithelial regions. To determine the assignment of a region
cluster, we employed a rank aggregation method. First, we calculated
the average score for each feature within each cluster. Next, we ranked
the clusters for each feature based on these average scores. The ranks
of the three features were then averaged for each cluster. The cluster
with the top average rank was preliminarily identified as the malignant
region, while the remaining two clusters were provisionally designated
as normal regions. Finally, spatial neighborhood relationships were
incorporated to refine these classifications, yielding the final region
predictions (Fig. [160]5a).
Fig. 5. Workflow for identifying malignant regions in tumor spatial
transcriptomics (ST) data.
[161]Fig. 5
[162]Open in a new tab
a The pretrained scMalignantFinder model calculates the malignant
probability and malignant signature activity for each spot.
Additionally, image scores derived from histological images are
included as the third feature. The three features (malignant
probability, malignant signature activity, and image score) are
combined into a feature matrix for hierarchical clustering, grouping
spots into three clusters. A rank aggregation method was employed to
rank clusters based on their average ranks across three features,
enabling their preliminary classification as malignant or normal.
Spatial neighborhood relationships are then incorporated to refine
region annotations, resulting in the final classification of malignant
and normal spots. b–d Annotations of ST spots by different methods,
including annotation by pathologists (b), determination of malignancy
probability and signature activity (c), and final classification (d). e
Boxplot showing four metrics of scMalignantFinder on eight ST datasets
(Supplementary Data [163]1). n = 8 datasets; center line represents the
median value; box limits indicate the upper and lower quartiles;
whiskers extend to 1.5× the interquartile range. f Spatial
visualization of copy number variation (CNV) scores in ST slices of
breast cancer (left) and oral squamous cell carcinoma (right). g
Boxplot comparison of CNV scores between predicted malignant and normal
regions in eight ST datasets. n = 6 datasets; center line represents
the median value; box limits indicate the upper and lower quartiles;
whiskers extend to 1.5× the interquartile range. h Tumor signature
activity (left) and boxplot comparison between predicted malignant and
normal regions (right) in an ST slice of breast cancer. Statistical
significance was determined by a two-sided Student’s t test (****:
P < 0.0001). Center line represents the median value; box limits
indicate the upper and lower quartiles; whiskers extend to 1.5× the
interquartile range. Source data are provided as a Source Data file
(Supplementary Data [164]8, [165]13–[166]15).
We applied scMalignantFinder to predict malignant regions in eight
tumor ST datasets^[167]38–[168]41, encompassing five cancer types:
breast cancer, oral squamous cell carcinoma, renal cell carcinoma,
prostate cancer, and squamous cell carcinoma. In these datasets, the
malignant regions had been annotated by pathologists in prior
studies^[169]11,[170]37,[171]41 (Supplementary Data [172]8 and
[173]13). Our findings revealed that the two malignant features
predicted by scMalignantFinder - malignancy probability and signature
activity - were highly enriched in the annotated malignant spots in
seven out of the eight ST slices (Fig. [174]5b, c; Supplementary
Fig. [175]8a, b). Furthermore, the predicted results showed strong
concordance with expert annotations, achieving an average accuracy of
0.783 and an average balanced accuracy of 0.800 (Fig. [176]5b, e;
Supplementary Fig. [177]8). Compared to a prior logistic regression
model relying solely on malignancy probability to identify malignant
cells, the current approach, which integrates multi-feature clustering
and incorporates spatial neighborhood relationships, significantly
improved the performance (Supplementary Fig. [178]9a; Supplementary
Data [179]14).
Next, we benchmarked scMalignantFinder against two recently developed
methods for tumor ST analysis. The first one is Cancer-Finder^[180]11,
specifically designed to identify malignant regions in ST data
(Supplementary Data [181]14). scMalignantFinder demonstrated comparable
performance across the eight ST slices, with average accuracy and
balanced accuracy exceeding Cancer-Finder by 0.067 and 0.062,
respectively (Supplementary Fig. [182]9b, c). The second method,
Cottrazm^[183]42, is known for its capability to delineate tumor
boundaries. When applied to three ST slices (Supplementary
Fig. [184]9d), 87% ~ 99% of the Cottrazm inferred tumor boundary spots
were adjacent to a mix of scMalignantFinder predicted malignant and
normal spots (Supplementary Fig. [185]9e, f), aligning with the
definition of tumor boundaries connecting malignant and normal regions.
To further validate the reliability of scMalignantFinder identified
malignant regions, we performed multidimensional functional analyses,
beginning with CNV profiling (Supplementary Data [186]8 and [187]15).
Malignant regions are expected to exhibit elevated CNV levels. By using
inferCNV^[188]7 to calculate the CNV scores for each spot in the ST
datasets, it was shown that spots classified as malignant exhibited
significantly higher CNV scores than those as normal (Fig. [189]5f, g).
For instance, in a breast cancer ST slice, the predicted malignant
regions exhibited gains in chromosomes 1q and 8q, as well as a loss in
chromosome 1p, consistent with the notion that breast cancer are
commonly associated with these CNV alterations^[190]8,[191]43
(Supplementary Fig. [192]10a). Similarly, deconvolution analysis of
cell type composition^[193]37 revealed a significant enrichment of
malignant cells in regions identified as malignant, whereas stromal and
immune cells were predominantly localized in areas classified as normal
(Supplementary Fig. [194]10b, c; Supplementary Data [195]15).
Furthermore, we assessed tumor signature activity in ST slices of
breast cancer and renal cell carcinoma using previously published
signature genes for these cancer types^[196]42, and found that the
tumor signature activity was notably higher in the predicted malignant
regions compared to the normal regions (Fig. [197]5h; Supplementary
Fig. [198]10d, e; Supplementary Data [199]15).
Overall, these results demonstrate the potential of scMalignantFinder
in annotating spatial transcriptomic data. Further improvements can be
achieved by including a wider range of cancer types and utilizing more
tailored training data.
Discussion
Over the past decade, scRNA-seq technology has been widely applied in
oncology research^[200]3,[201]23,[202]24,[203]44. However, the lack of
reliable methods for identifying malignant cells has been a major
obstacle to accurately studying tumor heterogeneity. Here, we have
developed scMalignantFinder to annotate malignant cells effectively.
Built upon meticulously calibrated training datasets and robust feature
selection strategies, scMalignantFinder outperforms existing automated
methods, including ikarus, PreCanCell, Cancer-Finder, and CopyKAT,
across multiple datasets from cancer cell lines, non-cancer tissues,
and various patient-derived cancer types. Additionally,
scMalignantFinder shows promising potential in identifying malignant
regions in ST data and tracking carcinogenic processes, aiding
researchers in characterizing the spatiotemporal dynamics of tumor
evolution^[204]45.
Current automated tools^[205]6,[206]10,[207]11 such as ikarus,
PreCanCell, and Cancer-Finder were designed to classify malignant cells
and all other cells in the tumor microenvironment. In contrast,
scMalignantFinder specifically distinguishes malignant cells from their
originating normal counterparts within the cancer lineage. This
strategy aligns with the logic employed in cell subtype or sublineage
analysis within hierarchical cell classification approaches, which have
recently proven effective for cell type annotation in the tumor
microenvironment^[208]5,[209]37,[210]46.
scMalignantFinder has significantly advanced both the scale and quality
of training data for malignant cell identification, guided by
well-established knowledge. It was built on over 400,000 single-cell
transcriptomes, representing the largest reference dataset for this
purpose to date. The initially annotated malignant cells, derived from
various methodologies in the original studies, were meticulously
refined using curated cancer gene signatures, resulting in a cancer
signature-calibrated training set where the malignant cells exhibit
pan-cancer characteristics. As expected, scMalignantFinder assigned
higher malignancy probability to the calibrated malignant cells in the
training set compared to the 2.5% of cells that were excluded
(Supplementary Fig. [211]2a). Consistently, the excluded cells showed a
loss of either upregulated or downregulated cancer signatures.
Retraining existing models, such as Cancer-Finder and ikarus, on this
well-calibrated dataset led to modest performance improvements
(Fig. [212]2c; Supplementary Fig. [213]3a), highlighting the importance
of large-scale and high-quality training data in enhancing the accuracy
of malignant cell identification.
Based on the cancer signature-calibrated malignant cells, we adopted
the DEGs between malignant cells and normal epithelial cells as
features for our model. Due to the extensive molecular heterogeneity
among tumors, we utilized the union of the DEG set in each dataset,
differing from previous methods that used intersections^[214]6,[215]10.
This difference at least partly accounts for the lower rates of false
positives and false negatives observed in our model. Attempts to
directly adopt curated cancer gene signatures as model features were
less effective (Fig. [216]2c; Supplementary Fig. [217]3a), probably
because DEGs derived from malignant cells are more closely associated
with malignancy at the transcriptional level. Notably, 75% of the
identified DEGs are not encompassed in the curated cancer gene
signatures, which hold the promise to offer additional insights into
carcinogenesis. We found that the overexpressed genes across different
cancer types, rather than specific to a particular type, primarily
contributed to the model and were significantly enriched among approved
drug targets, supporting the notion that pan-cancer hallmarks may be
promising therapeutic targets due to their association with fundamental
principles of carcinogenesis^[218]19.
Tracking the malignant transformation of cells and unraveling the
biological process underlying carcinogenesis are crucial for
identifying molecular targets at key stages and enabling early tumor
detection and precise diagnosis^[219]47. In this context, we displayed
the potential of scMalignantFinder to investigate cellular and
transitional dynamics during tumor progression by distinguishing
malignant cells from their normal epithelial counterparts. Moving
forward, we aim to further our initial findings in gastric cancer and
colorectal cancer by delving into the molecular mechanisms underlying
these intermediate states.
Within the recent advancements in ST technology in cancer
biology^[220]48, we have adapted the pre-trained model to identify
malignant regions, demonstrating excellent scalability in this field.
Through a multi-feature clustering framework, incorporating malignant
features, image-based characteristics, and spatial neighborhood
networks, scMalignantFinder demonstrated notable improvements in
accurately identifying malignant regions. Future enhancements could
focus on the continuous accumulation of expert-curated ST data and the
incorporation of additional spatial features, such as the spatial
distribution probability of malignant cells^[221]37, to enhance the
identification of malignant regions.
Despite these advantages, scMalignantFinder still has limitations and
room for improvement. It is designed to identify malignant cells from
cancerous lineages, relying on the prerequisite identification of basic
cell types. To address this, we have integrated scATOMIC^[222]5, a
recently developed automated cell type annotator for pan-cancer
analysis, into the current version. This integration enables users to
rapidly identify malignant cells, normal cells, and other cell types
within the tumor microenvironment using scRNA-seq data. Additionally,
the capabilities of scMalignantFinder require further testing,
particularly for cancers with unknown or controversial origins, such as
synovial sarcoma^[223]49 and certain brain tumors^[224]50.
In summary, with a machine learning framework driven by both data and
knowledge, scMalignantFinder achieves a significant advancement in
malignant cell identification, offering a scalable and versatile tool
for both single-cell and spatial transcriptomic analyses.
Methods
scMalignantFinder design
Data collection and processing
We collected five single-cell RNA-seq datasets from various cancer
types (lung, colorectal, gastric, and liver cancers)^[225]14–[226]18 as
detailed in Supplementary Data [227]1. Cell type annotations were
obtained from the original studies, and we retained gene expression
matrices for malignant and normal epithelial cells. Additionally, bulk
RNA-seq data for 18 cancer types were sourced from TCGA, with 16
datasets used after excluding cancer types with fewer than five samples
(Supplementary Data [228]1).
Selection of cancer gene signatures
We compiled 29 gene signatures related to cancer hallmarks from the
following sources:
(1) 14 cancer functional states (Stemness, Invasion, Metastasis,
Proliferation, EMT, Angiogenesis, Apoptosis, Cell cycle,
Differentiation, DNA damage, DNA repair, Hypoxia, Inflammation, and
Quiescence) from the cancerSEA database^[229]20.
(2) 10 cancer hallmarks (Sustaining Proliferative Signaling, Evading
Growth Suppressors, Avoiding Immune Destruction, Enabling Replicative
Immortality, Tumour-Promoting Inflammation, Activating Invasion and
Metastasis, Inducing Angiogenesis, Genome Instability and Mutation,
Resisting Cell Death, Deregulating Cellular Energetics) from manually
curated pathway gene sets^[230]22.
(3) 5 epithelial cell states (Basal-like, Normal-enriched,
Pro-angiogenic, Pro-inflammatory, Metabolic) from the Ecotyper
framework^[231]21.
Each cell in the scRNA-seq datasets and each sample in the bulk RNA-seq
datasets were scored using these gene signatures. For scRNA-seq data,
normalization was performed using Scanpy’s “pp.normalize_total”
function (version 1.9.3)
([232]https://scanpy.readthedocs.io/en/stable/), followed by scoring
with the AUCell function in the pySCENIC package (0.12.1)
([233]https://github.com/aertslab/pySCENIC). For bulk RNA-seq data, the
ssGSEA function from the GSVA R package (1.46.0) was used.
Log2 fold changes in gene activities between malignant and normal cells
were calculated, and significance was assessed using the Wilcoxon
rank-sum test. Gene signatures with a P-value ≤ 1×10^−^10 in scRNA-seq
and ≤1 × 10^−5 in bulk RNA-seq were considered significant.
Upregulated and downregulated gene signatures were filtered based on
the following criteria, resulting in nine consistent cancer gene
signatures (eight upregulated and one downregulated) identified in both
datasets (Supplementary Data [234]2):
(1) Gene signatures that were significantly upregulated in ≥ 4
scRNA-seq datasets and significantly downregulated in ≤ 1 scRNA-seq
dataset were considered upregulated in malignant cells;
(2) Gene signatures that were significantly downregulated in ≥4
scRNA-seq datasets and significantly upregulated in ≤ 1 scRNA-seq
dataset were considered downregulated in malignant cells;
(3) Gene signatures that were significantly upregulated in ≥12 bulk
RNA-seq datasets and significantly downregulated in ≤ 2 bulk RNA-seq
datasets (or significantly upregulated in ≥6 bulk RNA-seq datasets and
significantly downregulated in ≤1 bulk RNA-seq dataset) were considered
upregulated in tumor samples;
(4) Gene signatures that were significantly downregulated in ≥12 bulk
RNA-seq datasets and significantly upregulated in ≤ 2 bulk RNA-seq
datasets (or significantly downregulated in ≥6 bulk RNA-seq datasets
and significantly upregulated in ≤ 1 bulk RNA-seq dataset) were
considered downregulated in tumor samples.
Training set calibration with curated cancer gene signatures
We used a gene signature scoring-based approach to filter originally
labeled malignant cells. Cells were retained if they had higher
upregulated signature activity or lower downregulated signature
activity compared to normal cells:
(1) malignant cells with upregulated signature activity higher than the
average activity of normal cells.
(2) malignant cells with downregulated signature activity lower than
the average activity of normal cells.
Raw gene count matrices were filtered to exclude cells with fewer than
200 or more than 8000 gene counts or with more than 50% mitochondrial
reads. This resulted in 416,774 cells (134,053 malignant and 282,721
normal epithelial cells) in the training set.
Feature selection and logistic model training
Differentially expressed genes (DEGs) were identified using Seurat’s
FindAllMarkers function (v4.3.0)^[235]51. Genes with a |Log2FC | ≥ 0.5,
P-value < 0.01, and an expression proportion (pct.1) ≥ 0.2 were
selected as DEGs. DEGs were determined separately for each dataset, and
the union of upregulated and downregulated genes across datasets was
used as features, resulting in 2707 DEGs (1656 upregulated and 1051
DEGs as features). A logistic regression model was constructed using
these DEGs as input features and trained with Python’s
‘sklearn.linear_model.LogisticRegression’ (v1.2.2). The expression
profiles from the calibrated training dataset were used for model
training.
Performance evaluation of scMalignantFinder
We validated the performance of scMalignantFinder using test sets from
multiple independent datasets^[236]16,[237]17,[238]23–[239]28
(Supplementary Data [240]1). The dataset including cells originating
only from cancer cell lines was used as the positive gold standard test
set, while the dataset including cells from non-cancer donor tissues
served as the negative gold standard test set. The cell type
annotations for the datasets were sourced from the original studies.
(Supplementary Data [241]1). Following the same normalization method as
the training set, all datasets were used as inputs for
scMalignantFinder. The predictive performance was evaluated using seven
metrics: area under receiver operating characteristic (AUROC),
accuracy, balanced accuracy, sensitivity, specificity, positive
predictive value (PPV), and negative predictive value (NPV), calculated
as follows:
[MATH:
accuracyi=TPi+TNiTPi+TN
mrow>i+F
Pi+F<
/mi>Ni
:MATH]
1
[MATH:
sensitivityi
mi>=TPiTPi+
FNi :MATH]
2
[MATH:
specificityi
mi>=TNiTNi+
FPi :MATH]
3
[MATH:
balancedaccur
mi>acyi<
mo>=sens
mi>itivityi+s
pecific<
mi>ityi2 :MATH]
4
[MATH:
PPVi=TP<
mrow>iTP
mi>i+FPi :MATH]
5
[MATH:
NPVi=TN<
mrow>iTN
mi>i+FNi :MATH]
6
where
[MATH:
TPi
:MATH]
,
[MATH:
TNi
:MATH]
,
[MATH:
FPi
:MATH]
,
[MATH:
FNi
:MATH]
represent the true positives, true negatives, false positives, and
false negatives on dataset
[MATH: i :MATH]
.
We compared scMalignantFinder with four automated tools: ikarus^[242]10
(0.0.3) ([243]https://github.com/BIMSBbioinfo/ikarus),
PreCanCell^[244]6 (1.1.0)
([245]https://github.com/WangX-Lab/PreCanCell), Cancer-Finder^[246]11
([247]https://github.com/Patchouli-M/SequencingCancerFinder), and
CopyKAT^[248]8 (1.1.0) ([249]https://github.com/navinlabcode/copykat).
Each tool was used with its respective preprocessing methods and
default parameters, except for ikarus, where ‘adapt_signatures’ was set
to True, and CopyKAT, where ‘ngene.chr’ was set to 1.
We also retrained ikarus and Cancer-Finder using the training set and
DEGs constructed in this study, as detailed under the sections
“Training set calibration” and “Feature selection and logistic model
training.” For ikarus, its gene signatures were replaced with DEGs
identified from the calibrated training dataset. The retrained models
were subsequently assessed for their predictive performance using the
same independent test sets and evaluation metrics.
Characterization of “misclassified” cells in the test sets
Cells were categorized into four groups based on initial annotations
and scMalignantFinder predictions: (1) true malignant, labeled as
malignant and correctly predicted as malignant, (2) predicted
malignant, predicted as malignant but labeled as normal, (3) predicted
normal, predicted as normal but labeled as malignant, and (4) true
normal, labeled as normal and correctly predicted as normal. Cells
whose prediction did not match the original labels were considered
“misclassified”. These cells were characterized using functional
annotation and CNV inference. Functional annotation of selected genes
was analyzed using Enrichr
([250]https://maayanlab.cloud/Enrichr/)^[251]52 against the
WikiPathways and TRRUST datasets. Pathways and transcription factors
displaying a P-value < 0.05 were deemed significantly enriched. The CNV
profiles were inferred using the ‘infercnpy’ package (0.4.2) available
at [252]https://github.com/icbi-lab/infercnvpy. During the inference
process, we utilized the following parameters: a window_size of 250 and
excluded two chromosomes (chrX and chrY). The normal reference cells
were defined as cells annotated as normal based on the original study.
The calculation of the CNV score for each cell was performed using the
‘infercnvpy.tl.cnv_score’ function.
Application of scMalignantFinder to single-cell data spanning multiple stages
of carcinogenesis
We analyzed scRNA-seq datasets representing different stages of
carcinogenesis in colorectal^[253]16,[254]17 and gastric
cancer^[255]32. For colorectal cancer, epithelial cell profiles from
normal mucosal tissues, colorectal polyps, and tumors were processed.
For gastric cancer, samples included non-atrophic gastritis (NAG),
chronic atrophic gastritis (CAG), intestinal metaplasia (IM), and early
gastric cancer (EGC). Using scMalignantFinder, malignancy probability
and classification were predicted for each cell. Percentages of
predicted malignant cells and malignant probabilities were computed and
compared across the progression stages of both cancers.
Feature importance analysis
The ‘coef_’ attribute of the scMalignantFinder model represents the
coefficients assigned to each feature within the model. These
coefficients serve as indicators of feature importance and are
determined by their absolute values. Subsequently, all features are
sorted in descending order based on their importance, and the
highest-ranked features are selected to construct a classification
model. The significance of each performance metric, when compared to
the original model, is evaluated using a paired-sample t-test.
Furthermore, the features of the model were categorized as unique or
common based on their presence across cancer types (Supplementary
Data [256]3). Unique DEGs refer to genes that are present in only one
specific cancer type, while common DEGs are genes that are found in two
or more cancer types.
Association of DEGs with clinical features and drug targets
To analyze the association between DEG gene sets and overall survival,
we followed the methodology of a previous study^[257]23. In this
approach, the calculated effect size was defined as the hazard ratio,
computed through Cox regression, with the P value obtained via an
overall likelihood test^[258]23. To test the association between DEG
gene sets and drug targets, we collected 704 targets with approved
drugs from the PHAROS database (labeled as Tclin)^[259]34 and computed
the odds ratio following the methods of a recent study^[260]53.
scMalignantFinder workflow for identifying malignant regions
For the ST data obtained from previous studies^[261]38–[262]41
(Supplementary Data [263]1), we applied scMalignantFinder to identify
malignant regions through a multi-step process that integrates
malignancy probability, gene signature activity and image-based
features. Each spot in the ST data was assigned two attributes:
malignancy probability, calculated using the logistic regression
classifier trained on single-cell transcriptomics data, and malignant
signature activity, which quantifies the expression of
malignant-upregulated DEGs using the AUCell function from the pySCENIC
package. For datasets with paired histological images, such as
hematoxylin and eosin (H&E) staining or immunofluorescence, an
additional image score was computed using ‘calculate_image_features’
function in Squidpy^[264]54 (1.3.1)
([265]https://squidpy.readthedocs.io/en/stable/index.html). These three
features—malignancy probability, malignant signature activity, and
image score—were combined into a feature matrix.
Next, hierarchical clustering was performed on this feature matrix
using the Ward method to classify spots into three clusters,
considering that solid tumors typically consist of malignant regions,
normal epithelial regions, and non-epithelial regions. To determine the
identity of each cluster, a rank-based vote aggregation method was
used. The average score for each feature was first calculated within
each cluster. Clusters were then ranked for each feature based on these
average scores. The ranks across the three features were averaged for
each cluster. The cluster with the top average rank was preliminarily
identified as the malignant region, while the other two clusters were
provisionally designated as normal regions. If two or more clusters
shared the highest average rank, the cluster with the highest
malignancy probability was assigned as the malignant region.
To refine the preliminary classifications, a spatial neighbor graph was
constructed using Squidpy’s ‘spatial_neighbors’ function. This graph
enabled the reassignment of spot classifications based on spatial
neighborhood relationships. Specifically, if more than half of a spot’s
neighbors within the closest distance belonged to a particular cluster,
the spot was reassigned to that cluster. If not, the original
classification was retained. The images with pathological annotations
were extracted from previous studies^[266]11,[267]37,[268]41.
Benchmarking against tumor spatial transcriptomics analysis methods
We benchmarked scMalignantFinder against two recently developed methods
for tumor ST analysis. The first method, Cancer-Finder^[269]11,
utilizes a domain generalization-based deep learning approach
specifically designed for identifying malignant regions in ST data. We
employed the pretrained Cancer-Finder model along with its associated
checkpoints optimized for ST datasets
([270]https://github.com/Patchouli-M/SequencingCancerFinder). Both
scMalignantFinder and Cancer-Finder were applied to the eight ST
datasets, and performance metrics, including accuracy and balanced
accuracy, were calculated for comparison. The second method,
Cottrazm^[271]42 (0.1.1) ([272]https://github.com/Yelab2020/Cottrazm),
was used to delineate tumor boundaries in three ST slices. Cottrazm
classifies spots into malignant, normal, or boundary regions. To
evaluate the consistency between scMalignantFinder and Cottrazm, we
compared the predictions of malignant and normal spots between the two
methods. Additionally, we analyzed the spatial relationships between
tumor boundary spots identified by Cottrazm and neighboring malignant
or normal spots predicted by scMalignantFinder, assessing proximity
within one or two spot-size distances.
Functional analysis of malignant regions
To validate the identified malignant regions, we performed functional
analyses, including CNV profiling, cell type deconvolution, and tumor
signature activity assessment. CNV scores for each spot in the ST
datasets were calculated using the STCNV and STCNVScore function
integrated within the Cottrazm pipeline, which calls inferCNV
([273]https://github.com/broadinstitute/inferCNV). Cell type
deconvolution was conducted using SpaCET^[274]37 (1.2.0)
([275]https://github.com/data2intelligence/SpaCET) to estimate the
proportion of malignant cells, stromal cells, and immune cells within
the predicted malignant and normal regions. Additionally, tumor
signature activity was assessed using gene sets previously published
for breast cancer and renal cell carcinoma^[276]42. The AUCell function
from the pySCENIC package was used to calculate signature activity for
each spot, quantifying the expression of tumor-associated genes in
malignant and normal regions.
Statistics and reproducibility
All statistical analyses were performed using Python (3.10) with
appropriate libraries for data processing, statistical testing, and
visualization. Statistical significance was assessed using two-sided
Wilcoxon rank-sum tests, two-side Student’s t tests, or Mann-Whitney U
tests where applicable, as indicated in figure legends. For survival
analyses, Cox regression and likelihood ratio tests were employed to
determine statistical significance. Significance levels are denoted as
*P < 0.05, **P < 0.01, ***P < 0.001, and ****P < 0.0001.
Reporting summary
Further information on research design is available in the [277]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[278]Supplementary Information^ (3.3MB, pdf)
[279]Supplementary Data 1^ (12.4KB, xlsx)
[280]Supplementary Data 2^ (20.5KB, xlsx)
[281]Supplementary Data 3^ (55.9KB, xlsx)
[282]Supplementary Data 4^ (103.6KB, xlsx)
[283]Supplementary Data 5^ (75.7MB, xlsx)
[284]Supplementary Data 6^ (42.5MB, xlsx)
[285]Supplementary Data 7^ (86.3KB, xlsx)
[286]Supplementary Data 8^ (1.5MB, xlsx)
[287]Supplementary Data 9^ (75.8MB, xlsx)
[288]Supplementary Data 10^ (42.6MB, xlsx)
[289]Supplementary Data 11^ (3.1MB, xlsx)
[290]Supplementary Data 12^ (104.3KB, xlsx)
[291]Supplementary Data 13^ (1.1MB, xlsx)
[292]Supplementary Data 14^ (1.2MB, xlsx)
[293]Supplementary Data 15^ (2.9MB, xlsx)
[294]42003_2025_7942_MOESM17_ESM.docx^ (17.2KB, docx)
Description of Additional Supplementary Files
[295]Reporting Summary^ (72KB, pdf)
[296]Transparent Peer Review file^ (4.1MB, pdf)
Acknowledgements