Abstract
Cancer is one of the most concerning public health issues and breast
cancer is one of the most common cancers in the world. The immune cells
within the tumor microenvironment regulate cancer development. In this
study, single immune cell data sets were used to identify marker gene
sets for exhausted CD8 + T cells (CD8Tex) in breast cancer. Machine
learning methods were used to cluster subtypes and establish the
prognostic models with breast cancer bulk data using the gene sets to
evaluate the impacts of CD8Tex. We analyzed breast cancer
overexpressing and survival-associated marker genes and identified
CD8Tex hub genes in the protein–protein-interaction network. The
relevance of the hub genes for CD8 + T-cells in breast cancer was
evaluated. The clinical associations of the hub genes were analyzed
using bulk sequencing data and spatial sequencing data. The pan-cancer
expression, survival, and immune association of the hub genes were
analyzed. We identified biomarker gene sets for CD8Tex in breast
cancer. CD8Tex-based subtyping systems and prognostic models performed
well in the separation of patients with different immune relevance and
survival. CRTAM, CLEC2D, and KLRB1 were identified as CD8Tex hub genes
and were demonstrated to have potential clinical relevance and immune
therapy impact. This study provides a unique view of the critical
CD8Tex hub genes for cancer immune therapy.
Keywords: Immune cell, Breast cancer, Machine learning, Single-cell
sequencing, Spatial sequencing
Subject terms: Computational biology and bioinformatics, Cancer, Breast
cancer, Oncogenes, Tumour biomarkers, Tumour immunology
Introduction
Cancer is one of the most concerning public health issues in the
world^[32]1. It is estimated that in 2024, there will be approximately
2,001,140 new cancer cases and 611,720 cancer-related deaths in the
United States^[33]2. In China, it was estimated that approximately
4,800,000 new cancer cases occurred, causing about 3,200,000
cancer-related deaths^[34]3. Breast cancer is one of the most common
cancers in the world ^[35]3. Much as the prevention and tumorigenesis
of breast cancer have been studied intensively in the past
decades^[36]4, the incidence of breast cancer increased by 0.5% each
year from 2014 to 2018^[37]5. The development of breast cancer was
impacted by both genetic risk factors and environmental risk factors.
Clinical breast cancer was subtyped by the expression level of certain
breast cancer critical receptors: the estrogen receptor (ER), the
progesterone receptor (PR), and human epidermal growth factor receptor
2 (HER2). Breast cancer was divided into the following 4 molecular
subtypes: Luminal A, Luminal B, Triple negative (all called
basal-like), and HER2-enriched.
It has been widely accepted that the immune cells within the tumor
microenvironment regulate cancer development^[38]6. Tumor-infiltrating
immune cells have emerged as clinically relevant and highly associated
with prognosis and response to treatment for breast cancer as well as
other cancer types ^[39]7. Checkpoint blockade therapies have
demonstrated notable advancements in treating various human cancer
types^[40]8–[41]10. Breast cancer, previously considered poorly
immunogenic, has been an exception^[42]11. Even though breast cancer
isn't typically considered highly immunogenic due to its relatively low
tumor mutational burden, the abundance of tumor-infiltrating
lymphocytes in breast cancer correlates with markedly improved
prognoses, both with and without PD-1 targeted
immunotherapy^[43]12,[44]13. Two checkpoint inhibitors targeting the
PD-1/PD-L1 pathway, atezolizumab and pembrolizumab, have gained
approval for treating triple-negative breast cancer
patients^[45]14,[46]15. Our comprehension of the mechanisms underlying
resistance or response to immunotherapy remains incomplete, as does our
understanding of the intricate cellular interactions within the tumor
immune microenvironment. To develop new immunotherapies and utilize
existing ones effectively for breast cancer patients, it is imperative
to grasp the tumor immune microenvironment comprehensively. Although
breast cancer tumor-infiltrating lymphocytes are mainly composed of
CD3 + T cells^[47]16,[48]17, recent research has identified a subset of
CD8 + T cells that play a crucial role in breast cancer^[49]18. This
finding is supported by a single-cell RNA sequencing study of CD3 + T
cells isolated from human primary breast cancer^[50]18. This subset of
CD8 + T cells exhibited elevated expression levels of immune checkpoint
molecules like PDCD1 (PD-1), CTLA4, HAVCR2 (TIM-3), and LAG3. The
transcriptional signature originating from these cells was linked to
improved prognoses, irrespective of the total quantity of CD8 + T cells
present and the administered treatment^[51]18,[52]19. For immune cells
and other cells in cancer, the unique biomarkers of cells can be used
to evaluate the abundance of the cells in tumors. The discovery and
investigation of these cell biomarkers in cancer facilitate the
understanding of the role and function of the corresponding cells in
tumors.
The inherent heterogeneity of breast cancer poses significant
challenges for conventional diagnostic and therapeutic methods^[53]20.
Typically, these approaches rely on analyzing bulk tumor tissue
samples, which may obscure underlying heterogeneity due to their focus
on average expression levels^[54]20. However, emerging technologies
like single-cell analysis offer promising alternatives, already widely
used in oncology research^[55]21,[56]22. By examining gene expression,
phenotypes, protein levels, and other cellular properties at an
individual cell level, these techniques are well-suited to tackle tumor
heterogeneity, particularly in highly diverse cancers like breast
cancer^[57]23–[58]25. Single-cell analysis can aid in predicting
cellular evolution during tumor progression and enhance the precision
of predicting treatment outcomes and patient prognosis^[59]23–[60]25.
Moreover, these techniques play a vital role in devising novel
therapeutic strategies by enabling the detailed examination of genetic
variations and phenotypic characteristics of tumor cells, leading to
the identification of new therapeutic targets^[61]20. This, in turn,
facilitates the development of highly targeted treatment strategies,
improving our ability to predict treatment efficacy and potential drug
resistance. Single-cell gene sequencing, utilizing next-generation
sequencing (NGS), has become indispensable for studying breast cancer
heterogeneity^[62]26. Unlike traditional Sanger sequencing, NGS systems
employ massive parallel sequencing to generate billions of DNA reads,
allowing for the detection of various genetic variations, including
single-nucleotide mutations, small insertions/deletions, and copy
number variations^[63]27. This comprehensive view aids in streamlining
the development of targeted treatment strategies. For example, in
HER2-positive breast cancer, single-cell sequencing identifies
diversity in HER2 gene amplification across different cells,
facilitating personalized treatment plans^[64]28. NGS versatility
extends to RNA sequencing (RNA-seq), enabling quantitative and sequence
analyses of diverse RNA types and their expression levels, enriching
our understanding of breast cancer molecular mechanisms^[65]29. RNA
sequencing has been wildly used in cancer research ^[66]30–[67]39.
For transcriptomic data, generally, bulk sequencing data provides
averaged expression levels of all cells in a tissue sample, while
single-cell sequencing data has been used to decipher the cellular and
molecular landscape at a single-cell resolution ^[68]40. The advantage
of bulk sequencing data is that it often comes with the clinical
information of the patients. Such patient levels data facilitate the
analysis of diagnosis and prognosis as well as other clinical factors
associated with cancers. In addition, spatial sequencing provides gene
expression profiles of a sample with positional information, which is
useful for studying heterogeneity within a tumor sample ^[69]41.
Single-cell RNA sequencing (scRNA-seq) presents significant new
prospects for systematically delineating the cellular landscape of
tumors and uncovering fresh insights into cell biology, disease
etiology, and drug response^[70]42,[71]43. Numerous studies have
effectively employed scRNA-seq to examine selected populations within
human breast tumors, unveiling a spectrum of differentiation states
within tumor-infiltrating lymphocytes^[72]44, highlighting the role of
tissue-resident CD8 cells in breast cancer^[73]18, and shedding light
on chemoresistance mechanisms in breast cancer neoplastic cells^[74]45.
Recent endeavors have utilized mass cytometry with antibody marker
panels to scrutinize breast cancer cell types and ecosystems across
hundreds of patients^[75]46,[76]47. Consequently, there is a pressing
need for a more comprehensive transcriptional atlas of breast tumors at
high molecular resolution, encompassing all subtypes and cell types.
Such an atlas would aid in refining the disease's taxonomy, delineating
heterotypic cellular interactions, and elucidating cellular
differentiation processes. Equally crucial are data systematically
mapping the spatial transcriptomic architecture of breast tumors, as
this can unveil how cells in the tumor microenvironment (TME) are
organized into functional units. A recent paper integrated
single-nucleus RNA sequencing with microarray-based spatial
transcriptomics to delineate cell populations and their spatial
distribution within breast cancer tissues ^[77]48.
Our study focused on exhaust CD8 + T cells (CD8 + Tex). The recent
surge in cancer immunotherapy, primarily based on checkpoint blockade,
has been a breakthrough in treating various cancer types. However,
certain factors are hindering the progress of these treatments, such as
varying genetic make-up of individuals, resistant tumor sub-types, and
immune-related adverse events. While the focus of immunotherapies has
been on improving CD8 + T cells, the relationship between CD4 + T cells
and CD8 + T cells is also gaining attention. The tumor-infiltrating T
regulatory (Treg) cells are a major obstacle in the cross-talk between
CD4 + T cells and CD8 + T cells since they are capable of inhibiting
anti-tumour immunity^[78]49. CD8 + Tex, which is often seen in chronic
infections and cancer, is a progressive process characterized by
decreased effector function and upregulation of inhibitory receptors
such as PD-1 and Tim-3. Although immunological checkpoint inhibitors
have allowed for the eradication of tumors, a better understanding of
the mechanisms by which T cell–exhaustion pathways work in tumors and
the factors that drive them is needed. In this regard, the role of
CD8 + Tex in immunosuppression is key to the resistance of cancer in
immune therapy.
The study hypothesized that certain genes can be biomarkers of CD8 + T
cells in breast cancer as well as other cancer types. In this study, we
aimed to identify these genes and demonstrate their association with
cancers. We believe this study provides a unique view of the critical T
cell hub genes for cancer immune therapy.
Methods
Overview of the study
Single immune cell data sets were used to identify marker gene sets for
CD8 + Tex cells in breast cancer. A machine learning method, consensus
clustering, was used to cluster TCGA BRCA patients using the identified
marker genes, hence, we constructed CD8 + Tex-based genetic subtypes
based on the abound of CD8 + Tex in breast cancer samples. We compared
the immune cell infiltration levels and predicted immune checkpoint
blockade response rate among subtypes to demonstrate the potential
clinical value of the subtype systems for immune therapy of breast
cancer. The identified marker gene sets were also used to construct the
prognostic models for breast cancer patients using a machine learning
algorithm lasso (least absolute shrinkage and selection operator) with
TCGA BRCA cohort, which further identified the critical genes for the
subsequent study. We further analyzed the protein–protein interaction
of these molecules and identified hub genes in the
protein–protein-interaction network. The correlation between the hub
genes and CD8 + T-cell infiltration levels of breast cancer was
evaluated using different immune cell calculation algorithms. The
clinical associations of these hub genes were analyzed using the
clinical information of breast cancer patients and their expression
differences in invasive ductal cancer and ductal cancer in situ were
analyzed using spatial sequencing data. The pan-cancer
cancer-non-cancer expression and survival association of these hub
genes were analyzed. The correlation between the hub genes and
CD8 + infiltration levels and the immune therapy predictive values of
these hub genes were analyzed using immune checkpoint blockade
sub-cohorts.
Data collection
Single-cell cohorts were accessed and analyzed from the TISCH2
platform^[79]50([80]http://tisch.comp-genomics.org/). In this study, 4
single-cell sequence data sets were included as shown in Table [81]1.
This dataset comprises expression data from immune cells obtained
through fluorescence-activated cell sorting (FACS), focusing on an
enriched fraction of immune cells. The MAESTRO v1.1.0 workflow^[82]51
([83]https://github.com/liulab-dfci/MAESTRO/blob/master/README.md)
employed PCA for dimension reduction and KNN and Louvain
algorithms^[84]52,[85]53 for clustering to identify 2000 variable
features for each dataset; the number of principal components and the
resolution for graph-based clustering were adjusted according to the
cell number. Previous studies revealed that MAESTRO demonstrated
superior consistency across nearly all cell types, regardless of
whether the correlation was computed using all genes or solely the top
2000 variable genes^[86]51, hence in this study, only the top 2000
variable genes were used. UMAP was used to reduce the dimension and
visualize the clustering results^[87]54, and the Wilcoxon test was used
to identify differentially expressed (DE) genes of each cluster of cell
type (|logFC|> = 0.25, FDR < 1e-05). Data from The Cancer Genome Atlas
(TCGA, [88]https://www.cancer.gov/ccg/research/genome-sequencing/tcga)
and GTEx ([89]https://gtexportal.org/home/) were obtained, which
included gene expression profiling data and clinical information on
cancer tissues. This data was obtained in accordance with the
guidelines and policies.
Table 1.
Information for single-cell data sets.
Dataset Name Species Treatment Patients number Cells CD8Tex cells
CD8Tex cell (%) Platform Primary or metastasis PublicationS
[90]GSE110686 Human None 2 6,035 622 10.3 10 × Genomics Primary and
metastasis ^[91]18
GSE114727_10X Human None 3 28,678 1389 4.8 10 × Genomics Primary
^[92]44
[93]GSE176078 Human None 26 89,471 13,500 15.1 10 × Genomics Primary
^[94]55
EMTAB8107 Human None 14 33,043 5193 15.7 10 × Genomics Primary ^[95]56
[96]Open in a new tab
Single-cell data quality control
A standardized analysis pipeline utilizing MAESTRO v1.1.0^[97]51 was
employed to process all gathered datasets. This workflow encompassed
quality control, batch effect mitigation, cell clustering, differential
expression analysis, cell-type annotation, and malignant cell
classification. The raw count, TPM, or FPKM table served as input for
this standardized workflow. Cell quality was assessed using two
metrics: total counts (UMI) per cell (library size) and the number of
detected genes per cell. Cells with low quality were excluded if the
library size was < 1000 or the number of detected genes was < 500.
Single-cell data batch effect correction
To systematically assess batch effects across each dataset, an
entropy-based metric ^[98]44,[99]57 was utilized to quantify data
mixing among batches. Typically, samples from different patients in
most datasets are susceptible to batch effects. A k-NN graph (k = 30)
was constructed based on the Euclidean distance between cells in UMAP
coordinates for datasets with more than one patient. For each cell j,
the distribution of patients in its nearest neighbors was computed. The
measure of mixing between patients Hj is defined as:
[MATH:
Hj=-
∑t=1
Tpj
tlog2p
mi>jt :MATH]
here,
[MATH: pjt
:MATH]
represents the proportion of cells from patient t among the 30 nearest
neighbors of cell j, while T denotes the total number of patients. High
entropy, indicating that the most similar cells in a cell's
neighborhood come from different patients, suggests potential batch
effects. Conversely, low entropy suggests that the most similar cells
originate from the same patient, indicating a potential batch effect.
It is noteworthy that datasets primarily composed of malignant cells
(malignant % > 75%) may exhibit low entropy due to the heterogeneity of
malignant cell expression among different tumors ^[100]42.
Consequently, the collected datasets were classified into three groups.
Firstly, for datasets primarily containing malignant cells, there was
no need to eliminate batch effects between different patients, as they
reflect differences between distinct tumors. Secondly, datasets with a
median entropy lower than 0.7 underwent batch effect correction using
Seurat v3.1.2^[101]58. Median entropies shifted towards higher values
post-batch effect removal, indicating significant correction of
potential batch effects. Thirdly, datasets with a median entropy higher
than 0.7 were considered less affected by batch effects.
Single-cell clustering and differential gene analysis
For each dataset, the MAESTRO workflow identified the top 2000 variable
features and conducted PCA for dimension reduction, followed by
employing the KNN and Louvain algorithm for cluster
identification^[102]52,[103]53. To better capture cellular differences
and variabilities across datasets with varying cell numbers,
adjustments were made to the number of principal components and the
resolution for graph-based clustering. Both parameters were increased
with increasing cell numbers. The uniform manifold approximation and
projection (UMAP) were utilized for further dimension reduction and
visualization of clustering results^[104]59. The Wilcoxon test was
applied to identify differentially expressed (DE) genes for each
cluster compared to all other cells, based on criteria such as
log-transformed fold change (|logFC|≥ 0.25) and false discovery rate
(FDR < 1e−05). Clusters of cells were identified using a combination of
three approaches. Firstly, cell-type annotations provided by the
original studies were utilized. Secondly, the expression distribution
of malignant cell markers from the initial research was assessed,
including epithelial markers and EMT genes where available. Thirdly, we
applied InferCNV ([105]https://github.com/broadinstitute/infercnv) to
predict cell malignancy based on predicted copy number variations was
employed, segregating cells into malignant and non-malignant clusters.
For the remaining normal clusters, an automated marker-based annotation
method within MAESTRO was applied using the differentially expressed
genes between clusters. Subsequently, the cell-type annotations based
on the annotations provided by the original studies were manually
verified and corrected.
Bulk data differential expression analysis
Differentially expressed genes (DEGs) between subtypes were identified
using the Limma package with a cut-off fold change of 1.3 and a P-value
of 0.05. DEGs are genes whose expression levels vary significantly
between different groups. In this study, the goal is to identify genes
that are differentially expressed between different subtypes. Limma
(Linear Models for Microarray Data,
[106]https://bioconductor.org/packages/release/bioc/html/limma.html) is
a statistical package in R used for analyzing gene expression data. It
employs linear models and empirical Bayes methods to identify DEGs with
high sensitivity and specificity.
Bulk data enrichment analysis
For the GO biological and KEGG pathway enrichment analyses, the
ClusterProfiler package (version: 3.18.0,
[107]https://bioconductor.org/packages/release/bioc/html/clusterProfile
r.html) in R was employed. The False discovery rate (FDR) and p.adjust
were set at 0.25 and 0.05, respectively. Gene Ontology (GO,
[108]https://geneontology.org/) is a standardized system for annotating
genes and their products with terms representing biological processes,
molecular functions, and cellular components. GO biological pathway
enrichment analysis involves taking a list of genes, often derived from
experimental data such as gene expression studies or genome-wide
association studies, and determining whether any particular biological
processes represented by GO terms are significantly enriched in that
gene list compared to what would be expected by chance. The Kyoto
Encyclopedia of Genes and Genomes (KEGG,
[109]https://www.genome.jp/kegg/) is a collection of databases that
contain information about biological pathways, diseases, drugs, and
other biological entities. KEGG pathway enrichment analysis involves
mapping a list of genes to known biological pathways in the KEGG
database and determining whether any pathways are significantly
enriched in the gene list.
Analysis of the survival
The univariate Cox regression analysis, Kaplan–Meier (KM) plot, and
log-rank analysis were used to assess the survival association and
display the survival curves of genes. Univariate Cox regression
analysis examines the association between a single predictor variable
(such as a gene expression level or a clinical characteristic) and
survival time. It calculates the hazard ratio, which represents the
relative risk of experiencing the event of interest (such as death)
between two groups defined by the predictor variable. The Cox
proportional hazards model is commonly used for this analysis, allowing
for the estimation of hazard ratios while accounting for censoring and
other covariates. The Kaplan–Meier plot is a graphical method used to
estimate the survival function (probability of survival) over time. It
is particularly useful for visualizing survival differences between
groups defined by categorical variables (such as sub-groups or
biomarker expression levels). The plot displays the proportion of
individuals surviving at each time point, along with confidence
intervals. The log-rank test is a statistical test used to compare the
survival curves of two or more groups. It assesses whether there are
significant differences in survival times between the groups, taking
into account censoring. The test evaluates whether the observed
differences in survival are greater than would be expected by chance.
If the p-value from the log-rank test is below a predetermined
significance level (0.05), it indicates that there is a statistically
significant difference in survival between the groups. All analyses
were carried out using R (foundation for statistical computing 2020)
version 4.0.3
([110]https://cran.r-project.org/bin/windows/base/old/4.0.3/).
Statistical significance was defined as p < 0.05.
Consensus clustering analysis
Subtyping of the samples was carried out using the ConsensusClusterPlus
package([111]https://bioconductor.org/packages/release/bioc/html/Consen
susClusterPlus.html). The number of clusters was set at 1–6 for
consistency analysis to optimize the best clustering number.
Unsupervised class discovery involves identifying potential groups
within a dataset based solely on inherent features without external
guidance. Researchers using this technique typically aim to address two
key questions: the number of groups present in the data and the
confidence level associated with both the group quantity and their
memberships. Consensus clustering^[112]60 serves as a valuable method
for tackling these inquiries, particularly prominent in fields such as
cancer research^[113]36,[114]61. Consensus clustering offers both
quantitative and visual indicators of “stability”, obtained through
iterative subsampling and clustering. By synthesizing outcomes from
multiple repetitions, consensus clustering generates a consensus that
demonstrates resilience against sampling variations. While initially
available within the GenePattern software^[115]62, the consensus
clustering technique has been further developed into
ConsensusClusterPlus, a package in the R language offering enhanced
functionalities and visualizations. In this study, this method is
suitable for distinguishing subsets of samples of breast cancer
patients based on certain genes and comparing the overall effect of
these genes on clinical phenotypes.
Immune analysis
The immune cell infiltration level was calculated using different
algorithms, including TIMER^[116]63, XCELL^[117]64, CIBERSORT^[118]65,
MCPCOUNTER^[119]66, QUANTISEQ^[120]67, and EPIC^[121]68. TIMER is a web
server for the comprehensive analysis of tumor-infiltrating immune
cells. It provides immune cell infiltration levels in various cancer
types and their associations with clinical outcomes. TIMER
([122]http://timer.cistrome.org/) utilizes gene expression data to
estimate the abundance of immune cell subtypes within tumor samples.
XCELL ([123]https://github.com/dviraran/xCell) is another computational
tool used for cell type enrichment analysis in gene expression data. It
estimates the relative abundance of different cell types within a
heterogeneous sample, including immune cells. XCELL utilizes gene
expression signatures specific to various cell types to infer their
proportions in the sample. CIBERSORT (Cell-type Identification By
Estimating Relative Subsets Of RNA Transcripts,
[124]https://cibersortx.stanford.edu/) is a computational method used
to characterize cell composition in complex tissues based on gene
expression data. It deconvolutes bulk tissue gene expression profiles
to estimate the relative proportions of different cell types present.
CIBERSORT is particularly useful for studying immune cell populations
in tumor microenvironments. MCPCOUNTER (Microenvironment Cell
Populations-counter, [125]https://github.com/ebecht/MCPcounter) is a
gene expression-based method for quantifying the abundance of specific
immune and stromal cell populations in tumor samples. It uses
predefined gene signatures associated with different cell types to
estimate their relative proportions in the sample. QUANTISEQ
([126]https://icbi.i-med.ac.at/software/quantiseq/doc/) is a
computational tool for the deconvolution of gene expression profiles to
estimate the proportions of immune cell subsets within a sample. It
utilizes gene expression signatures specific to different immune cell
types to infer their relative abundances. EPIC (Evaluating the Presence
of Immune Cells, [127]https://github.com/GfellerLab/EPIC) is a
computational tool for quantifying immune cell infiltration in tumor
samples based on DNA methylation data. It uses DNA methylation profiles
to estimate the proportions of immune cell subsets present in the tumor
microenvironment. Immune checkpoint blockade (ICB) responses of
subtypes were predicted using the Tumor Immune Dysfunction and
Exclusion (TIDE, [128]http://tide.dfci.harvard.edu/) algorithm^[129]69
using the TIDE online analysis platform. TIDE represents a
computational framework designed to assess the likelihood of immune
evasion by tumors. It achieves this by analyzing the gene expression
patterns present in cancer samples. The immune therapy sub-cohorts were
accessed and analyzed with TIDE tool^[130]69
([131]http://tide.dfci.harvard.edu/setquery/). The TIDE score was
designed to predict response to immune checkpoint blockade, including
anti-PD1 and anti-CTLA4, for melanoma and NSCLC. The use of TIDE in
this study is based on the assumption that breast cancer has a similar
immune system as melanoma and NSCLC.
Prognostic model for identification of critical genes
The model was constructed using the glmnet
([132]https://glmnet.stanford.edu/articles/glmnet.html) R package,
which implemented the least absolute shrinkage and selection operator
(LASSO) regression algorithm^[133]70 with tenfold cross-validation for
gene signature selection. LASSO is a regression analysis method used
for variable selection and regularization. It aims to find the subset
of predictor variables that are most relevant for predicting the
response variable while simultaneously performing variable selection
and regularization to prevent overfitting. In LASSO regression, the
ordinary least squares (OLS) objective function is augmented with a
penalty term that is the sum of the absolute values of the coefficients
multiplied by a regularization parameter (lambda). This penalty term
encourages the coefficients of less important variables to be exactly
zero, effectively performing variable selection by shrinking some
coefficients to zero. In this study, LASSO was used to construct the
prognostic model. A validation cohort from Xena-hubs Breast Cancer
(Caldas)^[134]71 was used for validation of the model.
Protein–protein interaction network and hub gene
The protein–protein interaction network was constructed with STRING
([135]https://string-db.org/), where interactions with a score greater
than 0.4 were considered. The top 10 hub nodes in the network were
identified using the Hubba^[136]72
([137]https://apps.cytoscape.org/apps/cytohubba) in Cytoscape^[138]73
([139]https://cytoscape.org/). The algorithm used included Maximum
Clique Cardinality (MCC), Density of Maximum Neighborhood Component
(DMNC), Maximum Neighborhood Component (MNC), and Degree Centrality
(Degree). The combination of these algorithms offers a comprehensive
approach to unsupervised class discovery. Each algorithm provides a
unique perspective on the dataset. MCC focuses on identifying densely
connected subgraphs (cliques), DMNC evaluates the density of the
neighborhood around each node, MNC finds the largest connected
component, and Degree Centrality measures the importance of nodes based
on their connections. By incorporating these diverse perspectives, the
combined approach can capture different aspects of the underlying
structure of the data. The algorithms complement each other's strengths
and weaknesses. For example, while MCC is effective at identifying
densely connected subgroups, Degree Centrality can highlight nodes that
serve as central hubs within the network. This complementary nature
ensures that a wider range of structural features within the data are
considered, leading to a more comprehensive analysis.
Spatial sequencing analysis
The expression differences in invasive ductal cancer and ductal cancer
in situ were analyzed using spatial sequencing data with SpatialDB
([140]http://www.spatialomics.org/SpatialDB/index.php). Spatial
sequencing, also known as spatial transcriptomics or spatially resolved
transcriptomics, is a technology that allows researchers to study gene
expression patterns within tissues while preserving spatial
information. Spatial sequencing methods integrate high-throughput
sequencing techniques with spatially resolved imaging approaches to
generate spatially resolved gene expression profiles. These methods
enable us to analyze gene expression patterns within intact breast
cancer tissue sections, providing insights into the spatial
organization of cells and tissues and their functional implications.
The spatial sequencing data used in this study were published in a
previous paper^[141]74. The invasive ductal cancer and ductal cancer in
situ were labeled by a licensed clinical pathologist Dr Jielin Weng in
the Department of Pathology, The Second Affiliated Hospital of
Guangzhou Medical University. The data were visualized using the
SpatialDB tools^[142]75.
Results
CD8Tex marker gene identification based on single immune cell sequencing
The UMAP plots were conducted for each single-cell sequence data set
respectively (Fig. [143]1A). These cells were annotated and sorted into
17 cell types as shown in Table [144]2. Marker genes of CD8Tex cells
were identified for each data set respectively, and the marker genes
were cross-validated by intersection analysis of the marker gene sets
obtained from different data sets (Fig. [145]1B left panel). The
Jaccard index was calculated as shown in different colors in
Fig. [146]1B right panel. Eventually, we obtained 145 marker genes for
CD8Tex (Fig. [147]1B).
Figure 1.
[148]Figure 1
[149]Open in a new tab
CD8Tex marker genes identification based on single immune cell
sequencing. (A) UMAP plot of breast cancer single immune cell
sequencing data sets. (B) Intersection of marker genes identified by
single immune cell sequencing data sets. Left panel: the Venn diagram.
Right panel: pairwise intersections analysis.
Table 2.
Cell type abbreviation in single-cell data.
Cell type Abbreviation Full name
Immune cells B B Cells
CD4T CD4 T Cells
CD4Tconv Conventional CD4 T Cells
CD8T CD8 T Cells
CD8Tex Exhausted CD8 T Cells
DC Dendritic Cells
Mono/Macro Monocytes or Macrophages
Mast Mast Cells
Neutrophils Neutrophils
NK Natural Killer Cells
Tprolif Proliferating T Cells
Treg Regulatory T Cells
Stromal cells Endothelial Endothelial CELLS
Fibroblasts Fibroblasts
Myofibroblasts Myofibroblasts
Cancer cells Malignant Malignant cells
Other cells Oligodendrocyte Oligodendrocytes
[150]Open in a new tab
Clustering of CD8Tex-based subtypes
To further study the CD8Tex cells in breast cancer, we collected TCGA
cohort to join-analyze the CD8Tex marker genes. The basic clinical
information was provided in the [151]Supplementary Table. All of these
marker genes indeed have potential value for clinical application,
however, the survival association of these genes revealed their
practical value as a molecule that remarkably affects the progress of
the tumor, hence, we believe that those genes that significantly impact
patient survival are more likely to make a clinical difference when
used as biomarkers or therapeutic targets. First, we excluded genes
that do not affect the survival of breast cancer by conducting an
overall survival comparison between high and low-gene expression groups
(divided by expression medium). The univariate analysis was performed
and the hazard ratio of significant genes was shown in Fig. [152]2A.
Only four genes were risk factors for breast cancer patients, while the
others were protective factors.
Figure 2.
[153]Figure 2
[154]Open in a new tab
CD8Tex-based subtype clustering. (A) Survival screening of the CD8Tex
marker genes. The forest plot shows the hazard ratio (HR). (B1)
Consensus Cumulative Distribution Function (CDF) plot of subtype
numbers (k = 2–6). (B2) Delta area plot of the consensus CDF plot. (B3)
Consensus matrix and cluster trees of subtypes. (B4) Gene expression
heatmap of the subtypes. (B5) Principal Component Analysis (PCA) plot
of the consensus clustering. (C1) Overall survival Kaplan–Meier plot
(KM plot) of the subtypes. (C2) Progression-free interval KM plot of
the subtypes.
We admitted that not all markers are highly specific for exhausted
T-cells, the criteria to collect them is because their expression is
significantly different from other cell types, thus, the expression of
these marker genes can help define the distinctive expression patterns
of cell infiltration features in bulk sequencing data. We aimed to
investigate breast cancer with different CD8Tex infiltration levels,
thus, consensus clustering was used to cluster TCGA BRCA patients into
CD8Tex-based subtypes using the identified marker genes. Based on the
consensus cumulative distribution function (CDF) plotting, the number
of clusters (K) = 4 was the optimum cluster number for all of these
clustering (Fig. [155]2B1, 2). The subtyping approach has been applied
to help understand the role of a gene set^[156]76,[157]77. By the NMF
method, which is an effective dimension reduction method for cancer
subtype identification, patients were clustered into four distinct
subtypes (Fig. [158]2B3). The heatmap in Fig. [159]2B4 illustrates the
differential expression patterns of marker genes across the breast
cancer subtypes (C1, C2, C3, and C4), while the PCA plot in
Fig. [160]2B5 demonstrates the clustering of patients based on their
gene expression profiles within each subtype. (C1, C2, C3, and C4)
(Fig. [161]2B4, 5). The heatmap and PCA plots in Fig. [162]2B4, B5
visually depict the distinct expression patterns among the identified
breast cancer subtypes. The heatmap color scale ranges from low
expression (blue) to high expression (red), facilitating the
interpretation of differential expression patterns across subtypes. In
Fig. [163]2B4, the heatmap visually represents the relative expression
levels of marker genes across the identified breast cancer subtypes
(C1, C2, C3, and C4). Each row in the heatmap corresponds to a gene,
and each column represents a patient sample, with color intensity
indicating the expression level of each gene. This visualization allows
us to observe patterns of differential expression among the subtypes,
highlighting potential molecular distinctions. In contrast,
Fig. [164]2B5 utilizes PCA (Principal Component Analysis) to explore
the overall variation in gene expression profiles among patients within
each subtype. PCA is a dimensionality reduction technique that
identifies patterns in data and visualizes these patterns by projecting
patients onto a reduced-dimensional space defined by principal
components. The PCA plot helps to visualize how patients cluster based
on their gene expression profiles, providing insights into the
similarities and differences among subtypes beyond individual gene
expression levels. We analyzed the overall survival and progress-free
interval of the subtypes and found that the subtyping failed to
distinguish different survival patients except for the progress-free
interval of C3 versus C2 (Fig. [165]2C1,2).
Immune difference of CD8Tex-based subtypes
In addition, we also display the association between molecular subtypes
and CD8Tex-based subtypes with a Sankey diagram (Fig. [166]3A).
Generally, CD8Tex-based subtypes were not associated with the molecular
subtypes. In addition, we also calculated the immune cell infiltration
levels of the CD8Tex-based subtypes using multiple algorithms as
provided in the [167]supplementary materials. The XCELL algorithms
suggested that the CD8Tex-based subtypes separated samples with
different immune profiles. The immune score and microenvironment score
were remarkably different among CD8Tex-based subtypes. The stroma
scores were also significantly different among CD8Tex-based subtypes.
The C2 subtypes had a very high immune score and microenvironment
score. (Fig. [168]3B).
Figure 3.
[169]Figure 3
[170]Open in a new tab
Immune association of the CD8Tex-based subtypes. (A) Sankey diagram
showing the association between breast cancer molecular subtypes and
CD8Tex-based subtypes. (B) The XCELL scores in CD8Tex-based subtypes.
(C) Tumor Immune Dysfunction and Exclusion (TIDE) scores in
CD8Tex-based subtypes. (D) Predicted immune checkpoint blockade (ICB)
response of the subtypes based on the TIDE score.
To demonstrate the potential clinical value of these subtyping systems,
we calculated the TIDE score to predict the response of the samples for
ICB therapy. Results revealed that the C2 subtype had a significantly
higher TIDE score compared to the other subtypes (Fig. [171]3C). This
indicates that C2 subtypes had low T cell response. Based on the TIDE
score, we predicted the response of each sample for ICB and calculated
the response ratio for each subtype. Results showed that C1 had a 70%
response rate, C3 had a 63% response rate, and C4 had a 65% response
rate, yet, C2 had an 84% response rate (Fig. [172]3D). It is not
surprising that C2 with the highest immune score and microenvironment
score also has highest response rate. We believe that this is because,
although the results are derived from two distinct algorithms, there
are common parameter genes utilized in their calculation. These results
suggested that the subtyping systems performed well in the separation
of patients with different immune relevance, especially identified C2
subtypes as the responder subtype for ICB, indicating that these
marker-gene sets potentially provided clinical value for breast cancer
immune therapies.
Expression difference of CD8Tex-based subtypes
To further investigate the features of patients in C2 subtypes compared
to the other patients, we conducted a differential expression gene
analysis comparing C2 and the other subtypes. The analysis revealed
1552 up-regulated genes and 980 down-regulated genes in the C2 subtype
as shown in the volcano plot (Fig. [173]4A) and heatmap with clustering
(Fig. [174]4B). Figure [175]4B depicts a heatmap with hierarchical
clustering illustrating the differential expression patterns of genes
between the C2 subtype and other subtypes. The rows represent genes,
while the columns represent patient samples. The color scale in the
legend indicates gene expression levels, ranging from low (blue) to
high (red), facilitating the interpretation of the heatmap. This
visualization highlights distinct gene expression clusters associated
with the C2 subtype, suggesting potential biomarkers or pathways
specific to this subgroup.
Figure 4.
[176]Figure 4
[177]Open in a new tab
Differential expression gene enrichment. (A) Volcano plot of the
differential genes in CD8Tex-based subtypes C2. TCGA BRCA cohorts were
analyzed. C2 samples were compared with the other samples to identify
C2 differential genes. (B) Heat map and clustering of the differential
genes in CD8Tex-based subtypes C2. (C) GO and KEGG enrichment
analysis^[178]78–[179]80 of the differential genes in CD8Tex-based
subtypes C2.
Subsequently, we enriched these genes in the GO database and KEGG
pathways database. Results showed that the up-regulated genes were
enriched in multiple terms that are related to the T cell activity. The
KEGG pathway enrichment analysis revealed that the up-regulated genes
were associated with cytokine interaction and Th1/Th2 differentiation.
On the other hand, the down-regulated genes were associated with
protein secretion and hormone secretion as well as the PI3K-Akt
signaling pathway. (Fig. [180]4C) Although these DEGs and enrichment
might related to the previously identified 145 biomarkers, the
objective of this enrichment analysis is to broadly investigate the
disparities within the C2 cluster and discern which pathways might
underlie the diverse clinical phenotypes observed. Rather than
pinpointing specific markers, this analysis aims to offer general
insights into the biological mechanisms driving these clinical
differences.
Construction of a CD8Tex-based survival model
To further explore the prognostic value of the CD8Tex marker gene set
and obtain critical genes for breast cancer patients, we trained a
machine-learning prognostic model using the LASSO algorithm and TCGA
BRCA cohort. The model suggested that the minimum lambda was 0.0048.
When lambda was 0.0048, the model achieved the best performance. The
risk score formula and the included genes are presented in
Fig. [181]5A–C. The model includes risky genes (CLEC2D, CRTAM, EZR,
HLA-DRB1, NKG7, SLA2, SLC7A5, and SYTL3) and protective genes (CTSW,
GBP2, IFNG, KLRB1, MT-ND1, PSME2, SH2D2A, and TOX) which are discussed
later. The KM plot suggested that the high-risk group had a
significantly worse survival than the low-risk group (Fig. [182]5D).
This model helped us narrow down the critical genes for CD8Tex in
breast cancer. The time-dependent ROC analysis also revealed that the
AUC of the ROC was over 7 for different years of prediction, indicating
a good accuracy of the model for survival prediction (Fig. [183]5E,
[184]F). The model is validated with another independent cohort
(Fig. [185]5G). The validation cohort results are consistent with those
of the training cohort, demonstrating that the model can distinguish
between patients with long survival and those with short survival.
Figure 5.
[186]Figure 5
[187]Open in a new tab
Machine learning prognostic models based on CD8Tex marker genes and hub
genes identification. (A-C) The λ and coefficients of the model. (D)
Overall survival KM plots of high- and low-risk groups in TCGA-BRAC.
(E) Representative time-dependent receiver operating characteristic
(ROC) curve of the risk model. (F) Area Under Curve (AUC) of the
time-dependent ROC curve of the risk model. (G) Overall survival KM
plots of high- and low-risk groups in a validation cohort from
Xena-hubs Breast Cancer (Caldas 2007). (H) The protein–protein
interaction network with hub genes identified.
Identification of the hub genes for the CD8Tex regulation network
The genes in the LASSO model were critical CD8Tex marker genes for
breast cancer patients. We constructed a protein–protein interaction
network using these genes and calculated the top three hub genes using
MCC, DMNC, MNC, and Degree algorithms. The protein–protein interaction
network presented 31 edges and the average node degree was 3.88
(Fig. [188]5H). The MCC, DMNC, MNC, and Degree of each gene were
calculated as presented in [189]Supplementary Table. We then ranked the
scores and obtained the average rank of each gene. Eventually, CRTAM,
CLEC2D, and KLRB1 stood out as the top three hub genes in critical
CD8Tex marker genes for breast cancer patients.
CD8 T cell infiltration relevant to CD8Tex hub genes
To evaluate the clinical value of the hub genes identified, CRTAM,
CLEC2D, and KLRB1, for clinical evaluation of CD8 + T-cell infiltration
levels in breast cancer, we calculated the different CD8 + T-cell
infiltration levels and analyzed their correlation with CRTAM, CLEC2D,
and KLRB1 in breast cancer and molecular subtypes. To avoid bias in
some immune cell scores, we have employed multiple algorithms. Results
showed that CRTAM, CLEC2D, and KLRB1 were positively correlated with T
cell CD8 + in MCPCOUNTER, CIBERSORT, CIBERSORT-ABS, EPIC, and
QUANTISEQ, but not in TIMER. (Fig. [190]6A).
Figure 6.
[191]Figure 6
[192]Open in a new tab
CD8Tex hub marker genes for breast cancer. (A) Correlation analysis
between CD8Tex hub marker genes and CD8 + T cell infiltration scores.
(B) Expression of CD8Tex hub marker genes in breast cancer and normal
breast tissues. TCGA and GTEx data were analyzed. (C) Expression of
CD8Tex hub marker genes in breast cancer and normal breast tissues.
TCGA-paired samples were analyzed. (D) KM plot of the CD8Tex hub marker
genes in breast cancer. Rolls 1–3 display overall survival,
disease-free survival, and progress-free interval respectively. Columns
1–3 display CLEC2D, CRTAM, and KLRB1 respectively.
CD8Tex hub genes expression and survival analysis
Data also suggested that CLEC2D expression slightly decreased in tumors
compared to normal breast tissue, while CRTAM expression increased in
tumor tissue and KLRB1 expression had no difference in tumors compared
to normal tissues (Fig. [193]6B). However, in the cancer-noncancer
paired samples, the comparison suggested that CRTAM expression
increased in tumor tissue while KLRB1 expression decreased in the tumor
(Fig. [194]6C). Given the expression analysis results for CRTAM and
KLRB1 are not consistent, we cannot determine the cancer-non-cancer
expression pattern of these two genes definitively. However, based on
consistent findings, we can conclude that tumors exhibit elevated
expression of CRTAM. This suggests that tumors may possess a distinct
CD8 + T-cell infiltration feature compared to normal tissues. Moreover,
the expression levels of CRTAM and KLRB1 in normal breast tissue are
comparable to those in breast cancer tissue. This suggests a
potentially similar infiltration pattern of exhausted CD8 + T-cells in
both cancer and non-cancer tissues. This similarity in pattern could
hinder the development of cancer-specific targets for treatment.
KM survival analysis revealed that CRTAM, CLEC2D, and KLRB1 were all
significantly associated with better overall survival of breast cancer
patients. For disease-free survival, CRTAM was not associated, while
CLEC2D and KLRB1 were associated with better disease-free survival. As
for the progress-free interval, all three genes were associated.
(Fig. [195]6D) Survival association of CD8Tex hub marker genes for
breast cancer subtypes was also analyzed. Results showed that CRTAM was
a good prognostic biomarker for Luminal B, HER-enriched, and basal-like
breast cancer, but not for Luminal A breast cancer. CLEC2D was
associated with overall survival and progress-free interval in all
subtypes of breast cancer. Similar to CRTAM, KLRB1 was a good
prognostic biomarker for Luminal B, HER-enriched, and basal-like breast
cancer, but not for Luminal A breast cancer. However, KLRB1 was not
significant in the overall survival analysis of basal-like breast
cancer. (Fig. [196]7).
Figure 7.
[197]Figure 7
[198]Open in a new tab
Survival association of CD8Tex hub marker genes for breast cancer
subtypes.
Clinical association of the CD8Tex hub genes in breast cancers
The expression of CD8Tex hub genes has been demonstrated to be critical
markers for CD8 + T cells in breast cancer. While CD8 + T cells are
critical for breast cancer development, we proposed that CLEC2D, CRTAM,
and KLRB1 are associated with clinical characteristics. The comparison
of clinical information between high and low-expression groups
suggested that CLEC2D was associated with T stage, age, and radiation
therapy of breast cancer patients. CRTAM was associated with the N
stage, race, age, PAM50 (molecular subtype), and radiation therapy of
breast cancer patients. KLRB1 was associated with the M stage, race,
age, PAM50, menopause status, and radiation therapy of breast cancer
patients ([199]Supplementary Table). In the high-expression group of
CLEC2D, CRTAM, and KLRB1, more patients underwent radiation therapy. In
general, radiation therapy is often recommended for patients who are in
relatively good condition and able to tolerate treatment. However, it
can also be administered to patients with more advanced disease or
compromised health status, particularly if the potential benefits of
treatment outweigh the risks. Therefore, the association inferred from
our data suggests that high expression of CLEC2D, CRTAM, and KLRB1 may
be associated with less severe cases where radiation therapy is
recommended, possibly because patients are resistant to other types of
treatment such as chemotherapy. CLEC2D was associated with the T stage
indicating that it results in smaller tumors, which might reflect the
impact of T cells on tumor growth. Tumor-infiltrating T cells have been
associated with smaller tumor size^[200]81. CRTAM was associated with a
slightly higher lymph node metastasis (N stage), which might reflect
the impact of T cells on lymph node metastasis. Tumor-infiltrating
lymphocytes, including cytotoxic CD8 + T cells, can recognize and
eliminate cancer cells within lymph nodes, thereby inhibiting the
spread of metastatic disease^[201]82. The association of KLRB1 with
decreased tumor metastasis (M stage) suggests a potential role for
CD8Tex cells in limiting breast cancer metastasis. Taken together,
these results suggest that patients with CD8Tex may have smaller tumors
with locoregional metastases, a clinical scenario commonly addressed
with post-surgical radiotherapy. The association of CD8Tex hub genes
with clinical characteristics, particularly the link with radiotherapy,
raises intriguing questions about the immune activity status in breast
cancer patients. The observed associations between CLEC2D, CRTAM, and
KLRB1 expression levels and various clinical parameters, including
tumor stage, lymph node involvement, molecular subtype, menopause
status, and receipt of radiotherapy, suggest potential implications for
patient management and treatment outcomes. However, it is important to
note that the observed differences were slight and may not have
significant clinical implications.
These findings underscore the importance of considering the immune
context of tumors in clinical decision-making and the potential
implications for personalized treatment approaches. However, further
investigation, including validation in independent cohorts and
functional studies, is needed to fully elucidate the role of exhausted
T-cells and their associated biomarkers in breast cancer progression
and response to therapy.
CD8Tex hub genes and breast cancer heterogeneity
To investigate the CD8Tex hub genes and breast cancer extra-tumour
heterogeneity, we plotted the expression level of CLEC2D, CRTAM, and
KLRB1 in PAM50 breast cancer subtypes. Results revealed that CLEC2D and
CRTAM do not have differences among PAM50 breast cancer subtypes, KLRB1
expression in Luminal B was significantly lower than the other
subtypes. (Fig. [202]8A) To explore the CD8Tex hub genes and breast
cancer intro-tumour heterogeneity, we also compared the levels of
CLEC2D, CRTAM, and KLRB1 in invasive ductal cancer and ductal cancer in
situ using spatial sequencing data. The invasive ductal cancer and
ductal cancer in situ of breast cancer tissue were delineated by an
experienced clinical pathologist and the expression of CLEC2D, CRTAM,
and KLRB1 in invasive ductal cancer and ductal cancer in situ of breast
cancer tissue were measured at 4 separate layers. We summed the results
of 4 layers for analysis and results showed that CRTAM and KLRB1
expression was not significantly different between invasive ductal
cancer and ductal cancer in situ, while CLEC2D expression was
significantly higher in invasive ductal cancer over ductal cancer in
situ(Fig. [203]8B). Images of the spatial sequencing slide were shown
in Fig. [204]8C. These data suggested that KLRB1 could mark the
difference between cancer types while CLEC2D could mark the difference
between breast cancer tissue types within a sample. As CLEC2D and KLRB1
are two critical T cell genes, these data also reflect the potential
role of T cell infiltration in breast cancer.
Figure 8.
[205]Figure 8
[206]Open in a new tab
Tumour heterogeneity association of CD8Tex hub marker genes. (A)
Expression of CD8Tex hub marker genes in breast cancer subtypes. (B)
Expression comparison of CD8Tex hub marker genes in invasive ductal
cancer and ductal cancer in situ by spatial sequencing. C. Immages of
the spatial expression of CD8Tex hub marker genes in invasive ductal
cancer and ductal cancer in situ.
Pan-cancer expression and survival analysis of CD8Tex hub genes
To further expand the application of the CD8Tex hub genes for
pan-cancer use, we systematically explored the expression of CLEC2D,
CRTAM, and KLRB1 across cancer types, which is provided in the
[207]Supplementary results.
Discussion
Tumor-associated immune cells may promote or inhibit cancer cells
depending on their function and role in immunity. Many previous studies
have suggested the impact of the immune environment on
cancers^[208]30,[209]32,[210]34,[211]36–[212]38,[213]61,[214]83–[215]86
. The novelty of this study lies in our approach to identifying marker
genes for CD8Tex using multiple immune single-cell sequencing datasets.
By conducting intersection analysis across multiple independent
single-cell datasets, we aimed to ensure the stability and
reproducibility of the identified biomarker gene sets. Furthermore, we
identified hub genes for these biomarkers, namely CLEC2D, CRTAM, and
KLRB1, which have been less explored in the context of breast cancer.
Of significant importance is our discovery of the association between
these hub genes and various aspects of breast cancer immunity, clinical
features, and tumor heterogeneity. This comprehensive analysis sheds
light on previously overlooked molecular players in breast cancer and
highlights their potential relevance for understanding disease
progression and treatment response. Overall, our study offers valuable
insights into the immune landscape of breast cancer and uncovers
potential targets for further investigation and therapeutic
intervention.
In the CD8Tex-based survival model we constructed, the risky genes
include CLEC2D, CRTAM, EZR, HLA-DRB1, NKG7, SLA2, SLC7A5, and SYTL3,
while the protective genes include CTSW, GBP2, IFNG, KLRB1, MT-ND1,
PSME2, SH2D2A, and TOX. Most of these genes are reported for the first
time as prognostic biomarkers in breast cancer, although some have been
previously associated with breast cancer treatment or the breast cancer
microenvironment. CRTAM enhances the immune response against tumors in
triple-negative breast cancer by increasing the infiltration of CD8 + T
cells^[216]87. An integrative multi-omics analysis identifies a gene
signature related to metastasis, highlighting the potential oncogenic
role of EZR in breast cancer^[217]88. Breast cancer is associated with
increased allele frequencies of HLA-DRB1*11:01 and HLA-DRB1*10:01 in a
population of patients from Central Italy^[218]89; however, this
pertains to genetic mutation rather than gene expression. Targeting the
glutamine metabolic reprogramming mediated by SLC7A5 enhances the
effectiveness of anti-PD-1 therapy in triple-negative breast
cancer^[219]90. GBP2 serves as a prognostic biomarker and is linked to
immunotherapeutic responses in gastric cancer^[220]91; however, its
association with breast cancer has not been reported. As a single-cell
signature, low expression of KLRB1 predicts poor survival outcomes and
is associated with immune infiltration in breast cancer^[221]92. It is
also reported that inhibiting KLRB1 expression is associated with
impaired cancer immunity, leading to cancer progression and poor
prognosis in patients with breast invasive carcinoma^[222]93. Data
suggested that PSME2 was associated with immune-hot tumors in breast
cancer and with a favorable therapeutic response to
immunotherapy^[223]94. Overall, our results align with previous studies
regarding some of these genes. However, we emphasize their prognostic
value and have included additional novel biomarker genes to enhance the
overall prognostic accuracy of our model.
The cell types identified with biomarker gene sets all play important
roles in the immunity of cancers. Tumor-infiltrating B cells are an
essential part of the antitumor action of T cells^[224]95. CD8 + T
cells (CD8T) and CD4 + T cells have a similar differentiation process
but once differentiated, the CD4 + cells or the CD8 + cells are fixed
and function differently^[225]96. Conventional CD4 + T cells (CD4Tconv)
lack phenotypic markers that distinguish these cells from FoxP3 + T
regulatory cells (Treg)^[226]95. CD8T includes cytotoxic T cells, which
directly target cancer cells and induce apoptosis of cancer
cells^[227]97. During cancer progression, cancer-associated
fibroblasts, M2 macrophage, and Treg might negatively regulate
anti-cancer immune responses mediated by CD8 + T cells^[228]98.
Macrophages are monocytes that have migrated from the bloodstream into
tumor tissues, which serve as scavengers, maintaining homeostasis in
tumors ^[229]99. The effect of myofibroblast on cancer immunity has
been less reported. A study has shown that myofibroblast regulated type
I collagen thereby impacting cancer immunity^[230]100. It was reported
that the tumor-infiltrating T cells directly in contact with the cancer
cells proliferate more frequently compared with T cells in the
stroma^[231]101, hence, proliferating T cells (Tprolif) were thought to
be a subline of T cells. Exhausted CD8 T Cells (CD8Tex) is a set of
sub-cell lines of CD8 T Cells that persist in cancer but are
unsuccessful in killing cancer cells^[232]102. The presence and the
type of CD8Tex have been thought to be critical for the response of
cancer to some immune checkpoint blockades^[233]103. Therefore, the
abundance of these immune-related cells in tumors is a critical factor
for tumorigenesis as well as cancer immune therapy.
The analysis of the immune association of subtypes based on marker
genes for different cell types suggested that the subtyping systems
performed well in the separation of patients with different immune
relevance. The subtyping identified C2 as standing out from the other
patients. Interestingly, the subtypes also performed well in the
separation between ICB responders and ICB non-responders. These ICB
predictions were based on TIDE scores. However, the TIDE score was
designed to predict response to immune checkpoint blockade, including
anti-PD1 and anti-CTLA4, for melanoma and NSCLC. The use of TIDE in
this study is based on the assumption that breast cancer has a similar
immune system as melanoma and NSCLC. Hence, one should be cautious in
interpreting these results. Further study to define a prediction model
specific to breast cancer is required when more breast cancer immune
therapy cohort data are available for model training. In the comparison
of survival models, we concluded that the Tprolif model was the best
overall. However, the other model also provided a rather comparative
prognostic power. Nevertheless, the main purpose of this analysis is to
narrow down the critical genes for the subsequent study. Another
critical limitation is the observations in special sequencing data
might be subjected to bias. Firstly, it's essential to note that the
resolution of the spatial sequencing results is relatively low, and
some identified blocks may encompass both invasive ductal carcinoma and
ductal carcinoma in situ within breast cancer tissue. Secondly, the
classification of invasive ductal carcinoma and ductal carcinoma in
situ is determined by a licensed clinical pathologist using her
clinical expertise and knowledge, rather than relying on a standard
computational pipeline for detection. Additionally, drawing conclusions
solely based on these biomarkers for T cell infiltration levels may
carry significant risk. Hence, our data offer only initial insights
into the biomarkers, and further investigation into the association of
CD8Tex heterogeneity is warranted.
In this study, three critical hub genes were identified, including
CLEC2D, CRTAM, and KLRB1. NK cells play a central role in the immune
system, particularly in the recognition and elimination of malignant
and infected cells. 2B4 (CD244, SLAMF4) and CS1 (CD319, SLAMF7) are NK
cell receptors that control their cytotoxic function. Lectin-like
transcript 1 (LLT1), a member of the C-type lectin-like domain family 2
(CLEC2D), induces IFN-g production but does not directly regulate
cytolysis. LLT1, which is expressed in other cells, acts as a ligand
for the NK cell inhibitory receptor NKRP1A (CD161) and suppresses NK
cytolytic activity. Research has been conducted on novel therapies that
target these receptors to enhance NK cell effector functions^[234]104.
Hence, CLEC2D is associated with the NK-CD8 + cell regulation. On the
other hand, CD4( +)T cells possessing CRTAM can differentiate into
CD4( +)CTLs. A study found that after activation, CRTAM( +) CD4( +) T
cells secrete IFN-γ and express CTL-related genes such as eomesodermin,
Granzyme B, and perforin, which suggest that CRTAM( +) T cells are the
precursor of CD4( +)CTLs^[235]105. Additionally, ectopic expression of
CRTAM in T cells induces IFN-γ production, expression of CTL-related
genes, and cytotoxic activity. This is further supported by the fact
that CRTAM-mediated intracellular signaling is required for the
induction of both CD4( +)CTLs and IFN-γ production^[236]105.
Furthermore, these CRTAM( +) T cells traffic to mucosal tissues and
inflammatory sites and develop into CD4( +)CTLs, which can either
mediate protection against infection or induce inflammatory response
depending on the situation. These results demonstrate that CRTAM is a
key factor in the differentiation of CD4( +)CTLs through the induction
of Eomes and CTL-related genes^[237]105. As for KLRB1, this is a gene
that encodes CD161 protein. CD161 has been proposed as a pan-cancer
immune checkpoint^[238]106. Therefore, it is not surprising that, in
this study, we demonstrated that these two genes were closely
associated with the cancer immune environment and can be used to
predict the immune therapy response.
The mining and analysis of multi-omic profiling data enable
bioinformatic study with a rather comprehensive understanding of genes
in cancers. In this study, combining single-cell RNA sequencing
(scRNA-seq), bulk RNA sequencing (bulk RNA-seq), and spatial
transcriptomics data would indeed constitute a multi-omic approach,
provided that all of these datasets are derived from RNA sequencing
(RNA-seq) technologies. While each type of RNA-seq data captures gene
expression information at different scales and resolutions, integrating
them allows for a more comprehensive understanding of biological
systems. Integrating scRNA-seq, bulk RNA-seq, and spatial
transcriptomics data allows researchers to analyze gene expression
dynamics at multiple scales, from individual cells to tissue-level
spatial organization. This multi-omic approach enables the
identification of cell types, subpopulations, and spatially defined
gene expression patterns, facilitating a deeper understanding of
complex biological processes and disease mechanisms.
The perspective of this work is to deepen our understanding of the role
of CD8Tex in breast cancer. While immune therapy is not yet widely
utilized in breast cancer treatment, there is promise in using T
cell-based immune therapy for some refractory breast cancer cases.
CD8 + Tex cells might also be developed as a drug target, as numerous
studies have reported interactions involving genes, drugs^[239]107,
diseases, cells, and even the microbiome^[240]108–[241]112. Whether
CD8 + Tex cells play a role in these interactions can be explored in
future research. A recent study found an association between breast
cancer and menopause^[242]113. Further investigation is needed to
determine whether menopause affects CD8 + T cells in breast cancer. One
of the biggest issues in breast cancer is drug resistance^[243]114.
Addressing the challenges in this field necessitates exploration. Our
identification of hub genes for these biomarkers, such as CLEC2D,
CRTAM, and KLRB1, and their potential association with known biomarkers
of breast cancer cells, such as CHEK2^[244]115 and TP53^[245]116, lays
the groundwork for potential clinical applications in breast cancer
management. Further research is needed to explore how these biomarkers
can be effectively incorporated into clinical practice to improve
patient outcomes. Additionally, non-invasive early detection of breast
cancer has recently shown promising advancements^[246]117. The
biomarkers discovered may have the potential to contribute to the
non-invasive detection of this cancer type.
Supplementary Information
[247]Supplementary Information.^ (3.1MB, docx)
Acknowledgements