Abstract

   Cancer is one of the most concerning public health issues and breast
   cancer is one of the most common cancers in the world. The immune cells
   within the tumor microenvironment regulate cancer development. In this
   study, single immune cell data sets were used to identify marker gene
   sets for exhausted CD8 + T cells (CD8Tex) in breast cancer. Machine
   learning methods were used to cluster subtypes and establish the
   prognostic models with breast cancer bulk data using the gene sets to
   evaluate the impacts of CD8Tex. We analyzed breast cancer
   overexpressing and survival-associated marker genes and identified
   CD8Tex hub genes in the protein–protein-interaction network. The
   relevance of the hub genes for CD8 + T-cells in breast cancer was
   evaluated. The clinical associations of the hub genes were analyzed
   using bulk sequencing data and spatial sequencing data. The pan-cancer
   expression, survival, and immune association of the hub genes were
   analyzed. We identified biomarker gene sets for CD8Tex in breast
   cancer. CD8Tex-based subtyping systems and prognostic models performed
   well in the separation of patients with different immune relevance and
   survival. CRTAM, CLEC2D, and KLRB1 were identified as CD8Tex hub genes
   and were demonstrated to have potential clinical relevance and immune
   therapy impact. This study provides a unique view of the critical
   CD8Tex hub genes for cancer immune therapy.

   Keywords: Immune cell, Breast cancer, Machine learning, Single-cell
   sequencing, Spatial sequencing

   Subject terms: Computational biology and bioinformatics, Cancer, Breast
   cancer, Oncogenes, Tumour biomarkers, Tumour immunology

Introduction

   Cancer is one of the most concerning public health issues in the
   world^[32]1. It is estimated that in 2024, there will be approximately
   2,001,140 new cancer cases and 611,720 cancer-related deaths in the
   United States^[33]2. In China, it was estimated that approximately
   4,800,000 new cancer cases occurred, causing about 3,200,000
   cancer-related deaths^[34]3. Breast cancer is one of the most common
   cancers in the world ^[35]3. Much as the prevention and tumorigenesis
   of breast cancer have been studied intensively in the past
   decades^[36]4, the incidence of breast cancer increased by 0.5% each
   year from 2014 to 2018^[37]5. The development of breast cancer was
   impacted by both genetic risk factors and environmental risk factors.
   Clinical breast cancer was subtyped by the expression level of certain
   breast cancer critical receptors: the estrogen receptor (ER), the
   progesterone receptor (PR), and human epidermal growth factor receptor
   2 (HER2). Breast cancer was divided into the following 4 molecular
   subtypes: Luminal A, Luminal B, Triple negative (all called
   basal-like), and HER2-enriched.

   It has been widely accepted that the immune cells within the tumor
   microenvironment regulate cancer development^[38]6. Tumor-infiltrating
   immune cells have emerged as clinically relevant and highly associated
   with prognosis and response to treatment for breast cancer as well as
   other cancer types ^[39]7. Checkpoint blockade therapies have
   demonstrated notable advancements in treating various human cancer
   types^[40]8–[41]10. Breast cancer, previously considered poorly
   immunogenic, has been an exception^[42]11. Even though breast cancer
   isn't typically considered highly immunogenic due to its relatively low
   tumor mutational burden, the abundance of tumor-infiltrating
   lymphocytes in breast cancer correlates with markedly improved
   prognoses, both with and without PD-1 targeted
   immunotherapy^[43]12,[44]13. Two checkpoint inhibitors targeting the
   PD-1/PD-L1 pathway, atezolizumab and pembrolizumab, have gained
   approval for treating triple-negative breast cancer
   patients^[45]14,[46]15. Our comprehension of the mechanisms underlying
   resistance or response to immunotherapy remains incomplete, as does our
   understanding of the intricate cellular interactions within the tumor
   immune microenvironment. To develop new immunotherapies and utilize
   existing ones effectively for breast cancer patients, it is imperative
   to grasp the tumor immune microenvironment comprehensively. Although
   breast cancer tumor-infiltrating lymphocytes are mainly composed of
   CD3 + T cells^[47]16,[48]17, recent research has identified a subset of
   CD8 + T cells that play a crucial role in breast cancer^[49]18. This
   finding is supported by a single-cell RNA sequencing study of CD3 + T
   cells isolated from human primary breast cancer^[50]18. This subset of
   CD8 + T cells exhibited elevated expression levels of immune checkpoint
   molecules like PDCD1 (PD-1), CTLA4, HAVCR2 (TIM-3), and LAG3. The
   transcriptional signature originating from these cells was linked to
   improved prognoses, irrespective of the total quantity of CD8 + T cells
   present and the administered treatment^[51]18,[52]19. For immune cells
   and other cells in cancer, the unique biomarkers of cells can be used
   to evaluate the abundance of the cells in tumors. The discovery and
   investigation of these cell biomarkers in cancer facilitate the
   understanding of the role and function of the corresponding cells in
   tumors.

   The inherent heterogeneity of breast cancer poses significant
   challenges for conventional diagnostic and therapeutic methods^[53]20.
   Typically, these approaches rely on analyzing bulk tumor tissue
   samples, which may obscure underlying heterogeneity due to their focus
   on average expression levels^[54]20. However, emerging technologies
   like single-cell analysis offer promising alternatives, already widely
   used in oncology research^[55]21,[56]22. By examining gene expression,
   phenotypes, protein levels, and other cellular properties at an
   individual cell level, these techniques are well-suited to tackle tumor
   heterogeneity, particularly in highly diverse cancers like breast
   cancer^[57]23–[58]25. Single-cell analysis can aid in predicting
   cellular evolution during tumor progression and enhance the precision
   of predicting treatment outcomes and patient prognosis^[59]23–[60]25.
   Moreover, these techniques play a vital role in devising novel
   therapeutic strategies by enabling the detailed examination of genetic
   variations and phenotypic characteristics of tumor cells, leading to
   the identification of new therapeutic targets^[61]20. This, in turn,
   facilitates the development of highly targeted treatment strategies,
   improving our ability to predict treatment efficacy and potential drug
   resistance. Single-cell gene sequencing, utilizing next-generation
   sequencing (NGS), has become indispensable for studying breast cancer
   heterogeneity^[62]26. Unlike traditional Sanger sequencing, NGS systems
   employ massive parallel sequencing to generate billions of DNA reads,
   allowing for the detection of various genetic variations, including
   single-nucleotide mutations, small insertions/deletions, and copy
   number variations^[63]27. This comprehensive view aids in streamlining
   the development of targeted treatment strategies. For example, in
   HER2-positive breast cancer, single-cell sequencing identifies
   diversity in HER2 gene amplification across different cells,
   facilitating personalized treatment plans^[64]28. NGS versatility
   extends to RNA sequencing (RNA-seq), enabling quantitative and sequence
   analyses of diverse RNA types and their expression levels, enriching
   our understanding of breast cancer molecular mechanisms^[65]29. RNA
   sequencing has been wildly used in cancer research ^[66]30–[67]39.

   For transcriptomic data, generally, bulk sequencing data provides
   averaged expression levels of all cells in a tissue sample, while
   single-cell sequencing data has been used to decipher the cellular and
   molecular landscape at a single-cell resolution ^[68]40. The advantage
   of bulk sequencing data is that it often comes with the clinical
   information of the patients. Such patient levels data facilitate the
   analysis of diagnosis and prognosis as well as other clinical factors
   associated with cancers. In addition, spatial sequencing provides gene
   expression profiles of a sample with positional information, which is
   useful for studying heterogeneity within a tumor sample ^[69]41.
   Single-cell RNA sequencing (scRNA-seq) presents significant new
   prospects for systematically delineating the cellular landscape of
   tumors and uncovering fresh insights into cell biology, disease
   etiology, and drug response^[70]42,[71]43. Numerous studies have
   effectively employed scRNA-seq to examine selected populations within
   human breast tumors, unveiling a spectrum of differentiation states
   within tumor-infiltrating lymphocytes^[72]44, highlighting the role of
   tissue-resident CD8 cells in breast cancer^[73]18, and shedding light
   on chemoresistance mechanisms in breast cancer neoplastic cells^[74]45.
   Recent endeavors have utilized mass cytometry with antibody marker
   panels to scrutinize breast cancer cell types and ecosystems across
   hundreds of patients^[75]46,[76]47. Consequently, there is a pressing
   need for a more comprehensive transcriptional atlas of breast tumors at
   high molecular resolution, encompassing all subtypes and cell types.
   Such an atlas would aid in refining the disease's taxonomy, delineating
   heterotypic cellular interactions, and elucidating cellular
   differentiation processes. Equally crucial are data systematically
   mapping the spatial transcriptomic architecture of breast tumors, as
   this can unveil how cells in the tumor microenvironment (TME) are
   organized into functional units. A recent paper integrated
   single-nucleus RNA sequencing with microarray-based spatial
   transcriptomics to delineate cell populations and their spatial
   distribution within breast cancer tissues ^[77]48.

   Our study focused on exhaust CD8 + T cells (CD8 + Tex). The recent
   surge in cancer immunotherapy, primarily based on checkpoint blockade,
   has been a breakthrough in treating various cancer types. However,
   certain factors are hindering the progress of these treatments, such as
   varying genetic make-up of individuals, resistant tumor sub-types, and
   immune-related adverse events. While the focus of immunotherapies has
   been on improving CD8 + T cells, the relationship between CD4 + T cells
   and CD8 + T cells is also gaining attention. The tumor-infiltrating T
   regulatory (Treg) cells are a major obstacle in the cross-talk between
   CD4 + T cells and CD8 + T cells since they are capable of inhibiting
   anti-tumour immunity^[78]49. CD8 + Tex, which is often seen in chronic
   infections and cancer, is a progressive process characterized by
   decreased effector function and upregulation of inhibitory receptors
   such as PD-1 and Tim-3. Although immunological checkpoint inhibitors
   have allowed for the eradication of tumors, a better understanding of
   the mechanisms by which T cell–exhaustion pathways work in tumors and
   the factors that drive them is needed. In this regard, the role of
   CD8 + Tex in immunosuppression is key to the resistance of cancer in
   immune therapy.

   The study hypothesized that certain genes can be biomarkers of CD8 + T
   cells in breast cancer as well as other cancer types. In this study, we
   aimed to identify these genes and demonstrate their association with
   cancers. We believe this study provides a unique view of the critical T
   cell hub genes for cancer immune therapy.

Methods

Overview of the study

   Single immune cell data sets were used to identify marker gene sets for
   CD8 + Tex cells in breast cancer. A machine learning method, consensus
   clustering, was used to cluster TCGA BRCA patients using the identified
   marker genes, hence, we constructed CD8 + Tex-based genetic subtypes
   based on the abound of CD8 + Tex in breast cancer samples. We compared
   the immune cell infiltration levels and predicted immune checkpoint
   blockade response rate among subtypes to demonstrate the potential
   clinical value of the subtype systems for immune therapy of breast
   cancer. The identified marker gene sets were also used to construct the
   prognostic models for breast cancer patients using a machine learning
   algorithm lasso (least absolute shrinkage and selection operator) with
   TCGA BRCA cohort, which further identified the critical genes for the
   subsequent study. We further analyzed the protein–protein interaction
   of these molecules and identified hub genes in the
   protein–protein-interaction network. The correlation between the hub
   genes and CD8 + T-cell infiltration levels of breast cancer was
   evaluated using different immune cell calculation algorithms. The
   clinical associations of these hub genes were analyzed using the
   clinical information of breast cancer patients and their expression
   differences in invasive ductal cancer and ductal cancer in situ were
   analyzed using spatial sequencing data. The pan-cancer
   cancer-non-cancer expression and survival association of these hub
   genes were analyzed. The correlation between the hub genes and
   CD8 + infiltration levels and the immune therapy predictive values of
   these hub genes were analyzed using immune checkpoint blockade
   sub-cohorts.

Data collection

   Single-cell cohorts were accessed and analyzed from the TISCH2
   platform^[79]50([80]http://tisch.comp-genomics.org/). In this study, 4
   single-cell sequence data sets were included as shown in Table [81]1.
   This dataset comprises expression data from immune cells obtained
   through fluorescence-activated cell sorting (FACS), focusing on an
   enriched fraction of immune cells. The MAESTRO v1.1.0 workflow^[82]51
   ([83]https://github.com/liulab-dfci/MAESTRO/blob/master/README.md)
   employed PCA for dimension reduction and KNN and Louvain
   algorithms^[84]52,[85]53 for clustering to identify 2000 variable
   features for each dataset; the number of principal components and the
   resolution for graph-based clustering were adjusted according to the
   cell number. Previous studies revealed that MAESTRO demonstrated
   superior consistency across nearly all cell types, regardless of
   whether the correlation was computed using all genes or solely the top
   2000 variable genes^[86]51, hence in this study, only the top 2000
   variable genes were used. UMAP was used to reduce the dimension and
   visualize the clustering results^[87]54, and the Wilcoxon test was used
   to identify differentially expressed (DE) genes of each cluster of cell
   type (|logFC|> = 0.25, FDR < 1e-05). Data from The Cancer Genome Atlas
   (TCGA, [88]https://www.cancer.gov/ccg/research/genome-sequencing/tcga)
   and GTEx ([89]https://gtexportal.org/home/) were obtained, which
   included gene expression profiling data and clinical information on
   cancer tissues. This data was obtained in accordance with the
   guidelines and policies.

Table 1.

   Information for single-cell data sets.
   Dataset Name Species Treatment Patients number Cells CD8Tex cells
   CD8Tex cell (%) Platform Primary or metastasis PublicationS
   [90]GSE110686 Human None 2 6,035 622 10.3 10 × Genomics Primary and
   metastasis ^[91]18
   GSE114727_10X Human None 3 28,678 1389 4.8 10 × Genomics Primary
   ^[92]44
   [93]GSE176078 Human None 26 89,471 13,500 15.1 10 × Genomics Primary
   ^[94]55
   EMTAB8107 Human None 14 33,043 5193 15.7 10 × Genomics Primary ^[95]56
   [96]Open in a new tab

Single-cell data quality control

   A standardized analysis pipeline utilizing MAESTRO v1.1.0^[97]51 was
   employed to process all gathered datasets. This workflow encompassed
   quality control, batch effect mitigation, cell clustering, differential
   expression analysis, cell-type annotation, and malignant cell
   classification. The raw count, TPM, or FPKM table served as input for
   this standardized workflow. Cell quality was assessed using two
   metrics: total counts (UMI) per cell (library size) and the number of
   detected genes per cell. Cells with low quality were excluded if the
   library size was < 1000 or the number of detected genes was < 500.

Single-cell data batch effect correction

   To systematically assess batch effects across each dataset, an
   entropy-based metric ^[98]44,[99]57 was utilized to quantify data
   mixing among batches. Typically, samples from different patients in
   most datasets are susceptible to batch effects. A k-NN graph (k = 30)
   was constructed based on the Euclidean distance between cells in UMAP
   coordinates for datasets with more than one patient. For each cell j,
   the distribution of patients in its nearest neighbors was computed. The
   measure of mixing between patients Hj is defined as:
   [MATH:
   <mrow><msub><mi>H</mi><mi>j</mi></msub><mo>=</mo><mo>-</mo><munderover>
   <mo
   movablelimits="false">∑</mo><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>T</mi></munderover><mrow><msubsup><mi>p</mi><mrow><mi>j</mi></mrow>
   <mi>t</mi></msubsup><msub><mo>log</mo><mn>2</mn></msub><msubsup><mi>p</
   mi><mrow><mi>j</mi></mrow><mi>t</mi></msubsup></mrow></mrow> :MATH]

   here,
   [MATH: <msubsup><mi>p</mi><mrow><mi>j</mi></mrow><mi>t</mi></msubsup>
   :MATH]
   represents the proportion of cells from patient t among the 30 nearest
   neighbors of cell j, while T denotes the total number of patients. High
   entropy, indicating that the most similar cells in a cell's
   neighborhood come from different patients, suggests potential batch
   effects. Conversely, low entropy suggests that the most similar cells
   originate from the same patient, indicating a potential batch effect.
   It is noteworthy that datasets primarily composed of malignant cells
   (malignant % > 75%) may exhibit low entropy due to the heterogeneity of
   malignant cell expression among different tumors ^[100]42.
   Consequently, the collected datasets were classified into three groups.
   Firstly, for datasets primarily containing malignant cells, there was
   no need to eliminate batch effects between different patients, as they
   reflect differences between distinct tumors. Secondly, datasets with a
   median entropy lower than 0.7 underwent batch effect correction using
   Seurat v3.1.2^[101]58. Median entropies shifted towards higher values
   post-batch effect removal, indicating significant correction of
   potential batch effects. Thirdly, datasets with a median entropy higher
   than 0.7 were considered less affected by batch effects.

Single-cell clustering and differential gene analysis

   For each dataset, the MAESTRO workflow identified the top 2000 variable
   features and conducted PCA for dimension reduction, followed by
   employing the KNN and Louvain algorithm for cluster
   identification^[102]52,[103]53. To better capture cellular differences
   and variabilities across datasets with varying cell numbers,
   adjustments were made to the number of principal components and the
   resolution for graph-based clustering. Both parameters were increased
   with increasing cell numbers. The uniform manifold approximation and
   projection (UMAP) were utilized for further dimension reduction and
   visualization of clustering results^[104]59. The Wilcoxon test was
   applied to identify differentially expressed (DE) genes for each
   cluster compared to all other cells, based on criteria such as
   log-transformed fold change (|logFC|≥ 0.25) and false discovery rate
   (FDR < 1e−05). Clusters of cells were identified using a combination of
   three approaches. Firstly, cell-type annotations provided by the
   original studies were utilized. Secondly, the expression distribution
   of malignant cell markers from the initial research was assessed,
   including epithelial markers and EMT genes where available. Thirdly, we
   applied InferCNV ([105]https://github.com/broadinstitute/infercnv) to
   predict cell malignancy based on predicted copy number variations was
   employed, segregating cells into malignant and non-malignant clusters.
   For the remaining normal clusters, an automated marker-based annotation
   method within MAESTRO was applied using the differentially expressed
   genes between clusters. Subsequently, the cell-type annotations based
   on the annotations provided by the original studies were manually
   verified and corrected.

Bulk data differential expression analysis

   Differentially expressed genes (DEGs) between subtypes were identified
   using the Limma package with a cut-off fold change of 1.3 and a P-value
   of 0.05. DEGs are genes whose expression levels vary significantly
   between different groups. In this study, the goal is to identify genes
   that are differentially expressed between different subtypes. Limma
   (Linear Models for Microarray Data,
   [106]https://bioconductor.org/packages/release/bioc/html/limma.html) is
   a statistical package in R used for analyzing gene expression data. It
   employs linear models and empirical Bayes methods to identify DEGs with
   high sensitivity and specificity.

Bulk data enrichment analysis

   For the GO biological and KEGG pathway enrichment analyses, the
   ClusterProfiler package (version: 3.18.0,
   [107]https://bioconductor.org/packages/release/bioc/html/clusterProfile
   r.html) in R was employed. The False discovery rate (FDR) and p.adjust
   were set at 0.25 and 0.05, respectively. Gene Ontology (GO,
   [108]https://geneontology.org/) is a standardized system for annotating
   genes and their products with terms representing biological processes,
   molecular functions, and cellular components. GO biological pathway
   enrichment analysis involves taking a list of genes, often derived from
   experimental data such as gene expression studies or genome-wide
   association studies, and determining whether any particular biological
   processes represented by GO terms are significantly enriched in that
   gene list compared to what would be expected by chance. The Kyoto
   Encyclopedia of Genes and Genomes (KEGG,
   [109]https://www.genome.jp/kegg/) is a collection of databases that
   contain information about biological pathways, diseases, drugs, and
   other biological entities. KEGG pathway enrichment analysis involves
   mapping a list of genes to known biological pathways in the KEGG
   database and determining whether any pathways are significantly
   enriched in the gene list.

Analysis of the survival

   The univariate Cox regression analysis, Kaplan–Meier (KM) plot, and
   log-rank analysis were used to assess the survival association and
   display the survival curves of genes. Univariate Cox regression
   analysis examines the association between a single predictor variable
   (such as a gene expression level or a clinical characteristic) and
   survival time. It calculates the hazard ratio, which represents the
   relative risk of experiencing the event of interest (such as death)
   between two groups defined by the predictor variable. The Cox
   proportional hazards model is commonly used for this analysis, allowing
   for the estimation of hazard ratios while accounting for censoring and
   other covariates. The Kaplan–Meier plot is a graphical method used to
   estimate the survival function (probability of survival) over time. It
   is particularly useful for visualizing survival differences between
   groups defined by categorical variables (such as sub-groups or
   biomarker expression levels). The plot displays the proportion of
   individuals surviving at each time point, along with confidence
   intervals. The log-rank test is a statistical test used to compare the
   survival curves of two or more groups. It assesses whether there are
   significant differences in survival times between the groups, taking
   into account censoring. The test evaluates whether the observed
   differences in survival are greater than would be expected by chance.
   If the p-value from the log-rank test is below a predetermined
   significance level (0.05), it indicates that there is a statistically
   significant difference in survival between the groups. All analyses
   were carried out using R (foundation for statistical computing 2020)
   version 4.0.3
   ([110]https://cran.r-project.org/bin/windows/base/old/4.0.3/).
   Statistical significance was defined as p < 0.05.

Consensus clustering analysis

   Subtyping of the samples was carried out using the ConsensusClusterPlus
   package([111]https://bioconductor.org/packages/release/bioc/html/Consen
   susClusterPlus.html). The number of clusters was set at 1–6 for
   consistency analysis to optimize the best clustering number.
   Unsupervised class discovery involves identifying potential groups
   within a dataset based solely on inherent features without external
   guidance. Researchers using this technique typically aim to address two
   key questions: the number of groups present in the data and the
   confidence level associated with both the group quantity and their
   memberships. Consensus clustering^[112]60 serves as a valuable method
   for tackling these inquiries, particularly prominent in fields such as
   cancer research^[113]36,[114]61. Consensus clustering offers both
   quantitative and visual indicators of “stability”, obtained through
   iterative subsampling and clustering. By synthesizing outcomes from
   multiple repetitions, consensus clustering generates a consensus that
   demonstrates resilience against sampling variations. While initially
   available within the GenePattern software^[115]62, the consensus
   clustering technique has been further developed into
   ConsensusClusterPlus, a package in the R language offering enhanced
   functionalities and visualizations. In this study, this method is
   suitable for distinguishing subsets of samples of breast cancer
   patients based on certain genes and comparing the overall effect of
   these genes on clinical phenotypes.

Immune analysis

   The immune cell infiltration level was calculated using different
   algorithms, including TIMER^[116]63, XCELL^[117]64, CIBERSORT^[118]65,
   MCPCOUNTER^[119]66, QUANTISEQ^[120]67, and EPIC^[121]68. TIMER is a web
   server for the comprehensive analysis of tumor-infiltrating immune
   cells. It provides immune cell infiltration levels in various cancer
   types and their associations with clinical outcomes. TIMER
   ([122]http://timer.cistrome.org/) utilizes gene expression data to
   estimate the abundance of immune cell subtypes within tumor samples.
   XCELL ([123]https://github.com/dviraran/xCell) is another computational
   tool used for cell type enrichment analysis in gene expression data. It
   estimates the relative abundance of different cell types within a
   heterogeneous sample, including immune cells. XCELL utilizes gene
   expression signatures specific to various cell types to infer their
   proportions in the sample. CIBERSORT (Cell-type Identification By
   Estimating Relative Subsets Of RNA Transcripts,
   [124]https://cibersortx.stanford.edu/) is a computational method used
   to characterize cell composition in complex tissues based on gene
   expression data. It deconvolutes bulk tissue gene expression profiles
   to estimate the relative proportions of different cell types present.
   CIBERSORT is particularly useful for studying immune cell populations
   in tumor microenvironments. MCPCOUNTER (Microenvironment Cell
   Populations-counter, [125]https://github.com/ebecht/MCPcounter) is a
   gene expression-based method for quantifying the abundance of specific
   immune and stromal cell populations in tumor samples. It uses
   predefined gene signatures associated with different cell types to
   estimate their relative proportions in the sample. QUANTISEQ
   ([126]https://icbi.i-med.ac.at/software/quantiseq/doc/) is a
   computational tool for the deconvolution of gene expression profiles to
   estimate the proportions of immune cell subsets within a sample. It
   utilizes gene expression signatures specific to different immune cell
   types to infer their relative abundances. EPIC (Evaluating the Presence
   of Immune Cells, [127]https://github.com/GfellerLab/EPIC) is a
   computational tool for quantifying immune cell infiltration in tumor
   samples based on DNA methylation data. It uses DNA methylation profiles
   to estimate the proportions of immune cell subsets present in the tumor
   microenvironment. Immune checkpoint blockade (ICB) responses of
   subtypes were predicted using the Tumor Immune Dysfunction and
   Exclusion (TIDE, [128]http://tide.dfci.harvard.edu/) algorithm^[129]69
   using the TIDE online analysis platform. TIDE represents a
   computational framework designed to assess the likelihood of immune
   evasion by tumors. It achieves this by analyzing the gene expression
   patterns present in cancer samples. The immune therapy sub-cohorts were
   accessed and analyzed with TIDE tool^[130]69
   ([131]http://tide.dfci.harvard.edu/setquery/). The TIDE score was
   designed to predict response to immune checkpoint blockade, including
   anti-PD1 and anti-CTLA4, for melanoma and NSCLC. The use of TIDE in
   this study is based on the assumption that breast cancer has a similar
   immune system as melanoma and NSCLC.

Prognostic model for identification of critical genes

   The model was constructed using the glmnet
   ([132]https://glmnet.stanford.edu/articles/glmnet.html) R package,
   which implemented the least absolute shrinkage and selection operator
   (LASSO) regression algorithm^[133]70 with tenfold cross-validation for
   gene signature selection. LASSO is a regression analysis method used
   for variable selection and regularization. It aims to find the subset
   of predictor variables that are most relevant for predicting the
   response variable while simultaneously performing variable selection
   and regularization to prevent overfitting. In LASSO regression, the
   ordinary least squares (OLS) objective function is augmented with a
   penalty term that is the sum of the absolute values of the coefficients
   multiplied by a regularization parameter (lambda). This penalty term
   encourages the coefficients of less important variables to be exactly
   zero, effectively performing variable selection by shrinking some
   coefficients to zero. In this study, LASSO was used to construct the
   prognostic model. A validation cohort from Xena-hubs Breast Cancer
   (Caldas)^[134]71 was used for validation of the model.

Protein–protein interaction network and hub gene

   The protein–protein interaction network was constructed with STRING
   ([135]https://string-db.org/), where interactions with a score greater
   than 0.4 were considered. The top 10 hub nodes in the network were
   identified using the Hubba^[136]72
   ([137]https://apps.cytoscape.org/apps/cytohubba) in Cytoscape^[138]73
   ([139]https://cytoscape.org/). The algorithm used included Maximum
   Clique Cardinality (MCC), Density of Maximum Neighborhood Component
   (DMNC), Maximum Neighborhood Component (MNC), and Degree Centrality
   (Degree). The combination of these algorithms offers a comprehensive
   approach to unsupervised class discovery. Each algorithm provides a
   unique perspective on the dataset. MCC focuses on identifying densely
   connected subgraphs (cliques), DMNC evaluates the density of the
   neighborhood around each node, MNC finds the largest connected
   component, and Degree Centrality measures the importance of nodes based
   on their connections. By incorporating these diverse perspectives, the
   combined approach can capture different aspects of the underlying
   structure of the data. The algorithms complement each other's strengths
   and weaknesses. For example, while MCC is effective at identifying
   densely connected subgroups, Degree Centrality can highlight nodes that
   serve as central hubs within the network. This complementary nature
   ensures that a wider range of structural features within the data are
   considered, leading to a more comprehensive analysis.

Spatial sequencing analysis

   The expression differences in invasive ductal cancer and ductal cancer
   in situ were analyzed using spatial sequencing data with SpatialDB
   ([140]http://www.spatialomics.org/SpatialDB/index.php). Spatial
   sequencing, also known as spatial transcriptomics or spatially resolved
   transcriptomics, is a technology that allows researchers to study gene
   expression patterns within tissues while preserving spatial
   information. Spatial sequencing methods integrate high-throughput
   sequencing techniques with spatially resolved imaging approaches to
   generate spatially resolved gene expression profiles. These methods
   enable us to analyze gene expression patterns within intact breast
   cancer tissue sections, providing insights into the spatial
   organization of cells and tissues and their functional implications.
   The spatial sequencing data used in this study were published in a
   previous paper^[141]74. The invasive ductal cancer and ductal cancer in
   situ were labeled by a licensed clinical pathologist Dr Jielin Weng in
   the Department of Pathology, The Second Affiliated Hospital of
   Guangzhou Medical University. The data were visualized using the
   SpatialDB tools^[142]75.

Results

CD8Tex marker gene identification based on single immune cell sequencing

   The UMAP plots were conducted for each single-cell sequence data set
   respectively (Fig. [143]1A). These cells were annotated and sorted into
   17 cell types as shown in Table [144]2. Marker genes of CD8Tex cells
   were identified for each data set respectively, and the marker genes
   were cross-validated by intersection analysis of the marker gene sets
   obtained from different data sets (Fig. [145]1B left panel). The
   Jaccard index was calculated as shown in different colors in
   Fig. [146]1B right panel. Eventually, we obtained 145 marker genes for
   CD8Tex (Fig. [147]1B).

Figure 1.

   [148]Figure 1
   [149]Open in a new tab

   CD8Tex marker genes identification based on single immune cell
   sequencing. (A) UMAP plot of breast cancer single immune cell
   sequencing data sets. (B) Intersection of marker genes identified by
   single immune cell sequencing data sets. Left panel: the Venn diagram.
   Right panel: pairwise intersections analysis.

Table 2.

   Cell type abbreviation in single-cell data.
   Cell type     Abbreviation    Full name
   Immune cells  B               B Cells
                 CD4T            CD4 T Cells
                 CD4Tconv        Conventional CD4 T Cells
                 CD8T            CD8 T Cells
                 CD8Tex          Exhausted CD8 T Cells
                 DC              Dendritic Cells
                 Mono/Macro      Monocytes or Macrophages
                 Mast            Mast Cells
                 Neutrophils     Neutrophils
                 NK              Natural Killer Cells
                 Tprolif         Proliferating T Cells
                 Treg            Regulatory T Cells
   Stromal cells Endothelial     Endothelial CELLS
                 Fibroblasts     Fibroblasts
                 Myofibroblasts  Myofibroblasts
   Cancer cells  Malignant       Malignant cells
   Other cells   Oligodendrocyte Oligodendrocytes
   [150]Open in a new tab

Clustering of CD8Tex-based subtypes

   To further study the CD8Tex cells in breast cancer, we collected TCGA
   cohort to join-analyze the CD8Tex marker genes. The basic clinical
   information was provided in the [151]Supplementary Table. All of these
   marker genes indeed have potential value for clinical application,
   however, the survival association of these genes revealed their
   practical value as a molecule that remarkably affects the progress of
   the tumor, hence, we believe that those genes that significantly impact
   patient survival are more likely to make a clinical difference when
   used as biomarkers or therapeutic targets. First, we excluded genes
   that do not affect the survival of breast cancer by conducting an
   overall survival comparison between high and low-gene expression groups
   (divided by expression medium). The univariate analysis was performed
   and the hazard ratio of significant genes was shown in Fig. [152]2A.
   Only four genes were risk factors for breast cancer patients, while the
   others were protective factors.

Figure 2.

   [153]Figure 2
   [154]Open in a new tab

   CD8Tex-based subtype clustering. (A) Survival screening of the CD8Tex
   marker genes. The forest plot shows the hazard ratio (HR). (B1)
   Consensus Cumulative Distribution Function (CDF) plot of subtype
   numbers (k = 2–6). (B2) Delta area plot of the consensus CDF plot. (B3)
   Consensus matrix and cluster trees of subtypes. (B4) Gene expression
   heatmap of the subtypes. (B5) Principal Component Analysis (PCA) plot
   of the consensus clustering. (C1) Overall survival Kaplan–Meier plot
   (KM plot) of the subtypes. (C2) Progression-free interval KM plot of
   the subtypes.

   We admitted that not all markers are highly specific for exhausted
   T-cells, the criteria to collect them is because their expression is
   significantly different from other cell types, thus, the expression of
   these marker genes can help define the distinctive expression patterns
   of cell infiltration features in bulk sequencing data. We aimed to
   investigate breast cancer with different CD8Tex infiltration levels,
   thus, consensus clustering was used to cluster TCGA BRCA patients into
   CD8Tex-based subtypes using the identified marker genes. Based on the
   consensus cumulative distribution function (CDF) plotting, the number
   of clusters (K) = 4 was the optimum cluster number for all of these
   clustering (Fig. [155]2B1, 2). The subtyping approach has been applied
   to help understand the role of a gene set^[156]76,[157]77. By the NMF
   method, which is an effective dimension reduction method for cancer
   subtype identification, patients were clustered into four distinct
   subtypes (Fig. [158]2B3). The heatmap in Fig. [159]2B4 illustrates the
   differential expression patterns of marker genes across the breast
   cancer subtypes (C1, C2, C3, and C4), while the PCA plot in
   Fig. [160]2B5 demonstrates the clustering of patients based on their
   gene expression profiles within each subtype. (C1, C2, C3, and C4)
   (Fig. [161]2B4, 5). The heatmap and PCA plots in Fig. [162]2B4, B5
   visually depict the distinct expression patterns among the identified
   breast cancer subtypes. The heatmap color scale ranges from low
   expression (blue) to high expression (red), facilitating the
   interpretation of differential expression patterns across subtypes. In
   Fig. [163]2B4, the heatmap visually represents the relative expression
   levels of marker genes across the identified breast cancer subtypes
   (C1, C2, C3, and C4). Each row in the heatmap corresponds to a gene,
   and each column represents a patient sample, with color intensity
   indicating the expression level of each gene. This visualization allows
   us to observe patterns of differential expression among the subtypes,
   highlighting potential molecular distinctions. In contrast,
   Fig. [164]2B5 utilizes PCA (Principal Component Analysis) to explore
   the overall variation in gene expression profiles among patients within
   each subtype. PCA is a dimensionality reduction technique that
   identifies patterns in data and visualizes these patterns by projecting
   patients onto a reduced-dimensional space defined by principal
   components. The PCA plot helps to visualize how patients cluster based
   on their gene expression profiles, providing insights into the
   similarities and differences among subtypes beyond individual gene
   expression levels. We analyzed the overall survival and progress-free
   interval of the subtypes and found that the subtyping failed to
   distinguish different survival patients except for the progress-free
   interval of C3 versus C2 (Fig. [165]2C1,2).

Immune difference of CD8Tex-based subtypes

   In addition, we also display the association between molecular subtypes
   and CD8Tex-based subtypes with a Sankey diagram (Fig. [166]3A).
   Generally, CD8Tex-based subtypes were not associated with the molecular
   subtypes. In addition, we also calculated the immune cell infiltration
   levels of the CD8Tex-based subtypes using multiple algorithms as
   provided in the [167]supplementary materials. The XCELL algorithms
   suggested that the CD8Tex-based subtypes separated samples with
   different immune profiles. The immune score and microenvironment score
   were remarkably different among CD8Tex-based subtypes. The stroma
   scores were also significantly different among CD8Tex-based subtypes.
   The C2 subtypes had a very high immune score and microenvironment
   score. (Fig. [168]3B).

Figure 3.

   [169]Figure 3
   [170]Open in a new tab

   Immune association of the CD8Tex-based subtypes. (A) Sankey diagram
   showing the association between breast cancer molecular subtypes and
   CD8Tex-based subtypes. (B) The XCELL scores in CD8Tex-based subtypes.
   (C) Tumor Immune Dysfunction and Exclusion (TIDE) scores in
   CD8Tex-based subtypes. (D) Predicted immune checkpoint blockade (ICB)
   response of the subtypes based on the TIDE score.

   To demonstrate the potential clinical value of these subtyping systems,
   we calculated the TIDE score to predict the response of the samples for
   ICB therapy. Results revealed that the C2 subtype had a significantly
   higher TIDE score compared to the other subtypes (Fig. [171]3C). This
   indicates that C2 subtypes had low T cell response. Based on the TIDE
   score, we predicted the response of each sample for ICB and calculated
   the response ratio for each subtype. Results showed that C1 had a 70%
   response rate, C3 had a 63% response rate, and C4 had a 65% response
   rate, yet, C2 had an 84% response rate (Fig. [172]3D). It is not
   surprising that C2 with the highest immune score and microenvironment
   score also has highest response rate. We believe that this is because,
   although the results are derived from two distinct algorithms, there
   are common parameter genes utilized in their calculation. These results
   suggested that the subtyping systems performed well in the separation
   of patients with different immune relevance, especially identified C2
   subtypes as the responder subtype for ICB, indicating that these
   marker-gene sets potentially provided clinical value for breast cancer
   immune therapies.

Expression difference of CD8Tex-based subtypes

   To further investigate the features of patients in C2 subtypes compared
   to the other patients, we conducted a differential expression gene
   analysis comparing C2 and the other subtypes. The analysis revealed
   1552 up-regulated genes and 980 down-regulated genes in the C2 subtype
   as shown in the volcano plot (Fig. [173]4A) and heatmap with clustering
   (Fig. [174]4B). Figure [175]4B depicts a heatmap with hierarchical
   clustering illustrating the differential expression patterns of genes
   between the C2 subtype and other subtypes. The rows represent genes,
   while the columns represent patient samples. The color scale in the
   legend indicates gene expression levels, ranging from low (blue) to
   high (red), facilitating the interpretation of the heatmap. This
   visualization highlights distinct gene expression clusters associated
   with the C2 subtype, suggesting potential biomarkers or pathways
   specific to this subgroup.

Figure 4.

   [176]Figure 4
   [177]Open in a new tab

   Differential expression gene enrichment. (A) Volcano plot of the
   differential genes in CD8Tex-based subtypes C2. TCGA BRCA cohorts were
   analyzed. C2 samples were compared with the other samples to identify
   C2 differential genes. (B) Heat map and clustering of the differential
   genes in CD8Tex-based subtypes C2. (C) GO and KEGG enrichment
   analysis^[178]78–[179]80 of the differential genes in CD8Tex-based
   subtypes C2.

   Subsequently, we enriched these genes in the GO database and KEGG
   pathways database. Results showed that the up-regulated genes were
   enriched in multiple terms that are related to the T cell activity. The
   KEGG pathway enrichment analysis revealed that the up-regulated genes
   were associated with cytokine interaction and Th1/Th2 differentiation.
   On the other hand, the down-regulated genes were associated with
   protein secretion and hormone secretion as well as the PI3K-Akt
   signaling pathway. (Fig. [180]4C) Although these DEGs and enrichment
   might related to the previously identified 145 biomarkers, the
   objective of this enrichment analysis is to broadly investigate the
   disparities within the C2 cluster and discern which pathways might
   underlie the diverse clinical phenotypes observed. Rather than
   pinpointing specific markers, this analysis aims to offer general
   insights into the biological mechanisms driving these clinical
   differences.

Construction of a CD8Tex-based survival model

   To further explore the prognostic value of the CD8Tex marker gene set
   and obtain critical genes for breast cancer patients, we trained a
   machine-learning prognostic model using the LASSO algorithm and TCGA
   BRCA cohort. The model suggested that the minimum lambda was 0.0048.
   When lambda was 0.0048, the model achieved the best performance. The
   risk score formula and the included genes are presented in
   Fig. [181]5A–C. The model includes risky genes (CLEC2D, CRTAM, EZR,
   HLA-DRB1, NKG7, SLA2, SLC7A5, and SYTL3) and protective genes (CTSW,
   GBP2, IFNG, KLRB1, MT-ND1, PSME2, SH2D2A, and TOX) which are discussed
   later. The KM plot suggested that the high-risk group had a
   significantly worse survival than the low-risk group (Fig. [182]5D).
   This model helped us narrow down the critical genes for CD8Tex in
   breast cancer. The time-dependent ROC analysis also revealed that the
   AUC of the ROC was over 7 for different years of prediction, indicating
   a good accuracy of the model for survival prediction (Fig. [183]5E,
   [184]F). The model is validated with another independent cohort
   (Fig. [185]5G). The validation cohort results are consistent with those
   of the training cohort, demonstrating that the model can distinguish
   between patients with long survival and those with short survival.

Figure 5.

   [186]Figure 5
   [187]Open in a new tab

   Machine learning prognostic models based on CD8Tex marker genes and hub
   genes identification. (A-C) The λ and coefficients of the model. (D)
   Overall survival KM plots of high- and low-risk groups in TCGA-BRAC.
   (E) Representative time-dependent receiver operating characteristic
   (ROC) curve of the risk model. (F) Area Under Curve (AUC) of the
   time-dependent ROC curve of the risk model. (G) Overall survival KM
   plots of high- and low-risk groups in a validation cohort from
   Xena-hubs Breast Cancer (Caldas 2007). (H) The protein–protein
   interaction network with hub genes identified.

Identification of the hub genes for the CD8Tex regulation network

   The genes in the LASSO model were critical CD8Tex marker genes for
   breast cancer patients. We constructed a protein–protein interaction
   network using these genes and calculated the top three hub genes using
   MCC, DMNC, MNC, and Degree algorithms. The protein–protein interaction
   network presented 31 edges and the average node degree was 3.88
   (Fig. [188]5H). The MCC, DMNC, MNC, and Degree of each gene were
   calculated as presented in [189]Supplementary Table. We then ranked the
   scores and obtained the average rank of each gene. Eventually, CRTAM,
   CLEC2D, and KLRB1 stood out as the top three hub genes in critical
   CD8Tex marker genes for breast cancer patients.

CD8 T cell infiltration relevant to CD8Tex hub genes

   To evaluate the clinical value of the hub genes identified, CRTAM,
   CLEC2D, and KLRB1, for clinical evaluation of CD8 + T-cell infiltration
   levels in breast cancer, we calculated the different CD8 + T-cell
   infiltration levels and analyzed their correlation with CRTAM, CLEC2D,
   and KLRB1 in breast cancer and molecular subtypes. To avoid bias in
   some immune cell scores, we have employed multiple algorithms. Results
   showed that CRTAM, CLEC2D, and KLRB1 were positively correlated with T
   cell CD8 + in MCPCOUNTER, CIBERSORT, CIBERSORT-ABS, EPIC, and
   QUANTISEQ, but not in TIMER. (Fig. [190]6A).

Figure 6.

   [191]Figure 6
   [192]Open in a new tab

   CD8Tex hub marker genes for breast cancer. (A) Correlation analysis
   between CD8Tex hub marker genes and CD8 + T cell infiltration scores.
   (B) Expression of CD8Tex hub marker genes in breast cancer and normal
   breast tissues. TCGA and GTEx data were analyzed. (C) Expression of
   CD8Tex hub marker genes in breast cancer and normal breast tissues.
   TCGA-paired samples were analyzed. (D) KM plot of the CD8Tex hub marker
   genes in breast cancer. Rolls 1–3 display overall survival,
   disease-free survival, and progress-free interval respectively. Columns
   1–3 display CLEC2D, CRTAM, and KLRB1 respectively.

CD8Tex hub genes expression and survival analysis

   Data also suggested that CLEC2D expression slightly decreased in tumors
   compared to normal breast tissue, while CRTAM expression increased in
   tumor tissue and KLRB1 expression had no difference in tumors compared
   to normal tissues (Fig. [193]6B). However, in the cancer-noncancer
   paired samples, the comparison suggested that CRTAM expression
   increased in tumor tissue while KLRB1 expression decreased in the tumor
   (Fig. [194]6C). Given the expression analysis results for CRTAM and
   KLRB1 are not consistent, we cannot determine the cancer-non-cancer
   expression pattern of these two genes definitively. However, based on
   consistent findings, we can conclude that tumors exhibit elevated
   expression of CRTAM. This suggests that tumors may possess a distinct
   CD8 + T-cell infiltration feature compared to normal tissues. Moreover,
   the expression levels of CRTAM and KLRB1 in normal breast tissue are
   comparable to those in breast cancer tissue. This suggests a
   potentially similar infiltration pattern of exhausted CD8 + T-cells in
   both cancer and non-cancer tissues. This similarity in pattern could
   hinder the development of cancer-specific targets for treatment.

   KM survival analysis revealed that CRTAM, CLEC2D, and KLRB1 were all
   significantly associated with better overall survival of breast cancer
   patients. For disease-free survival, CRTAM was not associated, while
   CLEC2D and KLRB1 were associated with better disease-free survival. As
   for the progress-free interval, all three genes were associated.
   (Fig. [195]6D) Survival association of CD8Tex hub marker genes for
   breast cancer subtypes was also analyzed. Results showed that CRTAM was
   a good prognostic biomarker for Luminal B, HER-enriched, and basal-like
   breast cancer, but not for Luminal A breast cancer. CLEC2D was
   associated with overall survival and progress-free interval in all
   subtypes of breast cancer. Similar to CRTAM, KLRB1 was a good
   prognostic biomarker for Luminal B, HER-enriched, and basal-like breast
   cancer, but not for Luminal A breast cancer. However, KLRB1 was not
   significant in the overall survival analysis of basal-like breast
   cancer. (Fig. [196]7).

Figure 7.

   [197]Figure 7
   [198]Open in a new tab

   Survival association of CD8Tex hub marker genes for breast cancer
   subtypes.

Clinical association of the CD8Tex hub genes in breast cancers

   The expression of CD8Tex hub genes has been demonstrated to be critical
   markers for CD8 + T cells in breast cancer. While CD8 + T cells are
   critical for breast cancer development, we proposed that CLEC2D, CRTAM,
   and KLRB1 are associated with clinical characteristics. The comparison
   of clinical information between high and low-expression groups
   suggested that CLEC2D was associated with T stage, age, and radiation
   therapy of breast cancer patients. CRTAM was associated with the N
   stage, race, age, PAM50 (molecular subtype), and radiation therapy of
   breast cancer patients. KLRB1 was associated with the M stage, race,
   age, PAM50, menopause status, and radiation therapy of breast cancer
   patients ([199]Supplementary Table). In the high-expression group of
   CLEC2D, CRTAM, and KLRB1, more patients underwent radiation therapy. In
   general, radiation therapy is often recommended for patients who are in
   relatively good condition and able to tolerate treatment. However, it
   can also be administered to patients with more advanced disease or
   compromised health status, particularly if the potential benefits of
   treatment outweigh the risks. Therefore, the association inferred from
   our data suggests that high expression of CLEC2D, CRTAM, and KLRB1 may
   be associated with less severe cases where radiation therapy is
   recommended, possibly because patients are resistant to other types of
   treatment such as chemotherapy. CLEC2D was associated with the T stage
   indicating that it results in smaller tumors, which might reflect the
   impact of T cells on tumor growth. Tumor-infiltrating T cells have been
   associated with smaller tumor size^[200]81. CRTAM was associated with a
   slightly higher lymph node metastasis (N stage), which might reflect
   the impact of T cells on lymph node metastasis. Tumor-infiltrating
   lymphocytes, including cytotoxic CD8 + T cells, can recognize and
   eliminate cancer cells within lymph nodes, thereby inhibiting the
   spread of metastatic disease^[201]82. The association of KLRB1 with
   decreased tumor metastasis (M stage) suggests a potential role for
   CD8Tex cells in limiting breast cancer metastasis. Taken together,
   these results suggest that patients with CD8Tex may have smaller tumors
   with locoregional metastases, a clinical scenario commonly addressed
   with post-surgical radiotherapy. The association of CD8Tex hub genes
   with clinical characteristics, particularly the link with radiotherapy,
   raises intriguing questions about the immune activity status in breast
   cancer patients. The observed associations between CLEC2D, CRTAM, and
   KLRB1 expression levels and various clinical parameters, including
   tumor stage, lymph node involvement, molecular subtype, menopause
   status, and receipt of radiotherapy, suggest potential implications for
   patient management and treatment outcomes. However, it is important to
   note that the observed differences were slight and may not have
   significant clinical implications.

   These findings underscore the importance of considering the immune
   context of tumors in clinical decision-making and the potential
   implications for personalized treatment approaches. However, further
   investigation, including validation in independent cohorts and
   functional studies, is needed to fully elucidate the role of exhausted
   T-cells and their associated biomarkers in breast cancer progression
   and response to therapy.

CD8Tex hub genes and breast cancer heterogeneity

   To investigate the CD8Tex hub genes and breast cancer extra-tumour
   heterogeneity, we plotted the expression level of CLEC2D, CRTAM, and
   KLRB1 in PAM50 breast cancer subtypes. Results revealed that CLEC2D and
   CRTAM do not have differences among PAM50 breast cancer subtypes, KLRB1
   expression in Luminal B was significantly lower than the other
   subtypes. (Fig. [202]8A) To explore the CD8Tex hub genes and breast
   cancer intro-tumour heterogeneity, we also compared the levels of
   CLEC2D, CRTAM, and KLRB1 in invasive ductal cancer and ductal cancer in
   situ using spatial sequencing data. The invasive ductal cancer and
   ductal cancer in situ of breast cancer tissue were delineated by an
   experienced clinical pathologist and the expression of CLEC2D, CRTAM,
   and KLRB1 in invasive ductal cancer and ductal cancer in situ of breast
   cancer tissue were measured at 4 separate layers. We summed the results
   of 4 layers for analysis and results showed that CRTAM and KLRB1
   expression was not significantly different between invasive ductal
   cancer and ductal cancer in situ, while CLEC2D expression was
   significantly higher in invasive ductal cancer over ductal cancer in
   situ(Fig. [203]8B). Images of the spatial sequencing slide were shown
   in Fig. [204]8C. These data suggested that KLRB1 could mark the
   difference between cancer types while CLEC2D could mark the difference
   between breast cancer tissue types within a sample. As CLEC2D and KLRB1
   are two critical T cell genes, these data also reflect the potential
   role of T cell infiltration in breast cancer.

Figure 8.

   [205]Figure 8
   [206]Open in a new tab

   Tumour heterogeneity association of CD8Tex hub marker genes. (A)
   Expression of CD8Tex hub marker genes in breast cancer subtypes. (B)
   Expression comparison of CD8Tex hub marker genes in invasive ductal
   cancer and ductal cancer in situ by spatial sequencing. C. Immages of
   the spatial expression of CD8Tex hub marker genes in invasive ductal
   cancer and ductal cancer in situ.

Pan-cancer expression and survival analysis of CD8Tex hub genes

   To further expand the application of the CD8Tex hub genes for
   pan-cancer use, we systematically explored the expression of CLEC2D,
   CRTAM, and KLRB1 across cancer types, which is provided in the
   [207]Supplementary results.

Discussion

   Tumor-associated immune cells may promote or inhibit cancer cells
   depending on their function and role in immunity. Many previous studies
   have suggested the impact of the immune environment on
   cancers^[208]30,[209]32,[210]34,[211]36–[212]38,[213]61,[214]83–[215]86
   . The novelty of this study lies in our approach to identifying marker
   genes for CD8Tex using multiple immune single-cell sequencing datasets.
   By conducting intersection analysis across multiple independent
   single-cell datasets, we aimed to ensure the stability and
   reproducibility of the identified biomarker gene sets. Furthermore, we
   identified hub genes for these biomarkers, namely CLEC2D, CRTAM, and
   KLRB1, which have been less explored in the context of breast cancer.
   Of significant importance is our discovery of the association between
   these hub genes and various aspects of breast cancer immunity, clinical
   features, and tumor heterogeneity. This comprehensive analysis sheds
   light on previously overlooked molecular players in breast cancer and
   highlights their potential relevance for understanding disease
   progression and treatment response. Overall, our study offers valuable
   insights into the immune landscape of breast cancer and uncovers
   potential targets for further investigation and therapeutic
   intervention.

   In the CD8Tex-based survival model we constructed, the risky genes
   include CLEC2D, CRTAM, EZR, HLA-DRB1, NKG7, SLA2, SLC7A5, and SYTL3,
   while the protective genes include CTSW, GBP2, IFNG, KLRB1, MT-ND1,
   PSME2, SH2D2A, and TOX. Most of these genes are reported for the first
   time as prognostic biomarkers in breast cancer, although some have been
   previously associated with breast cancer treatment or the breast cancer
   microenvironment. CRTAM enhances the immune response against tumors in
   triple-negative breast cancer by increasing the infiltration of CD8 + T
   cells^[216]87. An integrative multi-omics analysis identifies a gene
   signature related to metastasis, highlighting the potential oncogenic
   role of EZR in breast cancer^[217]88. Breast cancer is associated with
   increased allele frequencies of HLA-DRB1*11:01 and HLA-DRB1*10:01 in a
   population of patients from Central Italy^[218]89; however, this
   pertains to genetic mutation rather than gene expression. Targeting the
   glutamine metabolic reprogramming mediated by SLC7A5 enhances the
   effectiveness of anti-PD-1 therapy in triple-negative breast
   cancer^[219]90. GBP2 serves as a prognostic biomarker and is linked to
   immunotherapeutic responses in gastric cancer^[220]91; however, its
   association with breast cancer has not been reported. As a single-cell
   signature, low expression of KLRB1 predicts poor survival outcomes and
   is associated with immune infiltration in breast cancer^[221]92. It is
   also reported that inhibiting KLRB1 expression is associated with
   impaired cancer immunity, leading to cancer progression and poor
   prognosis in patients with breast invasive carcinoma^[222]93. Data
   suggested that PSME2 was associated with immune-hot tumors in breast
   cancer and with a favorable therapeutic response to
   immunotherapy^[223]94. Overall, our results align with previous studies
   regarding some of these genes. However, we emphasize their prognostic
   value and have included additional novel biomarker genes to enhance the
   overall prognostic accuracy of our model.

   The cell types identified with biomarker gene sets all play important
   roles in the immunity of cancers. Tumor-infiltrating B cells are an
   essential part of the antitumor action of T cells^[224]95. CD8 + T
   cells (CD8T) and CD4 + T cells have a similar differentiation process
   but once differentiated, the CD4 + cells or the CD8 + cells are fixed
   and function differently^[225]96. Conventional CD4 + T cells (CD4Tconv)
   lack phenotypic markers that distinguish these cells from FoxP3 + T
   regulatory cells (Treg)^[226]95. CD8T includes cytotoxic T cells, which
   directly target cancer cells and induce apoptosis of cancer
   cells^[227]97. During cancer progression, cancer-associated
   fibroblasts, M2 macrophage, and Treg might negatively regulate
   anti-cancer immune responses mediated by CD8 + T cells^[228]98.
   Macrophages are monocytes that have migrated from the bloodstream into
   tumor tissues, which serve as scavengers, maintaining homeostasis in
   tumors ^[229]99. The effect of myofibroblast on cancer immunity has
   been less reported. A study has shown that myofibroblast regulated type
   I collagen thereby impacting cancer immunity^[230]100. It was reported
   that the tumor-infiltrating T cells directly in contact with the cancer
   cells proliferate more frequently compared with T cells in the
   stroma^[231]101, hence, proliferating T cells (Tprolif) were thought to
   be a subline of T cells. Exhausted CD8 T Cells (CD8Tex) is a set of
   sub-cell lines of CD8 T Cells that persist in cancer but are
   unsuccessful in killing cancer cells^[232]102. The presence and the
   type of CD8Tex have been thought to be critical for the response of
   cancer to some immune checkpoint blockades^[233]103. Therefore, the
   abundance of these immune-related cells in tumors is a critical factor
   for tumorigenesis as well as cancer immune therapy.

   The analysis of the immune association of subtypes based on marker
   genes for different cell types suggested that the subtyping systems
   performed well in the separation of patients with different immune
   relevance. The subtyping identified C2 as standing out from the other
   patients. Interestingly, the subtypes also performed well in the
   separation between ICB responders and ICB non-responders. These ICB
   predictions were based on TIDE scores. However, the TIDE score was
   designed to predict response to immune checkpoint blockade, including
   anti-PD1 and anti-CTLA4, for melanoma and NSCLC. The use of TIDE in
   this study is based on the assumption that breast cancer has a similar
   immune system as melanoma and NSCLC. Hence, one should be cautious in
   interpreting these results. Further study to define a prediction model
   specific to breast cancer is required when more breast cancer immune
   therapy cohort data are available for model training. In the comparison
   of survival models, we concluded that the Tprolif model was the best
   overall. However, the other model also provided a rather comparative
   prognostic power. Nevertheless, the main purpose of this analysis is to
   narrow down the critical genes for the subsequent study. Another
   critical limitation is the observations in special sequencing data
   might be subjected to bias. Firstly, it's essential to note that the
   resolution of the spatial sequencing results is relatively low, and
   some identified blocks may encompass both invasive ductal carcinoma and
   ductal carcinoma in situ within breast cancer tissue. Secondly, the
   classification of invasive ductal carcinoma and ductal carcinoma in
   situ is determined by a licensed clinical pathologist using her
   clinical expertise and knowledge, rather than relying on a standard
   computational pipeline for detection. Additionally, drawing conclusions
   solely based on these biomarkers for T cell infiltration levels may
   carry significant risk. Hence, our data offer only initial insights
   into the biomarkers, and further investigation into the association of
   CD8Tex heterogeneity is warranted.

   In this study, three critical hub genes were identified, including
   CLEC2D, CRTAM, and KLRB1. NK cells play a central role in the immune
   system, particularly in the recognition and elimination of malignant
   and infected cells. 2B4 (CD244, SLAMF4) and CS1 (CD319, SLAMF7) are NK
   cell receptors that control their cytotoxic function. Lectin-like
   transcript 1 (LLT1), a member of the C-type lectin-like domain family 2
   (CLEC2D), induces IFN-g production but does not directly regulate
   cytolysis. LLT1, which is expressed in other cells, acts as a ligand
   for the NK cell inhibitory receptor NKRP1A (CD161) and suppresses NK
   cytolytic activity. Research has been conducted on novel therapies that
   target these receptors to enhance NK cell effector functions^[234]104.
   Hence, CLEC2D is associated with the NK-CD8 + cell regulation. On the
   other hand, CD4( +)T cells possessing CRTAM can differentiate into
   CD4( +)CTLs. A study found that after activation, CRTAM( +) CD4( +) T
   cells secrete IFN-γ and express CTL-related genes such as eomesodermin,
   Granzyme B, and perforin, which suggest that CRTAM( +) T cells are the
   precursor of CD4( +)CTLs^[235]105. Additionally, ectopic expression of
   CRTAM in T cells induces IFN-γ production, expression of CTL-related
   genes, and cytotoxic activity. This is further supported by the fact
   that CRTAM-mediated intracellular signaling is required for the
   induction of both CD4( +)CTLs and IFN-γ production^[236]105.
   Furthermore, these CRTAM( +) T cells traffic to mucosal tissues and
   inflammatory sites and develop into CD4( +)CTLs, which can either
   mediate protection against infection or induce inflammatory response
   depending on the situation. These results demonstrate that CRTAM is a
   key factor in the differentiation of CD4( +)CTLs through the induction
   of Eomes and CTL-related genes^[237]105. As for KLRB1, this is a gene
   that encodes CD161 protein. CD161 has been proposed as a pan-cancer
   immune checkpoint^[238]106. Therefore, it is not surprising that, in
   this study, we demonstrated that these two genes were closely
   associated with the cancer immune environment and can be used to
   predict the immune therapy response.

   The mining and analysis of multi-omic profiling data enable
   bioinformatic study with a rather comprehensive understanding of genes
   in cancers. In this study, combining single-cell RNA sequencing
   (scRNA-seq), bulk RNA sequencing (bulk RNA-seq), and spatial
   transcriptomics data would indeed constitute a multi-omic approach,
   provided that all of these datasets are derived from RNA sequencing
   (RNA-seq) technologies. While each type of RNA-seq data captures gene
   expression information at different scales and resolutions, integrating
   them allows for a more comprehensive understanding of biological
   systems. Integrating scRNA-seq, bulk RNA-seq, and spatial
   transcriptomics data allows researchers to analyze gene expression
   dynamics at multiple scales, from individual cells to tissue-level
   spatial organization. This multi-omic approach enables the
   identification of cell types, subpopulations, and spatially defined
   gene expression patterns, facilitating a deeper understanding of
   complex biological processes and disease mechanisms.

   The perspective of this work is to deepen our understanding of the role
   of CD8Tex in breast cancer. While immune therapy is not yet widely
   utilized in breast cancer treatment, there is promise in using T
   cell-based immune therapy for some refractory breast cancer cases.
   CD8 + Tex cells might also be developed as a drug target, as numerous
   studies have reported interactions involving genes, drugs^[239]107,
   diseases, cells, and even the microbiome^[240]108–[241]112. Whether
   CD8 + Tex cells play a role in these interactions can be explored in
   future research. A recent study found an association between breast
   cancer and menopause^[242]113. Further investigation is needed to
   determine whether menopause affects CD8 + T cells in breast cancer. One
   of the biggest issues in breast cancer is drug resistance^[243]114.
   Addressing the challenges in this field necessitates exploration. Our
   identification of hub genes for these biomarkers, such as CLEC2D,
   CRTAM, and KLRB1, and their potential association with known biomarkers
   of breast cancer cells, such as CHEK2^[244]115 and TP53^[245]116, lays
   the groundwork for potential clinical applications in breast cancer
   management. Further research is needed to explore how these biomarkers
   can be effectively incorporated into clinical practice to improve
   patient outcomes. Additionally, non-invasive early detection of breast
   cancer has recently shown promising advancements^[246]117. The
   biomarkers discovered may have the potential to contribute to the
   non-invasive detection of this cancer type.

Supplementary Information

   [247]Supplementary Information.^ (3.1MB, docx)

Acknowledgements