Abstract Cancer is one of the most concerning public health issues and breast cancer is one of the most common cancers in the world. The immune cells within the tumor microenvironment regulate cancer development. In this study, single immune cell data sets were used to identify marker gene sets for exhausted CD8 + T cells (CD8Tex) in breast cancer. Machine learning methods were used to cluster subtypes and establish the prognostic models with breast cancer bulk data using the gene sets to evaluate the impacts of CD8Tex. We analyzed breast cancer overexpressing and survival-associated marker genes and identified CD8Tex hub genes in the protein–protein-interaction network. The relevance of the hub genes for CD8 + T-cells in breast cancer was evaluated. The clinical associations of the hub genes were analyzed using bulk sequencing data and spatial sequencing data. The pan-cancer expression, survival, and immune association of the hub genes were analyzed. We identified biomarker gene sets for CD8Tex in breast cancer. CD8Tex-based subtyping systems and prognostic models performed well in the separation of patients with different immune relevance and survival. CRTAM, CLEC2D, and KLRB1 were identified as CD8Tex hub genes and were demonstrated to have potential clinical relevance and immune therapy impact. This study provides a unique view of the critical CD8Tex hub genes for cancer immune therapy. Keywords: Immune cell, Breast cancer, Machine learning, Single-cell sequencing, Spatial sequencing Subject terms: Computational biology and bioinformatics, Cancer, Breast cancer, Oncogenes, Tumour biomarkers, Tumour immunology Introduction Cancer is one of the most concerning public health issues in the world^[32]1. It is estimated that in 2024, there will be approximately 2,001,140 new cancer cases and 611,720 cancer-related deaths in the United States^[33]2. In China, it was estimated that approximately 4,800,000 new cancer cases occurred, causing about 3,200,000 cancer-related deaths^[34]3. Breast cancer is one of the most common cancers in the world ^[35]3. Much as the prevention and tumorigenesis of breast cancer have been studied intensively in the past decades^[36]4, the incidence of breast cancer increased by 0.5% each year from 2014 to 2018^[37]5. The development of breast cancer was impacted by both genetic risk factors and environmental risk factors. Clinical breast cancer was subtyped by the expression level of certain breast cancer critical receptors: the estrogen receptor (ER), the progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2). Breast cancer was divided into the following 4 molecular subtypes: Luminal A, Luminal B, Triple negative (all called basal-like), and HER2-enriched. It has been widely accepted that the immune cells within the tumor microenvironment regulate cancer development^[38]6. Tumor-infiltrating immune cells have emerged as clinically relevant and highly associated with prognosis and response to treatment for breast cancer as well as other cancer types ^[39]7. Checkpoint blockade therapies have demonstrated notable advancements in treating various human cancer types^[40]8–[41]10. Breast cancer, previously considered poorly immunogenic, has been an exception^[42]11. Even though breast cancer isn't typically considered highly immunogenic due to its relatively low tumor mutational burden, the abundance of tumor-infiltrating lymphocytes in breast cancer correlates with markedly improved prognoses, both with and without PD-1 targeted immunotherapy^[43]12,[44]13. Two checkpoint inhibitors targeting the PD-1/PD-L1 pathway, atezolizumab and pembrolizumab, have gained approval for treating triple-negative breast cancer patients^[45]14,[46]15. Our comprehension of the mechanisms underlying resistance or response to immunotherapy remains incomplete, as does our understanding of the intricate cellular interactions within the tumor immune microenvironment. To develop new immunotherapies and utilize existing ones effectively for breast cancer patients, it is imperative to grasp the tumor immune microenvironment comprehensively. Although breast cancer tumor-infiltrating lymphocytes are mainly composed of CD3 + T cells^[47]16,[48]17, recent research has identified a subset of CD8 + T cells that play a crucial role in breast cancer^[49]18. This finding is supported by a single-cell RNA sequencing study of CD3 + T cells isolated from human primary breast cancer^[50]18. This subset of CD8 + T cells exhibited elevated expression levels of immune checkpoint molecules like PDCD1 (PD-1), CTLA4, HAVCR2 (TIM-3), and LAG3. The transcriptional signature originating from these cells was linked to improved prognoses, irrespective of the total quantity of CD8 + T cells present and the administered treatment^[51]18,[52]19. For immune cells and other cells in cancer, the unique biomarkers of cells can be used to evaluate the abundance of the cells in tumors. The discovery and investigation of these cell biomarkers in cancer facilitate the understanding of the role and function of the corresponding cells in tumors. The inherent heterogeneity of breast cancer poses significant challenges for conventional diagnostic and therapeutic methods^[53]20. Typically, these approaches rely on analyzing bulk tumor tissue samples, which may obscure underlying heterogeneity due to their focus on average expression levels^[54]20. However, emerging technologies like single-cell analysis offer promising alternatives, already widely used in oncology research^[55]21,[56]22. By examining gene expression, phenotypes, protein levels, and other cellular properties at an individual cell level, these techniques are well-suited to tackle tumor heterogeneity, particularly in highly diverse cancers like breast cancer^[57]23–[58]25. Single-cell analysis can aid in predicting cellular evolution during tumor progression and enhance the precision of predicting treatment outcomes and patient prognosis^[59]23–[60]25. Moreover, these techniques play a vital role in devising novel therapeutic strategies by enabling the detailed examination of genetic variations and phenotypic characteristics of tumor cells, leading to the identification of new therapeutic targets^[61]20. This, in turn, facilitates the development of highly targeted treatment strategies, improving our ability to predict treatment efficacy and potential drug resistance. Single-cell gene sequencing, utilizing next-generation sequencing (NGS), has become indispensable for studying breast cancer heterogeneity^[62]26. Unlike traditional Sanger sequencing, NGS systems employ massive parallel sequencing to generate billions of DNA reads, allowing for the detection of various genetic variations, including single-nucleotide mutations, small insertions/deletions, and copy number variations^[63]27. This comprehensive view aids in streamlining the development of targeted treatment strategies. For example, in HER2-positive breast cancer, single-cell sequencing identifies diversity in HER2 gene amplification across different cells, facilitating personalized treatment plans^[64]28. NGS versatility extends to RNA sequencing (RNA-seq), enabling quantitative and sequence analyses of diverse RNA types and their expression levels, enriching our understanding of breast cancer molecular mechanisms^[65]29. RNA sequencing has been wildly used in cancer research ^[66]30–[67]39. For transcriptomic data, generally, bulk sequencing data provides averaged expression levels of all cells in a tissue sample, while single-cell sequencing data has been used to decipher the cellular and molecular landscape at a single-cell resolution ^[68]40. The advantage of bulk sequencing data is that it often comes with the clinical information of the patients. Such patient levels data facilitate the analysis of diagnosis and prognosis as well as other clinical factors associated with cancers. In addition, spatial sequencing provides gene expression profiles of a sample with positional information, which is useful for studying heterogeneity within a tumor sample ^[69]41. Single-cell RNA sequencing (scRNA-seq) presents significant new prospects for systematically delineating the cellular landscape of tumors and uncovering fresh insights into cell biology, disease etiology, and drug response^[70]42,[71]43. Numerous studies have effectively employed scRNA-seq to examine selected populations within human breast tumors, unveiling a spectrum of differentiation states within tumor-infiltrating lymphocytes^[72]44, highlighting the role of tissue-resident CD8 cells in breast cancer^[73]18, and shedding light on chemoresistance mechanisms in breast cancer neoplastic cells^[74]45. Recent endeavors have utilized mass cytometry with antibody marker panels to scrutinize breast cancer cell types and ecosystems across hundreds of patients^[75]46,[76]47. Consequently, there is a pressing need for a more comprehensive transcriptional atlas of breast tumors at high molecular resolution, encompassing all subtypes and cell types. Such an atlas would aid in refining the disease's taxonomy, delineating heterotypic cellular interactions, and elucidating cellular differentiation processes. Equally crucial are data systematically mapping the spatial transcriptomic architecture of breast tumors, as this can unveil how cells in the tumor microenvironment (TME) are organized into functional units. A recent paper integrated single-nucleus RNA sequencing with microarray-based spatial transcriptomics to delineate cell populations and their spatial distribution within breast cancer tissues ^[77]48. Our study focused on exhaust CD8 + T cells (CD8 + Tex). The recent surge in cancer immunotherapy, primarily based on checkpoint blockade, has been a breakthrough in treating various cancer types. However, certain factors are hindering the progress of these treatments, such as varying genetic make-up of individuals, resistant tumor sub-types, and immune-related adverse events. While the focus of immunotherapies has been on improving CD8 + T cells, the relationship between CD4 + T cells and CD8 + T cells is also gaining attention. The tumor-infiltrating T regulatory (Treg) cells are a major obstacle in the cross-talk between CD4 + T cells and CD8 + T cells since they are capable of inhibiting anti-tumour immunity^[78]49. CD8 + Tex, which is often seen in chronic infections and cancer, is a progressive process characterized by decreased effector function and upregulation of inhibitory receptors such as PD-1 and Tim-3. Although immunological checkpoint inhibitors have allowed for the eradication of tumors, a better understanding of the mechanisms by which T cell–exhaustion pathways work in tumors and the factors that drive them is needed. In this regard, the role of CD8 + Tex in immunosuppression is key to the resistance of cancer in immune therapy. The study hypothesized that certain genes can be biomarkers of CD8 + T cells in breast cancer as well as other cancer types. In this study, we aimed to identify these genes and demonstrate their association with cancers. We believe this study provides a unique view of the critical T cell hub genes for cancer immune therapy. Methods Overview of the study Single immune cell data sets were used to identify marker gene sets for CD8 + Tex cells in breast cancer. A machine learning method, consensus clustering, was used to cluster TCGA BRCA patients using the identified marker genes, hence, we constructed CD8 + Tex-based genetic subtypes based on the abound of CD8 + Tex in breast cancer samples. We compared the immune cell infiltration levels and predicted immune checkpoint blockade response rate among subtypes to demonstrate the potential clinical value of the subtype systems for immune therapy of breast cancer. The identified marker gene sets were also used to construct the prognostic models for breast cancer patients using a machine learning algorithm lasso (least absolute shrinkage and selection operator) with TCGA BRCA cohort, which further identified the critical genes for the subsequent study. We further analyzed the protein–protein interaction of these molecules and identified hub genes in the protein–protein-interaction network. The correlation between the hub genes and CD8 + T-cell infiltration levels of breast cancer was evaluated using different immune cell calculation algorithms. The clinical associations of these hub genes were analyzed using the clinical information of breast cancer patients and their expression differences in invasive ductal cancer and ductal cancer in situ were analyzed using spatial sequencing data. The pan-cancer cancer-non-cancer expression and survival association of these hub genes were analyzed. The correlation between the hub genes and CD8 + infiltration levels and the immune therapy predictive values of these hub genes were analyzed using immune checkpoint blockade sub-cohorts. Data collection Single-cell cohorts were accessed and analyzed from the TISCH2 platform^[79]50([80]http://tisch.comp-genomics.org/). In this study, 4 single-cell sequence data sets were included as shown in Table [81]1. This dataset comprises expression data from immune cells obtained through fluorescence-activated cell sorting (FACS), focusing on an enriched fraction of immune cells. The MAESTRO v1.1.0 workflow^[82]51 ([83]https://github.com/liulab-dfci/MAESTRO/blob/master/README.md) employed PCA for dimension reduction and KNN and Louvain algorithms^[84]52,[85]53 for clustering to identify 2000 variable features for each dataset; the number of principal components and the resolution for graph-based clustering were adjusted according to the cell number. Previous studies revealed that MAESTRO demonstrated superior consistency across nearly all cell types, regardless of whether the correlation was computed using all genes or solely the top 2000 variable genes^[86]51, hence in this study, only the top 2000 variable genes were used. UMAP was used to reduce the dimension and visualize the clustering results^[87]54, and the Wilcoxon test was used to identify differentially expressed (DE) genes of each cluster of cell type (|logFC|> = 0.25, FDR < 1e-05). Data from The Cancer Genome Atlas (TCGA, [88]https://www.cancer.gov/ccg/research/genome-sequencing/tcga) and GTEx ([89]https://gtexportal.org/home/) were obtained, which included gene expression profiling data and clinical information on cancer tissues. This data was obtained in accordance with the guidelines and policies. Table 1. Information for single-cell data sets. Dataset Name Species Treatment Patients number Cells CD8Tex cells CD8Tex cell (%) Platform Primary or metastasis PublicationS [90]GSE110686 Human None 2 6,035 622 10.3 10 × Genomics Primary and metastasis ^[91]18 GSE114727_10X Human None 3 28,678 1389 4.8 10 × Genomics Primary ^[92]44 [93]GSE176078 Human None 26 89,471 13,500 15.1 10 × Genomics Primary ^[94]55 EMTAB8107 Human None 14 33,043 5193 15.7 10 × Genomics Primary ^[95]56 [96]Open in a new tab Single-cell data quality control A standardized analysis pipeline utilizing MAESTRO v1.1.0^[97]51 was employed to process all gathered datasets. This workflow encompassed quality control, batch effect mitigation, cell clustering, differential expression analysis, cell-type annotation, and malignant cell classification. The raw count, TPM, or FPKM table served as input for this standardized workflow. Cell quality was assessed using two metrics: total counts (UMI) per cell (library size) and the number of detected genes per cell. Cells with low quality were excluded if the library size was < 1000 or the number of detected genes was < 500. Single-cell data batch effect correction To systematically assess batch effects across each dataset, an entropy-based metric ^[98]44,[99]57 was utilized to quantify data mixing among batches. Typically, samples from different patients in most datasets are susceptible to batch effects. A k-NN graph (k = 30) was constructed based on the Euclidean distance between cells in UMAP coordinates for datasets with more than one patient. For each cell j, the distribution of patients in its nearest neighbors was computed. The measure of mixing between patients Hj is defined as: [MATH: Hj=- t=1 Tpj tlog2pjt :MATH] here, [MATH: pjt :MATH] represents the proportion of cells from patient t among the 30 nearest neighbors of cell j, while T denotes the total number of patients. High entropy, indicating that the most similar cells in a cell's neighborhood come from different patients, suggests potential batch effects. Conversely, low entropy suggests that the most similar cells originate from the same patient, indicating a potential batch effect. It is noteworthy that datasets primarily composed of malignant cells (malignant % > 75%) may exhibit low entropy due to the heterogeneity of malignant cell expression among different tumors ^[100]42. Consequently, the collected datasets were classified into three groups. Firstly, for datasets primarily containing malignant cells, there was no need to eliminate batch effects between different patients, as they reflect differences between distinct tumors. Secondly, datasets with a median entropy lower than 0.7 underwent batch effect correction using Seurat v3.1.2^[101]58. Median entropies shifted towards higher values post-batch effect removal, indicating significant correction of potential batch effects. Thirdly, datasets with a median entropy higher than 0.7 were considered less affected by batch effects. Single-cell clustering and differential gene analysis For each dataset, the MAESTRO workflow identified the top 2000 variable features and conducted PCA for dimension reduction, followed by employing the KNN and Louvain algorithm for cluster identification^[102]52,[103]53. To better capture cellular differences and variabilities across datasets with varying cell numbers, adjustments were made to the number of principal components and the resolution for graph-based clustering. Both parameters were increased with increasing cell numbers. The uniform manifold approximation and projection (UMAP) were utilized for further dimension reduction and visualization of clustering results^[104]59. The Wilcoxon test was applied to identify differentially expressed (DE) genes for each cluster compared to all other cells, based on criteria such as log-transformed fold change (|logFC|≥ 0.25) and false discovery rate (FDR < 1e−05). Clusters of cells were identified using a combination of three approaches. Firstly, cell-type annotations provided by the original studies were utilized. Secondly, the expression distribution of malignant cell markers from the initial research was assessed, including epithelial markers and EMT genes where available. Thirdly, we applied InferCNV ([105]https://github.com/broadinstitute/infercnv) to predict cell malignancy based on predicted copy number variations was employed, segregating cells into malignant and non-malignant clusters. For the remaining normal clusters, an automated marker-based annotation method within MAESTRO was applied using the differentially expressed genes between clusters. Subsequently, the cell-type annotations based on the annotations provided by the original studies were manually verified and corrected. Bulk data differential expression analysis Differentially expressed genes (DEGs) between subtypes were identified using the Limma package with a cut-off fold change of 1.3 and a P-value of 0.05. DEGs are genes whose expression levels vary significantly between different groups. In this study, the goal is to identify genes that are differentially expressed between different subtypes. Limma (Linear Models for Microarray Data, [106]https://bioconductor.org/packages/release/bioc/html/limma.html) is a statistical package in R used for analyzing gene expression data. It employs linear models and empirical Bayes methods to identify DEGs with high sensitivity and specificity. Bulk data enrichment analysis For the GO biological and KEGG pathway enrichment analyses, the ClusterProfiler package (version: 3.18.0, [107]https://bioconductor.org/packages/release/bioc/html/clusterProfile r.html) in R was employed. The False discovery rate (FDR) and p.adjust were set at 0.25 and 0.05, respectively. Gene Ontology (GO, [108]https://geneontology.org/) is a standardized system for annotating genes and their products with terms representing biological processes, molecular functions, and cellular components. GO biological pathway enrichment analysis involves taking a list of genes, often derived from experimental data such as gene expression studies or genome-wide association studies, and determining whether any particular biological processes represented by GO terms are significantly enriched in that gene list compared to what would be expected by chance. The Kyoto Encyclopedia of Genes and Genomes (KEGG, [109]https://www.genome.jp/kegg/) is a collection of databases that contain information about biological pathways, diseases, drugs, and other biological entities. KEGG pathway enrichment analysis involves mapping a list of genes to known biological pathways in the KEGG database and determining whether any pathways are significantly enriched in the gene list. Analysis of the survival The univariate Cox regression analysis, Kaplan–Meier (KM) plot, and log-rank analysis were used to assess the survival association and display the survival curves of genes. Univariate Cox regression analysis examines the association between a single predictor variable (such as a gene expression level or a clinical characteristic) and survival time. It calculates the hazard ratio, which represents the relative risk of experiencing the event of interest (such as death) between two groups defined by the predictor variable. The Cox proportional hazards model is commonly used for this analysis, allowing for the estimation of hazard ratios while accounting for censoring and other covariates. The Kaplan–Meier plot is a graphical method used to estimate the survival function (probability of survival) over time. It is particularly useful for visualizing survival differences between groups defined by categorical variables (such as sub-groups or biomarker expression levels). The plot displays the proportion of individuals surviving at each time point, along with confidence intervals. The log-rank test is a statistical test used to compare the survival curves of two or more groups. It assesses whether there are significant differences in survival times between the groups, taking into account censoring. The test evaluates whether the observed differences in survival are greater than would be expected by chance. If the p-value from the log-rank test is below a predetermined significance level (0.05), it indicates that there is a statistically significant difference in survival between the groups. All analyses were carried out using R (foundation for statistical computing 2020) version 4.0.3 ([110]https://cran.r-project.org/bin/windows/base/old/4.0.3/). Statistical significance was defined as p < 0.05. Consensus clustering analysis Subtyping of the samples was carried out using the ConsensusClusterPlus package([111]https://bioconductor.org/packages/release/bioc/html/Consen susClusterPlus.html). The number of clusters was set at 1–6 for consistency analysis to optimize the best clustering number. Unsupervised class discovery involves identifying potential groups within a dataset based solely on inherent features without external guidance. Researchers using this technique typically aim to address two key questions: the number of groups present in the data and the confidence level associated with both the group quantity and their memberships. Consensus clustering^[112]60 serves as a valuable method for tackling these inquiries, particularly prominent in fields such as cancer research^[113]36,[114]61. Consensus clustering offers both quantitative and visual indicators of “stability”, obtained through iterative subsampling and clustering. By synthesizing outcomes from multiple repetitions, consensus clustering generates a consensus that demonstrates resilience against sampling variations. While initially available within the GenePattern software^[115]62, the consensus clustering technique has been further developed into ConsensusClusterPlus, a package in the R language offering enhanced functionalities and visualizations. In this study, this method is suitable for distinguishing subsets of samples of breast cancer patients based on certain genes and comparing the overall effect of these genes on clinical phenotypes. Immune analysis The immune cell infiltration level was calculated using different algorithms, including TIMER^[116]63, XCELL^[117]64, CIBERSORT^[118]65, MCPCOUNTER^[119]66, QUANTISEQ^[120]67, and EPIC^[121]68. TIMER is a web server for the comprehensive analysis of tumor-infiltrating immune cells. It provides immune cell infiltration levels in various cancer types and their associations with clinical outcomes. TIMER ([122]http://timer.cistrome.org/) utilizes gene expression data to estimate the abundance of immune cell subtypes within tumor samples. XCELL ([123]https://github.com/dviraran/xCell) is another computational tool used for cell type enrichment analysis in gene expression data. It estimates the relative abundance of different cell types within a heterogeneous sample, including immune cells. XCELL utilizes gene expression signatures specific to various cell types to infer their proportions in the sample. CIBERSORT (Cell-type Identification By Estimating Relative Subsets Of RNA Transcripts, [124]https://cibersortx.stanford.edu/) is a computational method used to characterize cell composition in complex tissues based on gene expression data. It deconvolutes bulk tissue gene expression profiles to estimate the relative proportions of different cell types present. CIBERSORT is particularly useful for studying immune cell populations in tumor microenvironments. MCPCOUNTER (Microenvironment Cell Populations-counter, [125]https://github.com/ebecht/MCPcounter) is a gene expression-based method for quantifying the abundance of specific immune and stromal cell populations in tumor samples. It uses predefined gene signatures associated with different cell types to estimate their relative proportions in the sample. QUANTISEQ ([126]https://icbi.i-med.ac.at/software/quantiseq/doc/) is a computational tool for the deconvolution of gene expression profiles to estimate the proportions of immune cell subsets within a sample. It utilizes gene expression signatures specific to different immune cell types to infer their relative abundances. EPIC (Evaluating the Presence of Immune Cells, [127]https://github.com/GfellerLab/EPIC) is a computational tool for quantifying immune cell infiltration in tumor samples based on DNA methylation data. It uses DNA methylation profiles to estimate the proportions of immune cell subsets present in the tumor microenvironment. Immune checkpoint blockade (ICB) responses of subtypes were predicted using the Tumor Immune Dysfunction and Exclusion (TIDE, [128]http://tide.dfci.harvard.edu/) algorithm^[129]69 using the TIDE online analysis platform. TIDE represents a computational framework designed to assess the likelihood of immune evasion by tumors. It achieves this by analyzing the gene expression patterns present in cancer samples. The immune therapy sub-cohorts were accessed and analyzed with TIDE tool^[130]69 ([131]http://tide.dfci.harvard.edu/setquery/). The TIDE score was designed to predict response to immune checkpoint blockade, including anti-PD1 and anti-CTLA4, for melanoma and NSCLC. The use of TIDE in this study is based on the assumption that breast cancer has a similar immune system as melanoma and NSCLC. Prognostic model for identification of critical genes The model was constructed using the glmnet ([132]https://glmnet.stanford.edu/articles/glmnet.html) R package, which implemented the least absolute shrinkage and selection operator (LASSO) regression algorithm^[133]70 with tenfold cross-validation for gene signature selection. LASSO is a regression analysis method used for variable selection and regularization. It aims to find the subset of predictor variables that are most relevant for predicting the response variable while simultaneously performing variable selection and regularization to prevent overfitting. In LASSO regression, the ordinary least squares (OLS) objective function is augmented with a penalty term that is the sum of the absolute values of the coefficients multiplied by a regularization parameter (lambda). This penalty term encourages the coefficients of less important variables to be exactly zero, effectively performing variable selection by shrinking some coefficients to zero. In this study, LASSO was used to construct the prognostic model. A validation cohort from Xena-hubs Breast Cancer (Caldas)^[134]71 was used for validation of the model. Protein–protein interaction network and hub gene The protein–protein interaction network was constructed with STRING ([135]https://string-db.org/), where interactions with a score greater than 0.4 were considered. The top 10 hub nodes in the network were identified using the Hubba^[136]72 ([137]https://apps.cytoscape.org/apps/cytohubba) in Cytoscape^[138]73 ([139]https://cytoscape.org/). The algorithm used included Maximum Clique Cardinality (MCC), Density of Maximum Neighborhood Component (DMNC), Maximum Neighborhood Component (MNC), and Degree Centrality (Degree). The combination of these algorithms offers a comprehensive approach to unsupervised class discovery. Each algorithm provides a unique perspective on the dataset. MCC focuses on identifying densely connected subgraphs (cliques), DMNC evaluates the density of the neighborhood around each node, MNC finds the largest connected component, and Degree Centrality measures the importance of nodes based on their connections. By incorporating these diverse perspectives, the combined approach can capture different aspects of the underlying structure of the data. The algorithms complement each other's strengths and weaknesses. For example, while MCC is effective at identifying densely connected subgroups, Degree Centrality can highlight nodes that serve as central hubs within the network. This complementary nature ensures that a wider range of structural features within the data are considered, leading to a more comprehensive analysis. Spatial sequencing analysis The expression differences in invasive ductal cancer and ductal cancer in situ were analyzed using spatial sequencing data with SpatialDB ([140]http://www.spatialomics.org/SpatialDB/index.php). Spatial sequencing, also known as spatial transcriptomics or spatially resolved transcriptomics, is a technology that allows researchers to study gene expression patterns within tissues while preserving spatial information. Spatial sequencing methods integrate high-throughput sequencing techniques with spatially resolved imaging approaches to generate spatially resolved gene expression profiles. These methods enable us to analyze gene expression patterns within intact breast cancer tissue sections, providing insights into the spatial organization of cells and tissues and their functional implications. The spatial sequencing data used in this study were published in a previous paper^[141]74. The invasive ductal cancer and ductal cancer in situ were labeled by a licensed clinical pathologist Dr Jielin Weng in the Department of Pathology, The Second Affiliated Hospital of Guangzhou Medical University. The data were visualized using the SpatialDB tools^[142]75. Results CD8Tex marker gene identification based on single immune cell sequencing The UMAP plots were conducted for each single-cell sequence data set respectively (Fig. [143]1A). These cells were annotated and sorted into 17 cell types as shown in Table [144]2. Marker genes of CD8Tex cells were identified for each data set respectively, and the marker genes were cross-validated by intersection analysis of the marker gene sets obtained from different data sets (Fig. [145]1B left panel). The Jaccard index was calculated as shown in different colors in Fig. [146]1B right panel. Eventually, we obtained 145 marker genes for CD8Tex (Fig. [147]1B). Figure 1. [148]Figure 1 [149]Open in a new tab CD8Tex marker genes identification based on single immune cell sequencing. (A) UMAP plot of breast cancer single immune cell sequencing data sets. (B) Intersection of marker genes identified by single immune cell sequencing data sets. Left panel: the Venn diagram. Right panel: pairwise intersections analysis. Table 2. Cell type abbreviation in single-cell data. Cell type Abbreviation Full name Immune cells B B Cells CD4T CD4 T Cells CD4Tconv Conventional CD4 T Cells CD8T CD8 T Cells CD8Tex Exhausted CD8 T Cells DC Dendritic Cells Mono/Macro Monocytes or Macrophages Mast Mast Cells Neutrophils Neutrophils NK Natural Killer Cells Tprolif Proliferating T Cells Treg Regulatory T Cells Stromal cells Endothelial Endothelial CELLS Fibroblasts Fibroblasts Myofibroblasts Myofibroblasts Cancer cells Malignant Malignant cells Other cells Oligodendrocyte Oligodendrocytes [150]Open in a new tab Clustering of CD8Tex-based subtypes To further study the CD8Tex cells in breast cancer, we collected TCGA cohort to join-analyze the CD8Tex marker genes. The basic clinical information was provided in the [151]Supplementary Table. All of these marker genes indeed have potential value for clinical application, however, the survival association of these genes revealed their practical value as a molecule that remarkably affects the progress of the tumor, hence, we believe that those genes that significantly impact patient survival are more likely to make a clinical difference when used as biomarkers or therapeutic targets. First, we excluded genes that do not affect the survival of breast cancer by conducting an overall survival comparison between high and low-gene expression groups (divided by expression medium). The univariate analysis was performed and the hazard ratio of significant genes was shown in Fig. [152]2A. Only four genes were risk factors for breast cancer patients, while the others were protective factors. Figure 2. [153]Figure 2 [154]Open in a new tab CD8Tex-based subtype clustering. (A) Survival screening of the CD8Tex marker genes. The forest plot shows the hazard ratio (HR). (B1) Consensus Cumulative Distribution Function (CDF) plot of subtype numbers (k = 2–6). (B2) Delta area plot of the consensus CDF plot. (B3) Consensus matrix and cluster trees of subtypes. (B4) Gene expression heatmap of the subtypes. (B5) Principal Component Analysis (PCA) plot of the consensus clustering. (C1) Overall survival Kaplan–Meier plot (KM plot) of the subtypes. (C2) Progression-free interval KM plot of the subtypes. We admitted that not all markers are highly specific for exhausted T-cells, the criteria to collect them is because their expression is significantly different from other cell types, thus, the expression of these marker genes can help define the distinctive expression patterns of cell infiltration features in bulk sequencing data. We aimed to investigate breast cancer with different CD8Tex infiltration levels, thus, consensus clustering was used to cluster TCGA BRCA patients into CD8Tex-based subtypes using the identified marker genes. Based on the consensus cumulative distribution function (CDF) plotting, the number of clusters (K) = 4 was the optimum cluster number for all of these clustering (Fig. [155]2B1, 2). The subtyping approach has been applied to help understand the role of a gene set^[156]76,[157]77. By the NMF method, which is an effective dimension reduction method for cancer subtype identification, patients were clustered into four distinct subtypes (Fig. [158]2B3). The heatmap in Fig. [159]2B4 illustrates the differential expression patterns of marker genes across the breast cancer subtypes (C1, C2, C3, and C4), while the PCA plot in Fig. [160]2B5 demonstrates the clustering of patients based on their gene expression profiles within each subtype. (C1, C2, C3, and C4) (Fig. [161]2B4, 5). The heatmap and PCA plots in Fig. [162]2B4, B5 visually depict the distinct expression patterns among the identified breast cancer subtypes. The heatmap color scale ranges from low expression (blue) to high expression (red), facilitating the interpretation of differential expression patterns across subtypes. In Fig. [163]2B4, the heatmap visually represents the relative expression levels of marker genes across the identified breast cancer subtypes (C1, C2, C3, and C4). Each row in the heatmap corresponds to a gene, and each column represents a patient sample, with color intensity indicating the expression level of each gene. This visualization allows us to observe patterns of differential expression among the subtypes, highlighting potential molecular distinctions. In contrast, Fig. [164]2B5 utilizes PCA (Principal Component Analysis) to explore the overall variation in gene expression profiles among patients within each subtype. PCA is a dimensionality reduction technique that identifies patterns in data and visualizes these patterns by projecting patients onto a reduced-dimensional space defined by principal components. The PCA plot helps to visualize how patients cluster based on their gene expression profiles, providing insights into the similarities and differences among subtypes beyond individual gene expression levels. We analyzed the overall survival and progress-free interval of the subtypes and found that the subtyping failed to distinguish different survival patients except for the progress-free interval of C3 versus C2 (Fig. [165]2C1,2). Immune difference of CD8Tex-based subtypes In addition, we also display the association between molecular subtypes and CD8Tex-based subtypes with a Sankey diagram (Fig. [166]3A). Generally, CD8Tex-based subtypes were not associated with the molecular subtypes. In addition, we also calculated the immune cell infiltration levels of the CD8Tex-based subtypes using multiple algorithms as provided in the [167]supplementary materials. The XCELL algorithms suggested that the CD8Tex-based subtypes separated samples with different immune profiles. The immune score and microenvironment score were remarkably different among CD8Tex-based subtypes. The stroma scores were also significantly different among CD8Tex-based subtypes. The C2 subtypes had a very high immune score and microenvironment score. (Fig. [168]3B). Figure 3. [169]Figure 3 [170]Open in a new tab Immune association of the CD8Tex-based subtypes. (A) Sankey diagram showing the association between breast cancer molecular subtypes and CD8Tex-based subtypes. (B) The XCELL scores in CD8Tex-based subtypes. (C) Tumor Immune Dysfunction and Exclusion (TIDE) scores in CD8Tex-based subtypes. (D) Predicted immune checkpoint blockade (ICB) response of the subtypes based on the TIDE score. To demonstrate the potential clinical value of these subtyping systems, we calculated the TIDE score to predict the response of the samples for ICB therapy. Results revealed that the C2 subtype had a significantly higher TIDE score compared to the other subtypes (Fig. [171]3C). This indicates that C2 subtypes had low T cell response. Based on the TIDE score, we predicted the response of each sample for ICB and calculated the response ratio for each subtype. Results showed that C1 had a 70% response rate, C3 had a 63% response rate, and C4 had a 65% response rate, yet, C2 had an 84% response rate (Fig. [172]3D). It is not surprising that C2 with the highest immune score and microenvironment score also has highest response rate. We believe that this is because, although the results are derived from two distinct algorithms, there are common parameter genes utilized in their calculation. These results suggested that the subtyping systems performed well in the separation of patients with different immune relevance, especially identified C2 subtypes as the responder subtype for ICB, indicating that these marker-gene sets potentially provided clinical value for breast cancer immune therapies. Expression difference of CD8Tex-based subtypes To further investigate the features of patients in C2 subtypes compared to the other patients, we conducted a differential expression gene analysis comparing C2 and the other subtypes. The analysis revealed 1552 up-regulated genes and 980 down-regulated genes in the C2 subtype as shown in the volcano plot (Fig. [173]4A) and heatmap with clustering (Fig. [174]4B). Figure [175]4B depicts a heatmap with hierarchical clustering illustrating the differential expression patterns of genes between the C2 subtype and other subtypes. The rows represent genes, while the columns represent patient samples. The color scale in the legend indicates gene expression levels, ranging from low (blue) to high (red), facilitating the interpretation of the heatmap. This visualization highlights distinct gene expression clusters associated with the C2 subtype, suggesting potential biomarkers or pathways specific to this subgroup. Figure 4. [176]Figure 4 [177]Open in a new tab Differential expression gene enrichment. (A) Volcano plot of the differential genes in CD8Tex-based subtypes C2. TCGA BRCA cohorts were analyzed. C2 samples were compared with the other samples to identify C2 differential genes. (B) Heat map and clustering of the differential genes in CD8Tex-based subtypes C2. (C) GO and KEGG enrichment analysis^[178]78–[179]80 of the differential genes in CD8Tex-based subtypes C2. Subsequently, we enriched these genes in the GO database and KEGG pathways database. Results showed that the up-regulated genes were enriched in multiple terms that are related to the T cell activity. The KEGG pathway enrichment analysis revealed that the up-regulated genes were associated with cytokine interaction and Th1/Th2 differentiation. On the other hand, the down-regulated genes were associated with protein secretion and hormone secretion as well as the PI3K-Akt signaling pathway. (Fig. [180]4C) Although these DEGs and enrichment might related to the previously identified 145 biomarkers, the objective of this enrichment analysis is to broadly investigate the disparities within the C2 cluster and discern which pathways might underlie the diverse clinical phenotypes observed. Rather than pinpointing specific markers, this analysis aims to offer general insights into the biological mechanisms driving these clinical differences. Construction of a CD8Tex-based survival model To further explore the prognostic value of the CD8Tex marker gene set and obtain critical genes for breast cancer patients, we trained a machine-learning prognostic model using the LASSO algorithm and TCGA BRCA cohort. The model suggested that the minimum lambda was 0.0048. When lambda was 0.0048, the model achieved the best performance. The risk score formula and the included genes are presented in Fig. [181]5A–C. The model includes risky genes (CLEC2D, CRTAM, EZR, HLA-DRB1, NKG7, SLA2, SLC7A5, and SYTL3) and protective genes (CTSW, GBP2, IFNG, KLRB1, MT-ND1, PSME2, SH2D2A, and TOX) which are discussed later. The KM plot suggested that the high-risk group had a significantly worse survival than the low-risk group (Fig. [182]5D). This model helped us narrow down the critical genes for CD8Tex in breast cancer. The time-dependent ROC analysis also revealed that the AUC of the ROC was over 7 for different years of prediction, indicating a good accuracy of the model for survival prediction (Fig. [183]5E, [184]F). The model is validated with another independent cohort (Fig. [185]5G). The validation cohort results are consistent with those of the training cohort, demonstrating that the model can distinguish between patients with long survival and those with short survival. Figure 5. [186]Figure 5 [187]Open in a new tab Machine learning prognostic models based on CD8Tex marker genes and hub genes identification. (A-C) The λ and coefficients of the model. (D) Overall survival KM plots of high- and low-risk groups in TCGA-BRAC. (E) Representative time-dependent receiver operating characteristic (ROC) curve of the risk model. (F) Area Under Curve (AUC) of the time-dependent ROC curve of the risk model. (G) Overall survival KM plots of high- and low-risk groups in a validation cohort from Xena-hubs Breast Cancer (Caldas 2007). (H) The protein–protein interaction network with hub genes identified. Identification of the hub genes for the CD8Tex regulation network The genes in the LASSO model were critical CD8Tex marker genes for breast cancer patients. We constructed a protein–protein interaction network using these genes and calculated the top three hub genes using MCC, DMNC, MNC, and Degree algorithms. The protein–protein interaction network presented 31 edges and the average node degree was 3.88 (Fig. [188]5H). The MCC, DMNC, MNC, and Degree of each gene were calculated as presented in [189]Supplementary Table. We then ranked the scores and obtained the average rank of each gene. Eventually, CRTAM, CLEC2D, and KLRB1 stood out as the top three hub genes in critical CD8Tex marker genes for breast cancer patients. CD8 T cell infiltration relevant to CD8Tex hub genes To evaluate the clinical value of the hub genes identified, CRTAM, CLEC2D, and KLRB1, for clinical evaluation of CD8 + T-cell infiltration levels in breast cancer, we calculated the different CD8 + T-cell infiltration levels and analyzed their correlation with CRTAM, CLEC2D, and KLRB1 in breast cancer and molecular subtypes. To avoid bias in some immune cell scores, we have employed multiple algorithms. Results showed that CRTAM, CLEC2D, and KLRB1 were positively correlated with T cell CD8 + in MCPCOUNTER, CIBERSORT, CIBERSORT-ABS, EPIC, and QUANTISEQ, but not in TIMER. (Fig. [190]6A). Figure 6. [191]Figure 6 [192]Open in a new tab CD8Tex hub marker genes for breast cancer. (A) Correlation analysis between CD8Tex hub marker genes and CD8 + T cell infiltration scores. (B) Expression of CD8Tex hub marker genes in breast cancer and normal breast tissues. TCGA and GTEx data were analyzed. (C) Expression of CD8Tex hub marker genes in breast cancer and normal breast tissues. TCGA-paired samples were analyzed. (D) KM plot of the CD8Tex hub marker genes in breast cancer. Rolls 1–3 display overall survival, disease-free survival, and progress-free interval respectively. Columns 1–3 display CLEC2D, CRTAM, and KLRB1 respectively. CD8Tex hub genes expression and survival analysis Data also suggested that CLEC2D expression slightly decreased in tumors compared to normal breast tissue, while CRTAM expression increased in tumor tissue and KLRB1 expression had no difference in tumors compared to normal tissues (Fig. [193]6B). However, in the cancer-noncancer paired samples, the comparison suggested that CRTAM expression increased in tumor tissue while KLRB1 expression decreased in the tumor (Fig. [194]6C). Given the expression analysis results for CRTAM and KLRB1 are not consistent, we cannot determine the cancer-non-cancer expression pattern of these two genes definitively. However, based on consistent findings, we can conclude that tumors exhibit elevated expression of CRTAM. This suggests that tumors may possess a distinct CD8 + T-cell infiltration feature compared to normal tissues. Moreover, the expression levels of CRTAM and KLRB1 in normal breast tissue are comparable to those in breast cancer tissue. This suggests a potentially similar infiltration pattern of exhausted CD8 + T-cells in both cancer and non-cancer tissues. This similarity in pattern could hinder the development of cancer-specific targets for treatment. KM survival analysis revealed that CRTAM, CLEC2D, and KLRB1 were all significantly associated with better overall survival of breast cancer patients. For disease-free survival, CRTAM was not associated, while CLEC2D and KLRB1 were associated with better disease-free survival. As for the progress-free interval, all three genes were associated. (Fig. [195]6D) Survival association of CD8Tex hub marker genes for breast cancer subtypes was also analyzed. Results showed that CRTAM was a good prognostic biomarker for Luminal B, HER-enriched, and basal-like breast cancer, but not for Luminal A breast cancer. CLEC2D was associated with overall survival and progress-free interval in all subtypes of breast cancer. Similar to CRTAM, KLRB1 was a good prognostic biomarker for Luminal B, HER-enriched, and basal-like breast cancer, but not for Luminal A breast cancer. However, KLRB1 was not significant in the overall survival analysis of basal-like breast cancer. (Fig. [196]7). Figure 7. [197]Figure 7 [198]Open in a new tab Survival association of CD8Tex hub marker genes for breast cancer subtypes. Clinical association of the CD8Tex hub genes in breast cancers The expression of CD8Tex hub genes has been demonstrated to be critical markers for CD8 + T cells in breast cancer. While CD8 + T cells are critical for breast cancer development, we proposed that CLEC2D, CRTAM, and KLRB1 are associated with clinical characteristics. The comparison of clinical information between high and low-expression groups suggested that CLEC2D was associated with T stage, age, and radiation therapy of breast cancer patients. CRTAM was associated with the N stage, race, age, PAM50 (molecular subtype), and radiation therapy of breast cancer patients. KLRB1 was associated with the M stage, race, age, PAM50, menopause status, and radiation therapy of breast cancer patients ([199]Supplementary Table). In the high-expression group of CLEC2D, CRTAM, and KLRB1, more patients underwent radiation therapy. In general, radiation therapy is often recommended for patients who are in relatively good condition and able to tolerate treatment. However, it can also be administered to patients with more advanced disease or compromised health status, particularly if the potential benefits of treatment outweigh the risks. Therefore, the association inferred from our data suggests that high expression of CLEC2D, CRTAM, and KLRB1 may be associated with less severe cases where radiation therapy is recommended, possibly because patients are resistant to other types of treatment such as chemotherapy. CLEC2D was associated with the T stage indicating that it results in smaller tumors, which might reflect the impact of T cells on tumor growth. Tumor-infiltrating T cells have been associated with smaller tumor size^[200]81. CRTAM was associated with a slightly higher lymph node metastasis (N stage), which might reflect the impact of T cells on lymph node metastasis. Tumor-infiltrating lymphocytes, including cytotoxic CD8 + T cells, can recognize and eliminate cancer cells within lymph nodes, thereby inhibiting the spread of metastatic disease^[201]82. The association of KLRB1 with decreased tumor metastasis (M stage) suggests a potential role for CD8Tex cells in limiting breast cancer metastasis. Taken together, these results suggest that patients with CD8Tex may have smaller tumors with locoregional metastases, a clinical scenario commonly addressed with post-surgical radiotherapy. The association of CD8Tex hub genes with clinical characteristics, particularly the link with radiotherapy, raises intriguing questions about the immune activity status in breast cancer patients. The observed associations between CLEC2D, CRTAM, and KLRB1 expression levels and various clinical parameters, including tumor stage, lymph node involvement, molecular subtype, menopause status, and receipt of radiotherapy, suggest potential implications for patient management and treatment outcomes. However, it is important to note that the observed differences were slight and may not have significant clinical implications. These findings underscore the importance of considering the immune context of tumors in clinical decision-making and the potential implications for personalized treatment approaches. However, further investigation, including validation in independent cohorts and functional studies, is needed to fully elucidate the role of exhausted T-cells and their associated biomarkers in breast cancer progression and response to therapy. CD8Tex hub genes and breast cancer heterogeneity To investigate the CD8Tex hub genes and breast cancer extra-tumour heterogeneity, we plotted the expression level of CLEC2D, CRTAM, and KLRB1 in PAM50 breast cancer subtypes. Results revealed that CLEC2D and CRTAM do not have differences among PAM50 breast cancer subtypes, KLRB1 expression in Luminal B was significantly lower than the other subtypes. (Fig. [202]8A) To explore the CD8Tex hub genes and breast cancer intro-tumour heterogeneity, we also compared the levels of CLEC2D, CRTAM, and KLRB1 in invasive ductal cancer and ductal cancer in situ using spatial sequencing data. The invasive ductal cancer and ductal cancer in situ of breast cancer tissue were delineated by an experienced clinical pathologist and the expression of CLEC2D, CRTAM, and KLRB1 in invasive ductal cancer and ductal cancer in situ of breast cancer tissue were measured at 4 separate layers. We summed the results of 4 layers for analysis and results showed that CRTAM and KLRB1 expression was not significantly different between invasive ductal cancer and ductal cancer in situ, while CLEC2D expression was significantly higher in invasive ductal cancer over ductal cancer in situ(Fig. [203]8B). Images of the spatial sequencing slide were shown in Fig. [204]8C. These data suggested that KLRB1 could mark the difference between cancer types while CLEC2D could mark the difference between breast cancer tissue types within a sample. As CLEC2D and KLRB1 are two critical T cell genes, these data also reflect the potential role of T cell infiltration in breast cancer. Figure 8. [205]Figure 8 [206]Open in a new tab Tumour heterogeneity association of CD8Tex hub marker genes. (A) Expression of CD8Tex hub marker genes in breast cancer subtypes. (B) Expression comparison of CD8Tex hub marker genes in invasive ductal cancer and ductal cancer in situ by spatial sequencing. C. Immages of the spatial expression of CD8Tex hub marker genes in invasive ductal cancer and ductal cancer in situ. Pan-cancer expression and survival analysis of CD8Tex hub genes To further expand the application of the CD8Tex hub genes for pan-cancer use, we systematically explored the expression of CLEC2D, CRTAM, and KLRB1 across cancer types, which is provided in the [207]Supplementary results. Discussion Tumor-associated immune cells may promote or inhibit cancer cells depending on their function and role in immunity. Many previous studies have suggested the impact of the immune environment on cancers^[208]30,[209]32,[210]34,[211]36–[212]38,[213]61,[214]83–[215]86 . The novelty of this study lies in our approach to identifying marker genes for CD8Tex using multiple immune single-cell sequencing datasets. By conducting intersection analysis across multiple independent single-cell datasets, we aimed to ensure the stability and reproducibility of the identified biomarker gene sets. Furthermore, we identified hub genes for these biomarkers, namely CLEC2D, CRTAM, and KLRB1, which have been less explored in the context of breast cancer. Of significant importance is our discovery of the association between these hub genes and various aspects of breast cancer immunity, clinical features, and tumor heterogeneity. This comprehensive analysis sheds light on previously overlooked molecular players in breast cancer and highlights their potential relevance for understanding disease progression and treatment response. Overall, our study offers valuable insights into the immune landscape of breast cancer and uncovers potential targets for further investigation and therapeutic intervention. In the CD8Tex-based survival model we constructed, the risky genes include CLEC2D, CRTAM, EZR, HLA-DRB1, NKG7, SLA2, SLC7A5, and SYTL3, while the protective genes include CTSW, GBP2, IFNG, KLRB1, MT-ND1, PSME2, SH2D2A, and TOX. Most of these genes are reported for the first time as prognostic biomarkers in breast cancer, although some have been previously associated with breast cancer treatment or the breast cancer microenvironment. CRTAM enhances the immune response against tumors in triple-negative breast cancer by increasing the infiltration of CD8 + T cells^[216]87. An integrative multi-omics analysis identifies a gene signature related to metastasis, highlighting the potential oncogenic role of EZR in breast cancer^[217]88. Breast cancer is associated with increased allele frequencies of HLA-DRB1*11:01 and HLA-DRB1*10:01 in a population of patients from Central Italy^[218]89; however, this pertains to genetic mutation rather than gene expression. Targeting the glutamine metabolic reprogramming mediated by SLC7A5 enhances the effectiveness of anti-PD-1 therapy in triple-negative breast cancer^[219]90. GBP2 serves as a prognostic biomarker and is linked to immunotherapeutic responses in gastric cancer^[220]91; however, its association with breast cancer has not been reported. As a single-cell signature, low expression of KLRB1 predicts poor survival outcomes and is associated with immune infiltration in breast cancer^[221]92. It is also reported that inhibiting KLRB1 expression is associated with impaired cancer immunity, leading to cancer progression and poor prognosis in patients with breast invasive carcinoma^[222]93. Data suggested that PSME2 was associated with immune-hot tumors in breast cancer and with a favorable therapeutic response to immunotherapy^[223]94. Overall, our results align with previous studies regarding some of these genes. However, we emphasize their prognostic value and have included additional novel biomarker genes to enhance the overall prognostic accuracy of our model. The cell types identified with biomarker gene sets all play important roles in the immunity of cancers. Tumor-infiltrating B cells are an essential part of the antitumor action of T cells^[224]95. CD8 + T cells (CD8T) and CD4 + T cells have a similar differentiation process but once differentiated, the CD4 + cells or the CD8 + cells are fixed and function differently^[225]96. Conventional CD4 + T cells (CD4Tconv) lack phenotypic markers that distinguish these cells from FoxP3 + T regulatory cells (Treg)^[226]95. CD8T includes cytotoxic T cells, which directly target cancer cells and induce apoptosis of cancer cells^[227]97. During cancer progression, cancer-associated fibroblasts, M2 macrophage, and Treg might negatively regulate anti-cancer immune responses mediated by CD8 + T cells^[228]98. Macrophages are monocytes that have migrated from the bloodstream into tumor tissues, which serve as scavengers, maintaining homeostasis in tumors ^[229]99. The effect of myofibroblast on cancer immunity has been less reported. A study has shown that myofibroblast regulated type I collagen thereby impacting cancer immunity^[230]100. It was reported that the tumor-infiltrating T cells directly in contact with the cancer cells proliferate more frequently compared with T cells in the stroma^[231]101, hence, proliferating T cells (Tprolif) were thought to be a subline of T cells. Exhausted CD8 T Cells (CD8Tex) is a set of sub-cell lines of CD8 T Cells that persist in cancer but are unsuccessful in killing cancer cells^[232]102. The presence and the type of CD8Tex have been thought to be critical for the response of cancer to some immune checkpoint blockades^[233]103. Therefore, the abundance of these immune-related cells in tumors is a critical factor for tumorigenesis as well as cancer immune therapy. The analysis of the immune association of subtypes based on marker genes for different cell types suggested that the subtyping systems performed well in the separation of patients with different immune relevance. The subtyping identified C2 as standing out from the other patients. Interestingly, the subtypes also performed well in the separation between ICB responders and ICB non-responders. These ICB predictions were based on TIDE scores. However, the TIDE score was designed to predict response to immune checkpoint blockade, including anti-PD1 and anti-CTLA4, for melanoma and NSCLC. The use of TIDE in this study is based on the assumption that breast cancer has a similar immune system as melanoma and NSCLC. Hence, one should be cautious in interpreting these results. Further study to define a prediction model specific to breast cancer is required when more breast cancer immune therapy cohort data are available for model training. In the comparison of survival models, we concluded that the Tprolif model was the best overall. However, the other model also provided a rather comparative prognostic power. Nevertheless, the main purpose of this analysis is to narrow down the critical genes for the subsequent study. Another critical limitation is the observations in special sequencing data might be subjected to bias. Firstly, it's essential to note that the resolution of the spatial sequencing results is relatively low, and some identified blocks may encompass both invasive ductal carcinoma and ductal carcinoma in situ within breast cancer tissue. Secondly, the classification of invasive ductal carcinoma and ductal carcinoma in situ is determined by a licensed clinical pathologist using her clinical expertise and knowledge, rather than relying on a standard computational pipeline for detection. Additionally, drawing conclusions solely based on these biomarkers for T cell infiltration levels may carry significant risk. Hence, our data offer only initial insights into the biomarkers, and further investigation into the association of CD8Tex heterogeneity is warranted. In this study, three critical hub genes were identified, including CLEC2D, CRTAM, and KLRB1. NK cells play a central role in the immune system, particularly in the recognition and elimination of malignant and infected cells. 2B4 (CD244, SLAMF4) and CS1 (CD319, SLAMF7) are NK cell receptors that control their cytotoxic function. Lectin-like transcript 1 (LLT1), a member of the C-type lectin-like domain family 2 (CLEC2D), induces IFN-g production but does not directly regulate cytolysis. LLT1, which is expressed in other cells, acts as a ligand for the NK cell inhibitory receptor NKRP1A (CD161) and suppresses NK cytolytic activity. Research has been conducted on novel therapies that target these receptors to enhance NK cell effector functions^[234]104. Hence, CLEC2D is associated with the NK-CD8 + cell regulation. On the other hand, CD4( +)T cells possessing CRTAM can differentiate into CD4( +)CTLs. A study found that after activation, CRTAM( +) CD4( +) T cells secrete IFN-γ and express CTL-related genes such as eomesodermin, Granzyme B, and perforin, which suggest that CRTAM( +) T cells are the precursor of CD4( +)CTLs^[235]105. Additionally, ectopic expression of CRTAM in T cells induces IFN-γ production, expression of CTL-related genes, and cytotoxic activity. This is further supported by the fact that CRTAM-mediated intracellular signaling is required for the induction of both CD4( +)CTLs and IFN-γ production^[236]105. Furthermore, these CRTAM( +) T cells traffic to mucosal tissues and inflammatory sites and develop into CD4( +)CTLs, which can either mediate protection against infection or induce inflammatory response depending on the situation. These results demonstrate that CRTAM is a key factor in the differentiation of CD4( +)CTLs through the induction of Eomes and CTL-related genes^[237]105. As for KLRB1, this is a gene that encodes CD161 protein. CD161 has been proposed as a pan-cancer immune checkpoint^[238]106. Therefore, it is not surprising that, in this study, we demonstrated that these two genes were closely associated with the cancer immune environment and can be used to predict the immune therapy response. The mining and analysis of multi-omic profiling data enable bioinformatic study with a rather comprehensive understanding of genes in cancers. In this study, combining single-cell RNA sequencing (scRNA-seq), bulk RNA sequencing (bulk RNA-seq), and spatial transcriptomics data would indeed constitute a multi-omic approach, provided that all of these datasets are derived from RNA sequencing (RNA-seq) technologies. While each type of RNA-seq data captures gene expression information at different scales and resolutions, integrating them allows for a more comprehensive understanding of biological systems. Integrating scRNA-seq, bulk RNA-seq, and spatial transcriptomics data allows researchers to analyze gene expression dynamics at multiple scales, from individual cells to tissue-level spatial organization. This multi-omic approach enables the identification of cell types, subpopulations, and spatially defined gene expression patterns, facilitating a deeper understanding of complex biological processes and disease mechanisms. The perspective of this work is to deepen our understanding of the role of CD8Tex in breast cancer. While immune therapy is not yet widely utilized in breast cancer treatment, there is promise in using T cell-based immune therapy for some refractory breast cancer cases. CD8 + Tex cells might also be developed as a drug target, as numerous studies have reported interactions involving genes, drugs^[239]107, diseases, cells, and even the microbiome^[240]108–[241]112. Whether CD8 + Tex cells play a role in these interactions can be explored in future research. A recent study found an association between breast cancer and menopause^[242]113. Further investigation is needed to determine whether menopause affects CD8 + T cells in breast cancer. One of the biggest issues in breast cancer is drug resistance^[243]114. Addressing the challenges in this field necessitates exploration. Our identification of hub genes for these biomarkers, such as CLEC2D, CRTAM, and KLRB1, and their potential association with known biomarkers of breast cancer cells, such as CHEK2^[244]115 and TP53^[245]116, lays the groundwork for potential clinical applications in breast cancer management. Further research is needed to explore how these biomarkers can be effectively incorporated into clinical practice to improve patient outcomes. Additionally, non-invasive early detection of breast cancer has recently shown promising advancements^[246]117. The biomarkers discovered may have the potential to contribute to the non-invasive detection of this cancer type. Supplementary Information [247]Supplementary Information.^ (3.1MB, docx) Acknowledgements