Abstract Recent advances in single-cell transcriptome sequencing and computational analysis methods have improved our understanding of cellular heterogeneity. However, associating different cell subsets with phenotypes remains challenging. Recently, Ren et al. introduced PENCIL, a supervised learning framework incorporating gene selection to discern phenotype-relevant cells. To assess PENCIL’s reproducibility and transferability, we conducted a comprehensive evaluation across 12 single-cell RNA sequencing datasets representing four distinct phenotypes. We identified a few caveats with the original version of PENCIL, such as sensitivity to input perturbation, the correction of which contributed to PENCIL’s enhanced reproducibility. We highlight that boosting PENCIL’s cell subsets identification with gene set variation analysis creates a cytotoxic T cell immunotherapy response signature (CyTIR) predictive of immune checkpoint blockade response in skin cancer across multiple datasets, with an area under curve >0.75 and accuracy >0.71. Overall, our assessments enhance PENCIL’s reproducibility and utility, further extending its potential for identifying phenotype-relevant cell subsets in diverse biomedical applications. __________________________________________________________________ The advancements in single-cell transcriptome sequencing protocols have been accompanied by remarkable progress in computational analysis methods. These computational methods have facilitated the exploration of complex aspects of biology, including cell type composition, gene regulation, transcriptional intra-tumour heterogeneity and developmental dynamics^[31]1–[32]4. However, the identification of phenotype-relevant cell subsets still poses an important challenge. To address this issue, Ren et al. developed PENCIL^[33]5, a pioneering supervised learning framework that incorporates a supervised gene selection step in discerning cells associated with categorical or continuous phenotypes. This new approach holds promise for improving the classification of cells into distinct phenotype-relevant groups compared to previous methods relying on unsupervised gene selection (for example, see refs. [34]^6^–[35]^8). Nevertheless, a rigorous evaluation of its reproducibility and transferability is crucial to establishing PENCIL’s reliability and broader applicability. In this study, we address this gap by comprehensively assessing PENCIL’s performance across four distinct phenotypes, focusing on the CD8^+ T cells from 12 single-cell RNA sequencing datasets. Additionally, we introduce a new biomedical application that utilizes gene set variation analysis (GSVA) to generate phenotype-predicting signatures from the differentially expressed genes found in phenotype-relevant cell subsets. Our assessment seeks to validate the important findings previously reported in Ren et al.^[36]5, while additionally exploring the potential of this approach to identify phenotype-relevant cell subsets for various future potential biomedical applications of interest. Sensitivity of PENCIL to model parameters and input data To assess the reproducibility of PENCIL in identifying phenotype-relevant cell subsets using real data, we conducted our first analysis on the primary dataset utilized by Ren et al. in their original study^[37]5. This dataset comprised single-cell RNA sequencing data from 32 melanoma patients treated with immune checkpoint blockade (ICB) therapy, resulting in a total of 48 samples, including 17 responder samples and 31 non-responder samples^[38]9. In this dataset, there were 6,350 manually annotated CD8^+ T cells, with 4,301 from non-responders and 2,049 from responders ([39]Fig. 1a, left), which are used as input of PENCIL in the original study by Ren et al. Fig. 1 |. Sensitivity testing of PENCIL to model parameters and input data. [40]Fig. 1 | [41]Open in a new tab a, Sensitivity of PENCIL to the class_weights parameter setting: left, true phenotypic labels of cells; middle, predicted cell labels using default class_weights setting (ratio of non-responder (NR) to responder (R) is 1:1); right, changes of predicted number of ‘R cells’, ‘NR cells’ and ‘rejected (Rej.) cells’ with a series of different values of the class_weights parameter. The value x of class_weights represents the class weights set in PENCIL as 1:x (NR:R). The dashed line indicates the optimal class_weights value (NR:R = 1:2). b, Sensitivity of PENCIL to the input gene order. The diagram, by the technique of Uniform Manifold Approximation and Projection (UMAP), illustrates the predicted R, NR and Rej cells with input genes ordered by gene name (A to Z). The Venn plots from left to right illustrate the overlap of predicted cell labels with input genes ordered by gene name (A to Z), gene expression variance (low to high) and gene expression level (low to high), compared to the original gene order (gene expression variance from high to low). All simulations were conducted with class_weights set to 1:2. c, Sensitivity of PENCIL to the input cell identity: left, the overlap of CD8^+ T cells annotated by three different automatic tools, compared to original manually annotated CD8^+ T cells; right, the overlap of predicted R, NR and Rej cells when using CD8^+ T cells annotated by CellTypist, SingleR and ScType as input, compared to original manually annotated CD8^+ T cells. The data are from ref. [42]9. Sensitivity of PENCIL to the class_weights parameter setting In our initial test, with the default model parameter settings, the PENCIL algorithm predicted an unexpectedly low number of responder-related cells (n = 9; [43]Fig. 1a, middle), which substantially differed from the count reported by the original authors (n = 1,243). To resolve this discrepancy, we reached out to the original authors, who kindly shared their source code with us. We found that the model parameter class_weights, utilized to account for the imbalance between the number of cells from different true phenotypic labels, was set to 1:2 (non-responder:responder, which is approximately inversely proportional to the cell counts of both categories, that is, 4,301:2,049), rather than the default setting of 1:1. By adopting this adjusted setting, we achieved a more reasonable outcome, even though the exact number of non-responder and responder-related cells still deviated slightly from the authors’ reported results ([44]Fig. 1a, right). Several factors contribute to this discrepancy. First, the inherent randomness in deep learning could lead to variations when running on different machines. Second, there was a flaw in the data preprocessing of the authors’ code, as they directly applied log normalization on the log[2](TPM + 1) data without prior conversion to TPM expression values, where TPM is transcripts per million. Finally, we explored the influence of the class_weights parameter on the resulting cell numbers over a range of values and found that this parameter had a considerable impact on the predicted number of cells for different categories ([45]Fig. 1a, right). Sensitivity of PENCIL to the input gene order Although PENCIL accepts the gene expression data in both input formats—CSV and H5A—we were surprised to find that the results obtained were contingent on the format used. Studying this further, we found that the data given in these two formats differed in the gene order. Specifically, in the H5AD format, genes were sorted alphabetically by their names by default (A to Z), whereas in the CSV format, they were arranged based on their expression variance (from high to low). Altering gene order in the CSV input data enabled us to reproduce the fluctuations in the algorithm’s predictions ([46]Fig. 1b, left). We further explored alternative gene orderings, such as sorting by low-to-high expression variance or low-to-high expression level. Interestingly, we observed that the gene order in the input file could indeed impact the identity and count of responder cells, non-responder cells and rejected cells predicted by PENCIL ([47]Fig. 1b). Upon examining the code of the PENCIL algorithm and consulting with the original authors, we concluded that the observed discrepancy is attributed to the fact that, in the PENCIL framework, input genes are assigned initial random weights for the subsequent optimization step. Consequently, altering the gene order results in different weights being assigned to each gene, thereby yielding varied outcomes. Sensitivity of PENCIL to the input cell identity Given that there are many different cell annotation methods and that different methods may assign different identities to cells, we investigated PENCIL’s sensitivity to input cell identity. To this end, we compared CD8^+ T cells annotated by three automatic tools—CellTypist^[48]10, SingleR^[49]11 and ScType^[50]12—with manually annotated CD8^+ T cells. Although all methods identified a substantial number of CD8^+ T cells, CellTypist demonstrated the highest agreement with the original manual annotation (overlap >65%; [51]Fig. 1c, left) and the highest concordance in PENCIL-predicted responder, non-responder and rejector labels (overlap >40%; [52]Fig. 1c, right). Thus, CellTypist was selected for annotating datasets lacking cell type information. As more annotation tools emerge, benchmarking and selecting the most suitable one will become crucial for downstream analysis, as demonstrated in ref. [53]13. Learning sample phenotypes via phenotype-relevant cells Next, we utilized PENCIL to investigate the underlying mechanisms driving phenotypic differences and predict sample-level phenotypes based on single-cell-level phenotypic assignments. Our focus was on CD8^+ T cells in each dataset; to ensure robustness, samples with fewer than 50 CD8^+ T cells were excluded from subsequent sample-level analyses. Predicting ICB response in skin cancer While numerous signatures have been developed for predicting cancer ICB response, the majority operate on a bulk level. In this study, we investigated the feasibility of leveraging PENCIL to develop a single-cell-level signature for predicting ICB response. To achieve this, we trained PENCIL using the aforementioned melanoma samples, with ICB response as the phenotype of interest. Subsequently, we employed the trained model to predict the number of ICB-response-relevant cells in two distinct test sets: (1) tumour CD8^+ T cells from 15 patients with basal or squamous cell carcinoma ([54]GSE123813)^[55]14 and (2) circulating CD8^+ T cells from 20 metastatic melanoma patients ([56]GSE166181)^[57]15. On the training data, PENCIL showed an outstanding ability to predict sample-level ICB response status (responders versus non-responders) from the ratio of responder-related cells, achieving a perfect area under curve (AUC) of 1 and accuracy of 1 ([58]Fig. 2a, left). However, on the test sets, performance was only moderate. While responder samples had significantly higher ratios of responder-associated cells (P = 0.014–0.016), accuracy in identifying responder samples using a 50% cutoff for this ratio ranged just 0.48–0.54 ([59]Fig. 2a, middle, right). Fig. 2 |. Predicting ICB response in skin cancer using a newly curated signature derived from ICB-response-related CD8 + T cells. [60]Fig. 2 | [61]Open in a new tab a, Ratios of predicted non-responder-related (NR, blue) and responder-related CD8^+ T cells (R, red) by PENCIL in each sample. The AUCs of prediction were calculated using the proportion of R cells in each sample. The accuracy (acc.) was calculated using a cutoff of 50% of the proportion of R cells in each sample. b, Dot plots showing the expression levels of the top 5 significantly differentially expressed genes of PENCIL-predicted cell labels. The size of the dot encodes the percentage of cells expressing each gene, and the colour encodes the average expression level. An ICB response signature was curated and defined as the difference of the GSVA scores between the two sets of differentially expressed genes. c, The receiver operating characteristic curves and AUCs using this score to predict ICB response in the training and test sets. d, The ICB response signature score distribution in R- and NR-samples in different datasets. the accuracy for prediction of ICB response was calculated using cutoff of ICB response signature score at 0. The blue bars are signature scores of non-responders, the red bars are signature scores of responders. In a and d, two-tail Wilcoxon test P values are displayed. In d, the upper and lower boundaries signify the first and third quartiles, respectively, the central line denotes the median, and the whiskers stretch to the most distant data points not classified as outliers (within 1.5 times the interquartile range). Despite this moderate performance, we explored whether PENCIL’s phenotype-relevant cell classifications could still be utilized to build a more accurate predictor, by analysing differential gene expression rather than relative abundance of the phenotype-relevant cells. To this end, we identified genes significantly differentially expressed between the PENCIL-predicted responder and non-responder cell subsets. We then curated a cytotoxic T cell immunotherapy response signature (CyTIR), defined as the difference in GSVA scores between the top 5 responder-related genes (IL7R, CCR7, TSPYL2, GPR183 and HSPH1) and the top five non-responder-related genes (IFI6, WARS, GZMB, CCL3 and ISG15), ranked by fold change from the differential expression analysis ([62]Fig. 2b). Remarkably, the newly curated ICB response signature, CyTIR, successfully predicted sample-level ICB response from single-cell data in both the training and independent test sets (AUC = 0.75–0.86; [63]Fig. 2c). With a cutoff of 0, the CyTIR score achieved an accuracy of identifying ICB responder samples ranging from 0.71 to 0.78 ([64]Fig. 2d). In contrast, attempting to use the significantly differentially expressed genes identified in the original PENCIL work (Figure 5e in Ren et al.^[65]5) to calculate the GSVA ICB response signature score resulted in lower predictive performance (AUC = 0.66–0.81, accuracy = 0.60–0.73; see [66]Supplementary Fig. 1a), attributed to the aforementioned flawed data preprocessing in the original analysis. We also tested the GSVA ICB response signature using top differentially expressed genes in CD8^+ T cells from responder samples versus non-responder samples, without using PENCIL. The performance was poorer (AUC = 0.59–0.86, accuracy = 0.57–0.78; see [67]Supplementary Fig. 1b), indicating that by rejecting irrelevant cells, PENCIL can recognize stronger gene signals, yielding more accurately predicted outcomes. Additionally, we compared CyTIR with multiple well-established signatures for ICB response prediction, including the exhausted T cell signature^[68]16, the cytotoxic T cell signature^[69]17, the tumour inflammation signature^[70]18, the interferon-γ signature^[71]19, the CXCL13 + CD8^+ T cell signature^[72]20, the tumour-reactive CD8^+ T cell signature^[73]21 and the tumour-reactive tissue-resident memory signature^[74]22 ([75]Supplementary Fig. 1c–[76]i). Among these, CyTIR emerged as the most predictive and robust one. Interestingly, the cytotoxic T cell signature in CD8^+ T cells was negatively correlated with the ICB response (AUC = 0.17–0.28, accuracy = 0.27–0.39; see [77]Supplementary Fig. 1d), contrasting with that in bulk RNA sequencing data. This observation suggests that CD8^+ T cells expressing high levels of cytolytic effectors may exhibit characteristics of highly exhausted T cells, including the depletion of memory phenotypes, potentially leading to unfavourable outcomes. Consistently, the cytolytic effector GZMB indeed ranked among the top significantly upregulated genes in PENCIL-predicted non-responder cells compared to responder cells ([78]Fig. 2b). Predicting cell origin in head and neck squamous cell carcinoma To further assess PENCIL’s performance on another cancer and phenotype, we applied it to CD8^+ T cells from tumour and peripheral blood mononuclear cell (PBMC) samples of head and neck squamous cell carcinoma (HNSCC) patients. In this case, the objective was to identify tumour-originated and PBMC-originated CD8^+ T cell samples. Following training on one dataset ([79]GSE139324; [80]Figs. 3a and [81]3b, top)^[82]23, we tested PENCIL’s performance on five independent test datasets, including [83]GSE164690 (ref. [84]^24), [85]GSE162025 (ref. [86]^25), [87]GSE180268 (ref. [88]^26), [89]GSE182227 (ref. [90]^27) and [91]GSE200996 (ref. [92]^28). Impressively, PENCIL has demonstrated a reassuringly overall high discrimination ability between tumour and PBMC-originated samples in both the training set and all five test sets, using a cutoff of a 50% ratio of tumour-originated cells (accuracy = 0.89–1; [93]Fig. 3b). While human papillomavirus (HPV) infection is a major subtype driver in HNSCC, our analysis did not reveal a significant influence of HPV infection status on distinguishing CD8^+ T cell origin by PENCIL ([94]Supplementary Fig. 2). Additionally, when using significantly differentially expressed genes directly from CD8^+ T cells in tumour versus PBMC samples without employing PENCIL, the predictive power for CD8^+ T cell origin was poorer. This was particularly evident in datasets [95]GSE180268 and [96]GSE182227, where accuracy hovered around 0.55 ([97]Supplementary Fig. 3a). Fig. 3 |. Characterizing tumour-originated and PBMC-originated CD8^+ T cells. [98]Fig. 3 | [99]Open in a new tab a, UMAP showing the cells using the top 2000 high variable genes for the training data ([100]GSE139324). Cell numbers are shown in parentheses. b, Ratios of PENCILpredicted tumour-originated CD8^+ T cells (tumour, red) and PBMC-originated CD8^+ T cells (PBMC, blue) in each sample. c, the top ten enriched pathways of significantly highly expressed genes in PENCIL-predicted tumour-cells (left) and PBMC-cells (right). d, UCell score of three well-known anti-tumour immune signatures. e, Expression levels of three immune checkpoint genes. In d and e, two-tail Wilcoxon test P values are displayed. logFC, log fold change; acc., accuracy; adj. P, adjusted P; ER, endoplasmic reticulum; GTP, guanosine-5′-triphosphate; TCR, T cell receptor. To study the biological differences between tumour- and PBMC-originated CD8^+ T cells in the dataset described above, we performed a pathway enrichment analysis on the PENCIL-identified relevant cells. As expected, tumour-originated CD8^+ T cells exhibited enrichment in anti-tumour functions such as T cell receptor signalling, antigen processing and signalling by interleukins. In contrast, PBMC-originated CD8^+ T cells displayed greater enrichment in fundamental cellular functions, such as translation ([101]Fig. 3c). Additionally, we investigated the enrichment of well-known anti-tumour immune signatures and immune checkpoint genes. Compared to the PBMC, tumour-originated CD8^+ T cells displayed significantly higher enrichment of the tissue-resident memory (TRM), exhaustion and interferon-gamma (IFNG) signatures ([102]Fig. 3d). Notably, the tumour-originated CD8^+ T cells have a significantly higher expression of immune checkpoint genes TIGIT, LAG3 and activation surface marker HLA-DRB1 than the PBMC-originated ones annotated by PENCIL, capturing a pronounced activated yet exhausted phenotype of these cells in the tumour microenvironment ([103]Fig. 3e). In addition, PENCIL-predicted tissue-originated CD8^+ T cells exhibit significantly higher UCell scores for TRM, exhaustion and IFNG signatures, along with significantly elevated expression levels of the three abovementioned genes, compared to using CD8^+ T cells directly from the tumour samples without using PENCIL ([104]Supplementary Fig. 3b). Collectively, these results show that, by excluding a proportion of irrelevant cells, PENCIL can discern stronger gene signals, leading to more robust outcomes. PENCIL may overfit data in some cases We further extended the application of PENCIL to explore additional phenotypes and datasets. In one experiment, we evaluated PENCIL’s ability to predict sex-related CD8^+ T cells in skin cancer, and in another, we tested its ability to predict HPV-infection-related CD8^+ T cells in HNSCC. Intriguingly, on the training data, PENCIL consistently demonstrated outstanding performance in predicting phenotype-relevant cells, quantified by the fraction of cells labelled with a matching label to that of their host samples (AUC = 0.99–1; accuracy = 0.96–1; [105]Supplementary Fig. 4). However, its performance on the test data in both experiments was notably suboptimal, as it revealed statistically insignificant cell ratio differences between samples with different phenotypic labels (P = 0.39–1) and exhibited variable predictive power for the sample-level phenotype (AUC = 0.32–0.67, accuracy = 0–0.76; [106]Supplementary Fig. 4). These findings suggest that, while PENCIL excels in predicting phenotype-relevant cells on the training data, its generalization to independent test datasets requires further investigation and improvement to enhance its predictive accuracy. One suggestion for its further improvement could be retraining a model on PENCIL selected cells and incorporating additional measures to mitigate the risk of overfitting, even at the cost of reducing the training accuracy. Discussion In this study, we successfully reproduced the main results of Ren et al.^[107]5 using the provided code and data. While bearing this in mind, our results indicate that PENCIL may be sensitive to changes in input parameters and data, emphasizing the importance of carefully handling the latter when applying the algorithm. Our recommendations for those going forward include: (1) set the class_weights parameter for each category to be inversely proportional to the cell counts of that category, (2) arrange the order of genes in the expression input file by expression variance from high to low and (3) annotate the cell types carefully by employing proper annotation tools and cross-referencing the findings with biological knowledge. Taken together, these recommendations may aid researchers in maximizing the potential of PENCIL for single-cell analysis. Our study has showcased the utility of PENCIL in learning sample-level phenotypic differences by leveraging phenotype-relevant single cells across different datasets. Notably, augmenting PENCIL’s cell subset identification with GSVA led to the creation of the CyTIR signature for predicting ICB response in skin cancer. CyTIR is defined as the difference in GSVA scores between the top five responder-related genes (IL7R, CCR7, TSPYL2, GPR183, HSPH1) and the top five non-responder-related genes (IFI6, WARS, GZMB, CCL3, ISG15). Remarkably, compared to predictions based on differential cell abundance from PENCIL alone, CyTIR achieved gains of 4–5% in AUC and 31–48% in accuracy. Interestingly, our analysis revealed that GZMB, typically a responder-related gene, is down-regulated in responder-related CD8^+ T cells. This observation aligns with a seminal study demonstrating significant down-regulation of GZMB in CD8^+ T cells within tumours having tertiary lymphoid structures—a feature correlated with enhanced immunotherapy outcomes and survival^[108]29; conversely, the upregulation of GZMB, alongside other immune checkpoint genes like TIM3 and PD1, often marks a dysfunctional state in CD8^+ T cells^[109]9,[110]29. Importantly, although initially derived from melanoma tumour samples, CyTIR demonstrated high effectiveness in predicting responses in distinct datasets, including tumour samples from basal or squamous cell carcinoma and PBMC samples of metastatic melanoma. This cross-dataset applicability underscores CyTIR’s robustness. Further validation of CyTIR across additional skin cancer datasets and other cancer types will be instrumental in establishing its utility and adaptability in clinical settings. Finally, our study highlights a few practical considerations when using PENCIL. First, the computational runtime can vary substantially depending on hardware. While PENCIL runs in minutes with graphics processing unit acceleration for most datasets, it may take hours without such acceleration. Second, the performance of PENCIL is sensitive to model parameters and input data, necessitating careful parameter tuning and data handling. Third, PENCIL can generate predictions for various combinations of input cell types and phenotypic labels, but in certain cases, it may overfit the data due to the supervised gene selection during model training. To address these challenges, we recommend a stepwise approach. Primarily, as much as possible, researchers should curate multiple datasets and thoroughly evaluate PENCIL’s performance on independent test data before proceeding with downstream analyses on the resulting phenotype-relevant cells. By exercising prudence and diligence in its application, PENCIL can be successfully harnessed effectively to uncover important and interesting phenotype-relevant biology from single-cell RNA sequencing data. Methods Data preprocessing For all the analyses in this study, we focused on CD8^+ T cells. CD8^+ T cells of [111]GSE120575 were annotated by the original study of ref. [112]9. Only protein-coding genes were used for downstream analyses. As a result, 6,350 CD8^+ T cells and 20,645 genes were retained. The datasets [113]GSE123813 (ref. [114]^14), [115]GSE166181 (ref. [116]^15), [117]GSE115978 (ref. [118]^30), [119]GSE144236 (ref. [120]^31), [121]GSE145328 (ref. [122]^32), [123]GSE139324 (ref. [124]^23), [125]GSE162025 (ref. [126]^25) and [127]GSE180268 (ref. [128]^26) were processed, and the annotation of CD8^+ T cells was carried out using the Tumour Immune Single Cell Hub^[129]33. The annotated data were downloaded from [130]http://tisch.comp-genomics.org/home/. The datasets [131]GSE164690 (ref. [132]^24), [133]GSE182227 (ref. [134]^27) and [135]GSE200996 (ref. [136]^28) were downloaded from The Gene Expression Omnibus (GEO) ([137]https://www.ncbi.nlm.nih.gov/geo/) and the CD8^+ T cells were annotated with CellTypist^[138]10 with the majority-voting model parameter setting, using Immune_All_Low as the reference model. In the initial experiment testing different cell annotation methods, SingleR^[139]11 and ScType^[140]12 based cell type annotation was also used, using celldex::BlueprintEncodeData() and ScTypeDB_full::Immune system as the reference datasets, respectively. After obtaining the CD8^+ T cells, we normalized the raw counts with log normalization using the NormalizeData() function followed by scaling and centring the data using the ScaleData() function from the Seurat package ([141]https://satijalab.org/seurat/; v.4.3.0) in R ([142]https://www.r-project.org/; v.4.3.0). The phenotypic information for each dataset was provided with the original study. Statistical analysis Differential cell abundance analysis. For the binarized phenotypic labels (denoted as P1 and P2 below), PENCIL classified the input cells into three groups: P1 cells, P2 cells and rejected cells. To ensure robustness, we removed samples containing less than 50 CD8^+ T cells. For each dataset, the original sample size and sample size after filtering are summarized in [143]Supplementary Table 1. Subsequently, we calculated and plotted the ratios of P1 cells and P2 cells for the remaining samples. To test the differential abundance of P2 cells between the P1- and P2-samples, we conducted a two-tail Wilcoxon test. For predicting sample-level phenotypic labels based on phenotype-relevant cell abundance, we utilized the P2-cell ratio in each sample as a predictor. To assess the accuracy of prediction, we employed a 50% P2-cell ratio cutoff, following the approach used by Ren et al.^[144]5. Gene set variation analysis. In our analysis, we used the FindAllMarkers() function from the Seurat package in R to identify marker genes in the P1 cells, P2 cells and rejected cells. The function was set with parameters: only.pos = T, min.pct = 0.25 and logfc.threshold = 0.25. Subsequently, we filtered the genes with adjusted P values of less than 0.05. To rank the significant marker genes, we calculated their average log fold change. Next, we calculated the aggregated gene expression of all CD8^+ T cells for each sample using the AverageExpression() function in the Seurat package. For GSVA analysis, we calculated GSVA scores for the top five P1- and P2-marker genes separately on the scaled aggregated gene expression of all CD8^+ T cells in each sample, utilizing the GSVA^[145]34 package (v.1.48.2). To test the differential GSVA scores between the P1- and P2-samples, we conducted a two-tail Wilcoxon test. ICB response signature analysis. We compared our newly derived GSVA signature with seven established ICB response signatures. The established signatures and their definitions are elaborated below. (1) The exhausted T cell signature^[146]16 is defined as the GSVA score of the PDCD1, CTLA4, LAG3, HAVCR2 and TIGIT genes. (2) The cytotoxic T cell signature^[147]17 is defined as the GSVA score of the GZMA and PRF1 genes. (3) The tumour inflammation signature^[148]18 is defined as the GSVA score of the CD276, HLA-DQA1, CD274, IDO1, HLA-DRB1, HLA-E, CMKLR1, PDCD1LG2, PSMB10, LAG3, CXCL9, STAT1, CD8A, CCL5, NKG7, TIGIT, CD27 and CXCR6 genes. (4) The interferon-γ signature^[149]19 is defined as the GSVA score of the IDO1, CXCL10, CXCL9, HLA-DRA, STAT1 and IFNG genes. (5) The CXCL13 + CD8^+ T cell signature^[150]20 is defined as the ratio of CXCL13 + CD8^+ T cells to all CD8^+ T cells. (6) The tumour-reactive CD8^+ T cell signature^[151]21 is defined as the ratio of CD39 + CD8^+ T cells to all CD8^+ T cells. (7) The tumour-reactive TRM signature is defined as the ratio of CD39 + CD103 + CD8^+ T cells to all CD8^+ T cells. Pathway enrichment analysis. We conducted pathway enrichment analysis on the significant marker genes identified for the P1 cells and P2 cells, as described earlier. For this analysis, we utilized the enrich-Pathway() function from the ReactomePA^[152]35 package (v.1.44.0) with a P-value cutoff of 0.05. We presented the top ten significantly enriched pathways in both the P1 cells and P2 cells. Anti-tumour immune signature enrichment analysis. During the analysis of tumour-related CD8^+ T cells in HNSCC, we conducted an examination of the enrichment of anti-tumour immune signatures in both tumour-related CD8^+ T cells and PBMC-related CD8^+ T cells. Specifically, we defined the TRM signature using positive marker genes such as IL17A, IL10, IL2, IFNG, DUSP6, PDCD1, CRTAM, ITGA1, ITGAE, CD69, CD101 and CXCR6, along with negative marker genes including SELL, S1PR1, S1PR5, CX3CR1, MKI67, KLF2 and KLF3, following the approach used in ref. [153]36. Moreover, we defined the T cell exhaustion signature using positive marker genes such as PDCD1, CTLA4, LAG3, HAVCR2 and TIGIT, as outlined by ref. [154]16. Additionally, the interferon-gamma signature (IFNG) was established using positive marker genes like IDO1, CXCL10, CXCL9, HLA-DRA, IFNG and STAT1, following ref. [155]37. To calculate the enrichment scores for these signatures, we utilized the AddModuleScore_UCell() function from the UCell^[156]38 package (v.2.4.0). Supplementary Material Supplementary Information [157]NIHMS2065244-supplement-Supplementary_Information.pdf^ (729.1KB, pdf) Acknowledgements