Abstract

   Recent advances in single-cell transcriptome sequencing and
   computational analysis methods have improved our understanding of
   cellular heterogeneity. However, associating different cell subsets
   with phenotypes remains challenging. Recently, Ren et al. introduced
   PENCIL, a supervised learning framework incorporating gene selection to
   discern phenotype-relevant cells. To assess PENCIL’s reproducibility
   and transferability, we conducted a comprehensive evaluation across 12
   single-cell RNA sequencing datasets representing four distinct
   phenotypes. We identified a few caveats with the original version of
   PENCIL, such as sensitivity to input perturbation, the correction of
   which contributed to PENCIL’s enhanced reproducibility. We highlight
   that boosting PENCIL’s cell subsets identification with gene set
   variation analysis creates a cytotoxic T cell immunotherapy response
   signature (CyTIR) predictive of immune checkpoint blockade response in
   skin cancer across multiple datasets, with an area under curve >0.75
   and accuracy >0.71. Overall, our assessments enhance PENCIL’s
   reproducibility and utility, further extending its potential for
   identifying phenotype-relevant cell subsets in diverse biomedical
   applications.
     __________________________________________________________________

   The advancements in single-cell transcriptome sequencing protocols have
   been accompanied by remarkable progress in computational analysis
   methods. These computational methods have facilitated the exploration
   of complex aspects of biology, including cell type composition, gene
   regulation, transcriptional intra-tumour heterogeneity and
   developmental dynamics^[31]1–[32]4. However, the identification of
   phenotype-relevant cell subsets still poses an important challenge. To
   address this issue, Ren et al. developed PENCIL^[33]5, a pioneering
   supervised learning framework that incorporates a supervised gene
   selection step in discerning cells associated with categorical or
   continuous phenotypes. This new approach holds promise for improving
   the classification of cells into distinct phenotype-relevant groups
   compared to previous methods relying on unsupervised gene selection
   (for example, see refs. [34]^6^–[35]^8). Nevertheless, a rigorous
   evaluation of its reproducibility and transferability is crucial to
   establishing PENCIL’s reliability and broader applicability.

   In this study, we address this gap by comprehensively assessing
   PENCIL’s performance across four distinct phenotypes, focusing on the
   CD8^+ T cells from 12 single-cell RNA sequencing datasets.
   Additionally, we introduce a new biomedical application that utilizes
   gene set variation analysis (GSVA) to generate phenotype-predicting
   signatures from the differentially expressed genes found in
   phenotype-relevant cell subsets. Our assessment seeks to validate the
   important findings previously reported in Ren et al.^[36]5, while
   additionally exploring the potential of this approach to identify
   phenotype-relevant cell subsets for various future potential biomedical
   applications of interest.

Sensitivity of PENCIL to model parameters and input data

   To assess the reproducibility of PENCIL in identifying
   phenotype-relevant cell subsets using real data, we conducted our first
   analysis on the primary dataset utilized by Ren et al. in their
   original study^[37]5. This dataset comprised single-cell RNA sequencing
   data from 32 melanoma patients treated with immune checkpoint blockade
   (ICB) therapy, resulting in a total of 48 samples, including 17
   responder samples and 31 non-responder samples^[38]9. In this dataset,
   there were 6,350 manually annotated CD8^+ T cells, with 4,301 from
   non-responders and 2,049 from responders ([39]Fig. 1a, left), which are
   used as input of PENCIL in the original study by Ren et al.

Fig. 1 |. Sensitivity testing of PENCIL to model parameters and input data.

   [40]Fig. 1 |
   [41]Open in a new tab

   a, Sensitivity of PENCIL to the class_weights parameter setting: left,
   true phenotypic labels of cells; middle, predicted cell labels using
   default class_weights setting (ratio of non-responder (NR) to responder
   (R) is 1:1); right, changes of predicted number of ‘R cells’, ‘NR
   cells’ and ‘rejected (Rej.) cells’ with a series of different values of
   the class_weights parameter. The value x of class_weights represents
   the class weights set in PENCIL as 1:x (NR:R). The dashed line
   indicates the optimal class_weights value (NR:R = 1:2). b, Sensitivity
   of PENCIL to the input gene order. The diagram, by the technique of
   Uniform Manifold Approximation and Projection (UMAP), illustrates the
   predicted R, NR and Rej cells with input genes ordered by gene name (A
   to Z). The Venn plots from left to right illustrate the overlap of
   predicted cell labels with input genes ordered by gene name (A to Z),
   gene expression variance (low to high) and gene expression level (low
   to high), compared to the original gene order (gene expression variance
   from high to low). All simulations were conducted with class_weights
   set to 1:2. c, Sensitivity of PENCIL to the input cell identity: left,
   the overlap of CD8^+ T cells annotated by three different automatic
   tools, compared to original manually annotated CD8^+ T cells; right,
   the overlap of predicted R, NR and Rej cells when using CD8^+ T cells
   annotated by CellTypist, SingleR and ScType as input, compared to
   original manually annotated CD8^+ T cells. The data are from ref.
   [42]9.

Sensitivity of PENCIL to the class_weights parameter setting

   In our initial test, with the default model parameter settings, the
   PENCIL algorithm predicted an unexpectedly low number of
   responder-related cells (n = 9; [43]Fig. 1a, middle), which
   substantially differed from the count reported by the original authors
   (n = 1,243). To resolve this discrepancy, we reached out to the
   original authors, who kindly shared their source code with us. We found
   that the model parameter class_weights, utilized to account for the
   imbalance between the number of cells from different true phenotypic
   labels, was set to 1:2 (non-responder:responder, which is approximately
   inversely proportional to the cell counts of both categories, that is,
   4,301:2,049), rather than the default setting of 1:1. By adopting this
   adjusted setting, we achieved a more reasonable outcome, even though
   the exact number of non-responder and responder-related cells still
   deviated slightly from the authors’ reported results ([44]Fig. 1a,
   right). Several factors contribute to this discrepancy. First, the
   inherent randomness in deep learning could lead to variations when
   running on different machines. Second, there was a flaw in the data
   preprocessing of the authors’ code, as they directly applied log
   normalization on the log[2](TPM + 1) data without prior conversion to
   TPM expression values, where TPM is transcripts per million. Finally,
   we explored the influence of the class_weights parameter on the
   resulting cell numbers over a range of values and found that this
   parameter had a considerable impact on the predicted number of cells
   for different categories ([45]Fig. 1a, right).

Sensitivity of PENCIL to the input gene order

   Although PENCIL accepts the gene expression data in both input
   formats—CSV and H5A—we were surprised to find that the results obtained
   were contingent on the format used. Studying this further, we found
   that the data given in these two formats differed in the gene order.
   Specifically, in the H5AD format, genes were sorted alphabetically by
   their names by default (A to Z), whereas in the CSV format, they were
   arranged based on their expression variance (from high to low).
   Altering gene order in the CSV input data enabled us to reproduce the
   fluctuations in the algorithm’s predictions ([46]Fig. 1b, left). We
   further explored alternative gene orderings, such as sorting by
   low-to-high expression variance or low-to-high expression level.
   Interestingly, we observed that the gene order in the input file could
   indeed impact the identity and count of responder cells, non-responder
   cells and rejected cells predicted by PENCIL ([47]Fig. 1b). Upon
   examining the code of the PENCIL algorithm and consulting with the
   original authors, we concluded that the observed discrepancy is
   attributed to the fact that, in the PENCIL framework, input genes are
   assigned initial random weights for the subsequent optimization step.
   Consequently, altering the gene order results in different weights
   being assigned to each gene, thereby yielding varied outcomes.

Sensitivity of PENCIL to the input cell identity

   Given that there are many different cell annotation methods and that
   different methods may assign different identities to cells, we
   investigated PENCIL’s sensitivity to input cell identity. To this end,
   we compared CD8^+ T cells annotated by three automatic
   tools—CellTypist^[48]10, SingleR^[49]11 and ScType^[50]12—with manually
   annotated CD8^+ T cells. Although all methods identified a substantial
   number of CD8^+ T cells, CellTypist demonstrated the highest agreement
   with the original manual annotation (overlap >65%; [51]Fig. 1c, left)
   and the highest concordance in PENCIL-predicted responder,
   non-responder and rejector labels (overlap >40%; [52]Fig. 1c, right).
   Thus, CellTypist was selected for annotating datasets lacking cell type
   information. As more annotation tools emerge, benchmarking and
   selecting the most suitable one will become crucial for downstream
   analysis, as demonstrated in ref. [53]13.

Learning sample phenotypes via phenotype-relevant cells

   Next, we utilized PENCIL to investigate the underlying mechanisms
   driving phenotypic differences and predict sample-level phenotypes
   based on single-cell-level phenotypic assignments. Our focus was on
   CD8^+ T cells in each dataset; to ensure robustness, samples with fewer
   than 50 CD8^+ T cells were excluded from subsequent sample-level
   analyses.

Predicting ICB response in skin cancer

   While numerous signatures have been developed for predicting cancer ICB
   response, the majority operate on a bulk level. In this study, we
   investigated the feasibility of leveraging PENCIL to develop a
   single-cell-level signature for predicting ICB response. To achieve
   this, we trained PENCIL using the aforementioned melanoma samples, with
   ICB response as the phenotype of interest. Subsequently, we employed
   the trained model to predict the number of ICB-response-relevant cells
   in two distinct test sets: (1) tumour CD8^+ T cells from 15 patients
   with basal or squamous cell carcinoma ([54]GSE123813)^[55]14 and (2)
   circulating CD8^+ T cells from 20 metastatic melanoma patients
   ([56]GSE166181)^[57]15. On the training data, PENCIL showed an
   outstanding ability to predict sample-level ICB response status
   (responders versus non-responders) from the ratio of responder-related
   cells, achieving a perfect area under curve (AUC) of 1 and accuracy of
   1 ([58]Fig. 2a, left). However, on the test sets, performance was only
   moderate. While responder samples had significantly higher ratios of
   responder-associated cells (P = 0.014–0.016), accuracy in identifying
   responder samples using a 50% cutoff for this ratio ranged just
   0.48–0.54 ([59]Fig. 2a, middle, right).

Fig. 2 |. Predicting ICB response in skin cancer using a newly curated
signature derived from ICB-response-related CD8 + T cells.

   [60]Fig. 2 |
   [61]Open in a new tab

   a, Ratios of predicted non-responder-related (NR, blue) and
   responder-related CD8^+ T cells (R, red) by PENCIL in each sample. The
   AUCs of prediction were calculated using the proportion of R cells in
   each sample. The accuracy (acc.) was calculated using a cutoff of 50%
   of the proportion of R cells in each sample. b, Dot plots showing the
   expression levels of the top 5 significantly differentially expressed
   genes of PENCIL-predicted cell labels. The size of the dot encodes the
   percentage of cells expressing each gene, and the colour encodes the
   average expression level. An ICB response signature was curated and
   defined as the difference of the GSVA scores between the two sets of
   differentially expressed genes. c, The receiver operating
   characteristic curves and AUCs using this score to predict ICB response
   in the training and test sets. d, The ICB response signature score
   distribution in R- and NR-samples in different datasets. the accuracy
   for prediction of ICB response was calculated using cutoff of ICB
   response signature score at 0. The blue bars are signature scores of
   non-responders, the red bars are signature scores of responders. In a
   and d, two-tail Wilcoxon test P values are displayed. In d, the upper
   and lower boundaries signify the first and third quartiles,
   respectively, the central line denotes the median, and the whiskers
   stretch to the most distant data points not classified as outliers
   (within 1.5 times the interquartile range).

   Despite this moderate performance, we explored whether PENCIL’s
   phenotype-relevant cell classifications could still be utilized to
   build a more accurate predictor, by analysing differential gene
   expression rather than relative abundance of the phenotype-relevant
   cells. To this end, we identified genes significantly differentially
   expressed between the PENCIL-predicted responder and non-responder cell
   subsets. We then curated a cytotoxic T cell immunotherapy response
   signature (CyTIR), defined as the difference in GSVA scores between the
   top 5 responder-related genes (IL7R, CCR7, TSPYL2, GPR183 and HSPH1)
   and the top five non-responder-related genes (IFI6, WARS, GZMB, CCL3
   and ISG15), ranked by fold change from the differential expression
   analysis ([62]Fig. 2b). Remarkably, the newly curated ICB response
   signature, CyTIR, successfully predicted sample-level ICB response from
   single-cell data in both the training and independent test sets (AUC =
   0.75–0.86; [63]Fig. 2c). With a cutoff of 0, the CyTIR score achieved
   an accuracy of identifying ICB responder samples ranging from 0.71 to
   0.78 ([64]Fig. 2d). In contrast, attempting to use the significantly
   differentially expressed genes identified in the original PENCIL work
   (Figure 5e in Ren et al.^[65]5) to calculate the GSVA ICB response
   signature score resulted in lower predictive performance (AUC =
   0.66–0.81, accuracy = 0.60–0.73; see [66]Supplementary Fig. 1a),
   attributed to the aforementioned flawed data preprocessing in the
   original analysis. We also tested the GSVA ICB response signature using
   top differentially expressed genes in CD8^+ T cells from responder
   samples versus non-responder samples, without using PENCIL. The
   performance was poorer (AUC = 0.59–0.86, accuracy = 0.57–0.78; see
   [67]Supplementary Fig. 1b), indicating that by rejecting irrelevant
   cells, PENCIL can recognize stronger gene signals, yielding more
   accurately predicted outcomes.

   Additionally, we compared CyTIR with multiple well-established
   signatures for ICB response prediction, including the exhausted T cell
   signature^[68]16, the cytotoxic T cell signature^[69]17, the tumour
   inflammation signature^[70]18, the interferon-γ signature^[71]19, the
   CXCL13 + CD8^+ T cell signature^[72]20, the tumour-reactive CD8^+ T
   cell signature^[73]21 and the tumour-reactive tissue-resident memory
   signature^[74]22 ([75]Supplementary Fig. 1c–[76]i). Among these, CyTIR
   emerged as the most predictive and robust one. Interestingly, the
   cytotoxic T cell signature in CD8^+ T cells was negatively correlated
   with the ICB response (AUC = 0.17–0.28, accuracy = 0.27–0.39; see
   [77]Supplementary Fig. 1d), contrasting with that in bulk RNA
   sequencing data. This observation suggests that CD8^+ T cells
   expressing high levels of cytolytic effectors may exhibit
   characteristics of highly exhausted T cells, including the depletion of
   memory phenotypes, potentially leading to unfavourable outcomes.
   Consistently, the cytolytic effector GZMB indeed ranked among the top
   significantly upregulated genes in PENCIL-predicted non-responder cells
   compared to responder cells ([78]Fig. 2b).

Predicting cell origin in head and neck squamous cell carcinoma

   To further assess PENCIL’s performance on another cancer and phenotype,
   we applied it to CD8^+ T cells from tumour and peripheral blood
   mononuclear cell (PBMC) samples of head and neck squamous cell
   carcinoma (HNSCC) patients. In this case, the objective was to identify
   tumour-originated and PBMC-originated CD8^+ T cell samples. Following
   training on one dataset ([79]GSE139324; [80]Figs. 3a and [81]3b,
   top)^[82]23, we tested PENCIL’s performance on five independent test
   datasets, including [83]GSE164690 (ref. [84]^24), [85]GSE162025 (ref.
   [86]^25), [87]GSE180268 (ref. [88]^26), [89]GSE182227 (ref. [90]^27)
   and [91]GSE200996 (ref. [92]^28). Impressively, PENCIL has demonstrated
   a reassuringly overall high discrimination ability between tumour and
   PBMC-originated samples in both the training set and all five test
   sets, using a cutoff of a 50% ratio of tumour-originated cells
   (accuracy = 0.89–1; [93]Fig. 3b). While human papillomavirus (HPV)
   infection is a major subtype driver in HNSCC, our analysis did not
   reveal a significant influence of HPV infection status on
   distinguishing CD8^+ T cell origin by PENCIL ([94]Supplementary Fig.
   2). Additionally, when using significantly differentially expressed
   genes directly from CD8^+ T cells in tumour versus PBMC samples without
   employing PENCIL, the predictive power for CD8^+ T cell origin was
   poorer. This was particularly evident in datasets [95]GSE180268 and
   [96]GSE182227, where accuracy hovered around 0.55 ([97]Supplementary
   Fig. 3a).

Fig. 3 |. Characterizing tumour-originated and PBMC-originated CD8^+ T cells.

   [98]Fig. 3 |
   [99]Open in a new tab

   a, UMAP showing the cells using the top 2000 high variable genes for
   the training data ([100]GSE139324). Cell numbers are shown in
   parentheses. b, Ratios of PENCILpredicted tumour-originated CD8^+ T
   cells (tumour, red) and PBMC-originated CD8^+ T cells (PBMC, blue) in
   each sample. c, the top ten enriched pathways of significantly highly
   expressed genes in PENCIL-predicted tumour-cells (left) and PBMC-cells
   (right). d, UCell score of three well-known anti-tumour immune
   signatures. e, Expression levels of three immune checkpoint genes. In d
   and e, two-tail Wilcoxon test P values are displayed. logFC, log fold
   change; acc., accuracy; adj. P, adjusted P; ER, endoplasmic reticulum;
   GTP, guanosine-5′-triphosphate; TCR, T cell receptor.

   To study the biological differences between tumour- and PBMC-originated
   CD8^+ T cells in the dataset described above, we performed a pathway
   enrichment analysis on the PENCIL-identified relevant cells. As
   expected, tumour-originated CD8^+ T cells exhibited enrichment in
   anti-tumour functions such as T cell receptor signalling, antigen
   processing and signalling by interleukins. In contrast, PBMC-originated
   CD8^+ T cells displayed greater enrichment in fundamental cellular
   functions, such as translation ([101]Fig. 3c). Additionally, we
   investigated the enrichment of well-known anti-tumour immune signatures
   and immune checkpoint genes. Compared to the PBMC, tumour-originated
   CD8^+ T cells displayed significantly higher enrichment of the
   tissue-resident memory (TRM), exhaustion and interferon-gamma (IFNG)
   signatures ([102]Fig. 3d). Notably, the tumour-originated CD8^+ T cells
   have a significantly higher expression of immune checkpoint genes
   TIGIT, LAG3 and activation surface marker HLA-DRB1 than the
   PBMC-originated ones annotated by PENCIL, capturing a pronounced
   activated yet exhausted phenotype of these cells in the tumour
   microenvironment ([103]Fig. 3e). In addition, PENCIL-predicted
   tissue-originated CD8^+ T cells exhibit significantly higher UCell
   scores for TRM, exhaustion and IFNG signatures, along with
   significantly elevated expression levels of the three abovementioned
   genes, compared to using CD8^+ T cells directly from the tumour samples
   without using PENCIL ([104]Supplementary Fig. 3b). Collectively, these
   results show that, by excluding a proportion of irrelevant cells,
   PENCIL can discern stronger gene signals, leading to more robust
   outcomes.

PENCIL may overfit data in some cases

   We further extended the application of PENCIL to explore additional
   phenotypes and datasets. In one experiment, we evaluated PENCIL’s
   ability to predict sex-related CD8^+ T cells in skin cancer, and in
   another, we tested its ability to predict HPV-infection-related CD8^+ T
   cells in HNSCC. Intriguingly, on the training data, PENCIL consistently
   demonstrated outstanding performance in predicting phenotype-relevant
   cells, quantified by the fraction of cells labelled with a matching
   label to that of their host samples (AUC = 0.99–1; accuracy = 0.96–1;
   [105]Supplementary Fig. 4). However, its performance on the test data
   in both experiments was notably suboptimal, as it revealed
   statistically insignificant cell ratio differences between samples with
   different phenotypic labels (P = 0.39–1) and exhibited variable
   predictive power for the sample-level phenotype (AUC = 0.32–0.67,
   accuracy = 0–0.76; [106]Supplementary Fig. 4). These findings suggest
   that, while PENCIL excels in predicting phenotype-relevant cells on the
   training data, its generalization to independent test datasets requires
   further investigation and improvement to enhance its predictive
   accuracy. One suggestion for its further improvement could be
   retraining a model on PENCIL selected cells and incorporating
   additional measures to mitigate the risk of overfitting, even at the
   cost of reducing the training accuracy.

Discussion

   In this study, we successfully reproduced the main results of Ren et
   al.^[107]5 using the provided code and data. While bearing this in
   mind, our results indicate that PENCIL may be sensitive to changes in
   input parameters and data, emphasizing the importance of carefully
   handling the latter when applying the algorithm. Our recommendations
   for those going forward include: (1) set the class_weights parameter
   for each category to be inversely proportional to the cell counts of
   that category, (2) arrange the order of genes in the expression input
   file by expression variance from high to low and (3) annotate the cell
   types carefully by employing proper annotation tools and
   cross-referencing the findings with biological knowledge. Taken
   together, these recommendations may aid researchers in maximizing the
   potential of PENCIL for single-cell analysis.

   Our study has showcased the utility of PENCIL in learning sample-level
   phenotypic differences by leveraging phenotype-relevant single cells
   across different datasets. Notably, augmenting PENCIL’s cell subset
   identification with GSVA led to the creation of the CyTIR signature for
   predicting ICB response in skin cancer. CyTIR is defined as the
   difference in GSVA scores between the top five responder-related genes
   (IL7R, CCR7, TSPYL2, GPR183, HSPH1) and the top five
   non-responder-related genes (IFI6, WARS, GZMB, CCL3, ISG15).
   Remarkably, compared to predictions based on differential cell
   abundance from PENCIL alone, CyTIR achieved gains of 4–5% in AUC and
   31–48% in accuracy. Interestingly, our analysis revealed that GZMB,
   typically a responder-related gene, is down-regulated in
   responder-related CD8^+ T cells. This observation aligns with a seminal
   study demonstrating significant down-regulation of GZMB in CD8^+ T
   cells within tumours having tertiary lymphoid structures—a feature
   correlated with enhanced immunotherapy outcomes and survival^[108]29;
   conversely, the upregulation of GZMB, alongside other immune checkpoint
   genes like TIM3 and PD1, often marks a dysfunctional state in CD8^+ T
   cells^[109]9,[110]29. Importantly, although initially derived from
   melanoma tumour samples, CyTIR demonstrated high effectiveness in
   predicting responses in distinct datasets, including tumour samples
   from basal or squamous cell carcinoma and PBMC samples of metastatic
   melanoma. This cross-dataset applicability underscores CyTIR’s
   robustness. Further validation of CyTIR across additional skin cancer
   datasets and other cancer types will be instrumental in establishing
   its utility and adaptability in clinical settings.

   Finally, our study highlights a few practical considerations when using
   PENCIL. First, the computational runtime can vary substantially
   depending on hardware. While PENCIL runs in minutes with graphics
   processing unit acceleration for most datasets, it may take hours
   without such acceleration. Second, the performance of PENCIL is
   sensitive to model parameters and input data, necessitating careful
   parameter tuning and data handling. Third, PENCIL can generate
   predictions for various combinations of input cell types and phenotypic
   labels, but in certain cases, it may overfit the data due to the
   supervised gene selection during model training. To address these
   challenges, we recommend a stepwise approach. Primarily, as much as
   possible, researchers should curate multiple datasets and thoroughly
   evaluate PENCIL’s performance on independent test data before
   proceeding with downstream analyses on the resulting phenotype-relevant
   cells. By exercising prudence and diligence in its application, PENCIL
   can be successfully harnessed effectively to uncover important and
   interesting phenotype-relevant biology from single-cell RNA sequencing
   data.

Methods

Data preprocessing

   For all the analyses in this study, we focused on CD8^+ T cells. CD8^+
   T cells of [111]GSE120575 were annotated by the original study of ref.
   [112]9. Only protein-coding genes were used for downstream analyses. As
   a result, 6,350 CD8^+ T cells and 20,645 genes were retained. The
   datasets [113]GSE123813 (ref. [114]^14), [115]GSE166181 (ref.
   [116]^15), [117]GSE115978 (ref. [118]^30), [119]GSE144236 (ref.
   [120]^31), [121]GSE145328 (ref. [122]^32), [123]GSE139324 (ref.
   [124]^23), [125]GSE162025 (ref. [126]^25) and [127]GSE180268 (ref.
   [128]^26) were processed, and the annotation of CD8^+ T cells was
   carried out using the Tumour Immune Single Cell Hub^[129]33. The
   annotated data were downloaded from
   [130]http://tisch.comp-genomics.org/home/. The datasets [131]GSE164690
   (ref. [132]^24), [133]GSE182227 (ref. [134]^27) and [135]GSE200996
   (ref. [136]^28) were downloaded from The Gene Expression Omnibus (GEO)
   ([137]https://www.ncbi.nlm.nih.gov/geo/) and the CD8^+ T cells were
   annotated with CellTypist^[138]10 with the majority-voting model
   parameter setting, using Immune_All_Low as the reference model. In the
   initial experiment testing different cell annotation methods,
   SingleR^[139]11 and ScType^[140]12 based cell type annotation was also
   used, using celldex::BlueprintEncodeData() and ScTypeDB_full::Immune
   system as the reference datasets, respectively. After obtaining the
   CD8^+ T cells, we normalized the raw counts with log normalization
   using the NormalizeData() function followed by scaling and centring the
   data using the ScaleData() function from the Seurat package
   ([141]https://satijalab.org/seurat/; v.4.3.0) in R
   ([142]https://www.r-project.org/; v.4.3.0). The phenotypic information
   for each dataset was provided with the original study.

Statistical analysis

Differential cell abundance analysis.

   For the binarized phenotypic labels (denoted as P1 and P2 below),
   PENCIL classified the input cells into three groups: P1 cells, P2 cells
   and rejected cells. To ensure robustness, we removed samples containing
   less than 50 CD8^+ T cells. For each dataset, the original sample size
   and sample size after filtering are summarized in [143]Supplementary
   Table 1. Subsequently, we calculated and plotted the ratios of P1 cells
   and P2 cells for the remaining samples. To test the differential
   abundance of P2 cells between the P1- and P2-samples, we conducted a
   two-tail Wilcoxon test. For predicting sample-level phenotypic labels
   based on phenotype-relevant cell abundance, we utilized the P2-cell
   ratio in each sample as a predictor. To assess the accuracy of
   prediction, we employed a 50% P2-cell ratio cutoff, following the
   approach used by Ren et al.^[144]5.

Gene set variation analysis.

   In our analysis, we used the FindAllMarkers() function from the Seurat
   package in R to identify marker genes in the P1 cells, P2 cells and
   rejected cells. The function was set with parameters: only.pos = T,
   min.pct = 0.25 and logfc.threshold = 0.25. Subsequently, we filtered
   the genes with adjusted P values of less than 0.05. To rank the
   significant marker genes, we calculated their average log fold change.
   Next, we calculated the aggregated gene expression of all CD8^+ T cells
   for each sample using the AverageExpression() function in the Seurat
   package. For GSVA analysis, we calculated GSVA scores for the top five
   P1- and P2-marker genes separately on the scaled aggregated gene
   expression of all CD8^+ T cells in each sample, utilizing the
   GSVA^[145]34 package (v.1.48.2). To test the differential GSVA scores
   between the P1- and P2-samples, we conducted a two-tail Wilcoxon test.

ICB response signature analysis.

   We compared our newly derived GSVA signature with seven established ICB
   response signatures. The established signatures and their definitions
   are elaborated below. (1) The exhausted T cell signature^[146]16 is
   defined as the GSVA score of the PDCD1, CTLA4, LAG3, HAVCR2 and TIGIT
   genes. (2) The cytotoxic T cell signature^[147]17 is defined as the
   GSVA score of the GZMA and PRF1 genes. (3) The tumour inflammation
   signature^[148]18 is defined as the GSVA score of the CD276, HLA-DQA1,
   CD274, IDO1, HLA-DRB1, HLA-E, CMKLR1, PDCD1LG2, PSMB10, LAG3, CXCL9,
   STAT1, CD8A, CCL5, NKG7, TIGIT, CD27 and CXCR6 genes. (4) The
   interferon-γ signature^[149]19 is defined as the GSVA score of the
   IDO1, CXCL10, CXCL9, HLA-DRA, STAT1 and IFNG genes. (5) The CXCL13 +
   CD8^+ T cell signature^[150]20 is defined as the ratio of CXCL13 +
   CD8^+ T cells to all CD8^+ T cells. (6) The tumour-reactive CD8^+ T
   cell signature^[151]21 is defined as the ratio of CD39 + CD8^+ T cells
   to all CD8^+ T cells. (7) The tumour-reactive TRM signature is defined
   as the ratio of CD39 + CD103 + CD8^+ T cells to all CD8^+ T cells.

Pathway enrichment analysis.

   We conducted pathway enrichment analysis on the significant marker
   genes identified for the P1 cells and P2 cells, as described earlier.
   For this analysis, we utilized the enrich-Pathway() function from the
   ReactomePA^[152]35 package (v.1.44.0) with a P-value cutoff of 0.05. We
   presented the top ten significantly enriched pathways in both the P1
   cells and P2 cells.

Anti-tumour immune signature enrichment analysis.

   During the analysis of tumour-related CD8^+ T cells in HNSCC, we
   conducted an examination of the enrichment of anti-tumour immune
   signatures in both tumour-related CD8^+ T cells and PBMC-related CD8^+
   T cells. Specifically, we defined the TRM signature using positive
   marker genes such as IL17A, IL10, IL2, IFNG, DUSP6, PDCD1, CRTAM,
   ITGA1, ITGAE, CD69, CD101 and CXCR6, along with negative marker genes
   including SELL, S1PR1, S1PR5, CX3CR1, MKI67, KLF2 and KLF3, following
   the approach used in ref. [153]36. Moreover, we defined the T cell
   exhaustion signature using positive marker genes such as PDCD1, CTLA4,
   LAG3, HAVCR2 and TIGIT, as outlined by ref. [154]16. Additionally, the
   interferon-gamma signature (IFNG) was established using positive marker
   genes like IDO1, CXCL10, CXCL9, HLA-DRA, IFNG and STAT1, following ref.
   [155]37. To calculate the enrichment scores for these signatures, we
   utilized the AddModuleScore_UCell() function from the UCell^[156]38
   package (v.2.4.0).

Supplementary Material

   Supplementary Information
   [157]NIHMS2065244-supplement-Supplementary_Information.pdf^ (729.1KB,
   pdf)

Acknowledgements