Graphical abstract graphic file with name fx1.jpg [56]Open in a new tab Highlights * • Immune-related gene sets (irGS) identified via factorization of RNA-seq database * • The irGS refines pan-cancer immune subtypes * • The irGS enhances immunotherapy response classification * • The irGS improves functional annotation of spatial transcriptomic data __________________________________________________________________ Bioinformatics; Cancer; Expression study; Immunity; Molecular network; Transcriptomics Introduction Despite recent breakthroughs in immunotherapy such as immune checkpoint blockade (ICB)[57]^1 and adoptive cell therapy (ACT),[58]^2 only a limited fraction of patients benefits from these biomedical advancements. Challenges related to immune-related adverse events and resistance remain and hinder positive outcomes, especially in solid tumors.[59]^3 It is essential for the field of immunotherapy to understand molecular mechanisms underlying the treatment resistance and responses better predict treatment outcomes and develop targeted therapeutics. A systematic way to address these challenges is through statistical analysis of transcriptional programs of human patient samples,[60]^4 which can uncover useful gene expression patterns associated with immunotherapeutic responses. Interpretation of transcriptional programs and their functional significance, however, relies on utilizing previously established gene signature knowledgebases[61]^5^,[62]^6^,[63]^7^,[64]^8 such as Molecular Signature Database (MSigDB) Hallmarks (Hallmark),[65]^9 Kyoto Encyclopedia of Genes and Genomes (KEGG),[66]^10 Gene Ontology (GO),[67]^11 etc. Unfortunately, these knowledgebases are largely cell-type agnostic and lack granularity or immune specificity.[68]^7 Such biases severely hamper data interpretation in immunological studies utilizing high-throughput RNA profiling, resulting in many missed opportunities. To further advance cancer immunotherapy, there is an urgent need to discover and expand immune-related gene signatures/sets (irGSs) in current knowledgebases to comprehensively represent diverse immune-specific transcriptional states and to apply them to interpret molecular functions. A closer inspection of the aforementioned knowledgebases reveals important limitations that are inherent to each source. For example, Hallmark[69]^9 contains only 7 immune-related pathways, with an average of 160 member genes each. The large gene sets in this database can potentially inflate biological significance, without genuine and specific relevance to the designated functions. On the other hand, KEGG[70]^10 contains mostly metabolism related pathways that are not specific to immune systems. The GO[71]^11 terms, with its hierarchical nature, result in collective enrichment of general terms with overlapping functions, diminishing statistical power to detect specific terms. Moreover, although there are some immune-related pathways in the aforementioned knowledgebases, they are too broadly defined. For example, in KEGG, the term T cell receptor (TCR) signaling does not distinguish receptor formation by T cells from antigen presentation by antigen-presenting cells, two complimentary functions in TCR signaling. In Hallmark, “inflammatory response” is too broad to inform detailed molecular programs or mechanisms. Therefore, there is a need to define more specific gene sets and refined annotations to achieve higher granularity. To address this knowledge gap, some immune specific knowledgebases have been constructed. In particular, ImmuneSigDB,[72]^12 consisting of over 5,000 irGSs, was derived from RNA-expression datasets. These RNA expression datasets are very rich, derived from nearly 400 noncancer-specific immunological experiments, containing samples challenged with pathogen infections, cytokines or immunological perturbations of different kinds and magnitudes, thus possessing rich immunological states. Some immunological states may be shared by a variety of conditions. For example, the severe immune dysregulation and inflammation,[73]^13^,[74]^14 observed in sepsis samples may inform immune cell proliferation, inflammatory signaling, and cellular dysfunction and exhaustion in tumor immune microenvironment (TIME). However, the ImmuneSigDB study, conducted nearly a decade ago, applied empirical approaches that may have resulted in some limitations. First, the study examined sets of genes that are differentially expressed (DEGs) across different comparisons of interest, and these comparison groups were subjectively decided by researchers as important and immune-relevant.[75]^12 Second, differential gene expression analysis tends to output large numbers of DEGs (200 ± 70, max = 1,992), with 50% of the gene set size ranges from 190 to 196 member genes. Third, the derived irGSs tend to have limited translational implications as they can only be used to differentiate major states (such as cell lineages) instead of more granular substates. Fourth, given that these comparison groups are defined based on empirical knowledge, ImmuneSigDB irGSs are likely incomplete. Lastly, the ImmuneSigDB irGSs are simply annotated using the GEO ID of the dataset and the comparison groups, thus containing limited annotations of immune functions, hampering results interpretation. In this paper, we aim to construct concise and well-annotated immune-specific gene sets using non-negative matrix factorization (NMF),[76]^15 a powerful factor analysis technique[77]^16^,[78]^17^,[79]^18^,[80]^19^,[81]^20 that has not yet been applied to ImmuneSigDB data. By decomposing gene expression of samples from ImmuneSigDB into composite programs, followed by addition statistical analysis to consolidate and annotate these programs, we have identified 28 irGSs describing various immunological processes. These irGSs can be applied to study and characterize various immune-related transcriptional states such as those in play in TIMEs. We have systematically validated gene set annotations on 12 different datasets and explored their translational implications on 15 different datasets (see [82]STAR Methods Data Sources). These datasets include various modalities, CITE-seq, BulkRNAseq, scRNA-seq, and spatial transcriptomics, and were profiled from a variety of immunological contexts, including but not limited to cytokine perturbation experiments, infectious disease, and treatment naive or immunotherapy-treated cancer patients. Through these exploratory analyses, we showed that these gene sets can better characterize pan-cancer tumor-immune subtypes, separate ICB response in melanoma, liver, and lung cancer and reveal spatial heterogeneity in ovarian, intestinal, and breast cancer samples. Results Twenty eight gene sets covering diverse immunological functions in the TIME The ImmuneSigDB datasets comprise 83 human bulk gene expression profiles with 1,826 RNA samples from studies published in leading immunology journals ([83]Figure S1A). The datasets included in the ImmuneSigDB were derived from patients affected by severe infections and sepsis or from cell lines challenged with experimental manipulations[84]^12: gamma delta T cells activated by IL2,[85]^21 and CD14^+ monocytes[86]^22 cultured with IFN-gamma and stimulated with Toll-like receptor 2 ligand. Such a large collection of datasets entails transcriptional profiles from diverse cell states, making them suitable to derive generalizable irGSs. The datasets encompass gene expression data obtained from human and mouse samples ([87]Figure S1B) that originated from samples deriving from different cell types ([88]Figure S1C). Among datasets profiled from Homo sapiens, the number of genes and samples profiled varies among datasets ([89]Figures S1D and S1E). We partitioned the 83 ImmuneSigDB human datasets into 47 lymphoid-derived and 36 myeloid-derived datasets. To obtain unique irGSs, we carried out the analytic workflow, illustrated in [90]Figure 1A and detailed in [91]Figure S1F and [92]STAR Methods, separately for the lymphoid and the myeloid data. We applied NMF on each qualifying dataset (see [93]STAR Methods) and obtained the top 50 genes from each latent factor based on the NMF loadings, producing a total of 529 gene sets. To ensure generalizability, gene sets derived from one dataset were kept only if they overlap at least 20% with at least one gene set derived from another dataset. To ensure non-redundancy, we excluded gene sets with over 20% overlap with gene sets from the same dataset, resulting in 115 gene sets. To further reduce redundancy, we employed the algorithm proposed by Tirosh et al.,[94]^16 whereby gene sets are progressively merged into meta programs (MP) based on Jaccard distances (see [95]STAR Methods). Figure 1. [96]Figure 1 [97]Open in a new tab Analytic workflow and constructed gene sets (A) Gene set construction analysis workflow: we downloaded 83 immunology relevant human studies and performed NMF on each of the qualifying dataset. We curated robust NMF programs and merged them into meta gene sets. We performed over-representation enrichment analysis for each meta gene set using KEGG pathways, Hallmarks, and biological process terms from GO. The name and annotation of each meta gene set was determined based on counts and false discovery rate (FDR) adjusted p values of the highly enriched terms. We seek relevant multimodal datasets to functionally name and validate each gene set, and further examined the translational utilities of these gene sets. (B) The roles of the constructed gene sets in central immunity: constructed gene sets overlaid near the most relevant processes during central immune activities in tumor-immune microenvironment. Created with [98]BioRender.com. Using the workflow described previously, we derived 19 immune-related MPs (irMPs) from lymphoid samples (L_MP1-19) and 9 irMPs from myeloid samples (M_MP1-9) ([99]Table 1). To annotate each irMP with immune-relevant functions, we performed over-representation analysis (ORA) using biological processes from GO, KEGG, and Hallmark and sorted the enrichment results first by counts of core enrichment then by ascending adjusted p values. Enrichment results and constituent genes for each L_MP and M_MP were presented in [100]Tables S1, [101]S2, [102]S3, and [103]S4. We named the irMPs based on top enriched terms and refined the annotations by going over the constituent genes and the stringDB[104]^23 protein-protein interaction (PPI) network, followed by meticulous scrutiny by multiple immunologists (Matthew Gubin Ph.D.; Hind Rafei M.D.; Rafet Basar M.D.; Katy Rezvani M.D/Ph.D.; and Weiyi Peng M.D/Ph.D.) from MD Anderson Cancer Center. Rationales for each gene set annotation have been documented in supplementary notes. Table 1. Annotations for the 9 myeloid-derived irMPs and 19 lymphoid-derived irMPs Myeloid-derived gene sets Lymphoid-derived gene sets 1 Antiviral defense network Cell cycle progression 2 Antigen processing TCR anchoring 3 Cytokine production Histone-associated lipid antigen presentation 4 Cell adhesion 1 Interleukin-induced Tregs 5 Modulation of cell migration Antigen presentation 6 Cell adhesion 2 Cell cycle mitosis 7 Eosinophil chemotaxis Chemokine-mediated T cell activation 8 MHC2-mediated lymphocyte activation MHC-mediated immunity 9 Chemokine activity Metabolic process 10 Lipid localization TCR synapse 11 Cell cycle immune response 12 Lymphocyte activation 13 Interferon induced antiviral defense 14 Type-I interferon 15 Cell adhesion 1 16 Cell adhesion 2 (CD52^+) 17 IFN-gamma-induced cytotoxicity 18 Cytokine receptor signaling 19 T cell activation [105]Open in a new tab Overall, the L_MPs are highly specific to immune functions, consisting of pathways such as lymphocyte activation, antigen processing, TCR anchoring, and cytotoxic functions. M_MPs are less diverse, including pathways related to cell migration, adhesion, and antiviral response ([106]Figure 1B; [107]Table 1). A few MPs are related to core cellular processes, such as cell cycle and metabolism. To benchmark our irMPs with existing database, we compared and quantified their constituent genes overlap (see [108]STAR Methods) with 231 manually curated Spectra immune gene sets,[109]^24 5,000 ImmuneSigDB pathways, 50 Hallmark pathways, and 29 immune-related KEGG (irKEGG). We observed significant distinction (false discovery rate [FDR]-adjusted p value < 0.05) in 99.5%, 98%, 99%, and 87% of the comparison with Spectra gene sets ([110]Figure S2), ImmuneSigDB ([111]Figures S3 and [112]S4), Hallmark ([113]Figure S5), and irKEGGs ([114]Figure S6), respectively. Functional characterization of the irMPs To validate the functional annotation of irMPs and demonstrate their general applicability, we compiled 12 publicly accessible gene expression datasets encompassing a diverse range of immune contexts, captured through various sequencing techniques (see [115]STAR Methods Data Sources). These datasets were derived from well-designed immunological experiments with well-defined phenotypes, providing independent and orthogonal evidence required to validate annotations of irMPs. Compared to RNA expression, protein expression is often a more direct indicator of biological activity and reflects real physiological processes. We, therefore, leveraged two CITE-seq datasets,[116]^25^,[117]^26 in which the whole transcriptome and ∼200 surface antibodies were simultaneously profiled in the peripheral blood mononuclear cells (PBMCs) of healthy human and COVID-19 patients, respectively. irMPs were scored for each cell in the RNA level using single sample gene set enrichment analysis (ssGSEA)[118]^27 (see [119]STAR Methods). For each irMP of interest, cells were dichotomized into MP-high or MP-low (see [120]STAR Methods) and the abundance of proteins indicative of the function described by the irMP was compared between MP-high and MP-low groups to show consistency between the irMP and proteomic measurement. For irMP without relevant proteins in the CITE-seq data, we defined highly relevant phenotypes in appropriate scRNA-seq data or The Cancer Genome Atlas (TCGA) pan-cancer data and checked for gene set enrichment in a pre-defined phenotype. In addition, we utilized PPI network constructed by the StringDB database[121]^28 to elucidate the activity cascade of each irMP ([122]Figures S7 and [123]S8). Interestingly, the irMPs we obtained can be broadly classified into categories that align with the sequential progression of a typical immune response: cytokine/chemokine production, cell adhesion, cell chemotaxis, antigen processing, antigen presentation, lymphocyte activation, and cytotoxicity ([124]Figure 1B), indicating the key importance of these functional components and the rich molecular programs that have been under-explored. In the following, we describe the function of each irMP and evidence from orthogonal validations. Cytokine/chemokine production There are two cytokine/chemokine related irMPs from myeloid data (M_MP3 and M_MP9) and two from lymphoid data (L_MP7 and L_MP18). To validate their annotations, we examined scRNA-seq data profiled from two severe COVID-19 patients, with symptoms consistent with inflammatory cytokine storms (ICS).[125]^29 If the irMPs are in fact related to pathogenic cytokine/chemokine activities, they should be enriched in cells deriving from patients rather than health controls. As myeloid cells were concluded to be the source of ICS,[126]^29 we extracted myeloid cells and performed gene set enrichment analysis (GSEA) on fold-changes of genes expression between ICS and healthy controls. Indeed, we found that all four irMPs were significantly enriched in ICS patients relative to healthy controls ([127]Figure 2A). Figure 2. [128]Figure 2 [129]Open in a new tab Functional characterization of the generic immune gene sets (A) Barcode enrichment plot of 4 cytokine/chemokine release irMPs; all genes from the data are ranked along x axis according to enrichment score (y axis). There is a significant upregulation in all 4 irMPs in samples with severe COVID-19 patients with cytokine storm, and the ranked position of each gene within a signature is shown in the x axis. (B) Density plot comparing the normalized enrichment score of three antiviral irMPs in different immune cell types when stimulated with interferon vs. non-interferon. (C) Violin plot comparing cell cycle MP activities calculated using GS density across different cell types identified from COVID-19 single cell atlas. ANOVA test was performed for multi-group comparison. (D) UMAPs labeled with T cell activation stages, L_MP9 activity, glycolysis, and oxidative phosphorylation activities from Hallmark using GS density. Upon further examining the PPI network constructed from the genes in M_MP9 ([130]Figure S7.9), we found that M_MP9 has a central module of chemokine genes (CXCL10, CXCL9, CCL4, CCL2, CCRL2, CCL7, and CCL8), connected with antigen presenting cell (APC)-specific human leukocyte antigen (HLA) class II genes (HLA-DPA1, HLA-DMA, and HLA-DRA) by CD74, a chaperone responsible for antigen presentation.[131]^30 This further supports the hypothesis that M_MP9 captures the signaling cascade via chemokine secretion leading to the activation of APCs. Interferon-signaling pathways There is one antiviral response irMP (M_MP1) from myeloid data and two interferon-mediated viral defense irMPs (L_MP13 and L_MP14) from lymphoid data. To validate these annotations, we used cytokine dictionary curated by Cui et al.,[132]^31 in which the authors have profiled scRNA-seq on 17 different immune cell types extracted from lymph nodes (LNs) of mice treated with 86 different cytokines in vivo or phosphate buffered saline (PBS) as control. As these gene sets are associated with interferon signaling, they should be induced upon stimulation by interferon rather than other cytokines. For each cytokine treatment in each cell-type, we performed differential expression relative to corresponding PBS controls followed by GSEA. Across cell-types, we observed that all three irMPs show a general trend of significantly higher enrichment under interferon treatments (IFN-α1, IFN-β, IFN-ε, IFN-κ, IFN-γ, and IFN-λ2) relative to other cytokine treatments ([133]Figure 2B), indicating that these three irMPs are all signaling pathways downstream of interferon stimulation. Cell adhesion/cell migration There are two cell adhesion (M_MP4 and M_MP6) and two chemotaxis related (M_MP5 and M_MP7) irMPs from myeloid data, and two cell adhesion irMPs (L_MP15 and L_MP16) from lymphoid data. Cell adhesion and migration are essential steps in the immune response, as they facilitate downstream immune cell trafficking, activation, and effector function.[134]^32 In general, the three steps in cell adhesion are rolling, weak, and strong adhesion.[135]^33 Rolling involves selectin molecules (L-selectin, E-selectin, and P-selectin) loosely moving immune cells on endothelial surface. Weak adhesion occurs when integrin molecules (CD49d, CD29, CD11a, CD11b, and CD11c) weakly bind to ligands (ICAM) on endothelial cells, slowing down the immune cell movement. Strong adhesion occurs when integrin molecules firmly anchor the immune cells to the endothelial surface, allowing them to extravasate through the endothelium and into surrounding tissues. If the irMPs are associated with cell-adhesion and chemotaxis, their activity should be concordant with the expression of relevant protein markers discussed previously. To validate the functional annotation of these irMPs, we dichotomized cells in the PBMC CITE-seq data[136]^34 by their RNA-based MP activity levels and examined the abundance levels of relevant proteins (see [137]STAR Methods). We found that, among myeloid cells, the activity of M_MP4 and M_MP6 are associated with the abundance of cell adhesion proteins related to both rolling and weak adhesion ([138]Figure S9A). In particular, myeloid cells with higher M_MP4 have the highest CD62L abundance, suggesting that M_MP4 is driven by L-selectin, a cell adhesion molecule on the surfaces of leukocytes. In contrast, the activity levels of M_MP5 and M_MP7 that are associated with cellular migration are associated with lower abundance of cell adhesion marker proteins ([139]Figure S9A), consistent with their contrasting functional annotation. Among lymphoid cells, we found that L_MP15 shows strong association with integrins (CD49d, CD29, CD11a, CD11b, and CD11c), and L_MP16 with selectins (CD62P and CD62L) ([140]Figure S9B). These two irMPs differ from myeloid-derived cell adhesion irMPs in that L_MP15 and L_MP16 are enriched in HLA genes, suggesting possible cell adhesion through antigen presentation. Core cellular process Besides highly immune-specific gene sets, there are also gene sets related to core cellular processes. In particular, there are three irMPs (L_MP1, 6, and 11) annotated as cell cycle irMPs from lymphoid data. To validate their functions, we used a COVID-19 single cell gene expression atlas,[141]^35 where author identified a cluster of proliferative CD8^+ effector T cells. Since proliferative lymphocytes undergo clonal expansion, we expect cell cycle pathways to be upregulated in this cluster.[142]^28 Indeed, we observed significantly higher gene set activities[143]^26 for all three cell-cycle irMPs in the proliferative lymphocyte cluster ([144]Figure 2C) than in the other clusters. There is one metabolism-related gene set (L_MP9), enriched in glycolysis genes (ENO1, PGK1, ALDOA, PGAM1, TPI1, STMN1, LDHA, HMMR, ENO2, and PPIA). To confirm its relevancy to metabolism, we utilized a study in which authors perform in vitro stimulation of naive CD8^+ T cells and characterized metabolic programs at different activation stages,[145]^36 showing activation of T cells is accompanied with increased metabolism. We observed that the activity of L_MP9 increases as T cells become more activated, which also coincided with the trend of glycolysis and oxidative phosphorylation ([146]Figure 2D). Besides the aforementioned gene sets that represent shared functions in both myeloid and lymphoid compartments, we have also discovered irMPs that are highly specific to a cell type or a contact interface between two immune cells. Leukocyte activation Two of the lymphoid irMPs (L_MP12 and L_MP19) were annotated as lymphocyte activation and T cell activation, respectively. The activation of these pathways should coincide with high expression of protein markers reflective of lymphocyte activation. As COVID-19 induces immune cell activation, we leveraged the COVID-19 PBMC CITE-seq data.[147]^26 We extracted natural killer (NK) and CD8 T cells based on SingleR[148]^37 annotation (see [149]STAR Methods) and observed that NK/T cells with higher L_MP12 activity have higher abundance of CD16, CD56, KLRG1, CD8, and CD69 ([150]Figure 3A), which are known to be expressed by activated T and NK cells. It is also known that CD8 can be expressed by cytotoxic NK cells.[151]^38 Cells with higher L_MP19 activity have higher abundance of CD3, CD4, CD8, and CD69 ([152]Figure 3A), which are known to be expressed by activated T cells. In addition, L_MP12 and L_MP19 appear to be associated with checkpoint proteins, TIGIT, CD152 (CTLA4), and CD279 (PD-1), which also indicate activation and regulation of immune homeostasis.[153]^39 Figure 3. [154]Figure 3 [155]Open in a new tab Functional characterization for the specific irMPs (A) COVID-19 PBMC cells annotated as NK or CD8^+ T cells were dichotomized into two groups (high vs. low) based on the activity levels of 2 lymphoid-derived gene sets: L_MP12 and 19, estimated respectively from the RNA expression data. Shown in the heatmap table (and colored accordingly) are the average Z-transformed protein abundance of surface proteins: CD16, CD56, KLRG1, CD8, CD3, CD4, CD69, TIGIT, CD279, and CD152 in each of the respective cell groups. (B) Shown in violin plots are the abundance of 6 surface proteins: CD16, CD8, CD69, CD152, CD279, TIGIT in two dichotomized COVID-19 PBMC cells (annotated as NK or CD8^+ T cells) groups with respectively low and high L_MP17 activity levels, determined from the RNA expression data. In all the cases, dichotomization was performed using Gaussian mixture models. Activities between two groups were compared using t test with significance level 0.05. (C) Violin plot comparing two TCR-related irMPs across different T cell activation states using ssGSEA. Activities between two groups were compared using t test with significance level of 0.05. (D) Barcode enrichment plot of L_MP3 and L_MP5; All genes from the TCGA are ranked along x axis according to enrichment score (y axis). There is a significant upregulation in both L_MP3 and L_MP5 in samples with a high persistent tumor mutational burden, and the ranked position of each gene within a signature is shown in the x axis. Statistical significance was determined by a rank-based test. Interferon-induced cytotoxicity pathway One lymphoid irMP (L_MP17) was annotated as IFN-gamma induced cytotoxicity. As COVID-19 induces cytotoxicity, we leveraged the COVID-19 PBMC CITE-seq data.[156]^26 Extracting NK and CD8 T cells based on SingleR annotation, we observed that cells with higher L_MP17 activity have significantly higher abundance of CD8, CD16, and CD69 (FDR adjusted p value < 0.05), which are protein markers for cytotoxic CD8 T cells and cytotoxic NK cells ([157]Figure 3B). In addition, higher abundance of TIGIT, CD152 (CTLA4), and CD279 (PD-1) in L_MP17 high cells indicate activation, required for achieving cytotoxicity and regulating immune response to avoid over-activation.[158]^39 The PPI network for this MP suggests the following activity cascade ([159]Figure S7.17): Activated NK cells (NKG7 and GNLY) secrete IFNG, inducing the activation of granzyme-producing (GZMH, GZMB, GZMK, and GZMA) cytotoxic macrophages (LYZ). Moreover, downstream targets of leukocyte activation are also captured by L_MP17, specifically NFKBIA, KLF6, and nuclear receptor genes (NR4A2 and NR4A3), which work harmoniously to ensure T cell homeostasis and proper proliferation.[160]^40^,[161]^41 To further validate the cytotoxicity nature of this irMP, we used a COVID-19 single cell atlas data.[162]^35 We observed that lymphoid cells with higher L_MP17 activities correspond to cells with higher cytotoxicity scores ([163]Figure S9C), calculated using GSDensity with the following genes: PRF1, GZMA, GZMB, GZMH, and GNLY.[164]^35 In addition to cytotoxicity, L_MP17 scored high in the CD8 effector T cell cluster, a cluster defined as highly proliferating T cells in the paper, suggesting that L_MP17 describes not only a cytotoxic but also a proliferative cell state. Indeed, immediately connected to the granzyme modules ([165]Figure S7.17) are TUBA4A and TUBA1C, which are known to be involved in forming and stabilizing microtubules in proliferating T cells during various stages of cell cycling, contributing to spindle formation during mitosis and ensuring proper chromosome segregation and accurate cell division.[166]^42 TCRs Two lymphoid irMPs (L_MP2 and L_MP10) were annotated as TCR-anchoring and lipid localization at TCR synapse, respectively. L_MP2 shows significant enrichment in antigen receptor-mediated signaling pathways ([167]Table S1.2). In addition, L_MP2 is enriched in genes related to T cell surface glycoprotein and transmembrane adapter (CD247, CD2, CD28, CD3D, TRAT1, CD52, and HLA-E) and genes that have key roles in T cell antigen receptor-linked signal transduction pathways (ICOS, SLAMF1, LCK, and LAT)[168]^43^,[169]^44^,[170]^45^,[171]^46 ([172]Figure S7.2). As we expect TCR signaling to be upregulated when T cells are activated, we examined scRNA-seq from T cells isolated from bone marrow (BM) and LNs of healthy donors that were cultured in resting condition vs. activated by anti-CD3/anti-CD28 antibodies.[173]^47 As expected, T cells showed higher L_MP2 and L_MP10 activities, indicating cell-type specificity ([174]Figure 3C). Further, the activities of these gene sets were significantly higher in activated T cells indicating their relevance to TCR engagement upon T cell activation ([175]Figure 3C). Antigen presentation and processing pathway Two of the 19 lymphoid irMPs (L_MP3 and L_MP5) are related to antigen-presentation mediating immunity. We named L_MP3 as histone-associated lipid antigen presentation and L_MP5 as antigen presentation. L_MP3 is enriched in chromatin organization histones genes with lipid antigen presentation genes connected by a V(D)J recombination gene RAG1. Histone genes play an important factor in V(D)J recombination and formation of antigen receptors on lymphocytes.[176]^48 The PPI network of L_MP3 ([177]Figure S7.3) shows a HISTONE gene module and a CD1 gene module connected by RAG1. L_MP5 is enriched in genes responsible for cell cycle regulation (DLGAP5, BUB1B, and CCNB1) and genome stability (TUBB, HMGB4, HNRNPA2B1, PTTG1, and HMGB1), each of which is an important element for V(D)J recombination fidelity[178]^49 (RAG1 and RAG2) ([179]Figure S7.5). To validate the antigen presentation function of L_MP3 and L_MP5, we used pan-cancer mutational and gene-expression data from TCGA.[180]^50 Persistent tumor mutational burden (pTMB), defined as the number of single copy and multiple copy mutations,[181]^51 informs immune activation. Tumors with higher pTMB burdens are more likely to be visible to the immune system and are associated with sustained anti-tumor immune responses and improved response to ICB.[182]^51 These tumors should therefore have higher activity of antigen presentation pathways. We performed differential gene expression analysis (DGE) between TCGA samples with high vs. low pTMB, regardless of cancer type and stage (see [183]STAR Methods). GSEA with 19 lymphoid irMPs was performed. Both L_MP3 and L_MP5 are significantly upregulated among samples with high pTMB ([184]Figure 3D), confirming the antigen presentation function of these two irMPs. Notably, L_MP3 and L_MP5 are the only two irMPs (out of the 19) that are significantly enriched among high pTMB samples in the full GSEA profiles (FDR adjusted p value < 0.05) ([185]Figure S9D). pMHC-TCR contact interface Among myeloid irMPs, M_MP2 was annotated as antigen processing. [186]Figure S10A depicts a typical antigen uptake by an APC, such as, dendritic cell (DC), in which an immature DC (iDC) transforms into a mature DC (mDC) upon detecting pathogen-associated molecular patterns (PAMPs). This activation process involves the DC recognizing PAMPs, leading to upregulation of major histocompatibility complex (MHC) molecules for antigen presentation and the co-stimulatory molecules CD80/CD86, crucial for T cell activation.[187]^52^,[188]^53 Simultaneously, there is a change in cytoskeleton organization, notably the F-actin, to facilitate processing and presentation of the peptide-MHC (p-MHC) complexes on their surface for downstream T cell activation. Concordant with [189]Figure S10A, we observed that APCs (dendritic cells and macrophages, identified by SingleR[190]^37) with higher ssGSEA in M_MP2 also have significantly (p value < 0.05) higher abundance in MHC proteins in the COVID-19 CITE-seq data ([191]Figure S10A), such as HLA-F, HLA-A-B-C, and HLA-DR, suggesting APC activation. In addition, these cells with higher M_MP2 activity also have significantly (p value < 0.05) higher abundance in CD11a/CD18, which facilitate downstream T cell binding, and in CD86, a co-stimulatory molecule for T cells activation. Among lymphoid irMPs, L_MP8 was annotated as MHC mediated immunity. [192]Figure S10B depicts a typical interaction between an mDC and a T cell at the TCR synapse: Upon recognition of the p-MHC complex by TCR, the T cell upregulates CD69, indicating that the T cell has been successfully engaged and activated. Concordant with [193]Figure 3F, we observed that cells with higher M_MP2 activity also have significantly (p value < 0.05) higher abundance in MHC molecules, CD69, and TCRab from COVID-19 CITE-seq data, indicating successful TCR engagement at the immunological synapse. Among myeloid irMP, M_MP8 was annotated as MHC-II-mediated lymphocyte activation because it is enriched in MHC-II markers (HLA-DRB4, HLA-DQB1, HLA-DMA, and HLA-DMB) and TCR complex (TRAC and TRBC1), suggesting MHC-II dependent CD4 T cell activation (as MHC-II presents to helper T cells).[194]^54 To confirm if this gene set is describing functions at the interface between macrophages and CD4 T cells, we extracted macrophages and CD4 T cells based on SingleR annotation and observed that cells with high activity in M_MP8 have higher protein abundance in M2-like macrophages (CD163 and CD206), T cell activation (CD45RO and TCR), and MHC-II (HLA-DR) ([195]Figure S10C), suggesting CD4 T cell activation via MHC-II on M2-like macrophages. Interleukin-induced Tregs There is one lymphoid irMP (L_MP4) annotated as interleukin induced Tregs because it is enriched in Treg activation markers (CTLA4, IL2RA, TIGIT, TNFRSF9, and IL1R2). [196]Figure S10D depicts one of the mechanisms for Treg survival: binding between IL2 and CD25 (IL2RA) promotes the growth and suppressive functions of Tregs, upregulating CD39[197]^55^,[198]^56, eliciting a more suppressive Treg phenotype, defined by upregulation of TNFRSF9 and checkpoint transcripts.[199]^56 Using COVID-19 CITE-seq data, we extracted all Tregs based on SingleR annotation and found that Tregs with higher L_MP4 activity do have significantly (p value < 0.05) higher abundance in IL2 receptors (IL2RA and IL2RB), CD39, CTLA4, TIGIT, and TNFRS9 ([200]Figure S10D), which are all protein markers over-expressed in effector Tregs, suggesting possible interleukin-induced Treg activation. irMPs redefine immune subtypes in TCGA pan-cancer data In TCGA pan-cancer immunity study, tumor samples were previously categorized into six major immune subtypes (C1: wound healing, C2: IFN-gamma dominant, C3: inflammatory, C4: lymphocyte depleted, C5: immunologically quiet, and C6: TGF-beta dominant), based on 160 immune signatures.[201]^57 We performed Uniform Manifold Approximation and Projection (UMAP) based on ssGSEA[202]^27 scores with the 28 irMPs ([203]Figure S11A). Since C4 and C6 are composed of less than 10% and 2% of TCGA samples, respectively, we did not see a clear separation of these two clusters, but C4 is mostly distributed in the bottom left and C6 is mostly distributed in the top right of the UMAP ([204]Figure S11A). Nevertheless, the UMAP well assigned the tumor samples into C1+C2, C3, or C5 subtypes with significant (log rank p value < 0.0001) differential overall survival between C1+C2 and C3+C5 ([205]Figures 4A and 4B), but the overall survival difference between C3 and C5 is insignificant (p value = 0.623), despite having significantly (p value < 2.2e−16) different levels of lymphocyte infiltration. Through k-mean clustering with k = 6 based on the same UMAP, we re-clustered the same set of TCGA samples into 6 new subtypes ([206]Figure 4C), each of which is enriched in different types of cancer ([207]Figure 4D). Figure 4. [208]Figure 4 [209]Open in a new tab irMPs redefine immune subtypes in TCGA pan-cancer data (A) UMAP of subsets of TCGA samples (n = 7,768) using single sample gene-set enrichment (ssGSEA) scores calculated from the 28 irMPs. Each dot is a TCGA sample colored by its immune subtypes defined by Thorsson et al. (B) Kaplan-Meier curves for overall survival between TCGA immune subtypes. Log rank test is performed between comparisons of interests, with corresponding lymphocyte infiltration scores compared on the top right (p value is calculated with one-way ANOVA test). (C) UMAP of all TCGA samples (n = 10,128) using ssGSEA scores calculated from the 28 irMPs. Each dot is a TCGA sample colored by clusters identified using k-mean clustering. (D) Stacked plot showing the proportions of cancer types across k-mean clusters. (E) Sankey River plot showing the shuffling of each TCGA immune subtype into each of the k-mean clusters. (F) Kaplan-Meier overall survival curves across the different k-mean clusters. Log rank test is performed between pairs of interests (2 vs. 5, 2 vs. 4, 4 vs. 5). Risk table is shown below with clusters ordered from best survival (top) to worst survival (bottom). (G) Violin plot comparing tumor intrinsic characteristics between cluster 2 and the rest of the clusters by scoring programs with ssGSEA. Activities between two groups were compared using t test with significance level 0.05. (H) Violin plot comparing the gene expression levels of 3 classical activation and exhaustion markers across the k-mean clusters. Activites across multiple groups were compared using anova with significance level 0.05. Upon re-clustering, C3 and C5 are resolved into new clusters 2, 4, and 5 ([210]Figure 4E) that well delineate patients by their overall survival ([211]Figure 4F), with cluster 2 having the best overall survival. To further study the molecular characteristics of cluster 2, we assessed selected TCGA signature scores on tumor characteristics[212]^57 (see [213]STAR Methods) across our clusters and found that cluster 2 has significantly (p value < 0.05) lower scores, compared to other k-mean clusters in terms of aneuploidy score, neoantigen loads, intratumor heterogeneity, and hallmark pathways indicative of proliferation (mitotic spindle, G2M checkpoint, and MYC targets V1) ([214]Figure 4G), suggesting that cluster 2 contains more indolent tumors, corroborated with the observed low cell proliferation. In addition, cluster 2 has significantly lower expression in cytotoxic markers (GZMK, PRF1, and CD8A) and exhaustion markers (CTLA4, PDCD1, and TIGIT) ([215]Figure 4H), suggesting less immunogenic environment. Cluster 2 is also enriched in lower grade glioma, prostate adenocarcinoma, and thyroid carcinoma ([216]Figure 4D), which are often lower grade cancers with indolent to slower growth[217]^58 and are less likely to invade nearby tissues or metastasize to distant organs. These tumors generally have better prognosis[218]^59 compared to higher-grade cancers. Therefore, the superior survival of cluster 2 is mainly driven by its simple tumor composition, low proliferation, and low immunogenicity, showing that our irMPs can also discern tumor intrinsic differences via their associated immunological states. Clusters 4 and 5 are both characterized by low tumor proliferation and high tumor development ([219]Figure S11B). In term of immune landscape, cluster 4 has significantly (p value < 0.05) higher immune infiltration (Th1 cells and Gamma-Delta T cells) than clusters 2 or 5 and cluster 5 has significantly (p value < 0.05) higher level of activated NK cells compared to clusters 2 or 4 ([220]Figures S11B and S11C). The k-mean algorithm also redistributed C1 and C2 into clusters 1, 3, and 6 ([221]Figure 4E). Specifically, we identified a transitional subtype between C1 and C2 (cluster 3 in [222]Figure 4C)[223]^30^,[224]^31 characterized by in-between expression levels of characteristic IFNG-subtype (C2) markers (IFNG, JAK1, and STAT1)[225]^60 and classical wound-healing (C1) markers (S100A1, E2F1, and APC)[226]^61 ([227]Figure S11D). Cluster 3 has significantly (p value < 0.05) worse overall survival compared to clusters 1 and 6, respectively ([228]Figure S11E), and it is characterized by high tumor proliferation, significantly (p value < 0.05) higher tumor glycolysis and oxidative phosphorylation than cluster 6 ([229]Figure S11B). Clusters 1 and 6 are both characterized by high tumor proliferation ([230]Figure S11B). In term of immune landscape, cluster 1 has significantly (p value < 0.05) higher infiltration of CD8 T cells and M1 macrophages ([231]Figure S11C) than the other clusters, suggesting high level of immune activation, which is related to the observed high expression of exhaustion markers[232]^62 ([233]Figure S11F). Cluster 6 is characterized by significantly (p value < 0.05) lower hallmark immune pathway scores than other clusters ([234]Figure S11B). In summary, our irGSs redefine TCGA subtypes with significantly distinct survival patterns, unveiling not only the intrinsic traits of tumors such as their proliferation and metabolic profiles, but also the dynamic immune activities within the tumor microenvironment. Compared with the immune archetypes defined by Combes et al.,[235]^63 we are defining differential functional states rather than cell type compositions. [236]Figures S11B and S11C contain detailed information on hallmark pathway scores and the relative abundance of immune cell infiltration for each cluster. irMPs better distinguish ICB response We investigated if the irMP expression at baseline could be used to differentiate ICB responders and non-responders. We utilized four ICB melanoma cohorts,[237]^64^,[238]^65^,[239]^66^,[240]^67 details in [241]Table 2. To account for differences in sequencing depth and coverage, we calculated the fragments per kilobase of transcript per million mapped reads (FPKM) for each of the four RNA-seq data and obtained 104 RNA samples at baseline. We calculated ssGSEA score for each RNA expression sample in the integrated ICB cohort using (1) 50 Hallmark pathways,[242]^9 (2) 29 immune-relevant KEGG (irKEGG) pathways,[243]^10 (3) lymphoid irMPs (L_MPs), (4) myeloid irMPs (M_MPs), or (5) lymphoid-myeloid combined irMPs. We randomly divided 104 baseline samples into 70% training (n[train] = 73) and 30% testing (n[test] = 21). To avoid over-fitting, we fitted five generalized linear models (GLM) with LASSO regularization adjusting for the aforementioned gene set variables. Using the best model selected from 10-fold cross-validation, we fitted the model on the test data and evaluated model performance by comparing classification accuracy, which is the average of sensitivity and specificity. To deal with randomness in data splitting, we performed the procedure 1,000 times and computed the range for classification accuracy from each of the five models (see [244]STAR Methods and [245]Figure 5A). Our gene sets resulted in the highest mean accuracy of 0.71, followed by Hallmark (0.69) and irKEGG (0.62), with most of the gene sets being negatively associated with response ([246]Figure S12A). The classification accuracy is comparable (p value = 0.13) between Hallmark and combined irMPs ([247]Figure 5A). However, Hallmark contains 50 gene sets with relatively large (146 ± 67) gene set sizes, whereas the combined irMPs has only 28 gene sets with 50 genes each, possessing comparable predictive power using fewer parameters. We also compared the classification accuracies with the 18-gene tumor inflammation signature (TIS),[248]^68 which has been approved for use as an investigational use-only (IUO) criteria to stratify patients based on potential to respond to ICB, and the classification accuracy with TIS only on the test data is 52.83%. Table 2. Number of samples contained in each of the melanoma ICB cohort GEO ID Disease Treatment Pre (n) Post (n) Responder (n) Non-responder (n) 91061 Metastatic melanoma Anti PD-1 51 58 60 49 115821 Metastatic melanoma Anti PD-1 9 18 1 26 78220 Metastatic melanoma Anti PD-1 25 0 13 12 145996 Metastatic melanoma Anti PD-1 14 0 9 5 Total (n) 99 76 83 92 [249]Open in a new tab Figure 5. [250]Figure 5 [251]Open in a new tab irMPs better separate ICB responses (A) Distribution of ICB classification accuracy in each of the five model across 1,000 iterations. Distribution of accuracies was compared using t test with significance level 0.05. (B) The average fitted coefficients for top 10 selected irMPs across 1,000 iterations. (C) UMAP derived from 10 most selected irMPs activities (left), 50 Hallmarks (middle), and 29 irKEGGs (right) in lung (top), melanoma (middle), and liver (bottom) cancer. Color represents different responses. (D and E) Baseline cell-cell communications for responders (D) and non-responders (E) inferred from cell chat. Nodes are the 11 cell types annotated in the original single cell data, edges represent significant interactions (adjusted p value < 0.05) and the thickness of the edges represents the probability of interactions. To corroborate if our gene sets can better separate ICB response with baseline gene expression only, we extracted the top 10 most selected irMPs across the 1,000 LASSO models ([252]Figure 5B) and calculated their activities at baseline using BulkRNAseq profiled from lung cancer,[253]^69 melanoma,[254]^70 and liver cancer.[255]^71 We again observed that these 10 irMPs can better separate ICB response in comparison to irKEGGs or Hallmarks ([256]Figure 5C). These data indicate that the irMPs can be useful to develop tumor type agnostic general prognostic signatures to predict ICB response. To understand what these irMPs explain, we calculated the ssGSEA for exhaustion, cytotoxicity, and tissue resident memory (T[RM]) using their respective marker gene expression (full marker list detailed in [257]STAR Methods). Computing the correlation between irMPs activity and these signature expressions at baseline ([258]Figure S12B), we observed significant (FDR adjusted p value < 0.05) positive correlation between most of the irMPs and exhaustion, cytotoxicity, and T[RM], suggesting that at baseline non-responders show phenotype of activation induced exhaustion, a result of prior antigen stimulation. To confirm our observation, we deconvoluted each of the bulk RNA sample into exhausted (T[EX]), tissue resident memory (T[RM]), tissue effector memory (T[EM]), naive (T[N]), and effector T cells (T[EFF]) using CibersortX[259]^72 and pan-cancer T cell signature matrix[260]^73 as reference (see [261]STAR Methods). Indeed, non-responders show higher T[EX] and T[RM] and lower T[EM] and T[N] at baseline ([262]Figure S12C). However, when we fit the model with only exhaustion, cytotoxicity and T[RM] score, the classification accuracy is only 54%, suggesting that irMPs have captured addition factors that dictate response to ICB in addition abundance of T cell states. As cellular communication plays an important role in orchestrating the immune response,[263]^74^,[264]^75 we hypothesize that irMPs have also captured differential cellular interactions between responders and non-responders. Utilizing an independent scRNA-seq dataset[265]^76 from baseline melanoma patients later treated with anti-PD1, we converted the data into pseudo-bulk expression (see [266]STAR Methods), calculated activity for selected irMPs (top 10 most frequently selected irMPs from 1,000 LASSO regressions in the previous analysis), Hallmarks and irKEGGs pathways, and projected the 12 samples onto a UMAP. Once again, we observed that irMPs more effectively separated out responders and non-responders ([267]Figure S12D) and these 10 irMPs are also associated with exhaustion, cytotoxicity and T[RM] ([268]Figure S12E). We then performed CellChat[269]^77 analysis (see [270]STAR Methods) to infer cell-cell interactions based on ligand-receptor expression and observed that responders are enriched in interactions surrounding memory T cells, while non-responders are enriched in interactions surrounding exhausted T cells at baseline ([271]Figure 5D). To offer additional evidence of these differential interactions, we used scrublet[272]^78 to identify doublets and double-annotated the doublets (see [273]STAR Methods). Although doublets in droplet-based scRNA-seq data may result from multiple sources, they have been shown to reflect true physical interactions between cell types.[274]^79 Responders were characterized by a limited repertoire of doublet phenotypes with a marked enrichment for doublets with memory T cells. In contrast, non-responders presented with a more varied doublet repertoire with frequent doublets involving various exhausted lymphocytes ([275]Figure S12F). These trends mirror results of the cell-cell communication analysis ([276]Figures 5D and 5E). To confirm whether our irMPs have captured these interactions, we calculated the activities of the 10 important irMPs in each type of doublets at baseline. We observed that the top 10 selected irMPs all have high activities in doublets that are enriched in non-responders at baseline ([277]Figure S13). Taken together, the irMPs may capture interactions between exhausted lymphocytes and other cell types at baseline, which have been associated with poor response in anti-PD1 immunotherapy.[278]^80^,[279]^81^,[280]^82 irMPs granularly segment spatial transcriptomics data irMPs can potentially enhance segmentation and phenotyping of spatially resolved transcriptomic (SRT) data. For illustration, we obtained 10× Visium data from an FFPE human breast tissue section.[281]^83 The H&E slide from the tissue section was segmented by pathologists into major cell types ([282]Figure 6A). We calculated the activity scores of the irMPs from the SRT sample and observed that irMPs consistently isolate out the tumor regions (dark blue), aligning with pathologist’s annotation ([283]Figures 6B and [284]S14). Besides identifying tumor regions, our irMPs can also be characterized by the infiltrated immunological patterns on the H&E image. We used CellTrek[285]^84 to project single cells from a reference breast cancer patient profiled by scRNA-seq in Tumor Immune Single-cell Hub 2 (TISCH2)[286]^85 onto the spatial spots[287]^84 (see [288]STAR Methods).[289]^83 We then visualized the average irMP activity score for each projected cell type in [290]Figure 6C using a clustered dot plot and observed that clustering of irMPs was largely driven by lineages (3 clusters from top to bottom): myeloid, lymphoid, and cycling. On one hand, some irMPs are strongly lineage specific. For example, the “TCR anchoring” score is prominent among T cells, nearly absent in other cell types ([291]Figure 6C). On the other hand, some irMPs also provide functional insight across the TIME, not limited to a cell type. For example, “IFN-gamma induced cytotoxicity” exhibits the highest score in T cells and a notable score in macrophages ([292]Figure 6C), suggesting a co-enforcement feedback loop between activated T/NK cells and cytotoxic macrophages mediated by potent interferon gamma, a hub gene in the PPI network ([293]Figure S7.17), which aligns with the annotation of this irMP. Benchmarking with immune-related KEGG pathways and Hallmarks, we observed that Hallmark pathways are mostly specific to cancer and endothelial regions ([294]Figure S15A) and irKEGG pathways are myeloid-specific ([295]Figure S15B). To examine the generalizability of our results, we repeated the aforementioned experiments on an FFPE human ovarian tumor tissue and an FFPE human intestine tumor tissue using ovarian cancer patients[296]^86 and colorectal cancer[297]^87 patients as single cell references, respectively. We obtained coherent