Graphical abstract

   graphic file with name fx1.jpg
   [42]Open in a new tab

Highlights

     * •
       Computational approaches for detecting paired targets using CRISPR
       screen data
     * •
       Validation of paralog gene pairs' effect on T cell killing using
       CRISPR double knockouts
     * •
       A machine learning model for predicting paralog gene pairs that
       modulate immunotherapy

The bigger picture

   Immunotherapies leverage the patient’s own immune system to fight
   cancer, and emerging therapies in this domain have revolutionized
   cancer treatment; however, resistance and relapse remain challenges.
   Recent efforts to identify combinatorial targets, or pairs of genes
   that could be used together for cancer treatment, are hindered by
   limitations in current high-throughput screening approaches. This type
   of screening often misses promising target pairs. In this study, we
   leverage public genomic datasets to identify and validate gene pairs
   that synergistically enhance effector T cell function, ultimately
   promoting immune-mediated tumor eradication. Our findings underscore
   the overlooked potential of paralog gene pairs in driving functional
   synergy and offer a framework for designing more effective
   combinatorial screen libraries. While this study focuses on cancer
   immunotherapy, the methods we developed for identifying combinatorial
   targets may be broadly applicable to other viability functional
   screening asks, where large-scale datasets are more readily available.
     __________________________________________________________________

   Combinatorial targeting has shown promise in cancer therapies; however,
   the experimental detection of paired functional targets remains
   challenging due to the vast number of possible combinations. In this
   study, the authors introduce a method to repurpose single-gene screen
   data to identify functional paralog pairs that enhance immunotherapy,
   extending the search through machine learning approaches. The authors
   further demonstrate experimentally that synergistic paralog pairs can
   improve immunotherapy efficacy, even when neither individual target is
   effective, a scenario often underestimated.

Introduction

   Emerging immunotherapies, especially immune checkpoint blockade (ICB)
   and adoptive cell therapy, have revolutionized cancer treatment for
   multiple cancer types; however, a significant fraction of patients fail
   to respond to immunotherapies or relapse following treatment.[43]^1
   Recent efforts have focused on using CRISPR knockout (KO) or activation
   screening to identify targets that enhance T cell effector function and
   augment immune killing capability.[44]^2 Nevertheless, overcoming
   resistance by manipulating a single gene remains challenging due to the
   compensatory effects of other genes. Combination therapies have been
   proposed as a promising strategy to overcome tumor resistance to
   monotherapy when single agents are ineffective.[45]^3

   Paralogs, genes that originate from the same ancestors and share
   similar functions, often work in concert to promote cell survival.
   Numerous focused studies have utilized genomic data and pooled
   screening techniques to explore how targeting paralogous genes can
   impact cancer cell viability.[46]^4^,[47]^5^,[48]^6 Recent research has
   shown that targeting these genes simultaneously in cancer cells can
   enhance the effectiveness of immunotherapy by preventing the cells from
   evading T cell-mediated destruction. For instance, Park et al.
   demonstrated that treating mice with both anti-PD-L1 and anti-PD-L2
   antibodies significantly reduced tumor growth in specific mouse
   models.[49]^7 Additionally, signals from both PARP-1 and PARP-2 are
   essential for initiating an effective T cell response against breast
   cancer in mice.[50]^8 While pooled CRISPR screening is popular for
   identifying single genes that modulate specific responses, including
   those relevant to immunotherapy, double-knockout screens aiming to
   identify paired targets remain rare. Recent efforts in double-knockout
   screens, covering dozens to hundreds of gene pairs, have been conducted
   to identify novel gene combinations that lead to synthetic lethality
   and boost T cell function.[51]^5^,[52]^9^,[53]^10 However, a
   comprehensive, high-throughput strategy to screen all potential paralog
   pairs related to immunotherapy is challenging because a very large
   screening library that includes hundreds of thousands of gene
   combinations would be both costly and unfeasible. To solve this
   problem, we have developed a computational approach to identify the
   most promising paralog gene combinations, which can then be used to
   build a focused and manageable library for experimental validation.

   Here, we used an single guide RNA (sgRNA) set enrichment method to
   identify cancer-intrinsic paralog pairs that enhance T cell killing
   from prior pooled screening data. We show that our method identifies
   novel synergistic paralog pairs that are successfully validated
   experimentally. We have also constructed an ensemble learning XGBoost
   classifier to predict true-positive paralog pairs enhancing
   immunotherapy by incorporating features such as gene characteristics,
   sequence and structural similarity, protein-protein interaction (PPI)
   networks, and gene co-evolution data. To enhance the reliability of the
   XGBoost classifier predictions, we conducted further experimental
   validations on the highest-ranked paralog pairs identified. This study
   is expected to uncover novel therapeutic targets and inspire new
   combination immunotherapies to aid cancer treatment.

Results

Identifying cancer-intrinsic paired paralogs that enhance T cell-mediated
killing using CRISPR screen data

   A large proportion of genes have paralogs that perform redundant
   functions. Inactivation of functionally important paralog pairs can
   lead to synthetic lethality, as well as more efficient modulation of
   biological pathways.[54]^9^,[55]^11 Unfortunately the identification of
   these paralog pairs is limited in single-gene perturbation screens due
   to compensation between paralogous genes,[56]^12 and dual gene
   screening is often unfeasible due to the large size of CRISPR libraries
   required to encompass all potential combinations of paralogous gene
   pairs. Current double-knockout screens cover only a limited number of
   genes and require a substantial number of gRNA pairs ([57]Table S1).
   Even the largest paired gene screens to date cover less than 0.1% of
   all possible
   combinations.[58]^5^,[59]^9^,[60]^12^,[61]^13^,[62]^14^,[63]^15^,[64]^1
   6^,[65]^17^,[66]^18 To overcome this limitation, we proposed an
   unbiased computational approach to filter and predict functional gene
   pairs from prior published genome-wide CRISPR screen data.

   We initially attempted to search for immunotherapy-aiding paralog pairs
   using conventional MAGeCK results in Lawson’s CRISPR screen
   dataset,[67]^19 which was originally aimed at identifying genes
   involved in evading T cell-mediated cytotoxicity across six murine
   tumor cell lines. Based on a strict dual-hit gene criterion, we
   observed that few or no suitable pairs identified ([68]Figure S1A);
   however, relaxing the criteria to include single hit genes leads to a
   significant increase in uncertain pairs. To overcome this, we developed
   an enrichment-based methodology named paralog sgRNA set enrichment
   analysis (pSSEA), which integrates the sgRNAs of gene pairs to enhance
   the identification of high-potential paralog pairs. We provide a
   schematic of the pSSEA algorithm in [69]Figure 1B. Briefly, the process
   begins with the results of differential expression analysis (using
   DESeq2, MAGeCK, etc.) as input. The pSSEA evaluates the enrichment
   score by employing a Kolmogorov-Smirnov (KS)-like random walk
   statistic, a method similar to those used in GSVA[70]^20 and
   ASSESS.[71]^21 The detailed steps include
   [MATH: <mrow><mi>ν</mi><mrow><mo>(</mo><mi
   mathvariant="script">l</mi><mo>)</mo></mrow><mo
   linebreak="badbreak">=</mo><mfrac><mrow><munderover><mo>∑</mo><mrow><mi
   >i</mi><mo linebreak="badbreak">=</mo><mn>1</mn></mrow><mi
   mathvariant="script">l</mi></munderover><mrow><mspace
   width="0.25em"></mspace><mrow><mo
   stretchy="true">|</mo><msub><mi>r</mi><mi>i</mi></msub><mo
   stretchy="true">|</mo></mrow><mspace
   width="0.25em"></mspace><mi>Ι</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>g</mi><mi>i</mi></msub><mo
   linebreak="badbreak">∈</mo><mi>γ</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow></mrow><mrow><munderover><mo>∑</mo>
   <mrow><mi>i</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>p</mi></munderover><mro
   w><mspace width="0.25em"></mspace><mrow><mo
   stretchy="true">|</mo><msub><mi>r</mi><mi>i</mi></msub><mo
   stretchy="true">|</mo></mrow><mspace
   width="0.25em"></mspace><mi>Ι</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>g</mi><mi>i</mi></msub><mo
   linebreak="badbreak">∈</mo><mi>γ</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow></mrow></mfrac><mo
   linebreak="goodbreak">−</mo><mfrac><mrow><munderover><mo>∑</mo><mrow><m
   i>i</mi><mo linebreak="badbreak">=</mo><mn>1</mn></mrow><mi
   mathvariant="script">l</mi></munderover><mrow><mspace
   width="0.25em"></mspace><mrow><mo
   stretchy="true">|</mo><msub><mi>r</mi><mi>i</mi></msub><mo
   stretchy="true">|</mo></mrow><mspace
   width="0.25em"></mspace><mi>Ι</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>g</mi><mi>i</mi></msub><mo
   linebreak="badbreak">∉</mo><mi>γ</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow></mrow><mrow><mi>p</mi><mo
   linebreak="badbreak">−</mo><mrow><mo>|</mo><mi>γ</mi><mo>|</mo></mrow><
   /mrow></mfrac></mrow> :MATH]
   (Equation 1)

   where
   [MATH: <mrow><msub><mi>r</mi><mi>i</mi></msub></mrow> :MATH]
   is the rank order of sgRNAs, sorted by the log fold changes;
   [MATH: <mrow><mi>γ</mi></mrow> :MATH]
   is the sgRNA set that belongs to the paralog gene pair;
   [MATH: <mrow><mi>Ι</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>g</mi><mi>i</mi></msub><mo
   linebreak="badbreak">∈</mo><mi>γ</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   is the indicator function on whether the i-th sgRNA is in paralog sgRNA
   set
   [MATH: <mrow><mi>γ</mi></mrow> :MATH]
   ,
   [MATH: <mrow><mo>|</mo><mi>γ</mi><mo>|</mo></mrow> :MATH]
   is the number of sgRNAs belonging to the gene pairs, and p is the
   number of total sgRNAs in the library.
   [MATH: <mrow><mi>E</mi><mi>S</mi><mo
   linebreak="badbreak">=</mo><mi>ν</mi><mrow><mo
   stretchy="true">[</mo><mrow><mi
   mathvariant="italic">arg</mi><munder><mi>max</mi><mrow><mi
   mathvariant="script">l</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn><mo>,</mo><mo>.</mo><mo>.</mo><mo>
   .</mo><mo>,</mo><mi>p</mi></mrow></munder><mspace
   width="0.25em"></mspace><mrow><mo>|</mo><mrow><mi>ν</mi><mrow><mo>(</mo
   ><mi
   mathvariant="script">l</mi><mo>)</mo></mrow></mrow><mo>|</mo></mrow><ms
   pace width="0.25em"></mspace></mrow><mo
   stretchy="true">]</mo></mrow></mrow> :MATH]
   (Equation 2)

Figure 1.

   [72]Figure 1
   [73]Open in a new tab

   Identification of paralog pairs that enhance T cell killing using pSSEA

   (A) Protein-coding paralogous genes retrieved from Ensembl Compara
   database using bioMart.

   (B) Workflow for identifying paralog pair hits using the paralog sgRNA
   set enrichment analysis (pSSEA).

   (C) Identification of paralog pairs that enhance T cell killing in six
   cancer models using pSSEA. CRISPR screen data were taken from GEO:
   [74]GSE149933
   . The volcano plot of pSSEA outputs, with the x axis representing the
   enrichment score and the y axis showing the negative logarithm (base
   10) of the p value from pSSEA, is shown.

   (D) Scatterplot illustrating the characteristics of identified
   T cell-enhancing paralog pairs using a single-gene enrichment model.
   The x and y axes represent the negative logarithm (base 10) of the p
   values from the enrichment results of gene1 and gene2, respectively.

   The enrichment scores for paralog pairs were calculated by identifying
   the maximum deviation from zero during a random walk analysis.

   To assess the significance of enrichment, we created an empirical null
   distribution by randomly selecting an equivalent number of
   non-targeting control (NTC) sgRNAs and repeating this process 10,000
   times to ensure robustness. The significance of enrichment for each
   gene pair was then determined by comparing their sgRNA distributions
   against both the positive and negative tails of the NTC null
   distribution.

   We re-analyzed Lawson’s CRISPR screen dataset[75]^19 to identify
   cancer-intrinsic paralog pairs that enhance T cell-mediated killing
   function using the pSSEA method. From the Ensembl Compara database
   (Ensembl, v.102), a total of 139,654 protein-coding gene paralog pairs
   were retrieved, covering 16,202 protein-coding genes, as shown in
   [76]Figure 1A. First, DESeq2 was used to quantify the selective
   depletion of sgRNAs within tumor populations under immune selection
   from co-cultured OT-1 T cells. Subsequently, using the pSSEA framework,
   we identified paralogous gene pairs that potentially influence
   T cell-mediated cytotoxicity based on the differential sgRNA results
   obtained from DESeq2. Our analysis identified 7,310 significant paralog
   pairs (5.2% of 139,654) with p values below 0.05 in B16F10 melanoma
   cell data. Among these, 5,632 pairs were positively enriched,
   suggesting vulnerability to T cell killing, and 1,678 pairs were
   negatively enriched, suggesting resistance to T cell killing
   ([77]Figure 1D). Using these same methods, we identified paralog pairs
   associated with vulnerability and resistance to T cell killing for
   CT26, MC38, 4T1, EMT6, and Renca cell lines, as shown in [78]Figure 1C.

   To investigate the properties of paralog pairs identified as enriched
   through pSSEA, we assessed individual gene enrichment using pSSEA in
   single-gene mode. Among the 5,632 pairs associated with vulnerability
   to T cell killing in B16F10 melanoma cells, 3,620 pairs (64.3%) were
   single hit, with only one gene showing significance; 520 pairs (9.2%)
   were dual hits, with both genes significant; and the remaining 1,492
   pairs (26.5%) showed no hits of either of the genes within the
   identified paralog pair ([79]Figure 1D; [80]Table S2). Notably, we
   observed that over a quarter of the paralog pairs exhibited a
   synergistic effect despite no hits at the single-gene level. This
   observation was confirmed using data from Parrish et al., which
   includes both single knockout (SKO) and double knockout (DKO) screens
   for tumor viability. By analyzing DKO alongside SKO screens using
   MaGeck, we found that out of 354 synthetic lethality pairs identified
   in the DKO screen, 95 were not significant as individual genes in the
   SKO screen (26.8%; [81]Figure S1B). We also observed that pSSEA was
   more effective in identifying synthetic gene pairs compared to
   selecting pairs based on the significance of individual genes or both
   genes in SKO screens ([82]Figure S1C). This aligns with our previous
   findings in [83]Figure 1D that many potential synergistic gene pairs
   may not be significant as individual genes, highlighting a broad
   spectrum of potential targets that might be overlooked by obtaining
   gene pairs according to individual gene results.

Validation of synergistic cancer-intrinsic paralog pairs that, together,
boost T cell killing

   To determine if the gene pairs identified by pSSEA truly function
   synergistically, we focused on the subset of paralog pairs that
   exhibited a synergistic effect despite neither gene being individually
   significant. This subset constituted over 25% of the functional paralog
   pairs identified ([84]Figure 1D). For our experimental validation using
   the B16F10 murine melanoma cell model, we first screened for paralog
   pairs in the B16F10 model, where neither gene showed a significant
   effect individually but demonstrated a significant effect when both
   genes were perturbed together. This ensured that we could efficiently
   validate potential gene pairs in this tumor model, Then, to enhance the
   robustness of our selection, we applied pSSEA to five additional tumor
   models, including CT26, MC38, 4T1, EMT6, and Renca. We aimed to
   identify paralog pairs that consistently enhanced T cell killing across
   multiple models, thereby reducing the likelihood that observed effects
   were model-specific anomalies. We observed that no paralog pair was
   significant across all six models. However, some pairs appeared
   frequently, being significant in up to four models ([85]Figure 2A;
   [86]Table S3). Ultimately, four paralog pairs, Rbm45+Rbms2,
   Ppp2r2a+Ppp2r2d, Elf2+Etv6, and Adam10+Adam15, were selected for
   further experimental assessment of their potential to enhance T cell
   killing efficiency ([87]Figures S1D–S1G).

Figure 2.

   [88]Figure 2
   [89]Open in a new tab

   Synergistic paralog pairs that, together, boost T cell killing function
   in B16F10 tumor model

   (A) The T cell synergistic paralog pairs identified in the B16F10
   melanoma model and the intersection across the other 5 tumor models.
   T cell-enhanced paralog pairs were identified using pSSEA, and
   synergistic pairs were denoted as pairs where neither of the two genes
   could reach significance using a single-gene enrichment approach. Four
   paralog pairs were identified in 4 out of 6 tumor models. Left: UpSet
   plot of overlap synergistic paralog pairs identified from the B16F10
   melanoma model that were also significant in the 5 other cancer cell
   models in the Lawson study.

   (B) Experimental design for the validation of selected paralog pairs
   predicted to enhance T cell killing. Naive CD8a+ T cells were isolated
   from the spleen of OT-I mice and stimulated with anti-mouse CD3/CD28
   antibody for 2 days. Then, B16F10-Cas9-mCherry-Ova gene knockout cells
   were co-cultured with OT-I T cells at an E:T = 1:1 ratio. Tumor cell
   killing were tested at 24 h by flow cytometry.

   (C) Bar chart of mean viability of B16F10 cells with Adam10 and Adam15
   single- or double-gene knockout(s) following culture with or without
   OT-1 T cells. Individual data points are shown as black dots. The p
   values come from two-sided heteroscedastic t tests (ns, p > 0.05, ∗p ≤
   0.05, ∗∗p ≤ 0.01, and ∗∗∗p ≤ 0.001).

   (D) Bar chart of mean viability of Ppp2r2a and Ppp2r2d knockout B16F10
   cells following culture with or without OT-1 T cells.

   (E) Bar chart of mean viability of Elf2 and Etv6 knockout B16F10 cells
   following culture with or without OT-1 T cells.

   (F) Bar chart of mean viability of Rbm45 and Rbms2 knockout B16F10
   cells following culture with or without OT-1 T cells.

   To perform the paired gene KO, we utilized the same gRNAs from the
   original gRNA-library pool used in the CRISPR screen. For the first
   gene within the paralog pair, the targeting gRNA was cloned into a
   GFP-expressing construct, while for the second gene within the paralog
   pair, the targeting gRNA was inserted into a puromycin-expressing
   construct ([90]Figure 2B). To evaluate the synergistic function of each
   gene pair, we established four experimental groups with targeted KOs:
   NTC paired with NTC (NTC-NTC), gene1 KO paired with NTC (gene1KO-NTC),
   gene2 KO paired with NTC (gene2KO-NTC), and a double knockout (DKO) of
   both genes (gene1KO-gene2KO). To generate cells with these specific
   gene targets, we performed dual transduction of B16F10-Cas9-mCherry-OVA
   tumor cells, which express the OVA antigen, using lentiviruses carrying
   the gRNAs with GFP or puromycin markers. GFP-positive cells were sorted
   and then selected for puromycin-resistant cells to obtain either
   single- or double-knockout cells. Subsequently, gene editing
   was verified, and KO efficiency was evaluated using the T7E1 assay. To
   investigate the role of the paired genes in T cell-mediated
   cytotoxicity, we co-cultured B16F10-Cas9-mCherry-OVA tumor cells from
   the four experimental groups described above with OT-I T cells, whose
   T cell receptor (TCR) recognizes the SIINFEKL peptide presented by Ova
   antigen-expressing tumor cells, and assessed tumor cell survival.

   In these co-culture experiments, we found that simultaneous
   inactivation of Adam10 and Adam15 significantly increased
   T cell-mediated tumor cell death, while single knockouts of Adam15 or
   Adam10 did not affect T cell activity ([91]Figures 2C and [92]S1H).
   Similarly, Ppp2r2a+Ppp2r2d double knockouts displayed significantly
   greater vulnerability to T cell cytotoxicity, whereas individual
   knockouts of Ppp2r2a or Ppp2r2d showed only modestly enhanced T cell
   killing ([93]Figures 2D and [94]S1I). The Elf2 and Etv6 combination
   showed significantly enhanced T cell killing compared to the NTC
   control but not compared to the single-gene Elf2 and Etv6 KO cells
   ([95]Figures 2E and [96]S1J). The Rbm45/Rbms2 pair showed no
   discernible impact on T cell efficiency in either single- or
   double-knockout cells compared to the NTC group ([97]Figures 2F and
   [98]S1K). Furthermore, we evaluated the synergistic effects of
   combination targets in response to T-cells mediated killing, synergy
   analysis showed that Adam10+Adam15, Ppp2r2a+Ppp2r2d, and Elf2+Etv6
   demonstrated positive synergy scores across the HSA, Loewe, and delta
   log fold change (dLFC) models (see the [99]methods section and
   [100]Table S4). In total, three out of four predicted paralog pairs
   were successfully confirmed to exhibit increased T cell vulnerability,
   highlighting previously overlooked synergistic gene pairs. These
   findings also provide fundamental experimental evidence for the
   accuracy of using the pSSEA method to effectively identify paralog
   pairs that enhance T cell killing efficacy.

Building an ensemble XGBoost classifier to predict paralog pairs that enhance
T cell-mediated cytotoxic killing

   Using a CRISPR screen dataset, we have demonstrated that paralog pairs
   can work together to enhance T cell-mediated cytotoxicity. However,
   this CRISPR screen dataset was derived from in vitro co-culture of
   cancer cell lines with cytotoxic T cells, an assay with significant
   limitations and potential experimental biases. Therefore, we developed
   a system for deeper characterization of paralog pairs and prediction of
   their potential for enhancing cancer immunotherapy. Utilizing
   T cell-enhancing paralog pairs identified in B16F10 melanoma data with
   pSSEA as true positives and those identified as insignificant as false
   positives, we compiled 32 features representing various aspects of
   paralogous gene interactions. These features include sequence
   characterization, expression abundance and correlation, shared PPIs,
   complex membership, co-evolution, tumor microenvironment (TME)
   associations, perturbation similarities, and combined survival
   associations in patients with cancer ([101]Figure 3A; [102]Table S5).
   Then, an ensemble classifier was trained, utilizing the identified 32
   features, to predict potential paralog pairs. In order to ensure the
   quality of model training and testing, the paralog pairs were filtered
   using several criteria: paralog pairs identified as T cell-enhancing
   pairs or insignificant pairs in B16 data; a sequence BLAST identity
   larger than 30% with an e-value < 1e−5; measurements in structural
   similarity, synthetic lethality, and survival associations; and paralog
   pairs with family sizes of 20 or fewer to prevent overrepresentation of
   specific ortholog families. A total of 3,204 paralog pairs were
   selected from the initial set of 139,654 paralog pairs. Furthermore,
   80% of these 3,204 pairs were randomly assigned to the training dataset
   (n = 2,547), which includes negative pairs (n = 2,274) and positive
   pairs (n = 273). The remaining 20% were saved for internal testing
   later ([103]Figures S2A and S2B). Prior to building the ensemble model,
   our analysis revealed that certain individual features exhibited a
   stronger predictive effect in the training data. The top influential
   feature was gene expression abundance with the highest ROC AUC (area
   under receiver operating characteristic curve) values, indicating that
   gene expression itself could be a critical indicator of a paralog pair
   ([104]Figure 3B). Notably, the features ranked second to fifth were
   four PPI-network features ([105]Figure 3A), highlighting the crucial
   role of gene networks in determining whether two genes could interact
   in an immunotherapy context. Synthetic lethality was identified as a
   mid-level feature, suggesting that the collaborative roles of paralogs
   of cancer cells in response to immune context might differ from their
   roles in cell viability.

Figure 3.

   [106]Figure 3
   [107]Open in a new tab

   Development and evaluation of XGBoost classifier for predicting T cell
   enhancing synergistic paralog pairs

   (A) Schematic for constructing a machine learning classifier to predict
   T cell-enhancing paralog pairs.

   (B) ROC AUC and PRC AUC measures of 32 paralog features for predicting
   paralog pairs that boost T cell killing.

   (C) ROC curves of XGBoost and top individual features in the internal
   test dataset. The XGBoost classifier outperformed the traditional
   individual features.

   (D) Precision recall curves of the XGBoost classifier and individual
   features.

   Next, the XGBoost classifier, an ensemble-boosted tree learner, was
   trained, utilizing 5-fold cross-validation for hyperparameter tuning.
   We then applied the optimal parameters to re-train the classifier using
   the entire training dataset (details in [108]methods). Variable
   importance analysis revealed that complex, co-localization, and PPI
   features were the top-ranking features ([109]Figure S2C). We then
   evaluated the XGBoost classifier’s performance using testing data.
   Notably, the XGBoost classifier significantly outperformed all single
   features used as a baseline for comparison ([110]Figure 3C).
   Specifically, the classifier achieved an AUC of 0.743, surpassing the
   best-performing individual-feature average paralog gene expression in
   cancer cell lines, which had an AUC of 0.612. Considering the
   imbalanced nature of the training dataset, with more negative pairs
   than positive ones, we also assessed the classifier’s performance using
   the precision-recall curve (PRC; [111]Figure 3D). Similarly, the
   XGBoost classifier showed better performance compared to all individual
   features. Overall, the results underscore the enhanced predictive
   capability of our XGBoost classifiers in effectively capturing
   interactions between features and non-linear relationships, thereby
   enhancing T cell killing potential.

   Furthermore, we evaluated the classifier’s performance on paralog pairs
   in cells other than B16F10 cells. ROC analysis revealed that the
   classifier trained on B16F10 cell data could not predict paralog pairs
   enhancing T cell killing in other cell types, except for CT26 colon
   cells ([112]Figures S3A–S3E), indicating inherent heterogeneity among
   cancer cell models.

Interpreting the prediction of immunotherapy-favorable paralog pairs from the
machine learning classifier

   Using the XGBoost classifier, we comprehensively analyzed the
   probability of enhancing T cell cytotoxicity for additional paralog
   pairs outside the training and internal testing sets. By filtering with
   criteria including measurements in structural similarity, synthetic
   lethality, and survival associations and excluding pairs from the
   training and testing sets, we obtained 15,414 paralog pairs out of the
   total 139,654 pairs for prediction. Each paralog pair was calculated
   for its potential impact, assigned a probability score, and ranked
   accordingly. The Shapley additive explanations (SHAP) tree explainer
   method assessed the contribution of each feature to the final
   predictions.

   The paralog pairs were ranked by their prediction scores, and
   subsequent pathway enrichment analysis revealed that genes in the top
   5% of paralog pairs were significantly enriched in functions related to
   protein phosphorylation and ubiquitination modification
   ([113]Figure S4A). The top three ranked predicted paralog pairs were
   Rapgef1+Papgef2, Cct8+Cct3, and Syk+Itk ([114]Figure 4A;
   [115]Table S5). Notably the expression of Syk in cancer cells has shown
   to be associated with both tumor promotion and suppression.[116]^22 Itk
   has been reported to enhance ICB response in solid tumors,[117]^23 and
   Rapgef1 was identified as essential for melanocyte growth through a
   CRISPR proliferation screen.[118]^24 Analysis of The Cancer Genome
   Atlas (TCGA) cohort revealed that four genes (Rapgef1, Rapgef2, Itk,
   and Syk) had mutation rates exceeding 1% in the pan-cancer cohort. At
   the individual cancer-type level, four genes (Cct3, Rapgef1, Rapgef2,
   and Syk) had mutation rates exceeding 5% in highly mutated endometrial
   carcinoma, while in melanoma, three genes (Rapgef1, Rapgef2, and Syk)
   had mutation rates exceeding 3%; Itk had a rate of 7.3%
   ([119]Figure 4B). The SHAP profile revealed differential feature
   attributions among the top pairs. For example, sequence identity and
   paralog family size showed negative contributions to the prediction in
   Rapgef1+Rapgef2 ([120]Figure 4C) and Itk+Syk ([121]Figure 4D) but
   positive contributions in Cct3+Cct8 ([122]Figure S4B).

Figure 4.

   [123]Figure 4
   [124]Open in a new tab

   Visualization and validation of top 3 predicted paralogs from XGBoost
   classifier

   (A) Paralog pairs with the top 3 prediction scores are displayed in a
   scatterplot.

   (B) Mutation profile of top 3 paralog pairs RAPGEF1/RAPGEF2, SYK/ITK,
   and CCT8/CCT3 in The Cancer Genome Atlas (TCGA) cohorts.

   (C and D) SHAP profile for RAPGEF1/RAPGEF2 (C) and SYK/ITK (D).

   (E and F) Bar chart of mean viability of Rapgef1/Rapgef2 (E) and
   Itk/Syk (F) knockout treated B16F10 melanoma cells following culture
   with or without OT-1 T cells. The p values come from two-sided
   heteroscedastic t tests. (ns, p > 0.05, ∗p ≤ 0.05, ∗∗p ≤ 0.01, and ∗∗∗p
   ≤ 0.001).

   (G and H) Gene sets enrichment analysis (GSEA) revealed that patients
   with low expression of RAPGEF1/RAPGEF2 (G) and SYK/ITK (H) were
   associated with downregulation of PD-L1/PD-1 signaling pathway. The
   GSEA was performed using the differential expression output between
   patients with low (<25%) expression of both paralog genes versus all
   remaining patients.

   To evaluate the synergistic effect of the paralog genes identified by
   the XGBoost classifier, we conducted DKO-B16F10-OT1 T cell co-culture
   assays. Knocking out Rapgef1 alone did not affect T cell-mediated tumor
   cell killing, while knocking out Rapgef2 alone increased tumor cell
   resistance to T cell killing. However, a combined KO of Rapgef1 and
   Rapgef2 significantly improved T cell efficiency in killing tumor cells
   ([125]Figures 4E and [126]S4C). A similar pattern was observed with the
   Syk+Itk gene pair: knocking out individual genes increased resistance
   to T cell-mediated killing, but a double knockout enhanced T cell
   killing activity ([127]Figures 4F and [128]S4D). In contrast, double
   knockout of the Cct8+Cct3 gene pair impaired T cell killing ability
   against tumor cells ([129]Figures S4E and S4F). Synergy analysis
   revealed that Rapgef1+Papgef2 and Syk+Itk exhibited positive synergy
   effects (see the [130]methods section and [131]Table S4).

   To further investigate why Cct3/Cct8 might fail while the other two
   pairs work, we performed gene set enrichment analysis (GSEA) on
   TCGA-SKCM data, comparing patients with low expression in both paired
   genes (≤25%) to those with high expression (≥75%). We observed
   significant downregulation of the PD-L1/PD-1 signaling pathway in
   patients with melanoma with low Rapgef1+Rapgef2 ([132]Figure 4G) and
   Itk+Syk ([133]Figure 4H) expression. Downregulation of the PD-L1/PD-1
   signaling pathway augments T cell killing by interrupting inhibitory
   signals that suppress T cell activity. Conversely, patients with low
   Cct3+Cct8 expression ([134]Figure S4G) showed insignificant
   upregulation of this pathway compared to others. This finding suggests
   that the failure of Cct3+Cct8 to enhance T cell cytotoxicity may be due
   to the activation of other immunosuppressive pathways and the elevated
   expression of inhibitory molecules.

Discussion

   Although numerous CRISPR screen studies aim to identify novel cancer
   cell targets for enhancing cancer immunotherapies, few have focused on
   functional gene pairs, thus limiting the development of combination
   strategies. To address this, we developed a computational
   enrichment-based approach, pSSEA, to identify potential paralog gene
   pairs using genome-wide CRISPR screen datasets by combining sgRNAs from
   two paralogs. We demonstrated that pSSEA enables the identification of
   cancer-intrinsic paralog pairs that can synergistically enhance T cell
   killing. Subsequently, we constructed an ensemble-learning XGBoost
   classifier to predict additional cancer-intrinsic, immunotherapy-aiding
   paralog pairs and experimentally tested the top predictions. Notably,
   all analyses in this study were based on CRISPR screen data. We
   observed that only a small subset of the screen-derived
   T cell-enhancing paralog pairs demonstrated a combined effect in
   predicting ICB responses when using patient-derived transcriptomic data
   (data not shown). This suggests that these two data types may capture
   different aspects of immunotherapy-related signals.

   We developed an enrichment-based approach pSSEA to identify key paralog
   gene pairs, building on two main considerations: (1) single-gene
   screens well correlated with combinatorial screens effects from
   previously studies[135]^9^,[136]^14^,[137]^16 and (2) although
   combinatorial targeting remains a compelling area of interest,
   designing a library that comprehensively covers all possible gene pairs
   is technically unfeasible.

Limitations of the study

   Limitations of this study should also be acknowledged. First, the
   paralog pairs selected and used for modeling were not derived from
   actual combinatorial screens targeting immunotherapy. The absence of
   ground-truth data may influence the accurate identification of true
   paired hits, potentially impacting the model’s performance. Secondly,
   we did not incorporate a synergistic effect module, as used in other
   combinatorial screening studies,[138]^13^,[139]^25^,[140]^26 which
   defines synergy as the difference between a gene pair’s observed
   effects and the expected sum of their individual effects. While
   concerns remain about the comparability of enrichment scores generated
   from sgRNA sets of varying sizes, we plan to further investigate this
   issue and evaluate the feasibility of integrating similar
   functionality. Nevertheless, this study has the potential to offer
   valuable insights into the rapidly advancing field of immunotherapy and
   harness the power of well-established single-gene screens to uncover
   meaningful gene pair interactions.

   In summary, we envision our work providing novel methods to leverage
   genome-wide screen data for selecting combination targets in cancer
   immunotherapy. We believe that the computational framework in our study
   could be adapted to prioritize non-paralogous protein-coding gene
   pairs, broadening its applicability beyond paralogs.

Methods

Paralog pair information

   Information on protein-coding paralog pairs was obtained from the
   Ensembl Compara database via biomaRt (hsapiens_gene_ensembl, v.102)
   using custom R scripts. Duplicated paralog pairs were filtered out. A
   total of 139,654 protein-coding gene paralog pairs, covering 16,202
   protein-coding genes, were used for the analyses in this study.

Collect features for paralog pairs

Sequence similarity

   Amino acid sequence of paralog genes were obtained from Ensembl
   (release 102).[141]^27 Paralog sequence similarities were measured
   using BLASTP command line tools.[142]^28 The longest translated peptide
   was used for genes with multiple transcripts. Gene pair sequence
   identity, paralog family size, and whether they are the closest pair in
   the same gene family were included as sequence-related features for
   paralog pairs.

   We also used high-resolution structure PDB files from
   [143]https://alphafold.ebi.ac.uk/.[144]^29 The structural similarity of
   paralogous proteins was measured using template modeling score
   (TM-score) software.[145]^30 The TM-score and root-mean-square
   deviation (RMSD) were included as additional sequence-related features.

Shared gene network, complex, and co-localization

   We collected shared PPI features to characterize the role of paralogs
   in the gene interaction network, including total PPIs, shared PPIs, and
   two statistical measures: the Jaccard index and Fisher’s exact test
   −log10(p value) for shared PPIs. Gene subcellular location data were
   obtained from the Human Protein Atlas (v.22.0, file:
   subcellular_location.tsv). The co-localization of paralogous genes was
   also measured using the Jaccard index. The complex membership of
   paralog pairs and gene-protein complex membership data were obtained
   from CORUM[146]^31 via [147]https://maayanlab.cloud/Harmonizome/.

Gene co-evolution

   Gene ages were obtained from ProteinHistorian
   ([148]https://proteinhistorian.docpollard.org/),[149]^32 and we used
   the average age of the two paralogs as the age for the pair (Wagner age
   reconstruction algorithm). The conservation level of an individual gene
   was calculated as the number of species with orthologs (BLASTP
   e-value < 1-e−5, identity > 30%) out of 269 species from Ensembl
   genomes (v.102). The shared homologs of a gene pair were calculated as
   the number of species having both gene homologs. The ratio of the
   number of non-synonymous substitutions per non-synonymous site to the
   number of synonymous substitutions per synonymous site (Ka/Ks) for gene
   pairs was calculated using the codeml program in the PAML package
   (v.4.10.6).[150]^33^,[151]^34 We adopted the phylogenetic distance
   method developed by Tabach et al. to measure the co-evolution of a pair
   of genes.[152]^35 A non-negative matrix factorization (NMF)-derived
   distance considering phylogenetic tree information was calculated using
   the phylogenetic profiles of the paralogous genes.[153]^36^,[154]^37

Essentiality and synthetic lethality

   Paralogs with available evidence of essentiality from S. pombe and
   S. cerevisiae ortholog data from OGEE v.3 were marked as essential
   pairs.[155]^38 We adopted Kegel’s method to identify synthetic
   lethality pairs[156]^6 using DepMap CRISPR data
   ([157]https://depmap.org/portal, 23Q2 release). Paralog pairs were
   considered synthetic if the A1 gene was an essential gene (Chronos
   score[158]^39 < −0.6 in 1% or 10% of cell lines), and its dependency
   was significantly associated with A2 loss status with a p value less
   than 0.05.

Co-expression pattern

   Gene pair co-expression was measured using Spearman correlation
   coefficient ρ at multiple levels: cancer cell lines, normal tissues,
   and cancer scenarios. Cancer Cell Line Encyclopedia (CCLE) gene
   expression data were downloaded from DepMap.[159]^40 Tissue-specific
   gene expression profiles from healthy donors were downloaded from
   Genotype-Tissue Expression (GTEx; [160]https://gtexportal.org/) public
   releases.[161]^41 Cancer-specific gene expression profiles for 33
   cancer types were downloaded from TCGA GDC Portal
   ([162]https://portal.gdc.cancer.gov/).[163]^42

Perturbation similarity

   Due to the limited number of gene perturbation experiments in
   Connectivity Map (L1000, [164]https://clue.io/data/),[165]^43 we
   conducted pseudo perturbation using TCGA mRNA sequence data to mimic
   the transcriptomic gene expression changes associated with single-gene
   manipulation. Practically, the differential expression profile of
   specific gene inhibition was calculated as the log fold change in gene
   expression between patients with expression levels in the lowest 25th
   percentile versus those in the highest 75th percentile of the query
   gene. The perturbation similarity of paralogous genes was measured
   using recovery AUC with the paralog gene differential expression
   profiles.

Microenvironment associations

   The TME plays an important role in cancer progression and treatment
   outcome. We attempted to uncover features quantifying the association
   of the paralogs with multiple microenvironment indicators. CD8^+ T cell
   infiltration was measured by the single-stranded GSEA (ssGSEA)
   method.[166]^20 We calculated T cell exclusion and dysfunction scores
   using another software, the TIDE algorithm developed by Jiang
   et al.[167]^44 The association with gene pairs and microenvironmental
   characteristics was calculated by Kendall’s rank correlation
   test.[168]^45

Survival associations

   To determine gene pairs with a synergistic effect on survival outcomes,
   we calculated the survival associations to assess whether a paralog
   pair acts in concert to influence cancer prognosis. First, patients
   were categorized into three groups based on the expression levels of
   the paired paralog genes: groups were designated as 1 for both genes
   showing low expression, 3 for both showing high expression, and 2 for
   any other expression combination. The survival association for paralog
   pairs was then calculated using the formula
   [MATH: <mrow><mtext>survival</mtext><mspace
   width="0.25em"></mspace><mtext>association</mtext><mo
   linebreak="badbreak">=</mo><mo
   linebreak="goodbreak">−</mo><mi>log</mi><mspace
   width="0.25em"></mspace><mn>10</mn><mrow><mo>(</mo><mi>p</mi><mo>)</mo>
   </mrow><mspace width="0.25em"></mspace><mo>∗</mo><mspace
   width="0.25em"></mspace><mtext>sign</mtext><mrow><mo>(</mo><mi
   mathvariant="normal">β</mi><mo>_</mo><mtext>gene</mtext><mn>1</mn><mo>∗
   </mo><mi
   mathvariant="normal">β</mi><mo>_</mo><mtext>gene</mtext><mn>2</mn><mo>)
   </mo></mrow><mtext>.</mtext></mrow> :MATH]

   In this formula, p refers to the p value from the Cox regression
   analysis of the paralog group, adjusted for sex and age, while β_gene1
   and β_gene2 represent the exponentiated coefficients of the individual
   genes from the Cox proportional hazards model. Clinical data and
   genomics data for TCGA-SKCM were downloaded from the GDC Portal
   ([169]https://portal.gdc.cancer.gov/).[170]^46

XGBoost ensemble classifier

   The popular supervised-learning algorithm XGBoost—a gradient boosting
   ensemble learner with regularization parameters—was used in this study
   (implementation in the XGBoost package).[171]^47 We performed a grid
   search based on 5-fold cross-validation of our paralog pair dataset to
   explore the space of potential hyperparameters for the XGBoost
   classifier. The parameters used in the final model are as follows:
   objective = reg: logistic, n_estimators = 400, max_depth = 10,
   min_child_weight = 1, min_split_loss = 0.1, subsample = 1,
   colsample_bytree = 1, learning_rate = 0.01, and random_state = 8 (for
   reproducibility). All parameters not listed used default settings.

Performance evaluation

   The ROC AUC and PRC AUC for the feature values as well as the
   classifiers were computed to evaluate the performance of individual
   features and trained models.

Lentivirus purification

   For lentivirus production, 20 μg of lenti-U6-sgRNA-EFS-GFP-WPRE or
   20 μg of lenti-U6-sgRNA-EFS-Puro-WPRE, 10 μg of pMD2.G, and 15 μg of
   psPAX2 were co-transfected into LentiX-293 cells plated in a 150 mm
   dish at 80%–90% confluency using 130 μg polyethyleneimine (PEI). 6 h
   later, the media were replaced with fresh DMEM+10% FBS. Virus
   supernatant was collected 48 h post-transfection and centrifuged at
   1,500g for 15 min to remove the cell debris. The virus supernatant was
   concentrated by Lenti-X concentrator at 4°C for 30 min (1 vol of
   Lenti-X concentrator with 3 vol of supernatant), followed by
   centrifugation at 1,500g for 45 min at 4°C. Finally, the virus was
   resuspended in DMEM, aliquoted, and stored at −80°C.

Generation of single-gene-knockout or double-gene-knockout cells

   B16F10 cells stably expressing Cas9 were generated by transducing
   B16F10 cells with lentiviral EF1a-NLS-Cas9-2A-Blast-WPRE, followed by
   5 days of selection under 20 μg/mL blasticidin. B16F10-Cas9 cells were
   further transfected with lentiviral EF1a-mCherry-2A-OVA-WPRE and sorted
   for mCherry-positive cells to generate B16f10-Cas9-mCherry-OVA cells.
   The two gRNAs targeting each gene within the paralog pair were
   transduced into the B16F10-Cas9-mCherry-Ova cells simultaneously. gRNA
   virus-infected cells were cultured at 37°C for more than 24 h, and then
   GFP-positive cells were sorted and selected under 5 μg/mL puromycin for
   3 days to generate the double-knockout cells.

T7E1 assay

   T7E1 was used to estimate gRNA cutting efficiency. In brief, gDNA was
   extracted by using Genomic DNA prep with QuickExtract (QE) buffer. PCR
   amplification of the genomic regions flanking the CRISPR RNA (crRNA)
   was performed using the primers listed in [172]Table S3. Using Phusion
   Flash High Fidelity Master Mix (Thermo Fisher Scientific), the
   thermocycling parameters for PCR were 98°C for 1 min; 35 cycles of 98°C
   for 1 s, 60°C for 5 s, and 72°C for 15 s); and 72°C for 2 min. The PCR
   amplicons were then used for T7E1 assays according to the
   manufacturer’s protocol.

Naive OT-I CD8a+ T cell isolation and co-culture assay

   The naive OT-I CD8a+ T cells were isolated using the Naive CD8a+ T Cell
   Isolation Kit (Miltenyi Biotec) following the manufacturer’s
   instructions. Naive CD8a+ T cells were isolated from the spleen of OT-I
   mice and stimulated with anti-mouse CD3/CD28 antibody for 2 days. For
   the tumor cell and OT-I T cell co-culture assay,
   B16F10-Cas9-mCherry-Ova gene KO cells were co-cultured with OT1 cells
   at effector-to-target cell ratios (E:T) = 1:1 ratios. Tumor cell
   killing was tested at 24 h by flow cytometry. LIVE/DEAD near-infrared
   (IR) was diluted at 1:1,000 to identify the dead cells. Tumor cells
   were identified as mCherry-positive cells.

Synergy analysis

   Statistical analysis for evaluating synergistic interactions between
   paralog pairs was calculated by the SynergyFinder R package[173]^48
   using four models, including the highest single agent (HSA), Bliss,
   Loewe, and zero interaction potency (ZIP) models. The p values for the
   average synergy score were computed using bootstrapping of the
   dose-response matrix. We also adopted the dLFC method to measure the
   additive effect of gene pairs.[174]^13 The dLFC was calculated as the
   observed mean relative killing rate of the DKO minus the expected sum
   of the relative killing rates of the individual SKO. The relative
   killing rate was determined by subtracting the values from the NTC
   group.

Resource availability

Lead contact

   Requests for information and resources used in this article should be
   addressed to the lead contact, Sidi Chen (sidi.chen@yale.edu).

Materials availability

   This study did not generate new unique reagents.

Data and code availability

   The source code for processed data and modeling is available under
   GitHub ([175]https://github.com/cpdong/imParalog) and Zenodo
   ([176]https://doi.org/10.5281/zenodo.14759518)[177]^49 and included in
   this article and its [178]supplemental information files. An
   interactive web application of this study is deployed at
   [179]https://sidichenlab.shinyapps.io/imparalog for free visiting and
   querying. Any additional information required to re-analyze the data
   reported in this paper is available from the lead contact upon request.

Acknowledgments