Graphical abstract
graphic file with name fx1.jpg
[42]Open in a new tab
Highlights
* •
Computational approaches for detecting paired targets using CRISPR
screen data
* •
Validation of paralog gene pairs' effect on T cell killing using
CRISPR double knockouts
* •
A machine learning model for predicting paralog gene pairs that
modulate immunotherapy
The bigger picture
Immunotherapies leverage the patient’s own immune system to fight
cancer, and emerging therapies in this domain have revolutionized
cancer treatment; however, resistance and relapse remain challenges.
Recent efforts to identify combinatorial targets, or pairs of genes
that could be used together for cancer treatment, are hindered by
limitations in current high-throughput screening approaches. This type
of screening often misses promising target pairs. In this study, we
leverage public genomic datasets to identify and validate gene pairs
that synergistically enhance effector T cell function, ultimately
promoting immune-mediated tumor eradication. Our findings underscore
the overlooked potential of paralog gene pairs in driving functional
synergy and offer a framework for designing more effective
combinatorial screen libraries. While this study focuses on cancer
immunotherapy, the methods we developed for identifying combinatorial
targets may be broadly applicable to other viability functional
screening asks, where large-scale datasets are more readily available.
__________________________________________________________________
Combinatorial targeting has shown promise in cancer therapies; however,
the experimental detection of paired functional targets remains
challenging due to the vast number of possible combinations. In this
study, the authors introduce a method to repurpose single-gene screen
data to identify functional paralog pairs that enhance immunotherapy,
extending the search through machine learning approaches. The authors
further demonstrate experimentally that synergistic paralog pairs can
improve immunotherapy efficacy, even when neither individual target is
effective, a scenario often underestimated.
Introduction
Emerging immunotherapies, especially immune checkpoint blockade (ICB)
and adoptive cell therapy, have revolutionized cancer treatment for
multiple cancer types; however, a significant fraction of patients fail
to respond to immunotherapies or relapse following treatment.[43]^1
Recent efforts have focused on using CRISPR knockout (KO) or activation
screening to identify targets that enhance T cell effector function and
augment immune killing capability.[44]^2 Nevertheless, overcoming
resistance by manipulating a single gene remains challenging due to the
compensatory effects of other genes. Combination therapies have been
proposed as a promising strategy to overcome tumor resistance to
monotherapy when single agents are ineffective.[45]^3
Paralogs, genes that originate from the same ancestors and share
similar functions, often work in concert to promote cell survival.
Numerous focused studies have utilized genomic data and pooled
screening techniques to explore how targeting paralogous genes can
impact cancer cell viability.[46]^4^,[47]^5^,[48]^6 Recent research has
shown that targeting these genes simultaneously in cancer cells can
enhance the effectiveness of immunotherapy by preventing the cells from
evading T cell-mediated destruction. For instance, Park et al.
demonstrated that treating mice with both anti-PD-L1 and anti-PD-L2
antibodies significantly reduced tumor growth in specific mouse
models.[49]^7 Additionally, signals from both PARP-1 and PARP-2 are
essential for initiating an effective T cell response against breast
cancer in mice.[50]^8 While pooled CRISPR screening is popular for
identifying single genes that modulate specific responses, including
those relevant to immunotherapy, double-knockout screens aiming to
identify paired targets remain rare. Recent efforts in double-knockout
screens, covering dozens to hundreds of gene pairs, have been conducted
to identify novel gene combinations that lead to synthetic lethality
and boost T cell function.[51]^5^,[52]^9^,[53]^10 However, a
comprehensive, high-throughput strategy to screen all potential paralog
pairs related to immunotherapy is challenging because a very large
screening library that includes hundreds of thousands of gene
combinations would be both costly and unfeasible. To solve this
problem, we have developed a computational approach to identify the
most promising paralog gene combinations, which can then be used to
build a focused and manageable library for experimental validation.
Here, we used an single guide RNA (sgRNA) set enrichment method to
identify cancer-intrinsic paralog pairs that enhance T cell killing
from prior pooled screening data. We show that our method identifies
novel synergistic paralog pairs that are successfully validated
experimentally. We have also constructed an ensemble learning XGBoost
classifier to predict true-positive paralog pairs enhancing
immunotherapy by incorporating features such as gene characteristics,
sequence and structural similarity, protein-protein interaction (PPI)
networks, and gene co-evolution data. To enhance the reliability of the
XGBoost classifier predictions, we conducted further experimental
validations on the highest-ranked paralog pairs identified. This study
is expected to uncover novel therapeutic targets and inspire new
combination immunotherapies to aid cancer treatment.
Results
Identifying cancer-intrinsic paired paralogs that enhance T cell-mediated
killing using CRISPR screen data
A large proportion of genes have paralogs that perform redundant
functions. Inactivation of functionally important paralog pairs can
lead to synthetic lethality, as well as more efficient modulation of
biological pathways.[54]^9^,[55]^11 Unfortunately the identification of
these paralog pairs is limited in single-gene perturbation screens due
to compensation between paralogous genes,[56]^12 and dual gene
screening is often unfeasible due to the large size of CRISPR libraries
required to encompass all potential combinations of paralogous gene
pairs. Current double-knockout screens cover only a limited number of
genes and require a substantial number of gRNA pairs ([57]Table S1).
Even the largest paired gene screens to date cover less than 0.1% of
all possible
combinations.[58]^5^,[59]^9^,[60]^12^,[61]^13^,[62]^14^,[63]^15^,[64]^1
6^,[65]^17^,[66]^18 To overcome this limitation, we proposed an
unbiased computational approach to filter and predict functional gene
pairs from prior published genome-wide CRISPR screen data.
We initially attempted to search for immunotherapy-aiding paralog pairs
using conventional MAGeCK results in Lawson’s CRISPR screen
dataset,[67]^19 which was originally aimed at identifying genes
involved in evading T cell-mediated cytotoxicity across six murine
tumor cell lines. Based on a strict dual-hit gene criterion, we
observed that few or no suitable pairs identified ([68]Figure S1A);
however, relaxing the criteria to include single hit genes leads to a
significant increase in uncertain pairs. To overcome this, we developed
an enrichment-based methodology named paralog sgRNA set enrichment
analysis (pSSEA), which integrates the sgRNAs of gene pairs to enhance
the identification of high-potential paralog pairs. We provide a
schematic of the pSSEA algorithm in [69]Figure 1B. Briefly, the process
begins with the results of differential expression analysis (using
DESeq2, MAGeCK, etc.) as input. The pSSEA evaluates the enrichment
score by employing a Kolmogorov-Smirnov (KS)-like random walk
statistic, a method similar to those used in GSVA[70]^20 and
ASSESS.[71]^21 The detailed steps include
[MATH: ν(l)=∑i=1l|ri|Ι(gi∈γ)∑
i=1p|ri|Ι(gi∈γ)−∑i=1l|ri|Ι(gi∉γ)p−|γ|<
/mrow> :MATH]
(Equation 1)
where
[MATH: ri :MATH]
is the rank order of sgRNAs, sorted by the log fold changes;
[MATH: γ :MATH]
is the sgRNA set that belongs to the paralog gene pair;
[MATH: Ι(gi∈γ) :MATH]
is the indicator function on whether the i-th sgRNA is in paralog sgRNA
set
[MATH: γ :MATH]
,
[MATH: |γ| :MATH]
is the number of sgRNAs belonging to the gene pairs, and p is the
number of total sgRNAs in the library.
[MATH: ES=ν[argmaxl=1,..
.,p|ν(l)|] :MATH]
(Equation 2)
Figure 1.
[72]Figure 1
[73]Open in a new tab
Identification of paralog pairs that enhance T cell killing using pSSEA
(A) Protein-coding paralogous genes retrieved from Ensembl Compara
database using bioMart.
(B) Workflow for identifying paralog pair hits using the paralog sgRNA
set enrichment analysis (pSSEA).
(C) Identification of paralog pairs that enhance T cell killing in six
cancer models using pSSEA. CRISPR screen data were taken from GEO:
[74]GSE149933
. The volcano plot of pSSEA outputs, with the x axis representing the
enrichment score and the y axis showing the negative logarithm (base
10) of the p value from pSSEA, is shown.
(D) Scatterplot illustrating the characteristics of identified
T cell-enhancing paralog pairs using a single-gene enrichment model.
The x and y axes represent the negative logarithm (base 10) of the p
values from the enrichment results of gene1 and gene2, respectively.
The enrichment scores for paralog pairs were calculated by identifying
the maximum deviation from zero during a random walk analysis.
To assess the significance of enrichment, we created an empirical null
distribution by randomly selecting an equivalent number of
non-targeting control (NTC) sgRNAs and repeating this process 10,000
times to ensure robustness. The significance of enrichment for each
gene pair was then determined by comparing their sgRNA distributions
against both the positive and negative tails of the NTC null
distribution.
We re-analyzed Lawson’s CRISPR screen dataset[75]^19 to identify
cancer-intrinsic paralog pairs that enhance T cell-mediated killing
function using the pSSEA method. From the Ensembl Compara database
(Ensembl, v.102), a total of 139,654 protein-coding gene paralog pairs
were retrieved, covering 16,202 protein-coding genes, as shown in
[76]Figure 1A. First, DESeq2 was used to quantify the selective
depletion of sgRNAs within tumor populations under immune selection
from co-cultured OT-1 T cells. Subsequently, using the pSSEA framework,
we identified paralogous gene pairs that potentially influence
T cell-mediated cytotoxicity based on the differential sgRNA results
obtained from DESeq2. Our analysis identified 7,310 significant paralog
pairs (5.2% of 139,654) with p values below 0.05 in B16F10 melanoma
cell data. Among these, 5,632 pairs were positively enriched,
suggesting vulnerability to T cell killing, and 1,678 pairs were
negatively enriched, suggesting resistance to T cell killing
([77]Figure 1D). Using these same methods, we identified paralog pairs
associated with vulnerability and resistance to T cell killing for
CT26, MC38, 4T1, EMT6, and Renca cell lines, as shown in [78]Figure 1C.
To investigate the properties of paralog pairs identified as enriched
through pSSEA, we assessed individual gene enrichment using pSSEA in
single-gene mode. Among the 5,632 pairs associated with vulnerability
to T cell killing in B16F10 melanoma cells, 3,620 pairs (64.3%) were
single hit, with only one gene showing significance; 520 pairs (9.2%)
were dual hits, with both genes significant; and the remaining 1,492
pairs (26.5%) showed no hits of either of the genes within the
identified paralog pair ([79]Figure 1D; [80]Table S2). Notably, we
observed that over a quarter of the paralog pairs exhibited a
synergistic effect despite no hits at the single-gene level. This
observation was confirmed using data from Parrish et al., which
includes both single knockout (SKO) and double knockout (DKO) screens
for tumor viability. By analyzing DKO alongside SKO screens using
MaGeck, we found that out of 354 synthetic lethality pairs identified
in the DKO screen, 95 were not significant as individual genes in the
SKO screen (26.8%; [81]Figure S1B). We also observed that pSSEA was
more effective in identifying synthetic gene pairs compared to
selecting pairs based on the significance of individual genes or both
genes in SKO screens ([82]Figure S1C). This aligns with our previous
findings in [83]Figure 1D that many potential synergistic gene pairs
may not be significant as individual genes, highlighting a broad
spectrum of potential targets that might be overlooked by obtaining
gene pairs according to individual gene results.
Validation of synergistic cancer-intrinsic paralog pairs that, together,
boost T cell killing
To determine if the gene pairs identified by pSSEA truly function
synergistically, we focused on the subset of paralog pairs that
exhibited a synergistic effect despite neither gene being individually
significant. This subset constituted over 25% of the functional paralog
pairs identified ([84]Figure 1D). For our experimental validation using
the B16F10 murine melanoma cell model, we first screened for paralog
pairs in the B16F10 model, where neither gene showed a significant
effect individually but demonstrated a significant effect when both
genes were perturbed together. This ensured that we could efficiently
validate potential gene pairs in this tumor model, Then, to enhance the
robustness of our selection, we applied pSSEA to five additional tumor
models, including CT26, MC38, 4T1, EMT6, and Renca. We aimed to
identify paralog pairs that consistently enhanced T cell killing across
multiple models, thereby reducing the likelihood that observed effects
were model-specific anomalies. We observed that no paralog pair was
significant across all six models. However, some pairs appeared
frequently, being significant in up to four models ([85]Figure 2A;
[86]Table S3). Ultimately, four paralog pairs, Rbm45+Rbms2,
Ppp2r2a+Ppp2r2d, Elf2+Etv6, and Adam10+Adam15, were selected for
further experimental assessment of their potential to enhance T cell
killing efficiency ([87]Figures S1D–S1G).
Figure 2.
[88]Figure 2
[89]Open in a new tab
Synergistic paralog pairs that, together, boost T cell killing function
in B16F10 tumor model
(A) The T cell synergistic paralog pairs identified in the B16F10
melanoma model and the intersection across the other 5 tumor models.
T cell-enhanced paralog pairs were identified using pSSEA, and
synergistic pairs were denoted as pairs where neither of the two genes
could reach significance using a single-gene enrichment approach. Four
paralog pairs were identified in 4 out of 6 tumor models. Left: UpSet
plot of overlap synergistic paralog pairs identified from the B16F10
melanoma model that were also significant in the 5 other cancer cell
models in the Lawson study.
(B) Experimental design for the validation of selected paralog pairs
predicted to enhance T cell killing. Naive CD8a+ T cells were isolated
from the spleen of OT-I mice and stimulated with anti-mouse CD3/CD28
antibody for 2 days. Then, B16F10-Cas9-mCherry-Ova gene knockout cells
were co-cultured with OT-I T cells at an E:T = 1:1 ratio. Tumor cell
killing were tested at 24 h by flow cytometry.
(C) Bar chart of mean viability of B16F10 cells with Adam10 and Adam15
single- or double-gene knockout(s) following culture with or without
OT-1 T cells. Individual data points are shown as black dots. The p
values come from two-sided heteroscedastic t tests (ns, p > 0.05, ∗p ≤
0.05, ∗∗p ≤ 0.01, and ∗∗∗p ≤ 0.001).
(D) Bar chart of mean viability of Ppp2r2a and Ppp2r2d knockout B16F10
cells following culture with or without OT-1 T cells.
(E) Bar chart of mean viability of Elf2 and Etv6 knockout B16F10 cells
following culture with or without OT-1 T cells.
(F) Bar chart of mean viability of Rbm45 and Rbms2 knockout B16F10
cells following culture with or without OT-1 T cells.
To perform the paired gene KO, we utilized the same gRNAs from the
original gRNA-library pool used in the CRISPR screen. For the first
gene within the paralog pair, the targeting gRNA was cloned into a
GFP-expressing construct, while for the second gene within the paralog
pair, the targeting gRNA was inserted into a puromycin-expressing
construct ([90]Figure 2B). To evaluate the synergistic function of each
gene pair, we established four experimental groups with targeted KOs:
NTC paired with NTC (NTC-NTC), gene1 KO paired with NTC (gene1KO-NTC),
gene2 KO paired with NTC (gene2KO-NTC), and a double knockout (DKO) of
both genes (gene1KO-gene2KO). To generate cells with these specific
gene targets, we performed dual transduction of B16F10-Cas9-mCherry-OVA
tumor cells, which express the OVA antigen, using lentiviruses carrying
the gRNAs with GFP or puromycin markers. GFP-positive cells were sorted
and then selected for puromycin-resistant cells to obtain either
single- or double-knockout cells. Subsequently, gene editing
was verified, and KO efficiency was evaluated using the T7E1 assay. To
investigate the role of the paired genes in T cell-mediated
cytotoxicity, we co-cultured B16F10-Cas9-mCherry-OVA tumor cells from
the four experimental groups described above with OT-I T cells, whose
T cell receptor (TCR) recognizes the SIINFEKL peptide presented by Ova
antigen-expressing tumor cells, and assessed tumor cell survival.
In these co-culture experiments, we found that simultaneous
inactivation of Adam10 and Adam15 significantly increased
T cell-mediated tumor cell death, while single knockouts of Adam15 or
Adam10 did not affect T cell activity ([91]Figures 2C and [92]S1H).
Similarly, Ppp2r2a+Ppp2r2d double knockouts displayed significantly
greater vulnerability to T cell cytotoxicity, whereas individual
knockouts of Ppp2r2a or Ppp2r2d showed only modestly enhanced T cell
killing ([93]Figures 2D and [94]S1I). The Elf2 and Etv6 combination
showed significantly enhanced T cell killing compared to the NTC
control but not compared to the single-gene Elf2 and Etv6 KO cells
([95]Figures 2E and [96]S1J). The Rbm45/Rbms2 pair showed no
discernible impact on T cell efficiency in either single- or
double-knockout cells compared to the NTC group ([97]Figures 2F and
[98]S1K). Furthermore, we evaluated the synergistic effects of
combination targets in response to T-cells mediated killing, synergy
analysis showed that Adam10+Adam15, Ppp2r2a+Ppp2r2d, and Elf2+Etv6
demonstrated positive synergy scores across the HSA, Loewe, and delta
log fold change (dLFC) models (see the [99]methods section and
[100]Table S4). In total, three out of four predicted paralog pairs
were successfully confirmed to exhibit increased T cell vulnerability,
highlighting previously overlooked synergistic gene pairs. These
findings also provide fundamental experimental evidence for the
accuracy of using the pSSEA method to effectively identify paralog
pairs that enhance T cell killing efficacy.
Building an ensemble XGBoost classifier to predict paralog pairs that enhance
T cell-mediated cytotoxic killing
Using a CRISPR screen dataset, we have demonstrated that paralog pairs
can work together to enhance T cell-mediated cytotoxicity. However,
this CRISPR screen dataset was derived from in vitro co-culture of
cancer cell lines with cytotoxic T cells, an assay with significant
limitations and potential experimental biases. Therefore, we developed
a system for deeper characterization of paralog pairs and prediction of
their potential for enhancing cancer immunotherapy. Utilizing
T cell-enhancing paralog pairs identified in B16F10 melanoma data with
pSSEA as true positives and those identified as insignificant as false
positives, we compiled 32 features representing various aspects of
paralogous gene interactions. These features include sequence
characterization, expression abundance and correlation, shared PPIs,
complex membership, co-evolution, tumor microenvironment (TME)
associations, perturbation similarities, and combined survival
associations in patients with cancer ([101]Figure 3A; [102]Table S5).
Then, an ensemble classifier was trained, utilizing the identified 32
features, to predict potential paralog pairs. In order to ensure the
quality of model training and testing, the paralog pairs were filtered
using several criteria: paralog pairs identified as T cell-enhancing
pairs or insignificant pairs in B16 data; a sequence BLAST identity
larger than 30% with an e-value < 1e−5; measurements in structural
similarity, synthetic lethality, and survival associations; and paralog
pairs with family sizes of 20 or fewer to prevent overrepresentation of
specific ortholog families. A total of 3,204 paralog pairs were
selected from the initial set of 139,654 paralog pairs. Furthermore,
80% of these 3,204 pairs were randomly assigned to the training dataset
(n = 2,547), which includes negative pairs (n = 2,274) and positive
pairs (n = 273). The remaining 20% were saved for internal testing
later ([103]Figures S2A and S2B). Prior to building the ensemble model,
our analysis revealed that certain individual features exhibited a
stronger predictive effect in the training data. The top influential
feature was gene expression abundance with the highest ROC AUC (area
under receiver operating characteristic curve) values, indicating that
gene expression itself could be a critical indicator of a paralog pair
([104]Figure 3B). Notably, the features ranked second to fifth were
four PPI-network features ([105]Figure 3A), highlighting the crucial
role of gene networks in determining whether two genes could interact
in an immunotherapy context. Synthetic lethality was identified as a
mid-level feature, suggesting that the collaborative roles of paralogs
of cancer cells in response to immune context might differ from their
roles in cell viability.
Figure 3.
[106]Figure 3
[107]Open in a new tab
Development and evaluation of XGBoost classifier for predicting T cell
enhancing synergistic paralog pairs
(A) Schematic for constructing a machine learning classifier to predict
T cell-enhancing paralog pairs.
(B) ROC AUC and PRC AUC measures of 32 paralog features for predicting
paralog pairs that boost T cell killing.
(C) ROC curves of XGBoost and top individual features in the internal
test dataset. The XGBoost classifier outperformed the traditional
individual features.
(D) Precision recall curves of the XGBoost classifier and individual
features.
Next, the XGBoost classifier, an ensemble-boosted tree learner, was
trained, utilizing 5-fold cross-validation for hyperparameter tuning.
We then applied the optimal parameters to re-train the classifier using
the entire training dataset (details in [108]methods). Variable
importance analysis revealed that complex, co-localization, and PPI
features were the top-ranking features ([109]Figure S2C). We then
evaluated the XGBoost classifier’s performance using testing data.
Notably, the XGBoost classifier significantly outperformed all single
features used as a baseline for comparison ([110]Figure 3C).
Specifically, the classifier achieved an AUC of 0.743, surpassing the
best-performing individual-feature average paralog gene expression in
cancer cell lines, which had an AUC of 0.612. Considering the
imbalanced nature of the training dataset, with more negative pairs
than positive ones, we also assessed the classifier’s performance using
the precision-recall curve (PRC; [111]Figure 3D). Similarly, the
XGBoost classifier showed better performance compared to all individual
features. Overall, the results underscore the enhanced predictive
capability of our XGBoost classifiers in effectively capturing
interactions between features and non-linear relationships, thereby
enhancing T cell killing potential.
Furthermore, we evaluated the classifier’s performance on paralog pairs
in cells other than B16F10 cells. ROC analysis revealed that the
classifier trained on B16F10 cell data could not predict paralog pairs
enhancing T cell killing in other cell types, except for CT26 colon
cells ([112]Figures S3A–S3E), indicating inherent heterogeneity among
cancer cell models.
Interpreting the prediction of immunotherapy-favorable paralog pairs from the
machine learning classifier
Using the XGBoost classifier, we comprehensively analyzed the
probability of enhancing T cell cytotoxicity for additional paralog
pairs outside the training and internal testing sets. By filtering with
criteria including measurements in structural similarity, synthetic
lethality, and survival associations and excluding pairs from the
training and testing sets, we obtained 15,414 paralog pairs out of the
total 139,654 pairs for prediction. Each paralog pair was calculated
for its potential impact, assigned a probability score, and ranked
accordingly. The Shapley additive explanations (SHAP) tree explainer
method assessed the contribution of each feature to the final
predictions.
The paralog pairs were ranked by their prediction scores, and
subsequent pathway enrichment analysis revealed that genes in the top
5% of paralog pairs were significantly enriched in functions related to
protein phosphorylation and ubiquitination modification
([113]Figure S4A). The top three ranked predicted paralog pairs were
Rapgef1+Papgef2, Cct8+Cct3, and Syk+Itk ([114]Figure 4A;
[115]Table S5). Notably the expression of Syk in cancer cells has shown
to be associated with both tumor promotion and suppression.[116]^22 Itk
has been reported to enhance ICB response in solid tumors,[117]^23 and
Rapgef1 was identified as essential for melanocyte growth through a
CRISPR proliferation screen.[118]^24 Analysis of The Cancer Genome
Atlas (TCGA) cohort revealed that four genes (Rapgef1, Rapgef2, Itk,
and Syk) had mutation rates exceeding 1% in the pan-cancer cohort. At
the individual cancer-type level, four genes (Cct3, Rapgef1, Rapgef2,
and Syk) had mutation rates exceeding 5% in highly mutated endometrial
carcinoma, while in melanoma, three genes (Rapgef1, Rapgef2, and Syk)
had mutation rates exceeding 3%; Itk had a rate of 7.3%
([119]Figure 4B). The SHAP profile revealed differential feature
attributions among the top pairs. For example, sequence identity and
paralog family size showed negative contributions to the prediction in
Rapgef1+Rapgef2 ([120]Figure 4C) and Itk+Syk ([121]Figure 4D) but
positive contributions in Cct3+Cct8 ([122]Figure S4B).
Figure 4.
[123]Figure 4
[124]Open in a new tab
Visualization and validation of top 3 predicted paralogs from XGBoost
classifier
(A) Paralog pairs with the top 3 prediction scores are displayed in a
scatterplot.
(B) Mutation profile of top 3 paralog pairs RAPGEF1/RAPGEF2, SYK/ITK,
and CCT8/CCT3 in The Cancer Genome Atlas (TCGA) cohorts.
(C and D) SHAP profile for RAPGEF1/RAPGEF2 (C) and SYK/ITK (D).
(E and F) Bar chart of mean viability of Rapgef1/Rapgef2 (E) and
Itk/Syk (F) knockout treated B16F10 melanoma cells following culture
with or without OT-1 T cells. The p values come from two-sided
heteroscedastic t tests. (ns, p > 0.05, ∗p ≤ 0.05, ∗∗p ≤ 0.01, and ∗∗∗p
≤ 0.001).
(G and H) Gene sets enrichment analysis (GSEA) revealed that patients
with low expression of RAPGEF1/RAPGEF2 (G) and SYK/ITK (H) were
associated with downregulation of PD-L1/PD-1 signaling pathway. The
GSEA was performed using the differential expression output between
patients with low (<25%) expression of both paralog genes versus all
remaining patients.
To evaluate the synergistic effect of the paralog genes identified by
the XGBoost classifier, we conducted DKO-B16F10-OT1 T cell co-culture
assays. Knocking out Rapgef1 alone did not affect T cell-mediated tumor
cell killing, while knocking out Rapgef2 alone increased tumor cell
resistance to T cell killing. However, a combined KO of Rapgef1 and
Rapgef2 significantly improved T cell efficiency in killing tumor cells
([125]Figures 4E and [126]S4C). A similar pattern was observed with the
Syk+Itk gene pair: knocking out individual genes increased resistance
to T cell-mediated killing, but a double knockout enhanced T cell
killing activity ([127]Figures 4F and [128]S4D). In contrast, double
knockout of the Cct8+Cct3 gene pair impaired T cell killing ability
against tumor cells ([129]Figures S4E and S4F). Synergy analysis
revealed that Rapgef1+Papgef2 and Syk+Itk exhibited positive synergy
effects (see the [130]methods section and [131]Table S4).
To further investigate why Cct3/Cct8 might fail while the other two
pairs work, we performed gene set enrichment analysis (GSEA) on
TCGA-SKCM data, comparing patients with low expression in both paired
genes (≤25%) to those with high expression (≥75%). We observed
significant downregulation of the PD-L1/PD-1 signaling pathway in
patients with melanoma with low Rapgef1+Rapgef2 ([132]Figure 4G) and
Itk+Syk ([133]Figure 4H) expression. Downregulation of the PD-L1/PD-1
signaling pathway augments T cell killing by interrupting inhibitory
signals that suppress T cell activity. Conversely, patients with low
Cct3+Cct8 expression ([134]Figure S4G) showed insignificant
upregulation of this pathway compared to others. This finding suggests
that the failure of Cct3+Cct8 to enhance T cell cytotoxicity may be due
to the activation of other immunosuppressive pathways and the elevated
expression of inhibitory molecules.
Discussion
Although numerous CRISPR screen studies aim to identify novel cancer
cell targets for enhancing cancer immunotherapies, few have focused on
functional gene pairs, thus limiting the development of combination
strategies. To address this, we developed a computational
enrichment-based approach, pSSEA, to identify potential paralog gene
pairs using genome-wide CRISPR screen datasets by combining sgRNAs from
two paralogs. We demonstrated that pSSEA enables the identification of
cancer-intrinsic paralog pairs that can synergistically enhance T cell
killing. Subsequently, we constructed an ensemble-learning XGBoost
classifier to predict additional cancer-intrinsic, immunotherapy-aiding
paralog pairs and experimentally tested the top predictions. Notably,
all analyses in this study were based on CRISPR screen data. We
observed that only a small subset of the screen-derived
T cell-enhancing paralog pairs demonstrated a combined effect in
predicting ICB responses when using patient-derived transcriptomic data
(data not shown). This suggests that these two data types may capture
different aspects of immunotherapy-related signals.
We developed an enrichment-based approach pSSEA to identify key paralog
gene pairs, building on two main considerations: (1) single-gene
screens well correlated with combinatorial screens effects from
previously studies[135]^9^,[136]^14^,[137]^16 and (2) although
combinatorial targeting remains a compelling area of interest,
designing a library that comprehensively covers all possible gene pairs
is technically unfeasible.
Limitations of the study
Limitations of this study should also be acknowledged. First, the
paralog pairs selected and used for modeling were not derived from
actual combinatorial screens targeting immunotherapy. The absence of
ground-truth data may influence the accurate identification of true
paired hits, potentially impacting the model’s performance. Secondly,
we did not incorporate a synergistic effect module, as used in other
combinatorial screening studies,[138]^13^,[139]^25^,[140]^26 which
defines synergy as the difference between a gene pair’s observed
effects and the expected sum of their individual effects. While
concerns remain about the comparability of enrichment scores generated
from sgRNA sets of varying sizes, we plan to further investigate this
issue and evaluate the feasibility of integrating similar
functionality. Nevertheless, this study has the potential to offer
valuable insights into the rapidly advancing field of immunotherapy and
harness the power of well-established single-gene screens to uncover
meaningful gene pair interactions.
In summary, we envision our work providing novel methods to leverage
genome-wide screen data for selecting combination targets in cancer
immunotherapy. We believe that the computational framework in our study
could be adapted to prioritize non-paralogous protein-coding gene
pairs, broadening its applicability beyond paralogs.
Methods
Paralog pair information
Information on protein-coding paralog pairs was obtained from the
Ensembl Compara database via biomaRt (hsapiens_gene_ensembl, v.102)
using custom R scripts. Duplicated paralog pairs were filtered out. A
total of 139,654 protein-coding gene paralog pairs, covering 16,202
protein-coding genes, were used for the analyses in this study.
Collect features for paralog pairs
Sequence similarity
Amino acid sequence of paralog genes were obtained from Ensembl
(release 102).[141]^27 Paralog sequence similarities were measured
using BLASTP command line tools.[142]^28 The longest translated peptide
was used for genes with multiple transcripts. Gene pair sequence
identity, paralog family size, and whether they are the closest pair in
the same gene family were included as sequence-related features for
paralog pairs.
We also used high-resolution structure PDB files from
[143]https://alphafold.ebi.ac.uk/.[144]^29 The structural similarity of
paralogous proteins was measured using template modeling score
(TM-score) software.[145]^30 The TM-score and root-mean-square
deviation (RMSD) were included as additional sequence-related features.
Shared gene network, complex, and co-localization
We collected shared PPI features to characterize the role of paralogs
in the gene interaction network, including total PPIs, shared PPIs, and
two statistical measures: the Jaccard index and Fisher’s exact test
−log10(p value) for shared PPIs. Gene subcellular location data were
obtained from the Human Protein Atlas (v.22.0, file:
subcellular_location.tsv). The co-localization of paralogous genes was
also measured using the Jaccard index. The complex membership of
paralog pairs and gene-protein complex membership data were obtained
from CORUM[146]^31 via [147]https://maayanlab.cloud/Harmonizome/.
Gene co-evolution
Gene ages were obtained from ProteinHistorian
([148]https://proteinhistorian.docpollard.org/),[149]^32 and we used
the average age of the two paralogs as the age for the pair (Wagner age
reconstruction algorithm). The conservation level of an individual gene
was calculated as the number of species with orthologs (BLASTP
e-value < 1-e−5, identity > 30%) out of 269 species from Ensembl
genomes (v.102). The shared homologs of a gene pair were calculated as
the number of species having both gene homologs. The ratio of the
number of non-synonymous substitutions per non-synonymous site to the
number of synonymous substitutions per synonymous site (Ka/Ks) for gene
pairs was calculated using the codeml program in the PAML package
(v.4.10.6).[150]^33^,[151]^34 We adopted the phylogenetic distance
method developed by Tabach et al. to measure the co-evolution of a pair
of genes.[152]^35 A non-negative matrix factorization (NMF)-derived
distance considering phylogenetic tree information was calculated using
the phylogenetic profiles of the paralogous genes.[153]^36^,[154]^37
Essentiality and synthetic lethality
Paralogs with available evidence of essentiality from S. pombe and
S. cerevisiae ortholog data from OGEE v.3 were marked as essential
pairs.[155]^38 We adopted Kegel’s method to identify synthetic
lethality pairs[156]^6 using DepMap CRISPR data
([157]https://depmap.org/portal, 23Q2 release). Paralog pairs were
considered synthetic if the A1 gene was an essential gene (Chronos
score[158]^39 < −0.6 in 1% or 10% of cell lines), and its dependency
was significantly associated with A2 loss status with a p value less
than 0.05.
Co-expression pattern
Gene pair co-expression was measured using Spearman correlation
coefficient ρ at multiple levels: cancer cell lines, normal tissues,
and cancer scenarios. Cancer Cell Line Encyclopedia (CCLE) gene
expression data were downloaded from DepMap.[159]^40 Tissue-specific
gene expression profiles from healthy donors were downloaded from
Genotype-Tissue Expression (GTEx; [160]https://gtexportal.org/) public
releases.[161]^41 Cancer-specific gene expression profiles for 33
cancer types were downloaded from TCGA GDC Portal
([162]https://portal.gdc.cancer.gov/).[163]^42
Perturbation similarity
Due to the limited number of gene perturbation experiments in
Connectivity Map (L1000, [164]https://clue.io/data/),[165]^43 we
conducted pseudo perturbation using TCGA mRNA sequence data to mimic
the transcriptomic gene expression changes associated with single-gene
manipulation. Practically, the differential expression profile of
specific gene inhibition was calculated as the log fold change in gene
expression between patients with expression levels in the lowest 25th
percentile versus those in the highest 75th percentile of the query
gene. The perturbation similarity of paralogous genes was measured
using recovery AUC with the paralog gene differential expression
profiles.
Microenvironment associations
The TME plays an important role in cancer progression and treatment
outcome. We attempted to uncover features quantifying the association
of the paralogs with multiple microenvironment indicators. CD8^+ T cell
infiltration was measured by the single-stranded GSEA (ssGSEA)
method.[166]^20 We calculated T cell exclusion and dysfunction scores
using another software, the TIDE algorithm developed by Jiang
et al.[167]^44 The association with gene pairs and microenvironmental
characteristics was calculated by Kendall’s rank correlation
test.[168]^45
Survival associations
To determine gene pairs with a synergistic effect on survival outcomes,
we calculated the survival associations to assess whether a paralog
pair acts in concert to influence cancer prognosis. First, patients
were categorized into three groups based on the expression levels of
the paired paralog genes: groups were designated as 1 for both genes
showing low expression, 3 for both showing high expression, and 2 for
any other expression combination. The survival association for paralog
pairs was then calculated using the formula
[MATH: survivalassociation=−log10(p)
∗sign(β_gene1∗
β_gene2)
. :MATH]
In this formula, p refers to the p value from the Cox regression
analysis of the paralog group, adjusted for sex and age, while β_gene1
and β_gene2 represent the exponentiated coefficients of the individual
genes from the Cox proportional hazards model. Clinical data and
genomics data for TCGA-SKCM were downloaded from the GDC Portal
([169]https://portal.gdc.cancer.gov/).[170]^46
XGBoost ensemble classifier
The popular supervised-learning algorithm XGBoost—a gradient boosting
ensemble learner with regularization parameters—was used in this study
(implementation in the XGBoost package).[171]^47 We performed a grid
search based on 5-fold cross-validation of our paralog pair dataset to
explore the space of potential hyperparameters for the XGBoost
classifier. The parameters used in the final model are as follows:
objective = reg: logistic, n_estimators = 400, max_depth = 10,
min_child_weight = 1, min_split_loss = 0.1, subsample = 1,
colsample_bytree = 1, learning_rate = 0.01, and random_state = 8 (for
reproducibility). All parameters not listed used default settings.
Performance evaluation
The ROC AUC and PRC AUC for the feature values as well as the
classifiers were computed to evaluate the performance of individual
features and trained models.
Lentivirus purification
For lentivirus production, 20 μg of lenti-U6-sgRNA-EFS-GFP-WPRE or
20 μg of lenti-U6-sgRNA-EFS-Puro-WPRE, 10 μg of pMD2.G, and 15 μg of
psPAX2 were co-transfected into LentiX-293 cells plated in a 150 mm
dish at 80%–90% confluency using 130 μg polyethyleneimine (PEI). 6 h
later, the media were replaced with fresh DMEM+10% FBS. Virus
supernatant was collected 48 h post-transfection and centrifuged at
1,500g for 15 min to remove the cell debris. The virus supernatant was
concentrated by Lenti-X concentrator at 4°C for 30 min (1 vol of
Lenti-X concentrator with 3 vol of supernatant), followed by
centrifugation at 1,500g for 45 min at 4°C. Finally, the virus was
resuspended in DMEM, aliquoted, and stored at −80°C.
Generation of single-gene-knockout or double-gene-knockout cells
B16F10 cells stably expressing Cas9 were generated by transducing
B16F10 cells with lentiviral EF1a-NLS-Cas9-2A-Blast-WPRE, followed by
5 days of selection under 20 μg/mL blasticidin. B16F10-Cas9 cells were
further transfected with lentiviral EF1a-mCherry-2A-OVA-WPRE and sorted
for mCherry-positive cells to generate B16f10-Cas9-mCherry-OVA cells.
The two gRNAs targeting each gene within the paralog pair were
transduced into the B16F10-Cas9-mCherry-Ova cells simultaneously. gRNA
virus-infected cells were cultured at 37°C for more than 24 h, and then
GFP-positive cells were sorted and selected under 5 μg/mL puromycin for
3 days to generate the double-knockout cells.
T7E1 assay
T7E1 was used to estimate gRNA cutting efficiency. In brief, gDNA was
extracted by using Genomic DNA prep with QuickExtract (QE) buffer. PCR
amplification of the genomic regions flanking the CRISPR RNA (crRNA)
was performed using the primers listed in [172]Table S3. Using Phusion
Flash High Fidelity Master Mix (Thermo Fisher Scientific), the
thermocycling parameters for PCR were 98°C for 1 min; 35 cycles of 98°C
for 1 s, 60°C for 5 s, and 72°C for 15 s); and 72°C for 2 min. The PCR
amplicons were then used for T7E1 assays according to the
manufacturer’s protocol.
Naive OT-I CD8a+ T cell isolation and co-culture assay
The naive OT-I CD8a+ T cells were isolated using the Naive CD8a+ T Cell
Isolation Kit (Miltenyi Biotec) following the manufacturer’s
instructions. Naive CD8a+ T cells were isolated from the spleen of OT-I
mice and stimulated with anti-mouse CD3/CD28 antibody for 2 days. For
the tumor cell and OT-I T cell co-culture assay,
B16F10-Cas9-mCherry-Ova gene KO cells were co-cultured with OT1 cells
at effector-to-target cell ratios (E:T) = 1:1 ratios. Tumor cell
killing was tested at 24 h by flow cytometry. LIVE/DEAD near-infrared
(IR) was diluted at 1:1,000 to identify the dead cells. Tumor cells
were identified as mCherry-positive cells.
Synergy analysis
Statistical analysis for evaluating synergistic interactions between
paralog pairs was calculated by the SynergyFinder R package[173]^48
using four models, including the highest single agent (HSA), Bliss,
Loewe, and zero interaction potency (ZIP) models. The p values for the
average synergy score were computed using bootstrapping of the
dose-response matrix. We also adopted the dLFC method to measure the
additive effect of gene pairs.[174]^13 The dLFC was calculated as the
observed mean relative killing rate of the DKO minus the expected sum
of the relative killing rates of the individual SKO. The relative
killing rate was determined by subtracting the values from the NTC
group.
Resource availability
Lead contact
Requests for information and resources used in this article should be
addressed to the lead contact, Sidi Chen (sidi.chen@yale.edu).
Materials availability
This study did not generate new unique reagents.
Data and code availability
The source code for processed data and modeling is available under
GitHub ([175]https://github.com/cpdong/imParalog) and Zenodo
([176]https://doi.org/10.5281/zenodo.14759518)[177]^49 and included in
this article and its [178]supplemental information files. An
interactive web application of this study is deployed at
[179]https://sidichenlab.shinyapps.io/imparalog for free visiting and
querying. Any additional information required to re-analyze the data
reported in this paper is available from the lead contact upon request.
Acknowledgments