Abstract
Resistance to EGFR inhibitors (EGFRi) presents a major obstacle in
treating non-small cell lung cancer (NSCLC). One of the most exciting
new ways to find potential resistance markers involves running
functional genetic screens, such as CRISPR, followed by manual triage
of significantly enriched genes. This triage process to identify ‘high
value’ hits resulting from the CRISPR screen involves manual curation
that requires specialized knowledge and can take even experts several
months to comprehensively complete. To find key drivers of resistance
faster we build a recommendation system on top of a heterogeneous
biomedical knowledge graph integrating pre-clinical, clinical, and
literature evidence. The recommender system ranks genes based on
trade-offs between diverse types of evidence linking them to potential
mechanisms of EGFRi resistance. This unbiased approach identifies 57
resistance markers from >3,000 genes, reducing hit identification time
from months to minutes. In addition to reproducing known resistance
markers, our method identifies previously unexplored resistance
mechanisms that we prospectively validate.
Subject terms: Cancer genomics, Non-small-cell lung cancer,
Computational models, Machine learning, Target identification
__________________________________________________________________
Resistance to EGFR inhibitors presents a major obstacle in treating
non-small cell lung cancer. Here, the authors develop a recommender
system ranking genes based on trade-offs between diverse types of
evidence linking them to potential mechanisms of EGFRi resistance.
Introduction
In this study we explore how a biological question can be translated
into a recommendation problem^[52]1. Traditionally recommendation
systems have been used to help users discover relevant items among an
overwhelming number of options^[53]2. By interacting with
recommendations users provide either implicit or explicit feedback,
which recommendation models use to personalize and improve
predictions^[54]3. Information overload is particularly common in
e-commerce^[55]4, streaming^[56]5 and social media applications^[57]6,
hence recommendation systems play key role in these industries. The
biomedical domain on the other hand, is not seen as a typical
application area for recommendation systems. Their usage so far is
limited to a handful of recent studies: Ozsoy et al., applied
collaborative filtering for drug repositioning problem^[58]7; Frainay
et al., developed a network-based recommendation solution to enrich and
interpret metabolomic data^[59]8; Suphavilai et al., built a matrix
factorization-based recommendation system to predict response of cancer
drugs^[60]9; Radivojevic et al., developed an automated recommendation
solution for synthetic biology^[61]10. The success of these pilot
studies suggest there are potential opportunities for recommendation
systems in the biomedical domain; the amount of biomedical data is
growing exponentially and scientists could benefit from recommendation
solutions that help them to navigate the data and reason about it.
Naturally, direct transfer of classic recommendation approaches to the
biomedical domain is not trivial. Specifics of the problem space impose
numerous challenges for a recommendation system practitioner, to name a
few:
* an elementary unit of recommendation is not a simple self-contained
item (e.g a gene), but rather a research direction accompanied by a
biologically sound hypothesis;
* ultimate validation of recommendations is complex and often
requires expensive and time-consuming laboratory experiments, as
opposed to users just “selecting” an item in a common
non-biological recommendation scenario;
* unlike traditional applications, in a biomedical setting both
implicit and explicit feedback is scarce, making it harder to tune
and train models;
* ground truths are scarce and in most cases context-dependent, which
renders training challenging;
* due to the high cost associated with accepting a recommendation, an
increased emphasis is placed on explainability and exposing causal
reasoning paths behind a recommendation.
Despite these challenges, wider adoption of recommendation approaches
holds plenty of opportunities to support and accelerate biological
research. To illustrate this point, in this study we focused on the
problem of drug resistance in lung cancer. Our goal was to build a
recommendation solution that finds key genes driving drug resistance.
Similar problems are also often formulated as gene prioritization tasks
and have been previously addressed with network-based
methods^[62]11,[63]12, kernel-based learning^[64]13, and most
recently—deep learning approaches^[65]14,[66]15, to name a few. In this
study we were interested to explore the lung cancer resistance problem
through the lens of recommendation approach.
Drug resistance is a complex biological phenomenon that hinders
development of efficient and lasting cancer treatments^[67]16. Tumors
recruit diverse strategies to escape selective pressure induced by
drugs, such as changes in drug metabolism^[68]17, inhibition of cell
death^[69]18, epigenetic alterations^[70]19 or acquired mutations in
drug targets^[71]20. Enhanced DNA repair and increased amplification of
tumor driver genes also contribute to secondary resistance^[72]21.
Genetic and epidemiological diversity of patients^[73]22 further
complicates the resistance landscape.
In this study we focus on non-small cell lung cancer (NSCLC) carrying
activating mutations of the epidermal growth factor receptor (EGFR). It
accounts for 15-20% of lung cancer patients^[74]23. Treatment with
first or second generation EGFR tyrosine kinase inhibitors such as
gefitinib, erlotinib or afatinib results in impressive response rates
in patients initially^[75]24, however, tumors quickly develop
resistance to treatment. The majority of resistant cases are driven by
accumulation of secondary mutations of EGFR gene, such as T790M, that
prevent binding of EGFR TKI (tyrosine kinase inhibitors)
compounds^[76]25. Development of osimertinib, a third generation EGFR
TKI, provided the ability to target such secondary EGFR
mutations^[77]26. In fact, treatment with osimertinib significantly
improved patient survival in first-line therapy setting^[78]27.
However, therapy resistance prevails. Acquired mutations of EGFR such
as C797S drive osimertinib resistance in 6–26 % of cases. Bypass
pathway activation, amplifications of MET or mutations in PIK3CA have
also been shown to contribute to resistance^[79]28. Still, in half of
the cases the molecular resistance mechanisms remain unknown and
promising markers could reside in a so called “dark matter” of the
human genome^[80]29.
A common strategy to find key drivers of acquired resistance is based
on functional genomic screens, such as CRISPR screens^[81]30.
CRISPR-Cas9 genome-wide knock out, knock down and knock-in screens have
recently emerged as an efficient high-throughput technology to
systematically investigate resistance mechanisms^[82]30. CRISPR screens
can be applied in two ways to understand drug response and drug
resistance. First, they can be used to identify alterations in genes
that increase sensitivity of a cell to drug treatment. Here,
researchers measure negative selection of modified genes in drug
treatment. This approach can help to define therapeutic combinations
that might increase response to treatment. Second, CRISPR screens are
used to identify genes that drive drug resistance if altered. In this
case the experimental set-up mimics treatment scenarios in the clinic.
In this approach, outgrowth or positive selection of drug resistant
cells is measured and used to define mechanisms that drive resistance.
These can be potentially targeted once resistance is established.
In these settings, a typical output of a CRISPR screen may identify
many hundreds of resistance genes. To narrow down the list to the most
promising, biologically plausible and actionable resistance genes,
researchers have to perform manual triage and validation. During this
process experts aggregate prior knowledge about a disease with
additional evidence available from clinical and pre-clinical studies
and decide which genes to prioritize for experimental validation. The
selection process is tedious and time consuming. It also relies on deep
specialized knowledge, hence the results can be prone to the individual
bias. Our goal was to replace such manual triage with a recommendation
solution, which could efficiently integrate diverse types of evidence
and identify the most promising candidate genes driving drug
resistance.
By moving the problem to a recommendation domain we encounter two major
challenges. First is the lack of training data. Here we are dealing
with a highly specific molecular phenotype of a poorly understood
origin, which prevents us from using information on resistance markers
relevant for other, even closely related, diseases as training data.
Second, unlike a typical recommendation scenario, in our case both
explicit and implicit feedback are lacking. This fact limits our
ability to gradually train and improve models. Given these constraints
we followed an unsupervised recommendation approach, which relies on
content-based filtering. We formalized re-ranking of CRISPR hits as a
multi-objective optimization problem^[83]31, where diverse and
conflicting types of evidence supporting gene’s relevance are mapped to
objectives. During the optimization procedure feasible solutions
(genes) are identified and compared until no better can be found. A
crucial component of such framework is a set of hybrid features. Each
feature represents a distinct type of evidence, such as literature
support, clinical and pre-clinical evidence.
Along with the purely biological features, our recommendation system
relied on data derived from a specially constructed heterogeneous
biomedical knowledge graph^[84]32. Knowledge graphs provide a
convenient conceptual representation of relationships (edges) between
entities (nodes). In the recommendation context knowledge graphs gain
popularity as a way to introduce content-specific information and also
to provide explanations for the resulting recommendations^[85]33. In
addition, graph-based recommendations were shown to achieve higher
precision and accuracy compared to alternative
approaches^[86]34–[87]36. We used graph structural information together
with graph-based representations to express relevance of a gene in the
resistance context. Our assumption was that by combining graph-derived
features with clinical ones we could discover unobvious genes that
drive drug resistance in lung cancer.
In summary, in this study we explored how a question of finding drivers
of secondary EGFR TKI resistance could be addressed as a recommendation
problem. We demonstrate that a recommendation system based on
multi-objective optimization approach can be used to re-rank CRISPR
hits in the context of secondary drug resistance. The proposed
framework, together with an automated feature generation flow and
interactive re-ranking interface, helped to reduce gene hit
prioritization time from months to a few minutes.
Results
Re-ranking of CRISPR hits can be approached as multi-objective optimization
We framed re-ranking of CRISPR hits as a multi-objective optimization
problem. In this setting, diverse lines of evidence that support gene’s
relevance are treated as multiple objectives (Fig. [88]1). In other
words, the formal goal is to simultaneously optimize k objectives,
reflected in k objective functions: f[1](x), f[2](x), . . . , f[k](x).
Individual functions form a vector function F(x):
[MATH:
F(x)=
[f1(x)
,f2(x),
...,fk(x)]T :MATH]
1
where x = [x[1], x[2], . . . , x[m]] ∈ Ω; x represents the decision
variable, Ω-decision variable space. Therefore, multi-objective
optimization can be defined as minimization (or maximization) of the
objective function set F(x). With multiple competing objectives a
singular best solution often cannot be found. However, one can identify
a set of optimal solutions based on the notion of Pareto
dominance^[89]37. A solution x[1] dominates solution x[2] if the
following two conditions are true:
* solution x[1] is not worse than x[2] according to all objectives;
* solution x[1] is strictly better than solution x[2] according to at
least one objective.
Fig. 1. Recommendation system takes into account diverse types of evidence to
suggest promising drivers of drug resistance in NSCLC.
Fig. 1
[90]Open in a new tab
The evidence from specially built knowledge graph, literature, clinical
and pre-clinical datasets is aggregated and formalized as objectives in
multi-objective optimization (MOO) task. Recommended solutions (genes)
represent the optimal trade-offs between the conflicting objectives. A
subset of recommended genes is passed for the experimental validation.
If both conditions are true, we can say that x[1] dominates x[2], which
is equal to x[2] being dominated by x[1]. In other words, dominant
solutions can not be improved any further without compromising at least
one of the other objectives. A set of such dominant solutions forms a
Pareto front, which combines the best trade-off combinations between
competing objectives. Therefore, by computing Pareto fronts on diverse
sets of objectives defined based on CRISPR screen data and additional
supporting evidence we can narrow down the number of promising markers
of EGFR TKI resistance (Fig. [91]1).
A hybrid set of features supports recommendation system
To support the recommendation system we assembled a hybrid set of rich
features (Fig. [92]1 and Supplementary Table [93]1), with an idea that
each feature represents an objective. The selected features were
relevant for EGFR inhibitor resistance in NSCLC and corresponded to
distinct lines of evidence. Key feature types and rationale to consider
them for re-ranking of CRISPR hits are summarized below.
CRISPR
CRISPR screen data served as a starting point for re-ranking. In this
study we relied on screens that were set-up to resemble clinical
treatment scenarios for EGFR mutant lung cancer, using NSCLC cancer
cell lines harboring EGFR mutations commonly found in patient
populations and where those cell lines were treated with 1^st or 3^rd
generation EGFR inhibitors. In total we identified a starting list of
1550 candidate drug resistance genes^[94]38 that were labeled as
significant after the screen analysis. We further aggregated CRISPR
data by computing consistency metrics, which reflected stability of a
gene’s performance across experimental conditions. Normally genes
showing consistent behavior in multiple relevant conditions, e.g
related cell lines or treatments, are ranked higher by domain experts.
Altogether, seven consistency-based features were incorporated in the
feature set: (1) three features based on loss-of-function part of the
screen; (2) three features based on gain-of-function part of the
screen; (3) a summary metric reflecting overall consistency in the full
screen (Supplementary Table [95]1).
Literature-based metrics
Literature search is routinely used as a first step to confirm
experimental findings and to find support for a potential mechanistic
hypothesis. For the EGFR inhibitor resistance problem we were primarily
interested in the overall literature support for a gene. As a proxy of
literature support we calculated the total number of publications that
mention a gene in a relevant context, such as “cancer”, “resistance”,
“EGFR”, “NSCLC”. Conveniently, the same exact metric when reversed can
be interpreted as novelty of a particular target. To extract literature
mentions we analyzed a total corpus of >180,000 PubMed papers published
between 2000 and 2019. We included aggregated literature metrics, based
on two terms of interest: EGFR and NSCLC. For each gene we computed the
number of papers that mention a gene together with one of these terms
(Supplementary Table [96]1). To account to the fact that mentions in
research papers vary drastically between genes, we also included
normalized literature-based frequencies.
Graph-derived features
In this study we used a custom knowledge graph (KG) as a source of side
information for the recommendation system. Our KG contained 11 million
nodes and 84 million edges and was composed of 37 public and internal
datasets, such as Hetionet, OpenTargets, ChEMBL and Ensembl^[97]32. In
general, patterns of interactions between biological entities captured
by knowledge graphs can be translated into features and consumed by
recommendation systems in a number of ways (Fig. [98]1 and
Supplementary Table [99]1). One way is to compute features directly on
the graph. This includes metrics such as node degree—reflecting the
importance of a node; PageRank—a measure of node’s popularity^[100]39;
betweenness—a way to detect the amount of influence a node has over the
flow of information in the graph. An alternative approach involves
projecting the graph into a low-dimensional space, so that every node
is transformed into its vector representation—embedding. Embeddings
capture critical structural properties of the graph^[101]40, so that
the nodes that were close in the graph also remain close in the
embedding space. In this study we computed distances in the embedding
space between each gene and two key entities of interest: “EGFR” and
“NSCLC”. The assumption is that genes most relevant to the EGFR TKI
resistance phenotype should be close to either lung cancer or EGFR gene
nodes.
Clinical enrichment scores
To ensure the recommendation system captures clinical evidence, we
included genomic data from osimertinib-treated EGFR-mutant lung cancer
patients in the feature set. We prioritized five clinical trials:
AURAext^[102]41, AURA2^[103]42, AURA3, FLAURA^[104]27, and
ORCHARD^[105]43. The prevalence of genomic alterations in
non-responders vs. responders across 355 patients treated with
osimertinib were calculated and included as “clinical enrichment score
features” to the feature set (Supplementary Table [106]1).
Tractability and gene essentiality
Traditionally drug resistance in cancer is addressed by developing
compounds or combination therapy that modulates activity of its key
driver genes (targets). When a target is prioritized for drug
development one needs to ensure that: (1) a gene is tractable in
principle, i.e., it is shown or predicted to bind to commonly used drug
modalities with high affinity; (2) a gene should not be essential,
since knock-out of an essential gene can be detrimental to other cells
in the organism, not just the tumor ones. To support the first
consideration, we included bucket tractability estimates^[107]44 for
three modalities: antibodies, small molecules and other modalities
(enzyme, oligonucleotides, etc). In support of the second consideration
we integrated DepMap^[108]45 essentiality estimates.
In summary, the final hybrid set contained 27 rich features, supporting
diverse criteria taken into consideration during validation of CRISPR
hits by domain experts (Supplementary Table [109]1). The hybrid set was
also augmented by graph-derived features and literature-based metrics.
Correlation analysis of the hybrid feature space indicated expected
patterns: (1) strong positive correlation between structural graph
features, such as degree, pagerank and betweenness; (2) negative
correlation between CRISPR features derived from knock-out and
activation screens (Supplementary Fig. [110]7).
Interactive interface allows experts to re-rank CRISPR hits
So far, we have defined a basic model for multi-objective optimization
and demonstrated how to build a hybrid set of features to support
re-ranking of CRISPR hits in the EGFR TKI context.
In the real-world scenario, decision-making can be both iterative and
subjective. A choice of a particular set of objectives and the
direction of optimization for the same variable varies from expert to
expert. Each combination of objectives and corresponding directions for
optimization might result in a different shape of Pareto front,
therefore - in a different set of top recommended genes.
To accommodate diversity of opinions and to enable domain scientists to
explore complex trade-offs between the objectives we built an
interactive application - SkywalkR
[111]https://github.com/AstraZeneca/skywalkR^[112]46 (Fig. [113]2).
SkywalkR is a Shiny app^[114]47, which operates on top of the
pre-assembled hybrid feature set (see Supplementary Table [115]1).
SkywalkR app combines diverse facets of knowledge to guide
re-prioritization of CRISPR hits for experimental validation. In
addition, it allows domain experts to explore various trade-offs
between objectives. Thereby, it stimulates exploration of
possibilities, highlights gaps in the existing knowledge and motivates
to adjust expectations about optimal solutions.
Fig. 2. SkywalkR interactive interface allows users to re-rank CRISPR hits
based on various combinations of objectives.
[116]Fig. 2
[117]Open in a new tab
A On the side bar panel each objective is represented with a slider.
Users can decide which objectives to include in the optimization and
can also specify direction of optimization (minimize or maximize). B
Additional tools to explore the results. Relative view shows profiles
of recommended genes. Bar plots demonstrate standardized values across
objectives and top recommended genes. Co-occurrence heatmap
demonstrates clusters of genes frequently mentioned together in EGFR
TKI resistance context.
Automated engineering of rich features coupled with multi-objective
optimization realized through SkywalkR interactive interface
dramatically reduced the time required for gene prioritization from a
few weeks to minutes.
Evaluation demonstrates majority of top recommendations labeled as credible
by experts
To evaluate the recommendation framework against expert opinions we
fixed a default set of preferences. Preferences were defined by a