Abstract

   Resistance to EGFR inhibitors (EGFRi) presents a major obstacle in
   treating non-small cell lung cancer (NSCLC). One of the most exciting
   new ways to find potential resistance markers involves running
   functional genetic screens, such as CRISPR, followed by manual triage
   of significantly enriched genes. This triage process to identify ‘high
   value’ hits resulting from the CRISPR screen involves manual curation
   that requires specialized knowledge and can take even experts several
   months to comprehensively complete. To find key drivers of resistance
   faster we build a recommendation system on top of a heterogeneous
   biomedical knowledge graph integrating pre-clinical, clinical, and
   literature evidence. The recommender system ranks genes based on
   trade-offs between diverse types of evidence linking them to potential
   mechanisms of EGFRi resistance. This unbiased approach identifies 57
   resistance markers from >3,000 genes, reducing hit identification time
   from months to minutes. In addition to reproducing known resistance
   markers, our method identifies previously unexplored resistance
   mechanisms that we prospectively validate.

   Subject terms: Cancer genomics, Non-small-cell lung cancer,
   Computational models, Machine learning, Target identification
     __________________________________________________________________

   Resistance to EGFR inhibitors presents a major obstacle in treating
   non-small cell lung cancer. Here, the authors develop a recommender
   system ranking genes based on trade-offs between diverse types of
   evidence linking them to potential mechanisms of EGFRi resistance.

Introduction

   In this study we explore how a biological question can be translated
   into a recommendation problem^[52]1. Traditionally recommendation
   systems have been used to help users discover relevant items among an
   overwhelming number of options^[53]2. By interacting with
   recommendations users provide either implicit or explicit feedback,
   which recommendation models use to personalize and improve
   predictions^[54]3. Information overload is particularly common in
   e-commerce^[55]4, streaming^[56]5 and social media applications^[57]6,
   hence recommendation systems play key role in these industries. The
   biomedical domain on the other hand, is not seen as a typical
   application area for recommendation systems. Their usage so far is
   limited to a handful of recent studies: Ozsoy et al., applied
   collaborative filtering for drug repositioning problem^[58]7; Frainay
   et al., developed a network-based recommendation solution to enrich and
   interpret metabolomic data^[59]8; Suphavilai et al., built a matrix
   factorization-based recommendation system to predict response of cancer
   drugs^[60]9; Radivojevic et al., developed an automated recommendation
   solution for synthetic biology^[61]10. The success of these pilot
   studies suggest there are potential opportunities for recommendation
   systems in the biomedical domain; the amount of biomedical data is
   growing exponentially and scientists could benefit from recommendation
   solutions that help them to navigate the data and reason about it.

   Naturally, direct transfer of classic recommendation approaches to the
   biomedical domain is not trivial. Specifics of the problem space impose
   numerous challenges for a recommendation system practitioner, to name a
   few:
     * an elementary unit of recommendation is not a simple self-contained
       item (e.g a gene), but rather a research direction accompanied by a
       biologically sound hypothesis;
     * ultimate validation of recommendations is complex and often
       requires expensive and time-consuming laboratory experiments, as
       opposed to users just “selecting” an item in a common
       non-biological recommendation scenario;
     * unlike traditional applications, in a biomedical setting both
       implicit and explicit feedback is scarce, making it harder to tune
       and train models;
     * ground truths are scarce and in most cases context-dependent, which
       renders training challenging;
     * due to the high cost associated with accepting a recommendation, an
       increased emphasis is placed on explainability and exposing causal
       reasoning paths behind a recommendation.

   Despite these challenges, wider adoption of recommendation approaches
   holds plenty of opportunities to support and accelerate biological
   research. To illustrate this point, in this study we focused on the
   problem of drug resistance in lung cancer. Our goal was to build a
   recommendation solution that finds key genes driving drug resistance.
   Similar problems are also often formulated as gene prioritization tasks
   and have been previously addressed with network-based
   methods^[62]11,[63]12, kernel-based learning^[64]13, and most
   recently—deep learning approaches^[65]14,[66]15, to name a few. In this
   study we were interested to explore the lung cancer resistance problem
   through the lens of recommendation approach.

   Drug resistance is a complex biological phenomenon that hinders
   development of efficient and lasting cancer treatments^[67]16. Tumors
   recruit diverse strategies to escape selective pressure induced by
   drugs, such as changes in drug metabolism^[68]17, inhibition of cell
   death^[69]18, epigenetic alterations^[70]19 or acquired mutations in
   drug targets^[71]20. Enhanced DNA repair and increased amplification of
   tumor driver genes also contribute to secondary resistance^[72]21.
   Genetic and epidemiological diversity of patients^[73]22 further
   complicates the resistance landscape.

   In this study we focus on non-small cell lung cancer (NSCLC) carrying
   activating mutations of the epidermal growth factor receptor (EGFR). It
   accounts for 15-20% of lung cancer patients^[74]23. Treatment with
   first or second generation EGFR tyrosine kinase inhibitors such as
   gefitinib, erlotinib or afatinib results in impressive response rates
   in patients initially^[75]24, however, tumors quickly develop
   resistance to treatment. The majority of resistant cases are driven by
   accumulation of secondary mutations of EGFR gene, such as T790M, that
   prevent binding of EGFR TKI (tyrosine kinase inhibitors)
   compounds^[76]25. Development of osimertinib, a third generation EGFR
   TKI, provided the ability to target such secondary EGFR
   mutations^[77]26. In fact, treatment with osimertinib significantly
   improved patient survival in first-line therapy setting^[78]27.
   However, therapy resistance prevails. Acquired mutations of EGFR such
   as C797S drive osimertinib resistance in 6–26 % of cases. Bypass
   pathway activation, amplifications of MET or mutations in PIK3CA have
   also been shown to contribute to resistance^[79]28. Still, in half of
   the cases the molecular resistance mechanisms remain unknown and
   promising markers could reside in a so called “dark matter” of the
   human genome^[80]29.

   A common strategy to find key drivers of acquired resistance is based
   on functional genomic screens, such as CRISPR screens^[81]30.
   CRISPR-Cas9 genome-wide knock out, knock down and knock-in screens have
   recently emerged as an efficient high-throughput technology to
   systematically investigate resistance mechanisms^[82]30. CRISPR screens
   can be applied in two ways to understand drug response and drug
   resistance. First, they can be used to identify alterations in genes
   that increase sensitivity of a cell to drug treatment. Here,
   researchers measure negative selection of modified genes in drug
   treatment. This approach can help to define therapeutic combinations
   that might increase response to treatment. Second, CRISPR screens are
   used to identify genes that drive drug resistance if altered. In this
   case the experimental set-up mimics treatment scenarios in the clinic.
   In this approach, outgrowth or positive selection of drug resistant
   cells is measured and used to define mechanisms that drive resistance.
   These can be potentially targeted once resistance is established.

   In these settings, a typical output of a CRISPR screen may identify
   many hundreds of resistance genes. To narrow down the list to the most
   promising, biologically plausible and actionable resistance genes,
   researchers have to perform manual triage and validation. During this
   process experts aggregate prior knowledge about a disease with
   additional evidence available from clinical and pre-clinical studies
   and decide which genes to prioritize for experimental validation. The
   selection process is tedious and time consuming. It also relies on deep
   specialized knowledge, hence the results can be prone to the individual
   bias. Our goal was to replace such manual triage with a recommendation
   solution, which could efficiently integrate diverse types of evidence
   and identify the most promising candidate genes driving drug
   resistance.

   By moving the problem to a recommendation domain we encounter two major
   challenges. First is the lack of training data. Here we are dealing
   with a highly specific molecular phenotype of a poorly understood
   origin, which prevents us from using information on resistance markers
   relevant for other, even closely related, diseases as training data.
   Second, unlike a typical recommendation scenario, in our case both
   explicit and implicit feedback are lacking. This fact limits our
   ability to gradually train and improve models. Given these constraints
   we followed an unsupervised recommendation approach, which relies on
   content-based filtering. We formalized re-ranking of CRISPR hits as a
   multi-objective optimization problem^[83]31, where diverse and
   conflicting types of evidence supporting gene’s relevance are mapped to
   objectives. During the optimization procedure feasible solutions
   (genes) are identified and compared until no better can be found. A
   crucial component of such framework is a set of hybrid features. Each
   feature represents a distinct type of evidence, such as literature
   support, clinical and pre-clinical evidence.

   Along with the purely biological features, our recommendation system
   relied on data derived from a specially constructed heterogeneous
   biomedical knowledge graph^[84]32. Knowledge graphs provide a
   convenient conceptual representation of relationships (edges) between
   entities (nodes). In the recommendation context knowledge graphs gain
   popularity as a way to introduce content-specific information and also
   to provide explanations for the resulting recommendations^[85]33. In
   addition, graph-based recommendations were shown to achieve higher
   precision and accuracy compared to alternative
   approaches^[86]34–[87]36. We used graph structural information together
   with graph-based representations to express relevance of a gene in the
   resistance context. Our assumption was that by combining graph-derived
   features with clinical ones we could discover unobvious genes that
   drive drug resistance in lung cancer.

   In summary, in this study we explored how a question of finding drivers
   of secondary EGFR TKI resistance could be addressed as a recommendation
   problem. We demonstrate that a recommendation system based on
   multi-objective optimization approach can be used to re-rank CRISPR
   hits in the context of secondary drug resistance. The proposed
   framework, together with an automated feature generation flow and
   interactive re-ranking interface, helped to reduce gene hit
   prioritization time from months to a few minutes.

Results

Re-ranking of CRISPR hits can be approached as multi-objective optimization

   We framed re-ranking of CRISPR hits as a multi-objective optimization
   problem. In this setting, diverse lines of evidence that support gene’s
   relevance are treated as multiple objectives (Fig. [88]1). In other
   words, the formal goal is to simultaneously optimize k objectives,
   reflected in k objective functions: f[1](x), f[2](x), . . . , f[k](x).
   Individual functions form a vector function F(x):
   [MATH:
   <mi>F</mi><mrow><mo>(</mo><mrow><mi>x</mi></mrow><mo>)</mo></mrow><mo>=
   </mo><msup><mrow><mrow><mo>[</mo><mrow><msub><mrow><mi>f</mi></mrow><mr
   ow><mn>1</mn></mrow></msub><mrow><mo>(</mo><mrow><mi>x</mi></mrow><mo>)
   </mo></mrow><mo>,</mo><msub><mrow><mi>f</mi></mrow><mrow><mn>2</mn></mr
   ow></msub><mrow><mo>(</mo><mrow><mi>x</mi></mrow><mo>)</mo></mrow><mo>,
   </mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mrow><mi>f</mi></mr
   ow><mrow><mi>k</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>x</mi></mrow
   ><mo>)</mo></mrow></mrow><mo>]</mo></mrow></mrow><mrow><mi>T</mi></mrow
   ></msup> :MATH]
   1

   where x = [x[1], x[2], . . . , x[m]] ∈ Ω; x represents the decision
   variable, Ω-decision variable space. Therefore, multi-objective
   optimization can be defined as minimization (or maximization) of the
   objective function set F(x). With multiple competing objectives a
   singular best solution often cannot be found. However, one can identify
   a set of optimal solutions based on the notion of Pareto
   dominance^[89]37. A solution x[1] dominates solution x[2] if the
   following two conditions are true:
     * solution x[1] is not worse than x[2] according to all objectives;
     * solution x[1] is strictly better than solution x[2] according to at
       least one objective.

Fig. 1. Recommendation system takes into account diverse types of evidence to
suggest promising drivers of drug resistance in NSCLC.

   Fig. 1
   [90]Open in a new tab

   The evidence from specially built knowledge graph, literature, clinical
   and pre-clinical datasets is aggregated and formalized as objectives in
   multi-objective optimization (MOO) task. Recommended solutions (genes)
   represent the optimal trade-offs between the conflicting objectives. A
   subset of recommended genes is passed for the experimental validation.

   If both conditions are true, we can say that x[1] dominates x[2], which
   is equal to x[2] being dominated by x[1]. In other words, dominant
   solutions can not be improved any further without compromising at least
   one of the other objectives. A set of such dominant solutions forms a
   Pareto front, which combines the best trade-off combinations between
   competing objectives. Therefore, by computing Pareto fronts on diverse
   sets of objectives defined based on CRISPR screen data and additional
   supporting evidence we can narrow down the number of promising markers
   of EGFR TKI resistance (Fig. [91]1).

A hybrid set of features supports recommendation system

   To support the recommendation system we assembled a hybrid set of rich
   features (Fig. [92]1 and Supplementary Table [93]1), with an idea that
   each feature represents an objective. The selected features were
   relevant for EGFR inhibitor resistance in NSCLC and corresponded to
   distinct lines of evidence. Key feature types and rationale to consider
   them for re-ranking of CRISPR hits are summarized below.

CRISPR

   CRISPR screen data served as a starting point for re-ranking. In this
   study we relied on screens that were set-up to resemble clinical
   treatment scenarios for EGFR mutant lung cancer, using NSCLC cancer
   cell lines harboring EGFR mutations commonly found in patient
   populations and where those cell lines were treated with 1^st or 3^rd
   generation EGFR inhibitors. In total we identified a starting list of
   1550 candidate drug resistance genes^[94]38 that were labeled as
   significant after the screen analysis. We further aggregated CRISPR
   data by computing consistency metrics, which reflected stability of a
   gene’s performance across experimental conditions. Normally genes
   showing consistent behavior in multiple relevant conditions, e.g
   related cell lines or treatments, are ranked higher by domain experts.
   Altogether, seven consistency-based features were incorporated in the
   feature set: (1) three features based on loss-of-function part of the
   screen; (2) three features based on gain-of-function part of the
   screen; (3) a summary metric reflecting overall consistency in the full
   screen (Supplementary Table [95]1).

Literature-based metrics

   Literature search is routinely used as a first step to confirm
   experimental findings and to find support for a potential mechanistic
   hypothesis. For the EGFR inhibitor resistance problem we were primarily
   interested in the overall literature support for a gene. As a proxy of
   literature support we calculated the total number of publications that
   mention a gene in a relevant context, such as “cancer”, “resistance”,
   “EGFR”, “NSCLC”. Conveniently, the same exact metric when reversed can
   be interpreted as novelty of a particular target. To extract literature
   mentions we analyzed a total corpus of >180,000 PubMed papers published
   between 2000 and 2019. We included aggregated literature metrics, based
   on two terms of interest: EGFR and NSCLC. For each gene we computed the
   number of papers that mention a gene together with one of these terms
   (Supplementary Table [96]1). To account to the fact that mentions in
   research papers vary drastically between genes, we also included
   normalized literature-based frequencies.

Graph-derived features

   In this study we used a custom knowledge graph (KG) as a source of side
   information for the recommendation system. Our KG contained 11 million
   nodes and 84 million edges and was composed of 37 public and internal
   datasets, such as Hetionet, OpenTargets, ChEMBL and Ensembl^[97]32. In
   general, patterns of interactions between biological entities captured
   by knowledge graphs can be translated into features and consumed by
   recommendation systems in a number of ways (Fig. [98]1 and
   Supplementary Table [99]1). One way is to compute features directly on
   the graph. This includes metrics such as node degree—reflecting the
   importance of a node; PageRank—a measure of node’s popularity^[100]39;
   betweenness—a way to detect the amount of influence a node has over the
   flow of information in the graph. An alternative approach involves
   projecting the graph into a low-dimensional space, so that every node
   is transformed into its vector representation—embedding. Embeddings
   capture critical structural properties of the graph^[101]40, so that
   the nodes that were close in the graph also remain close in the
   embedding space. In this study we computed distances in the embedding
   space between each gene and two key entities of interest: “EGFR” and
   “NSCLC”. The assumption is that genes most relevant to the EGFR TKI
   resistance phenotype should be close to either lung cancer or EGFR gene
   nodes.

Clinical enrichment scores

   To ensure the recommendation system captures clinical evidence, we
   included genomic data from osimertinib-treated EGFR-mutant lung cancer
   patients in the feature set. We prioritized five clinical trials:
   AURAext^[102]41, AURA2^[103]42, AURA3, FLAURA^[104]27, and
   ORCHARD^[105]43. The prevalence of genomic alterations in
   non-responders vs. responders across 355 patients treated with
   osimertinib were calculated and included as “clinical enrichment score
   features” to the feature set (Supplementary Table [106]1).

Tractability and gene essentiality

   Traditionally drug resistance in cancer is addressed by developing
   compounds or combination therapy that modulates activity of its key
   driver genes (targets). When a target is prioritized for drug
   development one needs to ensure that: (1) a gene is tractable in
   principle, i.e., it is shown or predicted to bind to commonly used drug
   modalities with high affinity; (2) a gene should not be essential,
   since knock-out of an essential gene can be detrimental to other cells
   in the organism, not just the tumor ones. To support the first
   consideration, we included bucket tractability estimates^[107]44 for
   three modalities: antibodies, small molecules and other modalities
   (enzyme, oligonucleotides, etc). In support of the second consideration
   we integrated DepMap^[108]45 essentiality estimates.

   In summary, the final hybrid set contained 27 rich features, supporting
   diverse criteria taken into consideration during validation of CRISPR
   hits by domain experts (Supplementary Table [109]1). The hybrid set was
   also augmented by graph-derived features and literature-based metrics.
   Correlation analysis of the hybrid feature space indicated expected
   patterns: (1) strong positive correlation between structural graph
   features, such as degree, pagerank and betweenness; (2) negative
   correlation between CRISPR features derived from knock-out and
   activation screens (Supplementary Fig. [110]7).

Interactive interface allows experts to re-rank CRISPR hits

   So far, we have defined a basic model for multi-objective optimization
   and demonstrated how to build a hybrid set of features to support
   re-ranking of CRISPR hits in the EGFR TKI context.

   In the real-world scenario, decision-making can be both iterative and
   subjective. A choice of a particular set of objectives and the
   direction of optimization for the same variable varies from expert to
   expert. Each combination of objectives and corresponding directions for
   optimization might result in a different shape of Pareto front,
   therefore - in a different set of top recommended genes.

   To accommodate diversity of opinions and to enable domain scientists to
   explore complex trade-offs between the objectives we built an
   interactive application - SkywalkR
   [111]https://github.com/AstraZeneca/skywalkR^[112]46 (Fig. [113]2).
   SkywalkR is a Shiny app^[114]47, which operates on top of the
   pre-assembled hybrid feature set (see Supplementary Table [115]1).
   SkywalkR app combines diverse facets of knowledge to guide
   re-prioritization of CRISPR hits for experimental validation. In
   addition, it allows domain experts to explore various trade-offs
   between objectives. Thereby, it stimulates exploration of
   possibilities, highlights gaps in the existing knowledge and motivates
   to adjust expectations about optimal solutions.

Fig. 2. SkywalkR interactive interface allows users to re-rank CRISPR hits
based on various combinations of objectives.

   [116]Fig. 2
   [117]Open in a new tab

   A On the side bar panel each objective is represented with a slider.
   Users can decide which objectives to include in the optimization and
   can also specify direction of optimization (minimize or maximize). B
   Additional tools to explore the results. Relative view shows profiles
   of recommended genes. Bar plots demonstrate standardized values across
   objectives and top recommended genes. Co-occurrence heatmap
   demonstrates clusters of genes frequently mentioned together in EGFR
   TKI resistance context.

   Automated engineering of rich features coupled with multi-objective
   optimization realized through SkywalkR interactive interface
   dramatically reduced the time required for gene prioritization from a
   few weeks to minutes.

Evaluation demonstrates majority of top recommendations labeled as credible
by experts

   To evaluate the recommendation framework against expert opinions we
   fixed a default set of preferences. Preferences were defined by a