Abstract

   Low-cost multi-omics sequencing is expected to become clinical routine
   and transform precision oncology. Viable computational methods that can
   facilitate tailored intervention while tolerating sequencing biases are
   in high demand. Here we propose a class of transparent and
   interpretable computational methods called integral genomic signature
   (iGenSig) analyses, that address the challenges of cross-dataset
   modeling through leveraging information redundancies within
   high-dimensional genomic features, averaging feature weights to prevent
   overweighing, and extracting unbiased genomic information from large
   tumor cohorts. Using genomic dataset of chemical perturbations, we
   develop a battery of iGenSig models for predicting cancer drug
   responses, and validate the models using independent cell-line and
   clinical datasets. The iGenSig models for five drugs demonstrate
   predictive values in six clinical studies, among which the Erlotinib
   and 5-FU models significantly predict therapeutic responses in three
   studies, offering clinically relevant insights into their inverse
   predictive signature pathways. Together, iGenSig provides a
   computational framework to facilitate tailored cancer therapy based on
   multi-omics data.

   Subject terms: Cancer models, Targeted therapies, Genome informatics,
   Cancer genomics, Statistical methods
     __________________________________________________________________

   Predicting drug responses in cancer patients requires robust
   computational frameworks. Here, the authors develop an integral genomic
   signature —iGenSig— approach to predict drug responses using
   multi-omics data from tumour samples, and validate this approach using
   genomic datasets from multiple clinical studies.

Introduction

   Precision oncology, defined as molecular profiling of tumors to achieve
   customized patient care, has entered the mainstream of cancer patient
   care^[32]1. The current standard practices for precision oncology
   include detecting actionable mutations via genetic testing (i.e., EGFR
   mutation, ALK rearrangements), or detecting small-sized predictive or
   prognostic gene signatures via targeted expression assays (i.e.,
   Oncotype DX, MammaPrint). Such assays, however, require at least one
   assay per decision, which limit their cost-effectiveness. On the other
   hand, the past ten years have observed a stunning reduction of
   sequencing costs for a human genome from $300,000 to $1000, with $100
   whole-genome sequencing expected soon^[33]2. With this rate, it is
   expected that transcriptome and genome sequencing will become the
   clinical routine for patients. With the advent of low-cost genome
   sequencing, precision oncology is at the cusp of a deep transformation
   via leveraging the big data to provide a wide array of clinical
   decision supports which is deemed to be cost-effective. The
   computational approaches that can leverage these big data to facilitate
   clinical decisions and provide tailored health care are in high demand.
   For example, in metastatic lung cancer, the target therapies prescribed
   based on the current modeling of genomic sequencing data produced only
   minimal gain of quality-adjusted life year^[34]3. Innovative and robust
   clinical big data-based decision support models for precision oncology
   will be of vital importance.

   In recent years, there has been great enthusiasm about the potential of
   artificial intelligence-based clinical decision support systems for big
   data-based precision medicine, however, to date only few examples exist
   that impact clinical practice^[35]4. The main challenge is that
   multi-OMIC big data typically contain daunting amounts of
   high-dimensional features but a limited number of subjects which poses
   great challenges to the computational power and training process of
   artificial intelligence (AI) -based methods. In addition, AI approaches
   are “black box” tools, so the algorithmic and biological mechanisms
   underlying the models are largely unknown. The modeling process is
   controlled by AI which makes it difficult to interpret complex model
   predictions and is often plagued with the problems of overfitting and
   overweighing. In addition, there is a lack of big data-based methods
   specifically addressing the insufficient performance of the prediction
   models for crossing dataset modeling resulting from the common biases
   in detected genomic features across different datasets arising from
   sequencing errors, different library preparation methods and platforms,
   discordant sequencing depth and read-length, heterogeneous sample
   qualities, and experimental variations, etc. This calls for robust,
   transparent, and explainable methods that can predict clinical
   treatment outcomes from multi-OMIC data with substantially improved
   tolerance of sequencing biases.

   In this work, we propose a class of methods for big data-based
   precision medicine called integral genomic signature (iGenSig)
   analysis, which is designed to provide more robust clinical decision
   support with higher transparency and cross-dataset applicability
   (Fig. [36]1). Due to the computational challenge of the high
   dimensionality of genomic features, a common practice for big
   data-based modeling is to reduce the dimensionality of genomic features
   via removing redundant variables highly correlative with each other for
   gene expression signature panels, or creating synthetic features for
   machine learning approaches^[37]5 (Supplementary Fig. [38]1). Here we
   propose that the redundancies within high-dimensional features can in
   fact overcome sequencing errors and bias especially when there is a
   loss of detection of a subset of correlates. Here we define the genomic
   features significantly predicting a clinical phenotype (such as
   therapeutic response) as genomic correlates, and an integral genomic
   signature as the integral set of redundant high-dimensional genomic
   correlates for a given clinical phenotype such as therapeutic response.
   The iGenSig analysis generates prediction scores based on the set of
   redundant genomic features from labeled genomic datasets of therapeutic
   responses, and then reduces the effect of feature redundancy via
   adaptively penalizing the redundant features detected in specific
   samples based on their co-occurrence assessed using unlabeled genomic
   datasets for large cohorts of human cancers from The Cancer Genome
   Atlas (TCGA) (Fig. [39]1a). This allows for preserving redundant
   genomic features and introducing de novo redundant genomic features
   during the modeling while preventing the feature redundancy from
   flattening the scoring system. With this method, we speculate that if a
   subset of the genomic features was lost due to sequencing biases or
   experimental variations, the redundant genomic features would help
   sustain the prediction score. More importantly, we expect that
   the unbiased genomic information obtained from unlabeled large cancer
   cohorts will substantially improve the cross-dataset applicability of
   the iGenSig models, particularly on clinical trial datasets. On the
   other hand, iGenSig modeling utilizes the average correlation
   intensities of significant genomic features detected in specific
   samples to diminish the effect of false-positive detection resulting
   from sequencing errors and overweighing. This method also prevents
   overfitting by dynamically adjusting the feature weights for training
   subjects. Thus, iGenSig is a simple, white-box solution with an
   integral design to tolerate sequencing errors and bias for big
   data-based precision medicine. The principle and key features of
   iGenSig modeling are summarized in Fig. [40]1b.

Fig. 1. The principle, workflow, and algorithms of the integral genomic
signature modeling approach.

   [41]Fig. 1
   [42]Open in a new tab

   a The workflow and algorithms of integral genomic signature analysis.
   The upper panel shows the calculation of the weights for significant
   genomic features that predict drug sensitivity or resistance based on
   weighted K-S tests of Act Area or AUC for each drug respectively, and
   the lower panel shows the computation of a similarity matrix for
   genomic features based on TCGA Pan-Cancer dataset to penalize the
   redundancy between the genomic features associated with each cell line
   x. The resulting sensitive or resistant genomic signature scores are
   calculated separately using the weights predictive of sensitivity or
   resistance respectively based on the indicated formula. The dot plot
   shows the sensitive and resistant iGenSig scores for all cell line
   subjects, with red and blue colors indicating sensitive and resistant
   cell lines. b Schematic showing the principle and key features of
   iGenSig modeling: (i) the iGenSig approach intentionally retains and
   creates redundant genomic features, a concept like the use of redundant
   steel rods to reinforce the pillars of a building. (ii) iGenSig
   modeling utilizes the average correlation intensities of significant
   genomic features detected in specific samples to diminish the effect of
   false-positive detection resulting from sequencing errors and prevent
   overweighing. (iii) iGenSig modeling extracts the second genomic
   information from unlabeled genomic datasets for large cohorts of human
   cancers, in addition to the labeled genomic datasets of drug
   sensitivity, which will substantially improve its cross-dataset
   applicability, particularly on clinical trial datasets. (iv) iGenSig
   modeling is a white-box approach, thus will be more interpretable and
   controllable than machine learning or deep learning approaches.

Results

Development of the integral genomic signature approach based on a genomic
dataset of drug sensitivity

   To develop the iGenSig modeling, we utilized the drug sensitivity
   measurements of chemical perturbations, gene expression profiling data,
   and exome sequencing data for 989 cancer cell lines released by Genomic
   Datasets of Drug Sensitivity^[43]6 (GDSC, Supplementary Table [44]1).
   For the drug response measurements, we used high Act Area, the area
   above the fitted dose-response curve (or 1-AUC), to define a sensitive
   drug response, and high AUC, the area under the dose curve, to define a
   resistant response. According to the literature, the AUC and Act Area
   are much better quantifiers of drug responses than IC50^[45]7. To
   uniform multi-OMIC features, we formulated a Genomic-feature Matrix
   Transposed (GMT) format for compiling binary multi-OMIC features,
   similar to that used for compiling gene concepts^[46]8,[47]9. Using
   this format, we analyzed the expression profiling data and exome
   sequencing data from GDSC and compiled an integrated dataset combining
   the genomic features including upregulated genes, downregulated genes,
   mutated genes, and mutation hotspots. To increase the cross-dataset
   applicability of the iGenSig models, we intentionally introduced de
   novo feature redundancy by generating twelve overlapping levels of
   differentially expressed gene lists (Fig. [48]1a). We then selected
   significant genomic correlates using a weighted Kolmogorov–Smirnov
   (K-S) test that ranks the enrichment of each genomic feature in the
   cell line panel sorted decreasingly by Act Area or AUC, similar to that
   implemented by Gene Set Enrichment Analysis (GSEA)^[49]10. Next, we
   leveraged the TCGA Pan-Cancer RNAseq and exome dataset for 9532 tumors
   to quantify the co-occurrence between genomic features associated with
   each cell line based on similarity measures, which were then used to
   calculate a redundancy penalty score for each genomic feature. This
   provides unbiased information about the feature redundancy based on
   unlabeled extra-large patient cohorts.

   To prevent the bias from overfitting, we used a random collection of
   80% GDSC cell lines as a train set and the rest 20% as an internal test
   set for assessing the performance of the model. A total of five
   train/test sets are generated for modeling through random permutations.
   We then performed iGenSig modeling for 364 drugs that elicit a
   negatively skewed drug response distribution in cancer cell lines
   indicating the narrow effect of outstanding responses and have at least
   20 sensitive cell line subjects indicating the availability of
   outstanding responders. To benchmark the performance of the models, we
   discretized the cell lines into drug-sensitive and non-sensitive groups
   based on a waterfall method established in a previous study^[50]11, and
   calculated the Area Under ROC Curve (AUROC) for each drug. As a result,
   196 drugs showed an AUROC > 0.75 on the testing sets (54%), and 20
   drugs showed a ROCAUC >0.85 (Fig. [51]2a and Supplementary Data [52]1).
   Many of the top-performing drugs are FDA-approved chemotherapy or
   targeted therapy agents for cancer treatment, such as Lapatinib,
   Vincristine, Venetoclax, Epirubicin, Niraparib, and Afatinib. The
   top-performing drug models include targeted therapies against
   well-known cancer targets such as ERBBs, HDAC, BCL2, JAKs, PARP, ERK,
   CDKs, etc., and Lapatinib, Vincristine, and CAY10603 presented the best
   performing models with an average AUROC more than 0.9. The predictive
   powers of the iGenSig models appear to obviously correlate with the
   number of available genomic correlates for each drug (Spearman
   R = 0.60, Fig. [53]2b), suggesting that the iGenSig models rely on the
   available genomic information that can predict drug responses. The
   iGenSig scores negatively correlate with the AUC drug measurements in
   cell lines with a similar trend in both training and testing sets as
   exemplified by the Lapatinib model (Fig. [54]2c), suggesting that
   iGenSig modeling do not overfit toward the training set. Next, we
   clustered the drug's target kinase signaling based on their iGenSig
   scores in GDSC cell lines, which resulted in distinctive clustering of
   the drugs targeting the same or similar kinases (Fig. [55]2d).
   Interestingly, the Pan-cancer cell lines form five distinctive
   sensitivity clusters toward the drugs targeting the five kinase
   pathways. Outstanding response predictions for BRAF/MEK inhibitors are
   preferentially enriched in melanoma cell lines, while other drugs such
   as EGFR inhibitors exhibit cancer-type agnostic iGenSig scores,
   consistent with the tumor-type related clinical activities of these
   drugs.

Fig. 2. The performance of iGenSig models in predicting the drug responses of
GDSC cell lines.

   [56]Fig. 2
   [57]Open in a new tab

   a The performance of the iGenSig models for GDSC profiled drugs was
   assessed by their average AUROC. About 364 drugs that show a negatively
   skewed drug response distribution in cancer cell lines and have at
   least 20 sensitive cell lines are included in the analyses. The drugs
   with top-performing models (AUROC >0.85) are shown in bar chart on the
   right. The average AUROC for each drug was calculated based on five
   train/test sets. b Correlating the performance of the iGenSig models
   for 364 drugs with their average number of significant genomic
   features. The drug models assessed on the six clinical trial datasets
   are highlighted in red. c the performance of the iGenSig model for
   Lapatinib in predicting the response of GDSC cell lines based on a
   representative training and testing set. Left, sensitive, and resistant
   GenSig scores for GDSC cell lines. Middle, the correlation of the
   iGenSig scores with AUC measurements for Lapatinib. Right, the receiver
   operating characteristic (ROC) curve for predicting sensitive responses
   to Lapatinib. As the golden standard for the ROC curve, the cell line
   subjects in the test set are divided into sensitive and non-sensitive
   groups based on the AUC measurements for Lapatinib using the cutoff
   determined by the waterfall method (see Methods). d Ward D2
   Hierarchical Clustering for GDSC cancer cell lines and targeted kinase
   drugs of five RTK signalings based on iGenSig scores. The drugs
   targeting different kinases or different kinase families form
   distinctive clusters.

iGenSig models show no performance loss on the independent validation dataset
for drug sensitivity

   To assess the cross-dataset performance of our iGenSig models, we
   analyzed the RNAseq and exome sequencing data from the Cancer Cell Line
   Encyclopedia (CCLE)^[58]11. In total there are 14 drugs measured by
   both CCLE and GDSC datasets. Our result showed that the predictive
   performance of iGenSig models on the CCLE dataset appears to correlate
   with their performance on the testing sets of the GDSC dataset (Pearson
   R = 0.58, Fig. [59]3a). Using GDSC as a training set and CCLE as a
   validation set, the models for four drugs achieved AUROC of more than
   0.8. These include Irinotecan, Nilotinib, Lapatinib, and Erlotinib, for
   which the AUROC for prediction are 0.902, 0.873, 0.857, and 0.812
   respectively (Fig. [60]3b). Plotting the significant genomic features
   for Erlotinib in the two datasets revealed a consistent integral
   genomic signature correlating with drug-sensitive or resistant
   responses (Fig. [61]3c), as opposed to the modest consistency of the
   drug sensitivity measurements^[62]12. This is in line with the previous
   report about the greater agreement of prediction models than raw
   sensitivity values^[63]13, which could be attributed to the number of
   cell lines screened by both GDSC and CCLE for which insufficient
   sensitive cell lines were screened in both projects^[64]14, or due to
   the difference in cellular states under different cell culture
   conditions. It is interesting to note that the predictive performance
   of iGenSig models resulting from the permutated training sets on the
   CCLE validation dataset showed much lower deviations compared to that
   on the GDSC testing dataset (Fig. [65]3a). This may be attributed to
   the much smaller number of sensitive subjects in the GDSC testing
   datasets compared to the CCLE validation dataset.

Fig. 3. Predictive values of iGenSig models developed from the GDSC
pharmacogenomic dataset on the sensitivity of CCLE profiled cell lines to the
shared drugs.

   [66]Fig. 3
   [67]Open in a new tab

   a The performance of GDSC iGenSig models in predicting the responses of
   GDSC testing cell lines and CCLE cell lines to 14 drugs shared between
   the two datasets. 80% of GDSC cell lines are used for building the
   iGenSig models and 20% of GDSC cell lines are used for testing. 100% of
   CCLE cell lines are used for cross-dataset validations. If the same
   drug is profiled by both GDSC batch 1 and 2, the drug sensitivity data
   from batch 1 are used in the analysis. The error bars show standard
   deviations. b The predictive values of the iGenSig models developed
   from GDSC data on the CCLE cell lines treated with Irinotecan,
   Nilotinib, Lapatinib, or Erlotinib. The upper panel shows the
   correlation between the iGenSig scores and the Act Areas of the
   respective drugs for CCLE cell lines. The horizontal dashed lines show
   the cut-offs for sensitive (red) and resistant (blue) calls. The
   vertical dashed line shows the optimal cut-off for iGenSig scores
   determined based on AUROC. The lower panel shows the ROC curves of
   iGenSig scores in determining the sensitive cell lines vs non-sensitive
   cell lines. c GDSC and CCLE cell lines show consistent integral genomic
   signature that correlates with Erlotinib responses. The significant
   genomic features (n = 8540) based on K-S tests are shown in the figure.
   The GDSC and CCLE cell lines are first sorted by their sensitive
   iGenSig scores; the cell lines with sensitive iGenSig scores less than
   the median are then sorted by the resistant iGenSig cores. The cell
   lines that have been tested for Erlotinib chemical perturbations are
   shown in the figure, and the sensitive and resistant cell line subjects
   are indicated as yellow and blue bars.

   To test if the iGenSig predictions rely on the genomic features of the
   primary drug targets, we removed the drug target genes for Erlotinib,
   Lapatinib, or Nilotinib from GDSC and CCLE genomic feature sets. We
   then built the iGenSig models for these drugs based on the genomic
   features devoid of drug targets and assessed their performance on GDSC
   internal test set or the CCLE validation set. Our result showed that
   the performances of these iGenSig models are not affected by deleting
   the genomic features for known drug targets (Supplementary
   Fig. [68]2a). Furthermore, we examined if excluding the hematologic
   cancer cell lines such as leukemia and lymphoma from the GDSC training
   dataset can improve the prediction performances of iGenSig models on
   the drug sensitivity of CCLE solid cancer cell lines. Our results,
   however, did not significantly improve the performance of the fourteen
   drug models, but instead, this approach slightly decreased the overall
   performance (Supplementary Fig. [69]2b). This suggests that there may
   be predictive genomic information gained from these hematologic cancer
   cell lines as well. We thus used the models developed from the
   Pan-cancer cell line dataset in the following analysis.

The iGenSig model predicts the response of patient subjects to Erlotinib
treatment in the BATTLE trial and SAKK 19/05 trial

   Next, we sought to test the applicability of the GDSC iGenSig models in
   predicting therapeutic responses of patient subjects in clinical
   trials. We first focused on EGFR inhibitors for which our iGenSig
   models showed excellent cross-dataset performance. Most of the clinical
   trials for targeted drugs assessed their combinations with
   chemotherapies instead of monotherapies, which we postulate could
   confound the outcome of drug response prediction. Our further
   literature investigation revealed that a genomic study of the BATTLE
   trial ([70]GSE33072) profiled non-small cell lung cancer (NSCLC) tumors
   from 131 patients by gene expression array, among which 28 patients are
   treated with Erlotinib monotherapy, 47 patients are treated with
   Sorafenib monotherapy, and 20 patients are treated with vandetanib.
   Overall, the patient responses to Erlotinib in this trial are limited,
   and all patients treated with Erlotinib progressed within six months.
   This may be due to the selection of pretreated chemorefractory NSCLC
   patients as enrollment criteria^[71]15. Despite this, progression-free
   survival (PFS) analysis suggested that our GDSC iGenSig model for
   Erlotinib significantly predicted the favorable response of these
   patients in the Erlotinib arm, with a hazard ratio of 0.2 (p = 0.005,
   Fig. [72]4a, left). Among the three major treatment arms of this trial,
   the GDSC Erlotinib model showed a specific predictive effect on the
   Erlotinib arm compared to the Sorafenib or Vandetinib arms
   (Fig. [73]4a, right).

Fig. 4. Predictive values of the iGenSig model for Erlotinib developed from
GDSC cell line pharmacogenomic data on the survival of patient subjects from
the US BATTLE trial and Swiss SAKK 19/05 trial.

   [74]Fig. 4
   [75]Open in a new tab

   a Left, Kaplan–Meier plot showing the predictive values of GDSC iGenSig
   model for Erlotinib on the patients from the U.S. BATTLE trial. A
   data-driven cut point of high iGenSig scores was determined as
   described in Methods. The P value is based on the log-rank test. Right,
   the differences in iGenSig scores among patients that achieved (Y) or
   did not achieve (N) 8-week disease control in the Erlotinib, Sorafenib,
   and Vandetanib treatment arms. Patients with EGFR or KRAS mutations are
   depicted with red or blue colors. b The predictive values of the GDSC
   iGenSig model for Erlotinib on the patient subjects from the Swiss SAKK
   19/05 trial. Left, the ROC curve showing the performance of sensitive
   iGenSig scores on predicting the objective responses of patient
   subjects at 12 weeks following Erlotinib and Avastin treatment in the
   Swiss SAKK 19/05 clinical trial. Right, the predictive value of the
   iGenSig model for Erlotinib does not depend on EGFR mutation status.
   The box plots in a, b show the minima, first quartile, median, third
   quartile, and the maxima. c The network of upregulated and
   downregulated pathways characteristic of Erlotinib sensitive GDSC cell
   line subjects. The top upregulated and downregulated pathways clustered
   in the respective interconnected networks are shown in the figure. The
   CSEA enrichment score for each pathway in the Erlotinib sensitive
   signature is depicted by the size of each node. The pathway
   associations are depicted by the thickness of the edge. The pathway
   associations are calculated based on CSEA association scores between
   each pair of pathways. d Heatmap showing the associations of EMT
   markers and master transcription factors, as well as ZEB1 and MYC
   target genes with the sensitive iGenSig scores for Erlotinib in the
   GDSC, BATTLE, and SAKK 10/05 clinical trial datasets. The cell lines
   and patient subjects are sorted based on their sensitive iGenSig
   scores. The boxplot elements in 4a and 4b indicate the max, 75th
   percentile, median, 25th percentile, and min. The p values shown in the
   box plots of 4a, b are based on one-sided student’s t-tests.

   Next, we examined the predictive value of this model on a Swiss SAKK
   19/05 trial that tested the combination of Erlotinib and bevacizumab
   (Avastin)^[76]16. Recent evidence suggested that the addition of
   Bevacizumab to Erlotinib exhibits increased therapeutic efficacy. As
   bevacizumab alone is known to lack efficacy in lung cancer, this effect
   is thought to be the result of enhanced erlotinib activity^[77]16. The
   SAKK 19/05 trial is a multi-center single-arm trial in previously
   untreated patients. The endpoint provided by this study is objective
   response at 12 weeks after Erlotinib and bevacizumab treatment, and no
   survival data are available. Our result showed that the GDSC iGenSig
   model for Erlotinib showed a predictive AUROC of 0.795 (Fig. [78]4b,
   left), and this predictive value is independent of EGFR mutation status
   (Fig. [79]4b, right). On the other hand, out of the four patients with
   EGFR mutated tumors, only the tumor showing the highest iGenSig score
   exhibited an objective response. This suggests that while EGFR
   inhibition is indicated for EGFR mutated patients, a subgroup of EGFR
   wild-type patients may derive significant benefit from EGFR inhibitors
   as well, which could be identifiable by the iGenSig model.
   Consistently, clinical studies suggest that EGFR inhibition should not
   be confined to EGFR mutated lung cancer patients^[80]17. In addition to
   clinical trial datasets, we also applied our iGenSig model to a set of
   PDX models treated with Erlotinib and profiled with RNAseq and WXS,
   which revealed significant predictive value as well (HR = 0.12,
   p = 0.0001, Supplementary Fig. [81]3). Taken together, these results
   support the utility of integral genomic signature modeling in
   predicting therapeutic responses of EGFR inhibition and its excellent
   cross-dataset performance.

Biological interpretation of the iGenSig model yielded insights into the
predictive signature pathways

   Since epithelial-mesenchymal transition (EMT) has been previously
   reported to mediate EGFR resistance in the BATTLE trial study^[82]18,
   we wonder if the EMT signature contributes to the iGenSig predictions.
   We thus examined the pathways characteristic of the integral genomic
   signature for Erlotinib sensitivity in our iGenSig model. This can be
   achieved by extracting the genes contributing to the genomic features
   predicting sensitive responses in our GDSC iGenSig model. The resulting
   gene list can be then used to explore the enriched pathways based on
   the concept signature enrichment analysis (CSEA) developed in our
   previous study, which is designed for deep functional assessment of the
   pathways enriched in an experimental gene list^[83]9. Our result showed
   that the most significantly downregulated pathways characteristic of
   Erlotinib sensitive responses include MYC and E2F target gene
   signatures (Fig. [84]4c). Consistent with this, amplification of MYC
   has been found to mediate resistance to EGFR inhibitors, and targeting
   MYC was proposed as a promising strategy to overcome acquired
   resistance^[85]19. On the other hand, the EMT pathway is ranked as one
   of the most significantly upregulated pathways in the Erlotinib
   sensitive signature identified from GDSC cell lines, which contradicts
   the previous report^[86]18.

   We postulated that this may be attributed to the content of the EMT
   signature that mixed both upregulated and downregulated genes in EMT.
   We thus compiled an upregulated EMT signature and a downregulated EMT
   signature based on a previous report^[87]20. Correlating these EMT
   signatures with the Erlotinib iGenSig scores revealed that the
   downregulated and upregulated EMT signatures are indeed enriched in the
   subjects with high or low iGenSig scores respectively in the BATTLE
   trial dataset (Supplementary Fig. [88]4). However, in the GDSC
   Pan-Cancer cell line panel, both upregulated and downregulated EMT
   signatures are repressed in Erlotinib-resistant cell lines. This
   suggests that the repressions of both EMT signatures are characteristic
   of the Erlotinib-resistant cell lines at the Pan-cancer scale and
   explains the contradicting pathway enrichment results from CSEA
   analysis. Among the known EMT markers and transcription factors,
   overexpression of E-cadherin (CDH1) was observed in both sensitive cell
   lines and patient responders from BATTLE trial (Fig. [89]4d). Whereas
   overexpression of EMT markers such as N-cadherin (CDH2), Vimentin
   (VIM), and β-catenin (CTNNB1) are characteristic of the cell lines with
   intermediate sensitivity, and overexpression of ZEB1 is characteristic
   of Erlotinib-resistant cell lines. In the BATTLE trial, overexpression
   of β-catenin and/or ZEB1 are characteristic of subjects with low
   iGenSig scores. As ZEB1 is a transcriptional repressor, we assessed the
   correlation of ZEB1 target genes with the iGenSig scores. This revealed
   that the downregulation of ZEB1 target genes is characteristic of both
   resistant cell lines and patient subjects in the two clinical trials
   (Fig. [90]4d). On the other hand, MYC target genes appear to be
   upregulated in the most Erlotinib-resistant cell lines. Together, our
   results suggest that EMT is associated with reduced but intermediate
   response to Erlotinib whereas repression of ZEB1 signature and
   upregulation of MYC signature is associated with tumor-type agnostic
   resistance at the Pan-cancer scale. It is interesting to note that the
   iGenSig scores for the BATTLE trial appear to accentuate the epithelial
   signature on high-scored subjects, whereas the iGenSig scores for the
   SAKK 10/05 trial accentuate the MYC signature on low-scored subjects.
   This may explain their better performance in predicting sensitive or
   resistant subjects respectively (see Discussion).

The iGenSig model predicted patient response to 5-FU treatment in a French
CIT multi-center study

   Next, we sought to examine the utility of iGenSig modeling in
   predicting chemotherapy response. While most clinical studies of
   chemo-agents focus on testing combination regimens, we identified a
   multicenter clinical study carried out by the French Cartes d’Identité
   des Tumeurs (CIT) program that tested 5- Fluorouracil (5-FU)
   monotherapy on postsurgical colon cancer patient^[91]21. In addition,
   this study also tested combination chemotherapy regimens such as
   FOLFIRI, FOLFOX, and FUFOL. 5-FU is an antimetabolite drug, and is one
   of the most commonly used drugs for cancer treatment, particularly for
   colorectal cancer^[92]22. The GDSC iGenSig model for 5-FU significantly
   predicted patient overall survival in the 5-FU monotherapy arm
   (p = 0.002), with a hazard ratio of 0.27 (Fig. [93]5a). Whereas this
   predictive effect was diminished in the treatment arms testing
   combination chemotherapies containing 5-FU (Fig. [94]5b). To examine if
   this is due to the therapeutic effect exerted by other chemo-agents, we
   examined the FOLFIRI arm testing the combination of folinic acid, 5-FU,
   and irinotecan, for which the iGenSig models for the latter two drugs
   are available. Among the alive patients in this arm, two patients
   showed high iGenSig scores by both models, whereas the other three
   patients showed high scores by either of these two models
   (Fig. [95]5c). This suggests that in the two alive patients with low
   5-FU iGenSig scores, the therapeutic effects may be derived from
   irinotecan.

Fig. 5. Predictive values of the iGenSig model for 5-FU developed from GDSC
cell line pharmacogenomic dataset on patient survival in the French CIT
multicenter postsurgical colon cancer patient cohort.

   [96]Fig. 5
   [97]Open in a new tab

   a Kaplan–Meier plot showing the predictive values of the GDSC iGenSig
   model for 5-FU on the overall survival of patients from the French CIT
   cohort treated with 5-FU monotherapy. A data-driven cut point of high
   iGenSig scores was determined as described in Methods and the P value
   is calculated based on the log-rank test. b The predictive values of
   the GDSC iGenSig model for 5-FU on the overall survival of all patient
   subjects from the CIT cohort treated with different adjuvant
   chemotherapy regimens or untreated. The BRAF and KRAS mutation status
   for each subject are indicated by colored dots. The boxplot elements
   indicate the maxima, 75th percentile, median, 25th percentile, and
   minima. The p values are based on one-sided student’s t-tests. c The
   predictive values of the GDSC iGenSig models for irinotecan and 5-FU on
   the patient subjects treated with the FOLFIRI regimen in the CIT study.
   The patients are stratified based on their overall survival and their
   sensitive iGenSig scores based on the two models are plotted. d The
   network of upregulated and downregulated pathways characteristic of
   5-FU sensitive GDSC cell line subjects. The top upregulated and
   downregulated pathways clustered in the respective interconnected
   networks are shown in the figure. The CSEA enrichment score for each
   pathway in the Erlotinib sensitive signature is depicted by the size of
   each node. The pathway associations are depicted by the thickness of
   the edge. The pathway associations are calculated based on CSEA
   association scores between each pair of pathways. e Heatmap showing the
   associations of EMT markers and master transcription factors, ZEB1 and
   MYC target genes, and interferon γ responsive genes with the sensitive
   iGenSig scores for 5-FU in the GDSC and CIT datasets. The cell lines
   and patient subjects are sorted based on their sensitive iGenSig scores
   for 5-FU.

   Next, we examined the pathways enriched in the 5-FU sensitive iGenSig
   signature. Interestingly, as opposed to the Erlotinib signature, the
   EMT pathway is ranked as the most downregulated pathway in the
   sensitive GDSC cell lines, whereas the MYC target gene signature and
   interferon γ signature are revealed as top upregulated pathways
   associated with sensitive responses (Fig. [98]5d, e). This suggests
   that the tumors that are resistant to EGFR inhibitors may be sensitive
   to the 5-FU treatment. Fascinatingly, consistent with this, it is known
   that EGFR wild-type tumors show higher sensitivity to uracil-tegafur
   than EGFR mutated tumors in lung cancer^[99]23, whereas EGFR inhibition
   has been found to sensitize 5-Fu-resistant colon cancer cells^[100]24.
   Moreover, the interferon γ signature is associated with an inflammatory
   response triggered by the double-strand breaks resulting from the DNA
   damaging effect of 5-FU. The upregulation of Interferon γ regulated
   genes in cancer cells may confer better therapeutic effects through the
   interferon γ induced growth arrest and apoptosis in cancer
   cells^[101]25,[102]26, and this signature appears to be captured from
   leukemia and lymphoma cell lines in the GDSC panel (Supplementary
   Fig. [103]5).

Clinical studies of combination drug therapies revealed confounding factors
of iGenSig models

   We then went on to further explore the predictive values of GDSC
   iGenSig models on the clinical trials testing the combination of
   targeted therapy with chemotherapy, or combinatory chemotherapy
   regimens. To achieve this, we identified three large gene expression
   datasets from breast cancer clinical studies testing the drug
   combinations containing the drugs with favorable GDSC iGenSig models.
   We first examined the predictive value of the lapatinib model on the
   CALGB40601 trial, a neoadjuvant phase III trial that tested the
   combination of Lapatinib with Paclitaxel in Her2-positive breast cancer
   in one of the three treatment arms^[104]27. As we expected, our results
   showed that the GDSC iGenSig models derived from lapatinib or
   paclitaxel single-drug treatment has limited predictive effects on
   these combination treatments, with an AUROC of 0.61 and 0.52
   respectively. Since overexpression of estrogen receptor (ER) is known
   to confer resistance to HER2-targeted therapy through driving
   HER2-independent signaling^[105]28, thus may confound the predictive
   effect of our models, we stratified the patients based on their ER
   positivity. Following stratification, the lapatinib model showed
   significant predictive value in ER-negative patients (p = 0.04,
   Fig. [106]6a). In contrast, the paclitaxel model showed only a modest
   association with the pathological response (pCR) (Fig. [107]6b). This
   may be explained by the major clinical benefits derived from the
   HER2-targeted agents in HER2-positive patients. Next, we examined the
   predictive contributions of lapatinib and paclitaxel models as well as
   other clinical variables such as tumor stage, PR status, menopausal
   status, and patient age on patient outcome. By calculating Pseudo-R^2
   using logistic regression models (see Methods), the lapatinib model
   ranks the highest in explaining the variation in patient response,
   followed by PR status, and the composite model based on both lapatinib
   and paclitaxel models produced a higher predictive value than the
   individual models (Fig. [108]6c).

Fig. 6. The predictive values of the GDSC iGenSig models on the survival of
patient subjects from the CALGB40601 trial treated with lapatinib and
paclitaxel, and from neoadjuvant clinical studies testing
taxane-anthracycline-based combination chemotherapies.

   [109]Fig. 6
   [110]Open in a new tab

   a The predictive values of the GDSC iGenSig model for lapatinib on the
   pathological responses (breast and axilla) of the patient subjects from
   arm 3 (lapatinib + paclitaxel) of the CALGB40601 trial. The patients
   are stratified based on their ER positivity. b The associations of the
   GDSC iGenSig models for lapatinib and paclitaxel with the pathological
   responses (breast and axilla) of the ER-negative patient subjects from
   arm 3 of the CALGB40601 trial. c Pseudo R2 showing the percentage of
   contribution explaining the variance in pathological responses of
   ER-negative patient subjects for GDSC lapatinib and paclitaxel models,
   their composite model, and other confounding clinical variables. d
   Kaplan–Meier plot showing the predictive values of the GDSC iGenSig
   model for Paclitaxel on the distant recurrence-free survival of
   patients with basal-like TNBC treated with neoadjuvant
   taxane-anthracycline chemotherapy. A data-driven cut point of high
   iGenSig scores was determined as described in Methods and the P value
   is determined based on the log-rank test. e The interactions of
   confounding clinical variables with the GDSC paclitaxel model were
   assessed based on multiple logistic regression models for overall
   survival status. The bar plot shows the −log10 transformed p values of
   Chi-Square tests comparing the pair-wise multiple logistic regression
   models with the simple logistic regression model for the GDSC
   paclitaxel model. f The predictive value of the GDSC iGenSig model for
   paclitaxel and Epirubicin on the pathological responses of the enrolled
   patient subjects stratified based on tumor stages. The source cohorts
   of patient subjects are indicated by colored dots. g The predictive
   values of the GDSC paclitaxel model on the pathological complete
   response of breast cancer patients to P-FEC chemotherapy stratified
   based on tumor stage. The receptor status of each patient is indicated
   by colored dots. h The AUROC of the GDSC models for Paclitaxel,
   Epirubicin, and 5-FU or their composite model on predicting the
   pathological complete responses of stage I-II breast cancer patients
   treated with P-FEC in the OUH study. The p values annotated on panels
   a, f, and g are based on one-sided student’s t-tests. The boxplot
   elements in 6a, 6f, and 6g indicate the maxima, 75th percentile,
   median, 25th percentile, and minima.

   Next, we examined the predictive effect of the paclitaxel model on a
   prospective neoadjuvant multicenter clinical study for stage I-III
   ER-negative breast cancer patient cohort treated with chemotherapy
   containing sequential taxane and anthracycline-based regimens^[111]29.
   The GDSC paclitaxel model showed a moderate predictive effect on
   distant recurrence-free survival (p = 0.085), with a hazard ratio of
   0.65 (Fig. [112]6d). Next, we sought to examine if anthracycline
   treatment and other clinical variables such as AJCC stage, tumor grade,
   node status, receptor status, etc. may confound the iGenSig models.
   Among anthracyclines, epirubicin is preferred for treating ER-negative
   breast cancer patients and has the best GDSC iGenSig model. We thus
   used the iGenSig model for epirubicin as a predictor for the
   anthracycline treatment. The interactions of the paclitaxel model with
   these confounding variables are assessed using logistic regression (see
   Methods). Our result showed that the most confounding variable appear
   to be AJCC stage (p = 0.048), followed by the Epirubicin model
   (p = 0.069) and tumor grade (p = 0.093) (Fig. [113]6e). When stratified
   by the AJCC stage, both the Lapatinib model and the Epirubicin model
   showed significant predictive values in stage I-II tumors, which were
   diminished in stage III tumors (Fig. [114]6f).

   To verify this result, we examined another neoadjuvant clinical study
   carried out in a Japanese breast cancer patient cohort at Osaka
   University Hospital (OUH) testing neoadjuvant paclitaxel followed by
   5-fluorouracil, epirubicin, and cyclophosphamide (P-FEC)^[115]30. This
   study included both ER-positive and negative patients, and only pCR is
   provided as a clinical endpoint. Consistently, the paclitaxel model
   showed significant predictive value on pCR in the stage I-II breast
   cancer patients only (p = 0.0004). Within the Stage I-II patient group,
   the iGenSig models for paclitaxel, 5-fluorouracil, and epirubicin
   showed an AUROC of 0.74, 0.71, and 0.67 respectively, and the composite
   predictive models based on all three models achieved an AUROC of 0.81.
   Taken together, these data suggest that in combination therapy, the
   iGenSig models derived from single-drug treatment data can be
   confounded by the therapeutic benefits derived from other drugs in the
   combination, as well as confounding drug resistance factors, such as ER
   overexpression and late tumor stage. Stratifying the patients based on
   known confounding factors and combining the models for different drugs
   included in the regimen will help better observe the predictive effect
   of the iGenSig models on patient response to combination drug
   therapies.

Comparing the iGenSig algorithm with standard machine learning algorithms on
modeling drug responses

   Next, we sought to compare the performance of the iGenSig algorithm
   with standard machine learning or deep learning algorithms. First, we
   compared the iGenSig algorithm with the ridge regression and Support
   Vector Regression (SVR), which are the limited available machine
   learning algorithms capable of carrying out predictive modeling using
   ultra-high-dimensional features. In addition, we also performed ridge
   regression and Support Vector Machine (SVM) modeling based on binomial
   drug sensitivity labels. For AI-based methods, we computed the
   unsupervised representation of the genomic features for dimensionality
   reduction based on the autoencoder deep learning method, as previously
   reported^[116]31,[117]32. The resulting synthetic features were then
   fed to the machine learning methods for supervised learning on drug
   responses, such as elastic net, support vector machine (SVM), or Random
   Forest (RF) (Supplementary Fig. [118]6). We then applied these
   algorithms to model cancer cell sensitivity to the fourteen drugs
   shared by GDSC and CCLE datasets, and to model patient responses in the
   six clinical trial datasets, including BATTLE, SAKK 19/05, French CIT,
   CALGB40601, neoadjuvant taxane-anthracycline, and P-FEC studies. Our
   results showed that iGenSig models have the highest overall performance
   on clinical trial datasets among all modeling methods (Fig. [119]7).
   Ridge regression and SVR/SVM models worked better than AI-based models
   and are comparable to the iGenSig models on the CCLE validation dataset
   (Fig. [120]7a). This suggests that dimensionality reductions during
   AI-based modeling may reduce the cross-dataset prediction performance.
   The better performance of iGenSig models on clinical trial datasets
   supports the utility of our algorithm design in improving cross-dataset
   modeling for patient subjects via leveraging the redundancy within the
   integral genomic signature and extracting unbiased genomic information
   from unlabeled large tumor cohorts (Fig. [121]7b–g). More important,
   iGenSig is a white-box algorithm that has the much-needed transparency
   required for clinical application compared to the machine learning
   algorithms based on complex mathematical systems, which allows for
   easier interpretation of the biological mechanisms underlying the
   models.

Fig. 7. Comparisons between the iGenSig algorithm and standard machine
learning algorithms on modeling drug responses.

   [122]Fig. 7
   [123]Open in a new tab

   a The Prediction performance of the iGenSig algorithm and machine
   learning methods on the CCLE validation dataset for 14 drugs. If a drug
   is profiled by both GDSC batch 1 and 2, the drug sensitivity data from
   batch 1 are used in the analysis. b–g The Prediction performance of the
   iGenSig algorithm and machine learning methods on US BATTLE trial (b),
   Swiss SAKK 19/05 trial (c), CALGB40601 trial (d), neoadjuvant
   taxane-anthracycline study (e), neoadjuvant P-FEC study (f), and the
   French CIT colon cancer clinical study (g). The predictive models for
   the respective drugs annotated under each plot were generated based on
   80% cell lines from GDSC with five permutated training sets. For ML
   methods, the supervised learning was directly performed on the original
   high-dimensional genomic features using Gaussian family ridge
   regression (G-ridge) and support vector regression (SVR) based on drug
   sensitivity measurements or using Binary family ridge regression
   (B-ridge) and support vector machine (SVM) based on binary sensitivity
   labels. For AI methods, the unsupervised learning was performed by
   autoencoder (AE) and supervised learning was performed using various
   machine learning tools including elastic net (EN), random forest (RF),
   and support vector machine (SVM) based on binary sensitivity labels.
   The p values are based on two-sided student’s t-tests. The AUROC for
   predicting patient binary responses (pCR for breast cancer clinical
   studies or objective response for the SAKK 19/05 trial) or the
   concordance of the predictive scores with patient survival (PFS for the
   BATTLE trial and OS for the CIT cohort) are shown in the figure
   depending on the best available clinical endpoints for these studies.
   The mean and standard deviations are shown in the violin plots.

   Moreover, we also compared iGenSig with machine learning algorithms on
   a PDX genomic dataset that measured the sensitivity of Pan-cancer PDX
   tumors against 36 single drugs^[124]33. These include nine GDSC
   profiled drugs, seven of which have more than one PDX responder in this
   dataset. Our results show that the GDSC models for these seven drugs
   showed diverse predictive values on these PDX models so all modeling
   methods showed large deviations in predictive results for different
   drugs (Supplementary Fig. [125]7). The iGenSig, ridge regression, and
   SVR/SVM models showed comparable predictive performances, all of which
   are better than AI-based methods. We speculate that the observed large
   deviations could be in part attributed to the tumor microenvironment
   factors important for some drugs that cannot be modeled using in vitro
   cultured cell lines (as discussed below) or the lack of responders in
   this dataset.

Discussion

   Here we introduce a class of white-box methods called integral genomic
   signature analyses that leverage the high-dimensional redundant genomic
   features as an integral genomic signature to enhance the
   transferability of multi-omics-based modeling for precision oncology, a
   concept like the use of redundant steel rods to reinforce the pillars
   of a building (Fig. [126]1b). The iGenSig method is designed to address
   the transparency, cross-dataset applicability, and interpretability
   issues for big data-based modeling. The iGenSig models demonstrated
   improved performances in cross-applicability to clinical trial
   datasets, tolerating the experimental variations and bias in the
   genomic data. IGenSig models can be managed in every detailed step, and
   the underlying pathways can be readily biologically interpreted through
   the concept signature enrichment analysis we developed^[127]9. The
   performance of iGenSig models appears to at least in part depend on the
   availability of significant genomic correlates, which provided insights
   into the different performances of iGenSig models on different drugs.
   We expect that iGenSig as a class of big data-based modeling methods
   will have broad application in modeling therapeutic responses based on
   pharmacogenomics and clinical trial datasets. Further, iGenSig can also
   be potentially applied to predict other cancer behaviors to facilitate
   clinical decisions such as the aggressiveness of carcinoma in situ, or
   metastatic potential of clinically localized tumors.

   It is interesting to note that our iGenSig model for Erlotinib
   demonstrated predictive value on both preclinical tests on
   patient-derived xenografts and two clinical trials on human subjects,
   including the US BATTLE trial and the Swiss SAKK 19/05 trial.
   Interpretation of this model revealed that upregulation of MYC target
   gene signature and downregulation of ZEB1 target genes are
   characteristic of Erlotinib resistance signature, whereas induction of
   EMT is associated with reduced but intermediate responses. While both
   EMT and ZEB1 has been found to medicate acquired resistance to EGFR
   inhibitors in non-small cell lung cancer^[128]34,[129]35, our iGenSig
   signature suggested the discordance of ZEB1 overexpression with EMT
   induction and their differential contribution to Erlotinib resistance
   in Pan-cancer cell lines. This implies that the phenotypic changes
   other than EMT induced by ZEB1 may contribute to Erlotinib resistance
   at the Pan-cancer scale. Consistent with this finding, ZEB1 has been
   reported to exert more critical functional consequences than EMT
   itself. For example, it has been proposed that specific EMT inducers
   such as ZEB1, but not the EMT state, determine cancer stem cell
   properties^[130]36. Future experimental studies will be required to
   determine the differential contribution of ZEB1 and EMT to Erlotinib
   resistance in human cancers.

   On the other hand, our GDSC iGenSig model for sorafenib failed to
   predict patient response in the sorafenib treatment arm in the BATTLE
   trial dataset. To explore if this is attributable to the differences in
   the in vitro and in vivo microenvironments, we examined the primary
   targets of this drug. Sorafenib is a multi-kinase inhibitor of Raf-1,
   B-Raf, and VEGFR-2. Among these, VEGFR-2 is a VEGF receptor involved in
   angiogenesis. In light of this, we wonder if the iGenSig models based
   on in vitro cell line responses failed to model in vivo tumor responses
   to VEGFR inhibition. Based on the literature, VEGFA amplification is a
   known biomarker for Sorafenib response^[131]37, whereas CXCL8 (IL8) are
   known to induce VEGF overexpression in endothelial cells and promote
   angiogenesis^[132]38. Correlating these biomarkers with the iGenSig
   scores revealed that in the Sorafenib treated arm of the BATTLE trial,
   the sensitive tumors with low iGenSig scores appear to overexpress
   VEGFA and CXCL8, as well as the AMP-activated protein kinases
   (PRKAA1/2) important for stimulating VEGF expression and
   angiogenesis^[133]39 (Supplementary Fig. [134]8). This suggests that
   the inability of the GDSC iGenSig model to predict patient response to
   sorafenib in the BATTLE trial may be attributed to the
   anti-angiogenesis activity of sorafenib, which cannot be modeled using
   in vitro cultured cell lines. This reflects the limitation of modeling
   patient tumor response based on in vitro cell line models.

   Moreover, it is interesting to note that the iGenSig scores for the
   BATTLE and SAKK 10/05 trials appear to accentuate different signature
   pathways. The iGenSig scores for the BATTLE trial accentuate the
   epithelial signature on tumors with high iGenSig scores, whereas the
   iGenSig scores for the SAKK 10/05 trial accentuate MYC target signature
   on tumors with low iGenSig scores (Fig. [135]4d). This could be due to
   the inability of the current iGenSig modeling method to factorize these
   signature pathways and model their interactions, or due to the
   confounding effect of different immunological contextures in these
   tumors. Future studies will be required to examine the effect of immune
   cell infiltrates on bulk sequencing-based modeling, as well as develop
   new iGenSig methods that can factorize the different pathway signatures
   as predictive pillars, and better model their predictive interactions
   on treatment outcomes.

   Our analyses of six clinical trial datasets suggest that the iGenSig
   models for single-drug treatment show better predictive values in
   patients treated with the respective monotherapy. The major challenges
   to apply these single-drug models to combination therapies include the
   therapeutic benefits derived from the other drugs and known resistance
   factors such as ER overexpression and advanced tumor stage III. These
   confounding variables cannot be modeled based on the GDSC cell line
   panel. First ER-positive breast cancer cells typically do not grow in
   vitro thus there are very limited ER-positive cell line models in the
   GDSC cell line panel. Second, stage III breast tumors are characterized
   by extensive lymph node involvement (≥4 nodes) or chest wall or skin
   invasion. Such regional metastasis may create a tumor microenvironment
   leading to drug resistance, which cannot be recreated when the cell
   lines are generated from primary tumors.

   The remaining issue to be addressed for iGenSig modeling is how to
   eliminate the effect of confounding genomic features resulting from the
   imbalanced distribution of confounding factors such as gender or
   prognostic factors that can impact patient outcomes such as metastasis,
   etc. While this issue may be less impactful in modeling cell line
   responses when a large number of cell line subjects are included, it
   could become more consequential when a smaller number of subjects are
   tested in the clinical trial. In this case, the genomic features
   associated with the confounding factor may be identified and excluded
   from the iGenSig model through multivariate statistics. In addition,
   the confounding clinical variables that affect prognosis such as local
   or distant metastasis should be accounted for via multivariate
   statistics or stratification methods when assessing the performance of
   iGenSig models in predicting patient survival outcomes. This could be
   particularly helpful for modeling the clinical trials testing
   combination drug therapies where the predictive powers of the iGenSig
   models derived from single-drug treatment are relatively weak due to
   the therapeutic benefits from combinatory drugs. Future studies will be
   required to further optimize the iGenSig methods for modeling clinical
   trial datasets and taking into consideration of these biological
   variables and confounding factors.

Methods

Data retrieval

   The drug response data, gene expression data, and mutation data are
   from the Genomics of Drug Sensitivity in Cancer Project (GDSC), and the
   Cancer Cell Line Encyclopedia (CCLE) as of September 2018. The GDSC and
   CCLE gene expression data were retrieved from ArrayExpress (E-MTAB-783)
   and NCBI GEO ([136]GSE36133), respectively and normalized using Robust
   Multi-Array Averaging (RMA)^[137]40. Drug sensitivity data, mutation
   data, and cell line annotations were downloaded from the GDSC
   ([138]http://www.cancerrxgene.org/downloads) or CCLE
   ([139]http://www.broadinstitute.org/ccle) websites. The newly released
   batch 2 drug sensitivity dataset are downloaded from the GDSC website
   as of May 2021. The TCGA Pan-cancer gene expression and mutation
   datasets were retrieved from the UCSC Xena browser
   ([140]https://xenabrowser.net/datapages/). The gene expression and
   mutation data for the PDX tumors were retrieved from the supplementary
   dataset of the original publication^[141]33. The clinical trial
   datasets are retrieved from Gene Expression Omnibus (GEO,
   [142]https://www.ncbi.nlm.nih.gov/geo) or the database of Genotypes and
   Phenotypes (dbGaP, [143]https://dbgap.ncbi.nlm.nih.gov/), and the
   detailed access information are provided in the Data Availability
   section. All the genomic datasets included in this study are summarized
   in Supplementary Table [144]1.

Extracting genome-wide expression and mutation features for cell line and
tumors

   Based on gene expression and somatic point mutation datasets, we
   extracted genome-wide differential gene expression (DGE) and mutation
   features and generated an integrated genomic feature file. For gene
   expression datasets, quantile normalizations were performed and genes
   with standard deviations of less than 20% percentile are filtered. We
   then calculated log2 transformed fold changes of the expression values
   compared to the trimmed mean of expression values (excluding the 10%
   largest and 10% smallest values). To eliminate zero values during log2
   transformation, we added 1 to the expression value across all cell
   lines or tumors. Based on the mean and standard deviation (SD) of fold
   changes, we assigned the cell lines or tumors into the following
   overlapping groups: “Up_Level1” group with the fold change above
   Mean + 1 × SD for a given gene; “Up_Level2” group with the fold change
   above Mean + 2 × SD; …, and “Up_Level6” group with the fold change
   above Mean + 6 × SD Likewise, “Down_Level1”, “Down_Level2”, … and
   “Down_Level6” grouped cell lines based on Mean − 1 × SD, Mean − 2 × SD,
   and Mean − 6 × SD. The 12 “Levels” were labeled as genotypic features
   for each given gene and the binary genomic features are compiled as a
   genomic matrix transposed (GMT) file format. Similarly, we extracted
   binary genomic features to represent point mutations. The mutation
   hotspots and nonsynonymous somatic mutations such as missense,
   nonsense, and frameshift are assigned as mutation features. Each
   recurrent mutation hotspot and each recurrently mutated gene were
   assigned as separate features.

Defining drug responses of cancer cell lines

   Drug responses of cancer cell lines are represented by the area under
   the dose-response curve (AUC) in GDSC or the area over the
   dose-response curve (Act Area) in CCLE^[145]11,[146]41. We first tested
   the skewness of the AUC measurements for each drug in the GDSC dataset.
   A negative skewness distribution indicates that the drug has high AUC
   measurements (lack of responses) in most of the cell lines, but low AUC
   measurements (sensitive responses) in a small subset of the cell lines,
   and a lower level of skewness indicates a higher level of outstanding
   responses. To ensure the drugs have sufficient outstanding responders
   for training and testing the algorithm, the GDSC drugs with negative
   skewness and more than 20 sensitive cell line subjects are included in
   our iGenSig cancer cell line modeling. We then defined sensitive drug
   responses of cell lines based on Act Areas using the standard waterfall
   method described in the CCLE study (implemented in the
   “define.response.AUC” and “define.response.ActArea” function of the
   iGenSig R package)^[147]11. The Act Area measurements for CCLE or GDSC
   cell lines for a given compound are sorted in ascending order to
   generate a waterfall distribution. The cut-off for defining sensitive
   subjects was determined as the maximal distance to a line drawn between
   the start and endpoints of the distribution. The cut-off for
   non-responders was determined as “median of Act Area - median absolute
   deviation (MAD).” The cell lines with Act Area above the sensitivity
   cut-off were labeled as drug-sensitive and below the resistance cut-off
   were labeled as drug-resistant. The cell lines with Act Areas between
   the cut-offs for drug sensitivity and resistance were labeled as
   intermediate.

Calculating feature weights and selecting significant genomic features

   To define the weight (
   [MATH: <msub><mrow><mi>ω</mi></mrow><mrow><mi>i</mi></mrow></msub>
   :MATH]
   ) of each genomic feature in predicting sensitive drug responses, we
   leveraged the weighted Kolmogorov–Smirnov (WKS) statistics^[148]10 to
   test the enrichment of the feature-positive cell line in the cell line
   panel sorted in descending order based on Act Area (Fig. [149]1). The
   enrichment score (ES) for each genomic feature is calculated using the
   same algorithm as that implemented in Gene set enrichment analysis
   (GSEA)^[150]10. To prevent bias, we excluded the genomic features
   defining fewer than 5 cell lines during the calculation of GenSig
   scores. Likewise, we calculated the weights for each genomic feature in
   predicting resistant drug responses based on the cell line panel sorted
   by AUC in descending order. To prevent overfitting, for a given cell
   line x in the training set, cell line x is excluded from calculating
   the ES scores for the genomic features associated with cell line x. We
   assessed the significance of the observed ES by comparing that to the
   random ES scores calculated by random features with the same numbers of
   positive cell lines. Repeat this step until 2000 random enrichment
   scores were calculated, then the normalized enrichment score (NES) was
   calculated by:
   [MATH: <mi mathvariant="normal">NES</mi><mo>=</mo><mi
   mathvariant="normal">ES</mi><mo>/</mo><mi
   mathvariant="normal">mean</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="normal">ES</mi></mrow><mrow><mi
   mathvariant="normal">random</mi></mrow></msub></mrow><mo>)</mo></mrow>
   :MATH]
   1

   The p values were determined based on the chance of random ES scores to
   be above the observed ES score for feature i, and the false discovery
   rate (FDR) were calculated using the R package “qvalue” for multiple
   comparisons. In this study, we used a universal FDR q value cutoff of
   0.1 to select significant genomic features for calculating iGenSig
   scores. This parameter can be tuned for different drugs to further
   refine the model as the signal-to-noise levels of these predictive
   genomic features could be different for different drugs. Furthermore,
   we observed that some of the genes have both upregulation and
   downregulation features ranked as significant for predicting the
   sensitive drug response. We thus filtered the genes that have both
   upregulated features with FDR <0.1, and downregulated features with FDR
   <0.3, and the genes that have both downregulated features with FDR
   <0.1, and upregulated features with FDR <0.3. On the other hand, some
   of the genes have only level-1 DGE features selected as significant
   based on FDR <0.1, but none of their corresponding high levels of DGE
   features have an FDR more than 0.3 even if they define more than ten
   cell lines. These genomic features represent noises and are thus
   filtered as well.

The algorithms for penalizing feature redundancy and methods for iGenSig
modeling

   To prevent the inflation of iGenSig scores from feature redundancy, we
   leveraged the TCGA Pan-Cancer RNAseq and exome datasets to assess the
   co-occurrence between genomic features associated with each cell line
   and generated the cosine similarity matrix of genomic features based on
   Otsuka–Ochiai coefficient between these features (K[ij]). We then
   performed clustering of the cosine similarity matrix based on Ward’s
   method (D2) using the R module “hClust”. The correlated feature groups
   are then determined based on an adaptive dynamic cluster detection
   method^[151]42, using the parameters: dynamic.method = “hybrid”,
   cutTree.depth = 2, and minClusterSize = 40. We then introduced a
   penalization factor (ε) which is calculated for each genomic feature i
   based on the similarity indices of the colinear genomic features
   associated with a given cell line [x] and of the same cluster as a
   feature i:
   [MATH:
   <msub><mrow><mi>ε</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   under><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>∈</mo><msub><mrow><mi
   mathvariant="normal">Cluster</mi></mrow><mrow><mi>i</mi></mrow></msub><
   /mrow></munder><msub><mrow><mi>K</mi></mrow><mrow><mi>i</mi><mi>j</mi><
   /mrow></msub> :MATH]
   2

   Where K[ij is] the Otsuka–Ochiai coefficient between feature i and a
   given genomic feature j from the same cluster group as a feature i
   associated with cell line [x.] To eliminate the cumulative effect of
   small overlaps between genomic features, the Otsuka–Ochiai coefficients
   were adjusted to 0 if K[ij] < 0.1. Here ε[i] is an estimator of
   redundancy among the genomic features of the same cluster group
   associated with cell line [x]. The penalization factor ranges from 1
   (all genotypes are completely different from each other) to n (all
   genotypes are the same). We then penalized the weight ω[i] using ε[i],
   resulting in effective weight (EW):
   [MATH: <msub><mrow><mi
   mathvariant="normal">EW</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=<
   /mo><mfrac><mrow><msub><mrow><mi>ω</mi></mrow><mrow><mi>i</mi></mrow></
   msub></mrow><mrow><msub><mrow><mi>ε</mi></mrow><mrow><mi>i</mi></mrow><
   /msub></mrow></mfrac> :MATH]
   3

   The trimmed mean of ε[i] (trim = 0.3) was then used to calculate the
   effective feature number (EFN):
   [MATH: <msub><mrow><mi
   mathvariant="normal">EFN</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=
   </mo><mfrac><mrow><mi>n</mi></mrow><mrow><msub><mrow><mover
   accent="true"><mrow><mi>ε</mi></mrow><mo>¯</mo></mover></mrow><mrow><mi
   >T</mi></mrow></msub></mrow></mfrac> :MATH]
   4

   Finally, the iGenSig score of the given cell line[x] is computed as:
   [MATH: <msub><mrow><mi>i</mi><mi
   mathvariant="normal">GenSig</mi></mrow><mrow><mfenced
   open="∣"><mrow><msub><mrow><mi mathvariant="normal">Cell</mi><mspace
   width="0.16em"></mspace><mi
   mathvariant="normal">Line</mi></mrow><mrow><mi>x</mi></mrow></msub></mr
   ow></mfenced></mrow></msub><mo>=</mo><mfrac><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><msub><mrow><mi
   mathvariant="normal">EW</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow
   ><mrow><msub><mrow><mi
   mathvariant="normal">EFN</mi></mrow><mrow><mi>i</mi></mrow></msub></mro
   w></mfrac><mo>=</mo><mfrac><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><msub><mrow><mi>I</mi></mrow><mrow><mr
   ow><mo>{</mo><mrow><mi>i</mi><mi
   mathvariant="normal">ϵ</mi><mi>x</mi></mrow><mo>}</mo></mrow></mrow></m
   sub><mfrac><mrow><msub><mrow><mi>ω</mi></mrow><mrow><mi>i</mi></mrow></
   msub></mrow><mrow><msub><mrow><mi>ε</mi></mrow><mrow><mi>i</mi></mrow><
   /msub></mrow></mfrac></mrow><mrow><mfrac><mrow><mi>n</mi></mrow><mrow><
   msub><mrow><mover
   accent="true"><mrow><mi>ε</mi></mrow><mo>¯</mo></mover></mrow><mrow><mi
   >T</mi></mrow></msub></mrow></mfrac></mrow></mfrac> :MATH]
   5

   The sensitive and resistant iGenSig scores are calculated separately
   based on the significant genomic features selected for predicting
   sensitive or resistant responses. The sensitive iGenSig scores are used
   in this study for assessing the performance of the iGenSig models in
   predicting sensitive cell lines and patient subjects. Thus, the iGenSig
   scores labeled in the figures refer to the sensitive iGenSig scores
   unless otherwise noted.

   When applying the iGenSig models to the validation datasets, the
   weights of the significant genomic features (q < 0.1) calculated based
   on GDSC datasets will be used for calculating the iGenSig scores based
   on the presence of these features in a patient or cell line subject in
   the validation dataset. The weights of these features associated with
   that subject were then penalized based on the feature redundancy levels
   assessed using the TCGA Pan-cancer dataset (for CCLE dataset) or cancer
   type-specific TCGA dataset (for a clinical trial dataset of specific
   tumor type), and the iGenSig scores were calculated using the same
   algorithm. Thus, here only the weights from the GDSC dataset are used
   for modeling the validation datasets.

Benchmarking the performance of the iGenSig algorithm

   To benchmark the performance of the iGenSig algorithm in determining
   drug sensitivity, we randomly selected 20% of GDSC cell lines treated
   by a specific drug as an internal test set. We assigned the rest of the
   80% cell lines as a train set and performed this randomized sampling
   five times. The distributions of drug-sensitive and resistant cell
   lines were required to be balanced between the train and test set in
   each sampling. The CCLE dataset was used as an external validation set
   of our predictive models to assess their applicability to an
   independent dataset. The area under the ROC curve (AUROC) of the
   iGenSig scores was calculated based on the binary response of the cell
   lines determined based on the sensitive cutoff discussed above, and the
   optimal cut-points of iGenSig scores are determined using the R module
   “coords” of the “pROC” package. The cell line subjects were divided
   into sensitive cell lines and other cell lines that include both
   intermediate and resistant cell lines, and the sensitive iGenSig scores
   are used when assessing the predictive values of the iGenSig models.

   To test if the iGenSig predictions rely on the genomic features of the
   primary drug targets, we removed the drug target gene features for
   Erlotinib, Lapatinib, or Nilotinib from GDSC and CCLE genomic feature
   sets. We then built the iGenSig models based on the genomic features
   devoid of drug targets and assessed their performance on the internal
   test set (20% of GDSC cell lines) or external validation set (100% of
   CCLE cell lines). To examine if excluding the hematologic cancer cell
   lines from the GDSC training dataset can improve the prediction
   performances of iGenSig models on the drug sensitivity of CCLE solid
   cancer cell lines, we removed leukemia, lymphoma, and myeloma cell
   lines from the GDSC dataset when performing the modeling for the shared
   14 drugs, and then applied the models to CCLE solid cancer cell lines
   (Supplementary Fig. [152]2b).

Meta-analyses of clinical trial and prospective clinical study datasets

   To examine the predictive values of the iGenSig models on patient
   subjects, we compiled six clinical trial or prospective clinical study
   datasets. The detailed information of the clinical datasets used in
   this study including enrollment details, treatment arms, available
   endpoints, available genomic datasets, and source publications are
   summarized in Supplementary Table [153]1. The iGenSig model for the
   specific drug tested on a given treatment arm of the trial are
   developed based on the GDSC dataset, and then applied to the genomic
   features of the clinical trial datasets. These include the models for
   Lapatinib (Drug ID: 119), Erlotinib (Drug ID: 1), Sorafenib (Drug ID:
   30), 5-Fluorouracil (Drug ID: 179), Paclitaxel (Drug ID: 1080), and
   Epirubicin (Drug ID: 1511). The uses of clinical endpoints are
   dependent on the clinical information provided by the authors of the
   original publications. Overall survival (OS) is the preferred endpoint
   of choice^[154]43, followed by pathologic complete response (pCR). For
   the CALGB40601 trial, pCR in the breast and axilla is used in our
   analysis, as no tumor residuals in both breast and lymph nodes can best
   discriminate patient outcomes^[155]44. Other endpoints will be used in
   the analysis if OS and pCR are not available.

Implementation of machine learning and deep learning methods

   All machine learning and deep learning methods used in this study are
   implemented in R. For ridge regression and support vector regression,
   we performed the modeling directly based on the original
   high-dimensional binary genomic features. For ridge regression, we
   implemented using the R package “glmnet”. The Gaussian family is used
   for drug sensitivity measurements and the Binary family is used for
   binary sensitivity classifications. For tuning the ridge regression
   model, the best “lambda.1se” was obtained from tenfold
   cross-validation. Support vector regression is implemented using the
   “svm” function of the “e1071” package using default parameters based on
   either drug sensitivity measurements or binomial sensitivity labels.
   For AI-based methods, we applied the deep learning method
   autoencoder^[156]45 to perform unsupervised representation learning for
   dimensionality reduction and machine learning prediction algorithms for
   supervised learning of therapeutic responses using the low dimensional
   features generated by autoencoder, as previously
   reported^[157]31,[158]32. The models are developed based on 80% of GDSC
   datasets (five permutated training sets). The autoencoder model was
   built with three hidden layers with the unit sizes in each layer
   designed in accordance with a previous report^[159]31. We then applied
   the unsupervised representation of the genomic correlates to supervised
   learning methods including elastic net, artificial neural network,
   Random Forest (RF), and support vector machine (SVM) for prediction
   modeling. Elastic net is a regression method that combines lasso and
   ridge regularization with the two hyperparameters, alpha and lambda.
   Alpha is a mixing parameter to define the relative weight of the lasso
   and ridge penalization terms and lambda determines the amount of
   shrinkage^[160]46. We identified alpha with the best tuning and
   optimized for predictive performance over a range of lambdas.
   Regression was performed using the glmnet R package (ver. 4.0.2). We
   implemented an RF regression model using randomForest R package
   (ver.4.6.14). We specified 1,000 trees to grow and ensure every object
   gets predicted multiple times. We used SVM with the linear kernel
   method, “svmLinear2”, provided by the caret R package (ver. 6.0.86). We
   specified tuneLength = 10 in the tuning parameter grid and accuracy
   metric. For all modeling methods, the models were developed using the
   same genome-wide gene expression and mutation features we compiled, and
   we used the same training and external validation sets of cell lines
   and patient subjects as in iGenSig modeling. One model is developed for
   each drug based on each permutation set, which are then applied to the
   CCLE and clinical trial datasets. To match the genomic features, the
   genomic features are set to zero if they do not present in the
   validation sets as in the iGenSig modeling.

Pathway enrichment analysis for integral genomic signature

   To identify the pathways characteristic of the integral genomic
   signature for Erlotinib resistance modeled from the GDSC dataset, we
   first extracted the genes involved in the iGenSig signature and then
   classified these genes into positive contributing genes and negative
   contributing genes. The positive contributing genes are defined as
   upregulated genes or genes with hotspot mutations. The negative
   contributing genes are defined as downregulated genes or mutated genes
   without mutation hotspots. The pathways enriched in the positive or
   negative contributing genes for predicting Erlotinib or 5-FU sensitive
   responses are analyzed by the Concept Signature Enrichment Analysis
   (CSEA) developed in our previous study^[161]9 using the Hallmark
   pathway gene sets from MSigdb ([162]http://www.gsea-msigdb.org). The
   resulting top pathways are disambiguated via correcting the crosstalk
   effects between pathways, to reveal independent pathway
   modules^[163]47. A p value <0.01 is used as a cutoff for
   disambiguation. The functional associations between the significant
   pathways are then assessed using our CSEA method that computes the
   enrichment of each pathway x on the human gene list sorted descendingly
   based on the universal concept signature scores of each pathway
   y^[164]9. The pathway network was visualized using the “igraph” R
   package (ver. 1.2.4.2).

Statistical analysis

   The correlation between the predictive powers of the iGenSig models
   (AUROCs) with the total number of significant predictive genomic
   features (square root transformed) for each drug was determined using
   Spearman’s correlation coefficient. For survival analysis, the P values
   are calculated based on log-rank tests and a data-driven cut point of
   high iGenSig scores was determined using the R-package
   “maxstat”^[165]48. The concordance is used as a measure of
   goodness-of-fit for the predictive models in Coxph survival regression,
   which defines the probability that the predictive scores goes in the
   same direction as the survival data. To estimate the percent of outcome
   variations that can be explained by different factors, we fitted a
   logistic regression model and calculated the Pseudo-R^2 for different
   predictive factors and clinical variables such as iGenSig scores,
   clinical stage, tumor grade, etc. For interaction analysis, multiple
   logistic regression models are generated for pair-wise interactions of
   the iGenSig model with each of the confounding factors, and the
   multiple models are compared with the simple logistic regression model
   via the Chi-Square test implemented in the “anova” function. P values
   of the box or violin plots were analyzed by paired or unpaired
   Student’s t-tests. Two-sided tests are used when both directions are to
   be tested, such as when comparing the different modeling methods.
   One-sided tests are used to determine if there is a difference between
   groups in a specific direction such as that defined by the iGenSig
   scores.

Reporting summary

   Further information on research design is available in the [166]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [167]Supplementary Material^ (4.4MB, pdf)
   [168]Supplementary data 1^ (25.3KB, xlsx)
   [169]Reporting Summary^ (1.8MB, pdf)

Acknowledgements