Abstract

   Lapatinib and trastuzumab (Herceptin) are targeted therapies designed
   for patients with HER2+ breast tumors. Although these therapies
   improved survival rates of patients with this tumor type, not all the
   patients harboring HER2 amplification respond to these drugs. The
   NeoALTTO clinical trial was designed to test whether a higher response
   rate can be achieved by combining lapatinib and trastuzumab. Although
   the combination therapy showed almost double the response rate compared
   to the monotherapies, 40% of the patients did not respond to the
   treatment. In this study, we sought to identify biomarkers of HER2+
   breast cancer patients’ response to drugs relying on gene expression
   profiles of tumors. We show that univariate gene expression-based
   biomarkers are significant but weak predictors of drug response. We
   further show that pathway activities, estimated from gene expression
   patterns quantified using the recent transcriptional similarity
   coefficient (TSC) between the tumor samples, yield high predictive
   value for therapy response (concordance index >0.8, p < 0.05).
   Moreover, machine learning models, built using multiple algorithms
   including logistic regression, naive Bayes, random forest, k-nearest
   neighbor, and support vector machine, for predicting drug response in
   the NeoALTTO clinical trial, resulted in lower performance compared to
   our pathway-based approach. Our results indicate that transcriptional
   similarity of biological pathways can be used to predict lapatinib and
   trastuzumab response in HER2+ breast cancer.

   Keywords: breast cancer, human epidermal growth factor receptor 2,
   lapatinib, trastuzumab, transcriptional similarity coefficient,
   estrogen receptor

Introduction

   Unsupervised clustering of breast tumor samples based on
   high-throughput expression profiles enabled the identification of HER2+
   breast cancer subtype ([27]Carey et al., 2006; [28]Wirapati et al.,
   2008; [29]Onitilo et al., 2009). Breast cancer survival differed by
   subtype (p < 0.001), with shortest survival among HER2+ and basal-like
   subtypes. To treat HER2+ tumors, trastuzumab and lapatinib have been
   designed to target the EGFR/ERBB2 pathway and yielded a 30% rate of
   clinical response in the NeoALTTO clinical trial for patients with
   HER2+ breast tumors ([30]Baselga et al., 2012). This led to their
   adoption as standard-of-care therapies for HER2+ breast cancer patients
   ([31]Baselga et al., 2012). To increase the response rate of HER2+
   tumors, the biomedical research community sought to design combination
   therapies. Concurrent treatment with these drugs with a response rate
   of almost 60% proved a higher efficacy of combination therapy with
   respect to lapatinib and trastuzumab as monotherapies ([32]Baselga et
   al., 2012). However, there is a need to identify the non-responders to
   further improve the rate of treatment response in HER2+ breast cancer
   patients.

   Stratification of responders and non-responders to lapatinib,
   trastuzumab, and their combination therapies was conducted using either
   gene-based or pathway (or gene set)-based approaches. In gene-based
   approaches, mutation, amplification, or expression of individual genes
   with known association with HER2+ breast tumor biology was investigated
   as potential biomarkers of response to these therapies ([33]Gomez et
   al., 2007; [34]Bianchini et al., 2011; [35]Dave et al., 2011; [36]Loibl
   et al., 2014; [37]Schneeweiss et al., 2014; [38]Vici et al., 2014;
   [39]Menyhárt et al., 2015). In spite of successful studies as part of
   gene-based analysis, they could not capture the whole picture of
   resistance mechanisms to targeted therapies in HER2+ breast cancer
   patients due to the complexity of HER2+ tumor biology ([40]Nahta, 2012;
   [41]de Melo Gagliato et al., 2016; [42]Zhang et al., 2017). Hence,
   pathway (or gene set) approaches were conducted trying to identify the
   association of multiple genes or pathways to therapy response in HER2+
   breast cancer patients.

   The association of biological pathways to treatment response in HER2+
   breast cancer patients has been conducted based on pathway enrichment
   analysis using individual genes identified either (1) based on higher
   activity in responders (or non-responders) versus non-responders (or
   responders) ([43]Wu et al., 2012; [44]Boulbes et al., 2015; [45]Nam et
   al., 2015) or (2) as individual or multigene biomarkers of response
   ([46]Harris et al., 2007; [47]Willis et al., 2018). Here we present an
   alternative approach where pathways are used directly as features for
   predicting treatment response in HER2+ breast cancer patients.

   In this study, we used our recent Similarity Identification in Gene
   Expression (SIGN) approach ([48]Madani Tonekaboni et al., 2019) as a
   classifier relying on expression patterns in biological pathways for
   the classification of patient tumor samples to predict the response of
   cancer patients in each arm of the NeoALTTO clinical trial. We showed
   that the transcriptional similarity coefficient (TSC), identified
   comparing each patient tumor sample to responders versus
   non-responders, can be used to identify new pathway-based biomarkers of
   drug response in HER2+ breast cancer patients.

Materials and Methods

   The overall design of our study is illustrated in [49]Figure 1. In
   brief, we used the similarity of patterns of gene expression in
   biological pathways from patients responding to lapatinib, trastuzumab,
   and their combination to predict therapy response using our SIGN
   methodology ([50]Madani Tonekaboni et al., 2019). In this framework, a
   leave-one-out cross-validation was used to assess the performance of
   each biomarker in predicting the response of cancer patients in each
   treatment category ([51]Figure 1). The data and detailed methods are
   described below.

FIGURE 1.

   [52]FIGURE 1
   [53]Open in a new tab

   Design of the study regarding identification of responders to
   lapatinib, trastuzumab, and their combination therapy in the NeoALTTO
   clinical trial. BP, MF, and CC stand for the Gene Ontology (GO) terms
   for biological processes, molecular functions, and cellular components,
   respectively.

Gene Expression Profiles of Tumor Samples

   RNA-seq raw data of tumor samples in the NeoALTTO clinical trial were
   quantified with Kallisto ([54]Bray et al., 2016) in Toil pipeline
   ([55]Vivian et al., 2017) using the GENCODE version 23 (ALL version)
   transcriptome annotation. Transcript level abundances are summarized to
   gene level using the same approach as described in [56]Soneson et al.
   (2015).

Clinical Definition of Responders Versus Non-responders

   Responders and non-responders in the NeoALTTO clinical trial were
   determined using the rate of pathological complete response (pCR)
   ([57]Baselga et al., 2012). Any patient without a recorded pCR was
   regarded as a non-responder. A pathological complete response is
   defined as no invasive cancer in the breast or only non-invasive in
   situ cancer in the breast specimen. Surgical breast and axillary node
   resection specimens were evaluated for pathologic tumor response
   according to the National Surgical Adjuvant Breast and Bowel Project
   (NSABP) guidelines^[58]1.

Unsupervised Clustering

   Similarities of samples within each arm of the NeoALTTO clinical trial
   were identified using Spearman’s rank-order correlation ([59]Wissler,
   1905). The hierarchical clustering was then implemented on the
   similarity matrix between the samples using Euclidean distance and
   Ward’s minimum variance method ([60]Murtagh and Legendre, 2014).

Univariate Biomarker Discovery Using Genes

   Concordance indices between the expression of each gene and the
   binarized vector of drug response were calculated as the prediction
   performance of each gene as a univariate biomarker. The significance of
   each identified C-index was calculated using a permutation test. The
   observations were randomly permuted, and the C-index between the
   expression of each gene and the observed classes for the tumor samples
   was calculated. Then the fraction of times in which the C-index of the
   gene expression with real observed classes was lower than the C-indices
   identified with permuted observed classes was considered as the
   significance (or FDR) of the C-index identified for that gene.

Concordance Index

   We used the concordance index (C-index) to quantify the predictive
   value of our drug response predictors. The C-index estimates the
   probability that, for a pair of randomly chosen comparable samples, the
   sample with the higher predicted value will experience an event before
   the other sample or belongs to a higher binary class ([61]Harrell et
   al., 1982). We used the implementation of the concordance index
   available in the survcomp R package (version 1.34.0) ([62]Schröder et
   al., 2011).

Transcriptional Similarity Coefficient

   The transcriptional similarity coefficient (TSC) between each sample
   and the responders and non-responders were identified using the TSC
   function in the SIGN R package (version 0.1.0) ([63]Madani Tonekaboni
   et al., 2019). Let P be the matrix of expression of genes within a
   pathway for a set of biological samples where rows are genes and
   columns are samples. Then the TSC is defined as follows:
   [MATH:
   <mrow><mrow><mi>T</mi><mo>⁢</mo><mi>S</mi><mo>⁢</mo><mi>C</mi><mo>⁢</mo
   ><mrow><mo
   stretchy="false">(</mo><mrow><msub><mi>P</mi><mn>1</mn></msub><mo>⁢</mo
   ><msub><mi>P</mi><mpadded
   width="+1.7pt"><mn>2</mn></mpadded></msub></mrow><mo
   stretchy="false">)</mo></mrow></mrow><mo>=</mo><mfrac><mrow><msub><mo
   largeop="true" symmetric="true">∑</mo><mi>i</mi></msub><msub><mrow><mo
   stretchy="false">(</mo><mrow><msub><mi>P</mi><mn>10</mn></msub><mo>×</m
   o><msub><mi>P</mi><mn>20</mn></msub></mrow><mo
   stretchy="false">)</mo></mrow><mrow><mi>i</mi><mo>⁢</mo><mi>i</mi></mro
   w></msub></mrow><msqrt><mrow><msqrt><mrow><msub><mo largeop="true"
   symmetric="true">∑</mo><mrow><mi>i</mi><mo>⁢</mo><mi>j</mi></mrow></msu
   b><msubsup><mrow><mo
   stretchy="false">(</mo><msub><mi>P</mi><mn>10</mn></msub><mo
   stretchy="false">)</mo></mrow><mrow><mi>i</mi><mo>⁢</mo><mi>j</mi></mro
   w><mn>2</mn></msubsup></mrow></msqrt><mo>⁢</mo><msqrt><mrow><msub><mo
   largeop="true"
   symmetric="true">∑</mo><mrow><mi>i</mi><mo>⁢</mo><mi>j</mi></mrow></msu
   b><msubsup><mrow><mo
   stretchy="false">(</mo><msub><mi>P</mi><mn>20</mn></msub><mo
   stretchy="false">)</mo></mrow><mrow><mi>i</mi><mo>⁢</mo><mi>j</mi></mro
   w><mn>2</mn></msubsup></mrow></msqrt></mrow></msqrt></mfrac></mrow>
   :MATH]

   where P[1] and P[2] represent the matrix of gene expressions of a given
   pathway in two sets of samples (populations 1 and 2), i is the row
   index (i.e., gene index) within each matrix, j is the column index
   (i.e., sample index) within each matrix, and P[m][0] (either P[10] or
   P[20])
   [MATH:
   <mrow><msub><mi>P</mi><mrow><mi>m</mi><mo>⁢</mo><mn>0</mn></mrow></msub
   ><mo>=</mo><mrow><mrow><msub><mi>P</mi><mi>m</mi></msub><mo>×</mo><msub
   sup><mi>P</mi><mi>m</mi><mo>′</mo></msubsup></mrow><mo>-</mo><mrow><mi>
   D</mi><mo>⁢</mo><mi>i</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>g</mi><mo>⁢
   </mo><mi>o</mi><mo>⁢</mo><mi>n</mi><mo>⁢</mo><mi>a</mi><mo>⁢</mo><mi>l<
   /mi><mo>⁢</mo><mrow><mo
   stretchy="false">(</mo><mrow><msub><mi>P</mi><mi>m</mi></msub><mo>×</mo
   ><msubsup><mi>P</mi><mi>m</mi><mo>′</mo></msubsup></mrow><mo
   stretchy="false">)</mo></mrow></mrow></mrow></mrow> :MATH]

   where m is either 1 for population 1 or 2 for population 2. Deducting
   the diagonal elements in the above equation was initially proposed to
   the bioinformatics community for analyzing genomics data ([64]Smilde et
   al., 2009). This term will make sure that the identified similarities
   do not depend on the number of samples compared between the datasets.

   The TSC captures the similarity of the pathway expression pattern
   between two samples and/or sample sets that is in the range [−1,1].

Identifying Responders Using TSC

   The TSC for each pathway was identified between one sample and the
   remaining samples, divided into two groups of responders and
   non-responders ([65]Baselga et al., 2012). GO terms in level C5 with 10
   to 30 genes are used in this study to identify the similarity between
   samples based on their gene expression pattern ([66]Madani Tonekaboni
   et al., 2019). We limited the number of genes in GO terms to exclude
   large GO terms (at the top of the GO term hierarchy) that are parents
   of the GO terms in our study (at the bottom of the GO hierarchy). If
   the TSC for similarity for the responders was higher than that for the
   non-responders, the given sample was considered as a responder and vice
   versa. This process was repeated for every given sample in each arm of
   the trial. The method’s performance for predicting the response of
   cancer patients was assessed using the concordance index.

Cross-Validation in Predictive Models

   Each model was validated using leave-one-out cross-validation. In this
   setting, a target sample was put aside, and the rest of the samples
   were used for the prediction of drug response in the target sample. The
   TSC of each pathway between the set-aside sample and the randomly
   selected five samples from responders and non-responders were
   calculated. Then the median of the TSCs of all the pathways was
   calculated to assess if the sample has a higher similarity to
   responders or non-responders. This process was repeated 100 times for
   each sample, and majority votes of the 100 times were considered as the
   predicted class of the sample to be responder or non-responder.

Results

   We leveraged the gene expression and clinical information of HER2+
   breast cancer patients in the NeoALTTO clinical trial to identify
   biomarkers of drug response. The NeoALTTO clinical trial was a phase
   three randomized clinical trial designed to assess the efficacy of
   anti-HER2 monoclonal antibody trastuzumab, the tyrosine kinase
   inhibitor lapatinib, and their combination therapies on
   HER2-overexpressing breast cancer patients. The response (pCR) rate was
   significantly higher in the group given lapatinib and trastuzumab
   (51.3%) than in the group given trastuzumab alone (29.5%; p < 0.05).
   However, no significant difference in pCR between the lapatinib (24.7%)
   and the trastuzumab (p = 0.34) groups was observed.

   We identified correlations of tumor samples based on their gene
   expression profiles in three arms of the clinical trial separated based
   on the treatment type including trastuzumab alone, lapatinib alone, and
   their combination therapies ([67]Figure 1). The unsupervised clustering
   of samples could not stratify the patients based on their responses,
   relying on the rate of pathological complete response ([68]Figure 2A).

FIGURE 2.

   [69]FIGURE 2
   [70]Open in a new tab

   Identifying responders to lapatinib, trastuzumab, and their combination
   therapy in the NeoALTTO clinical trial using genes as univariate
   predictors of response. (A) Clustering of samples based on their
   similarity, defined as the Spearman correlation between gene expression
   profiles of the sample. (B) Top 10 genes as univariate biomarkers of
   drug response in ER+ and ER– cohorts within each arm of the NeoALTTO
   clinical trial.

   We further computed the C-index of genes as univariate biomarkers of
   drug response in each arm of the NeoALTTO trial. Relying on the common
   knowledge on ER being one of the main drivers in breast cancer
   development and progression ([71]Fuqua, 1997), we stratified our
   analyses based on the ER status. Top predictors of response yield a
   C-index of 0.68 ([72]Figure 2B), while the C-index of ERBB2 as a
   univariate biomarker of response in all the arms does not exceed 0.59
   (the full list can be found in the [73]Supplementary Material). The low
   performance of univariate modeling could be due to high correlation of
   patient tumor samples, as more than 90% of the tumor sample pairs had
   Pearson correlation of more than 0.9 using their gene expression
   profiles.

   We recently showed the high performance of a new method called SIGN in
   predicting the survival rate of breast cancer patients under different
   therapeutic regimens ([74]Madani Tonekaboni et al., 2019). We sought to
   use SIGN to predict the drug response of patients in each arm of the
   NeoALTTO trial. We used C-indices of the pathways, identified between
   the TSC of the pathways and the drug response in each arm of the trial,
   to cluster the arms. Trastuzumab alone and the combination therapy arms
   were clustered more closely compared to the lapatinib alone arm using
   the C-indices of the pathways, although the difference is not
   significant (p > 0.05) ([75]Figure 3A). Moreover, the pathway
   biomarkers of ER− and ER+ patient tumors showed low commonality
   revealing differences in the mechanism of response caused by the ER
   status of the patient tumors (absolute Spearman correlation <0.08)
   ([76]Figure 3B). Top pathway biomarkers for patients with the same
   treatment regimen and ER status had C-indices of more than 0.8 except
   for ER− patients under trastuzumab alone therapy ([77]Figure 3C).

FIGURE 3.

   [78]FIGURE 3
   [79]Open in a new tab

   Identifying responders to lapatinib, trastuzumab, and their combination
   therapy in the NeoALTTO clinical trial using the transcriptional
   similarity coefficient (TSC) of pathways. (A) Concordance indices of
   delta TSCs of GO terms, comparing each sample with responders and
   non-responders, in predicting the response of patients to lapatinib,
   trastuzumab, and their combination. (B) Clustering of groups of
   patients based on Concordance indices of delta TSC of GO terms (A). (C)
   Top pathways as predictors of lapatinib, trastuzumab, and their
   combination in ER+ and ER– tumor samples in the NeoALTTO clinical
   trial.

   Although the biological function of the identified pathways as
   biomarkers of drug response requires experimental validation, we found
   some evidence on their biological relevance. For example, among top
   identified biomarkers of drug response, there are REGULATION OF
   INTERFERON GAMMA BIOSYNTHETIC PROCESS and NEGATIVE T-CELL SELECTION for
   ER− and ER+ cancer patients under lapatinib treatment, respectively.
   These are in agreement with previous literature on the importance of
   immune signaling in lapatinib response in cancer patients ([80]Griguolo
   et al., 2019).

Comparison With Other Machine Learning Models

   We compared the top seven biomarkers identified in each arm of the
   NeoALTTO clinical trial for patients with ER− or ER+ status with the
   performance of 35 machine learning models ([81]Figure 4). These models
   were built using five different machine learning algorithms, including
   logistic regression, k-nearest-neighbor (k-NN), naive Bayes, random
   forest, and support vector machine (SVM), and seven different feature
   selection approaches ([82]Figure 4). SIGN-based biomarkers outperformed
   all 35 models in all treatment categories. We used the same
   leave-one-out cross-validation strategy as used for SIGN to compare the
   performance of these models.

FIGURE 4.

   [83]FIGURE 4
   [84]Open in a new tab

   Comparison of performance of top seven biomarkers of drug response
   using Similarity Identification in Gene Expression (SIGN) and 35
   machine learning models built combining five machine learning methods
   and seven different feature selection approaches.

Discussion

   We propose SIGN as a new approach to identify biomarkers of drug
   response in other subtypes of breast cancer or other tumor types. We
   showed the utility of SIGN in predicting the response of HER2+ breast
   cancer patients to lapatinib, trastuzumab, and their combination
   therapies using transcription patterns within biological pathways. Our
   results further emphasize the information gained upon using genes
   within biological pathways instead of individual markers of drug
   response. Furthermore, it suggests transcriptional similarity
   coefficient (TSC) as a new measure of similarity between tumor samples
   to be used in predicting their response to drug response. SIGN-based
   biomarkers outperformed 35 different machine learning models in
   predicting drug response in each treatment category. Moreover, the SIGN
   approach provides us with highly interpretable pathway-based biomarkers
   of drug response. Although SIGN showed promising performance for
   predicting response to lapatinib, trastuzumab, and their combination in
   HER2+ breast cancer patients, this approach needs further validation to
   ensure its generalizability in new clinical datasets. Upon having
   access to further clinical data of HER2+ patients in each one of these
   treatment categories, our findings in this study can be further
   assessed and validated.

Data Availability Statement

   The datasets generated for this study can be found in the
   [85]ClinicalTrials.gov Identifier: [86]NCT00553358.

Author Contributions

   SM led the project and performed the computational analysis of the work
   under supervision of BH-K. GB and SM collected and curated the data.
   All authors contributed to the article and approved the submitted
   version.

Conflict of Interest

   The authors declare that the research was conducted in the absence of
   any commercial or financial relationships that could be construed as a
   potential conflict of interest.

Acknowledgments