Abstract
Lapatinib and trastuzumab (Herceptin) are targeted therapies designed
for patients with HER2+ breast tumors. Although these therapies
improved survival rates of patients with this tumor type, not all the
patients harboring HER2 amplification respond to these drugs. The
NeoALTTO clinical trial was designed to test whether a higher response
rate can be achieved by combining lapatinib and trastuzumab. Although
the combination therapy showed almost double the response rate compared
to the monotherapies, 40% of the patients did not respond to the
treatment. In this study, we sought to identify biomarkers of HER2+
breast cancer patients’ response to drugs relying on gene expression
profiles of tumors. We show that univariate gene expression-based
biomarkers are significant but weak predictors of drug response. We
further show that pathway activities, estimated from gene expression
patterns quantified using the recent transcriptional similarity
coefficient (TSC) between the tumor samples, yield high predictive
value for therapy response (concordance index >0.8, p < 0.05).
Moreover, machine learning models, built using multiple algorithms
including logistic regression, naive Bayes, random forest, k-nearest
neighbor, and support vector machine, for predicting drug response in
the NeoALTTO clinical trial, resulted in lower performance compared to
our pathway-based approach. Our results indicate that transcriptional
similarity of biological pathways can be used to predict lapatinib and
trastuzumab response in HER2+ breast cancer.
Keywords: breast cancer, human epidermal growth factor receptor 2,
lapatinib, trastuzumab, transcriptional similarity coefficient,
estrogen receptor
Introduction
Unsupervised clustering of breast tumor samples based on
high-throughput expression profiles enabled the identification of HER2+
breast cancer subtype ([27]Carey et al., 2006; [28]Wirapati et al.,
2008; [29]Onitilo et al., 2009). Breast cancer survival differed by
subtype (p < 0.001), with shortest survival among HER2+ and basal-like
subtypes. To treat HER2+ tumors, trastuzumab and lapatinib have been
designed to target the EGFR/ERBB2 pathway and yielded a 30% rate of
clinical response in the NeoALTTO clinical trial for patients with
HER2+ breast tumors ([30]Baselga et al., 2012). This led to their
adoption as standard-of-care therapies for HER2+ breast cancer patients
([31]Baselga et al., 2012). To increase the response rate of HER2+
tumors, the biomedical research community sought to design combination
therapies. Concurrent treatment with these drugs with a response rate
of almost 60% proved a higher efficacy of combination therapy with
respect to lapatinib and trastuzumab as monotherapies ([32]Baselga et
al., 2012). However, there is a need to identify the non-responders to
further improve the rate of treatment response in HER2+ breast cancer
patients.
Stratification of responders and non-responders to lapatinib,
trastuzumab, and their combination therapies was conducted using either
gene-based or pathway (or gene set)-based approaches. In gene-based
approaches, mutation, amplification, or expression of individual genes
with known association with HER2+ breast tumor biology was investigated
as potential biomarkers of response to these therapies ([33]Gomez et
al., 2007; [34]Bianchini et al., 2011; [35]Dave et al., 2011; [36]Loibl
et al., 2014; [37]Schneeweiss et al., 2014; [38]Vici et al., 2014;
[39]Menyhárt et al., 2015). In spite of successful studies as part of
gene-based analysis, they could not capture the whole picture of
resistance mechanisms to targeted therapies in HER2+ breast cancer
patients due to the complexity of HER2+ tumor biology ([40]Nahta, 2012;
[41]de Melo Gagliato et al., 2016; [42]Zhang et al., 2017). Hence,
pathway (or gene set) approaches were conducted trying to identify the
association of multiple genes or pathways to therapy response in HER2+
breast cancer patients.
The association of biological pathways to treatment response in HER2+
breast cancer patients has been conducted based on pathway enrichment
analysis using individual genes identified either (1) based on higher
activity in responders (or non-responders) versus non-responders (or
responders) ([43]Wu et al., 2012; [44]Boulbes et al., 2015; [45]Nam et
al., 2015) or (2) as individual or multigene biomarkers of response
([46]Harris et al., 2007; [47]Willis et al., 2018). Here we present an
alternative approach where pathways are used directly as features for
predicting treatment response in HER2+ breast cancer patients.
In this study, we used our recent Similarity Identification in Gene
Expression (SIGN) approach ([48]Madani Tonekaboni et al., 2019) as a
classifier relying on expression patterns in biological pathways for
the classification of patient tumor samples to predict the response of
cancer patients in each arm of the NeoALTTO clinical trial. We showed
that the transcriptional similarity coefficient (TSC), identified
comparing each patient tumor sample to responders versus
non-responders, can be used to identify new pathway-based biomarkers of
drug response in HER2+ breast cancer patients.
Materials and Methods
The overall design of our study is illustrated in [49]Figure 1. In
brief, we used the similarity of patterns of gene expression in
biological pathways from patients responding to lapatinib, trastuzumab,
and their combination to predict therapy response using our SIGN
methodology ([50]Madani Tonekaboni et al., 2019). In this framework, a
leave-one-out cross-validation was used to assess the performance of
each biomarker in predicting the response of cancer patients in each
treatment category ([51]Figure 1). The data and detailed methods are
described below.
FIGURE 1.
[52]FIGURE 1
[53]Open in a new tab
Design of the study regarding identification of responders to
lapatinib, trastuzumab, and their combination therapy in the NeoALTTO
clinical trial. BP, MF, and CC stand for the Gene Ontology (GO) terms
for biological processes, molecular functions, and cellular components,
respectively.
Gene Expression Profiles of Tumor Samples
RNA-seq raw data of tumor samples in the NeoALTTO clinical trial were
quantified with Kallisto ([54]Bray et al., 2016) in Toil pipeline
([55]Vivian et al., 2017) using the GENCODE version 23 (ALL version)
transcriptome annotation. Transcript level abundances are summarized to
gene level using the same approach as described in [56]Soneson et al.
(2015).
Clinical Definition of Responders Versus Non-responders
Responders and non-responders in the NeoALTTO clinical trial were
determined using the rate of pathological complete response (pCR)
([57]Baselga et al., 2012). Any patient without a recorded pCR was
regarded as a non-responder. A pathological complete response is
defined as no invasive cancer in the breast or only non-invasive in
situ cancer in the breast specimen. Surgical breast and axillary node
resection specimens were evaluated for pathologic tumor response
according to the National Surgical Adjuvant Breast and Bowel Project
(NSABP) guidelines^[58]1.
Unsupervised Clustering
Similarities of samples within each arm of the NeoALTTO clinical trial
were identified using Spearman’s rank-order correlation ([59]Wissler,
1905). The hierarchical clustering was then implemented on the
similarity matrix between the samples using Euclidean distance and
Ward’s minimum variance method ([60]Murtagh and Legendre, 2014).
Univariate Biomarker Discovery Using Genes
Concordance indices between the expression of each gene and the
binarized vector of drug response were calculated as the prediction
performance of each gene as a univariate biomarker. The significance of
each identified C-index was calculated using a permutation test. The
observations were randomly permuted, and the C-index between the
expression of each gene and the observed classes for the tumor samples
was calculated. Then the fraction of times in which the C-index of the
gene expression with real observed classes was lower than the C-indices
identified with permuted observed classes was considered as the
significance (or FDR) of the C-index identified for that gene.
Concordance Index
We used the concordance index (C-index) to quantify the predictive
value of our drug response predictors. The C-index estimates the
probability that, for a pair of randomly chosen comparable samples, the
sample with the higher predicted value will experience an event before
the other sample or belongs to a higher binary class ([61]Harrell et
al., 1982). We used the implementation of the concordance index
available in the survcomp R package (version 1.34.0) ([62]Schröder et
al., 2011).
Transcriptional Similarity Coefficient
The transcriptional similarity coefficient (TSC) between each sample
and the responders and non-responders were identified using the TSC
function in the SIGN R package (version 0.1.0) ([63]Madani Tonekaboni
et al., 2019). Let P be the matrix of expression of genes within a
pathway for a set of biological samples where rows are genes and
columns are samples. Then the TSC is defined as follows:
[MATH:
TSC(P1P2)=∑i(P10×P20)ii∑ij(P10)ij2∑ij(P20)ij2
:MATH]
where P[1] and P[2] represent the matrix of gene expressions of a given
pathway in two sets of samples (populations 1 and 2), i is the row
index (i.e., gene index) within each matrix, j is the column index
(i.e., sample index) within each matrix, and P[m][0] (either P[10] or
P[20])
[MATH:
Pm0=Pm×Pm′-
Diag
onal<
/mi>(Pm×Pm′) :MATH]
where m is either 1 for population 1 or 2 for population 2. Deducting
the diagonal elements in the above equation was initially proposed to
the bioinformatics community for analyzing genomics data ([64]Smilde et
al., 2009). This term will make sure that the identified similarities
do not depend on the number of samples compared between the datasets.
The TSC captures the similarity of the pathway expression pattern
between two samples and/or sample sets that is in the range [−1,1].
Identifying Responders Using TSC
The TSC for each pathway was identified between one sample and the
remaining samples, divided into two groups of responders and
non-responders ([65]Baselga et al., 2012). GO terms in level C5 with 10
to 30 genes are used in this study to identify the similarity between
samples based on their gene expression pattern ([66]Madani Tonekaboni
et al., 2019). We limited the number of genes in GO terms to exclude
large GO terms (at the top of the GO term hierarchy) that are parents
of the GO terms in our study (at the bottom of the GO hierarchy). If
the TSC for similarity for the responders was higher than that for the
non-responders, the given sample was considered as a responder and vice
versa. This process was repeated for every given sample in each arm of
the trial. The method’s performance for predicting the response of
cancer patients was assessed using the concordance index.
Cross-Validation in Predictive Models
Each model was validated using leave-one-out cross-validation. In this
setting, a target sample was put aside, and the rest of the samples
were used for the prediction of drug response in the target sample. The
TSC of each pathway between the set-aside sample and the randomly
selected five samples from responders and non-responders were
calculated. Then the median of the TSCs of all the pathways was
calculated to assess if the sample has a higher similarity to
responders or non-responders. This process was repeated 100 times for
each sample, and majority votes of the 100 times were considered as the
predicted class of the sample to be responder or non-responder.
Results
We leveraged the gene expression and clinical information of HER2+
breast cancer patients in the NeoALTTO clinical trial to identify
biomarkers of drug response. The NeoALTTO clinical trial was a phase
three randomized clinical trial designed to assess the efficacy of
anti-HER2 monoclonal antibody trastuzumab, the tyrosine kinase
inhibitor lapatinib, and their combination therapies on
HER2-overexpressing breast cancer patients. The response (pCR) rate was
significantly higher in the group given lapatinib and trastuzumab
(51.3%) than in the group given trastuzumab alone (29.5%; p < 0.05).
However, no significant difference in pCR between the lapatinib (24.7%)
and the trastuzumab (p = 0.34) groups was observed.
We identified correlations of tumor samples based on their gene
expression profiles in three arms of the clinical trial separated based
on the treatment type including trastuzumab alone, lapatinib alone, and
their combination therapies ([67]Figure 1). The unsupervised clustering
of samples could not stratify the patients based on their responses,
relying on the rate of pathological complete response ([68]Figure 2A).
FIGURE 2.
[69]FIGURE 2
[70]Open in a new tab
Identifying responders to lapatinib, trastuzumab, and their combination
therapy in the NeoALTTO clinical trial using genes as univariate
predictors of response. (A) Clustering of samples based on their
similarity, defined as the Spearman correlation between gene expression
profiles of the sample. (B) Top 10 genes as univariate biomarkers of
drug response in ER+ and ER– cohorts within each arm of the NeoALTTO
clinical trial.
We further computed the C-index of genes as univariate biomarkers of
drug response in each arm of the NeoALTTO trial. Relying on the common
knowledge on ER being one of the main drivers in breast cancer
development and progression ([71]Fuqua, 1997), we stratified our
analyses based on the ER status. Top predictors of response yield a
C-index of 0.68 ([72]Figure 2B), while the C-index of ERBB2 as a
univariate biomarker of response in all the arms does not exceed 0.59
(the full list can be found in the [73]Supplementary Material). The low
performance of univariate modeling could be due to high correlation of
patient tumor samples, as more than 90% of the tumor sample pairs had
Pearson correlation of more than 0.9 using their gene expression
profiles.
We recently showed the high performance of a new method called SIGN in
predicting the survival rate of breast cancer patients under different
therapeutic regimens ([74]Madani Tonekaboni et al., 2019). We sought to
use SIGN to predict the drug response of patients in each arm of the
NeoALTTO trial. We used C-indices of the pathways, identified between
the TSC of the pathways and the drug response in each arm of the trial,
to cluster the arms. Trastuzumab alone and the combination therapy arms
were clustered more closely compared to the lapatinib alone arm using
the C-indices of the pathways, although the difference is not
significant (p > 0.05) ([75]Figure 3A). Moreover, the pathway
biomarkers of ER− and ER+ patient tumors showed low commonality
revealing differences in the mechanism of response caused by the ER
status of the patient tumors (absolute Spearman correlation <0.08)
([76]Figure 3B). Top pathway biomarkers for patients with the same
treatment regimen and ER status had C-indices of more than 0.8 except
for ER− patients under trastuzumab alone therapy ([77]Figure 3C).
FIGURE 3.
[78]FIGURE 3
[79]Open in a new tab
Identifying responders to lapatinib, trastuzumab, and their combination
therapy in the NeoALTTO clinical trial using the transcriptional
similarity coefficient (TSC) of pathways. (A) Concordance indices of
delta TSCs of GO terms, comparing each sample with responders and
non-responders, in predicting the response of patients to lapatinib,
trastuzumab, and their combination. (B) Clustering of groups of
patients based on Concordance indices of delta TSC of GO terms (A). (C)
Top pathways as predictors of lapatinib, trastuzumab, and their
combination in ER+ and ER– tumor samples in the NeoALTTO clinical
trial.
Although the biological function of the identified pathways as
biomarkers of drug response requires experimental validation, we found
some evidence on their biological relevance. For example, among top
identified biomarkers of drug response, there are REGULATION OF
INTERFERON GAMMA BIOSYNTHETIC PROCESS and NEGATIVE T-CELL SELECTION for
ER− and ER+ cancer patients under lapatinib treatment, respectively.
These are in agreement with previous literature on the importance of
immune signaling in lapatinib response in cancer patients ([80]Griguolo
et al., 2019).
Comparison With Other Machine Learning Models
We compared the top seven biomarkers identified in each arm of the
NeoALTTO clinical trial for patients with ER− or ER+ status with the
performance of 35 machine learning models ([81]Figure 4). These models
were built using five different machine learning algorithms, including
logistic regression, k-nearest-neighbor (k-NN), naive Bayes, random
forest, and support vector machine (SVM), and seven different feature
selection approaches ([82]Figure 4). SIGN-based biomarkers outperformed
all 35 models in all treatment categories. We used the same
leave-one-out cross-validation strategy as used for SIGN to compare the
performance of these models.
FIGURE 4.
[83]FIGURE 4
[84]Open in a new tab
Comparison of performance of top seven biomarkers of drug response
using Similarity Identification in Gene Expression (SIGN) and 35
machine learning models built combining five machine learning methods
and seven different feature selection approaches.
Discussion
We propose SIGN as a new approach to identify biomarkers of drug
response in other subtypes of breast cancer or other tumor types. We
showed the utility of SIGN in predicting the response of HER2+ breast
cancer patients to lapatinib, trastuzumab, and their combination
therapies using transcription patterns within biological pathways. Our
results further emphasize the information gained upon using genes
within biological pathways instead of individual markers of drug
response. Furthermore, it suggests transcriptional similarity
coefficient (TSC) as a new measure of similarity between tumor samples
to be used in predicting their response to drug response. SIGN-based
biomarkers outperformed 35 different machine learning models in
predicting drug response in each treatment category. Moreover, the SIGN
approach provides us with highly interpretable pathway-based biomarkers
of drug response. Although SIGN showed promising performance for
predicting response to lapatinib, trastuzumab, and their combination in
HER2+ breast cancer patients, this approach needs further validation to
ensure its generalizability in new clinical datasets. Upon having
access to further clinical data of HER2+ patients in each one of these
treatment categories, our findings in this study can be further
assessed and validated.
Data Availability Statement
The datasets generated for this study can be found in the
[85]ClinicalTrials.gov Identifier: [86]NCT00553358.
Author Contributions
SM led the project and performed the computational analysis of the work
under supervision of BH-K. GB and SM collected and curated the data.
All authors contributed to the article and approved the submitted
version.
Conflict of Interest
The authors declare that the research was conducted in the absence of
any commercial or financial relationships that could be construed as a
potential conflict of interest.
Acknowledgments