Graphical abstract
graphic file with name ga1.jpg
[39]Open in a new tab
Keywords: SNP, Drug response, Breast cancer, Prognosis
Abstract
Detecting SNPs associated with drug efficacy or toxicity is helpful to
facilitate personalized medicine. Previous studies usually find SNPs
associated with clinical outcome only in patients received a specific
treatment. However, without information from patients without drug
treatment, it is possible that the detected SNPs are associated with
patients’ clinical outcome even without drug treatment. Here we aimed
to detect drug response SNPs based on data from patients with and
without drug treatment through combing the cox proportional-hazards
model and pairwise Kaplan-Meier survival analysis. A pipeline named
Detection of Drug Response SNPs (DDRS) was built and applied to TCGA
breast cancer data including 363 patients with doxorubicin treatment
and 321 patients without any drug treatment. We identified 548
doxorubicin associated SNPs. Drug response score derived from these
SNPs were associated with drug-resistant level (indicated by IC[50]) of
breast cancer cell lines. Enrichment analyses showed that these SNPs
were enriched in active epigenetic regulation markers (e.g., H3K27ac).
Compared with random genes, the cis-eQTL genes of these SNPs had a
shorter protein–protein interaction distance to doxorubicin associated
genes. In addition, linear discriminant analysis showed that the eQTL
gene expression levels could be used to predict clinical outcome for
patients with doxorubicin treatment (AUC = 0.738). Specifically, we
identified rs2817101 as a drug response SNP for doxorubicin treatment.
Higher expression level of its cis-eQTL gene GSTA1 is associated with
poorer survival. This approach can also be applied to identify new drug
associated SNPs in other cancers.
1. Introduction
Cancer is a group of diseases involving uncontrolled growth and spread
of abnormal cells. Cancer incidence and mortality are rapidly growing.
Much work is ongoing to achieve better treatment of this group of
diseases. However, as an extremely heterogeneous condition, it is
estimated that any particular class of drugs is ineffective in about
75% of cancer patients [40][1]. Therefore, how to realize personalized
medicine (i.e., selecting optional therapy according to patients’
personal profile) remains a key challenge.
It is reported that genetics account for 20%–95% of variability in drug
disposition and effects [41][2]. Single nucleotide polymorphisms (SNPs)
account for about 80% of the overall genomic heterogeneity [42][3].
Moreover, a number of pharmacogenomics studies have demonstrated that
SNPs could influence the efficacy and side effects of drugs [43][4].
For anti-cancer drugs (e.g., irinotecan, mercaptopurine, 5-flurouracil,
and tamoxifen), several SNPs have been reported to be associated with
drug efficacy or toxicity [44][5], [45][6], [46][7]. These early
studies generally focused on SNPs within pre-specified genes of
interest, which might miss other potentially significant polymorphisms.
Recently, with the advances in high throughput technologies,
genome-wide association study (GWAS) provides a hypothesis-free
approach to identify novel SNPs that are responsible for drug response.
For example, Khan et al. [47][8] reported that the common SNP rs8113308
mapped to 19q13.41 was associated with reduced survival among endocrine
treated breast cancer patients. Cairns et al. [48][9] identified a SNP
in CSMD1 associated with breast cancer–free interval in a phase III
randomized trial of anastrozole versus exemestane. Generally, these
studies usually tested the association between SNPs and drug responses
(e.g., recurrence-free survival) only in patients received a specific
treatment. However, without information from patients without drug
treatment, this design cannot discriminate drug response SNPs from
clinical outcome associated SNPs. That’s, it is possible that the
detected SNPs are associated with patients’ clinical outcome even
without drug treatment.
In this study, we aimed to find drug response SNPs based on data from
patients with and without drug treatment. A pipeline named Detection of
Drug Response SNPs (DDRS) was built through combing the cox
proportional-hazards model and pairwise Kaplan-Meier survival analysis.
Briefly, first, we fit a Cox’s proportional hazards model including
three variates, including SNP, drug treatment status, and an
interaction term between SNP and drug treatment in data including
patients with and without drug treatment. SNPs with significant drug
treatment and interaction term were remained. Second, Kaplan-Meier (KM)
survival analysis were performed in patients with different genotypes
to further generate the final set of SNPs. SNPs with significant KM
results in patients without drug treatment were removed. To test the
performance of DDRS, data of breast cancer patients from The Cancer
Genome Project (TCGA) project were analyzed.
2. Methods
2.1. Pipeline of DDRS
The outline of DDRS is shown in [49]Fig. 1. The input data of the study
population included the genotype and patient’s survival information.
Specifically, patients without drug treatment were also included. We
used the patients’ survival status to indicate drug response since
overall survival is believed as the primary end point to evaluate the
outcome of any drug [50][10]. Overall survival has also been used as
the endpoint to find drug response and toxicity loci in previous
studies [51][11], [52][12]. First, a Cox’s proportional hazards model
was built to detect significant SNP-drug interactions (P < 0.05 for
[MATH: β3 :MATH]
).
[MATH: ht,X=h0texpβ1×drug+β2×<
mi>SNP+β3×<
mi>drug×SNP :MATH]
(1)
Fig. 1.
[53]Fig. 1
[54]Open in a new tab
An overview of the approach. Step 1: We performed SNP × drug
interaction analysis in all patients (including patients with drug
treatment and patients without any drug treatment). SNPs with
significant drug term (P < 0.05 for
[MATH: β1 :MATH]
) and significant interaction term (P < 0.05 for
[MATH: β3 :MATH]
) were remained. Step 2: For SNPs obtained from the first step, we
performed Kaplan-Meier (KM) analysis in subjects with different
genotypes to select SNPs associated with drug response in patients with
drug treatment. KM analysis in patients without any drug treatment was
also performed to remove the SNPs associated with survival but this
association was not related to drug treatment. Step 3: We performed
univariate cox proportional hazards analysis for each drug response
SNPs get from the first two steps. The coefficients for all SNPs was
used to calculate drug response score (DRS) in another population to
test the performance of drug response prediction.
We only focused on drugs with significant effect on survival (P < 0.05
for
[MATH: β1 :MATH]
) and significant interaction terms (P < 0.05 for
[MATH: β3 :MATH]
). Second, we performed KM survival analysis in patients with and
without drug treatment separately to further select the SNPs related to
drug response. For KM analysis, only subgroups with more than 20
patients (at least 5 patients with dead events) were remained for
analysis. SNPs with FDR-adjusted P < 0.05 in the KM analysis of
patients with drug treatment were selected. In addition, SNPs with
P < 0.05 in the KM analysis of patients without drug treatment were
removed to rule out the associations unrelated to drug treatment.
Finally, to test the generalization ability of the selected SNP sets in
other populations, we calculated the correlation between each SNP and
survival in one population:
[MATH: ht,X=h0texp(β×SNP) :MATH]
(2)
and constructed a drug response score (DRS) for each individual in
another dataset as follows:
[MATH: S=∑βi
×Xi :MATH]
(3)
where
[MATH: βi :MATH]
represents the coefficient of SNPi in Eq. [55](2) and
[MATH: Xi :MATH]
represents the tested allele’s copies of SNPi in Eq. [56](2). The
correlation between S and drug response were then calculated to test
whether the selected SNP sets could be used for survival prediction.
2.2. Genotype and clinical data collection processing
Clinical information, drug treatment information, germline DNA
genotypes for TCGA breast cancer samples (N = 1096) were obtained from
the Genomic Data Commons Data Portal
([57]https://cancergenome.nih.gov/, GDC portal). The genotyping
platform for all patients was the Affymetrix 6.0 array. Genotypes with
score < 0.1 are considered to be highly confident (Broad institute,
BIRDSUITE software) and 928,706 SNPs were retained in the study.
2.3. DRS and drug-resistant levels in cell lines
We constructed DRSs for 45 cancer cell lines (Table S1) to test whether
DRS can be used to predict drug response. Gene expression and drug
response data for these cell lines were obtained from the Genomics of
Drug Sensitivity in Cancer (GDSC) database [58][13]. We used IC[50]
(half maximal inhibitory concentration) for cell apoptosis to indicate
the drug-resistant level. Corresponding genotype data for all cell
lines were obtained from the Gene Expression Omnibus (GEO) database
([59]GSE34211 and [60]GSE41308). We used R package crlmm [61][14] for
genotype calling.
2.4. Functional annotation of the selected SNPs
We annotated the epigenetic regulatory features for SNPs using data
from the following resources: 1) we downloaded the 8 types of ChIP-seq
data from the GEO database ([62]GSE85158 [63][15]) for 11 breast cancer
cells. Sequencing reads were mapped to the human genome reference
(hg19) using Bowtie2 [64][16] with default settings. MACS2 [65][17] was
used to call peaks (-g hs -q 0.05 -n --keep-dup all). 2) we downloaded
the predicted enhancer region for BRCA from CistromeCancer [66][18].
The genome coordinates were converted from hg38 to hg19 using liftover
[67][19]. Next, we performed 10,000 permutation tests by using random
SNP sets generated by SNPsnap [68][20] with default settings. To
calculate the enrichment p-value, we simply count the number of sets
with annotated SNPs as or more extreme than our selected SNP set, and
divide that number by the total number (10,000). The enrichment score
(ES) was calculated as follows:
[MATH: ES=∑i=1<
/mrow>10,000x0xi/10,000
mrow> :MATH]
where
[MATH: x0 :MATH]
is the number of annotated SNPs from our selected SNP set, and
[MATH: xi :MATH]
is the number of annotated SNPs in the ith random SNP set.
2.5. eQTL analysis
Gene expression data normalized using RSEM [69][21] and segmented copy
number variation data were downloaded from Genome Data Analysis Center
database (GDAC, [70]http://gdac.broadinstitute.org/). Absolute gene
copy numbers were calculated from segmented copy number files by
ABSOLUTE [71][22], and then used as covariate to adjust the gene
expression data [72][23]. To remove the effect of population structure
on gene expression, we used smartpca in the EIGENSOFT [73][24] program
to perform principal component (PC) analysis, and selected the top 10
PCs from genome-wide genotype data as covariates. To remove the hidden
batch effects and other potential confounders in the gene expression
data, we also used the Probabilistic Estimation of Expression Residuals
[74][25] (PEER) method to select the first 15 PEER factors as
covariates. Age, pathologic stage and race were also used as
covariates. EQTL analysis was performed using matrix eQTL [75][26].
SNPs with false discovery rates (FDR) < 0.05 were defined as eQTL
genes. Cis-eQTL genes were defined if the SNP was located within 1 Mb
from the gene transcriptional start site (TSS).
2.6. Extraction of drug target genes and protein–protein interactions (PPIs)
We collected doxorubicin targets genes from the DrugBank database
[76][27] and Therapeutic Target Database [77][28]. We also collected
doxorubicin related enzymes, carriers and transporters from DrugBank.
Genes directly interacted with doxorubicin (confidence score over 0.7)
were collected from STITCH database [78][29]. To get PPIs, we
downloaded the human interactome published by Cheng et al. [79][30],
which collected 243,603 PPIs of 16,677 unique proteins. Next, we
performed 10,000 permutation tests by randomly chose same number of
genes from 16,677 proteins. To calculate the permutation p-value, we
simply compared the mean shortest PPI distances from eQTL genes to
doxorubicin and from randomly chosen genes to doxorubicin. We count the
number of randomly chosen gene sets with shorter mean PPI distances
than our eQTL gene set, and divide that number by the total permutation
number (10,000) [80][30].
2.7. Doxorubicin resistance prediction model
We constructed multigene classifiers using these eQTL genes and Linear
Discriminant Analysis (LDA) in [81]GSE20194, which has 230 patients
received 6 months of preoperative chemotherapy including doxorubicin.
182 patients were categorized as residual invasive cancer (RD) and 48
patients were categorized as pathological complete response (pCR, no
residual invasive cancer in the breast or lymph nodes). We performed
1000 times repeated 1-fold cross-validation. The classifier performance
on the validation data were assessed by using the area under the
receiver operating characteristic curve (ROC-AUC).
2.8. Pathway enrichment and pathway activity inference
The eQTL genes were ranked according to the product of the β from eQTL
analysis and the β from Cox analysis (i.e., β[cox] * β[eQTL]). For
genes with more than one significant SNPs, the results with the maximum
absolute value was remained. Pathway enrichment analysis for these
genes with pre-ranked values was then performed using GSEA [82][31].
Positive pathways enriched at P < 0.05 were recognized as
doxorubicin-resistant/risk pathways and negative pathways enriched at
P < 0.05 were recognized as doxorubicin-sensitive/protective pathways.
The pathways’ activity score (PAS) were calculated with diffrank
[83][32].
3. Result
3.1. Identification of doxorubicin response SNPs in breast cancer
We applied this approach to TCGA breast cancer data to detect
doxorubicin response SNPs. 363 patients with doxorubicin treatment and
321 patients without any drug treatment was included. Detailed
characteristics of the patients are provided in [84]Table 1. Using the
cox’s proportional hazards model, we identified 8020 SNPs that might be
interacting with doxorubicin treatment. Next, KM analysis in patients
with doxorubicin treatment showed that subjects with different
genotypes of 619 SNPs had significant different survival status (FDR
P < 0.05). In addition, using KM analysis in patients without any drug
treatment, we further ruled out the SNP-survival associations not
related to drug treatment and 71 SNPs were removed. Therefore, after
analyzing with the DDRS pipeline, a total of 548 doxorubicin associated
SNPs were identified with the detailed information summarized in
Supplementary Table S2.
Table 1.
Pre-treatment characteristics of the patients.
Patient with doxorubicin treatment Patients without any drug treatment
All patients
Number 363 321 684
Age
<=50 173 (48%) 71 (22%) 244 (36%)
>50 189 (52%) 240 (77%) 429 (64%)
Mean (SD) 52 (10) 62 (14) 56 (13)
Unknown 1 10 11
Nodal status
Positive 238 (66%) 152 (48%) 390 (57%)
Negative 122 (33%) 158 (49%) 280 (41%)
Unknown 3 (1%) 11 (3%) 14 (2%)
T stage
1 73 (21%) 79 (25%) 152 (22%)
2 233 (64%) 177 (55%) 410 (60%)
3 52 (14%) 41 (13%) 93 (14%)
4 5 (1%) 22 (7%) 27 (4%)
Unknown 0 (0%) 2 (0%) 2 (0%)
Pathologic stage
1 36 (10%) 59 (18%) 95 (14%)
2 213 (59%) 174 (54%) 387 (57%)
3 108 (30%) 67 (21%) 175 (26%)
4 2 (0%) 9 (3%) 11 (1%)
Unknown 4 (1%) 12 (4%) 16 (2%)
ER Status
Positive 244 (67%) 225 (70%) 469 (69%)
Negative 106 (29%) 79 (25%) 185 (27%)
Unknown 13 (4%) 17 (5%) 30 (4%)
PR Status
Positive 209 (58%) 188 (59%) 397 (58%)
Negative 139 (38%) 115 (36%) 254 (37%)
Unknown 15 (4%) 18 (5%) 33 (5%)
HER2 Status
Positive 45 (12%) 61 (19%) 106 (15%)
Negative 207 (57%) 144 (45%) 351 (51%)
Unknown 111 (31%) 116 (36%) 227 (34%)
[85]Open in a new tab
3.2. DRS derived from selected SNP are associated with drug resistant level
We calculated the DRS in the rest 269 TCGA breast cancer patients with
other drug treatment. As shown in [86]Fig. 2A, compared with other
patients, the basal like patients had the highest DRS. We also
calculated doxorubicin-associated DRS in breast cancer cell lines
([87]Fig. 2B). These cells were divided into doxorubicin-resistant
group (IC[50] higher than median value) and doxorubicin-sensitive group
(IC[50] lower than median value). The DRS in the doxorubicin-resistant
group was significantly higher than the doxorubicin-sensitive group
(P = 0.044).
Fig. 2.
[88]Fig. 2
[89]Open in a new tab
A: Boxplot of DRS of patients in different molecular subtypes. Patients
were stratified into groups of Luminal A, Luminal B, HER2 and Basal
like by the ER, PR and HER2 markers. B: Boxplot of DRS between
doxorubicin-resistant cells (High IC[50]) and doxorubicin-sensitive
(Low IC[50]) cells. C: Enrichment analysis heatmap plot of those
significant SNPs’ epigenetic annotation. Enrichment score (ES) are
color-coded from light to dark.
3.3. The selected SNPs are enriched in active regulatory epigenetic markers
According to the genomic region annotation results, over 94% SNPs are
located in the intergenic or intronic region. We further examined
whether these doxorubicin-associated SNPs were associated with active
epigenetic regulation using ChIP-seq data from 11 breast cancer cell
lines (Table S1, [90]Fig. 2C). The results showed that 34% SNPs are
located in active epigenetic mark regions of at least one breast cancer
cell lines. Further analysis showed that these SNPs were significantly
enriched in histone epigenetic regions associated with active enhancer
(H3K27ac [91][33], H3K4me1 [92][34] and H4K8ac [93][35]), active gene
transcription (H3K36me3 [94][36], H3K9ac [95][33] and H2BK120ub1
[96][37]) ([97]Fig. 2C) and also in predicted BRCA enhancer regions
(P = 0.0008, 10,000 permutations; 1.46-fold).
3.4. The eQTL gene expression levels could be used to predict clinical
outcome
We identified 958 cis-eQTL genes for these 548 doxorubicin associated
SNPs. Compared with random selected genes, these genes had
significantly shorter mean PPI distances to doxorubicin target genes,
doxorubicin related enzymes/carriers/transporters, and
doxorubicin-interacting genes ([98]Fig. 3A, P < 0.001). Next, we
estimated these genes’ doxorubicin-response predictive power. We used
all the genes’ expression as features to train a LDA classifier to
distinguish patients from RD and pCR. The ROC-AUC was 0.738 (CI:
0.736–0.741) ([99]Fig. 3B).
Fig. 3.
[100]Fig. 3
[101]Open in a new tab
A: Violin plot of mean shortest PPI distances to doxorubicin target.
Red bars represent the mean PPI distance of the cis-eQTL genes to
doxorubicin target genes, doxorubicin related enzymes, doxorubicin
related enzymes/carriers/transporters, and doxorubicin-interacting
genes. Blue bars represent the mean shortest PPI distance of 10,000
groups of randomly selected genes. B: ROC curve for the predictive
performance of the LDA genomic pCR predictor with these eQTL genes as
features. C: Volcano plot of univariate cox proportional hazards
results of eQTL genes. Horizontal axis showed the univariate cox
proportional hazards confidences and vertical axis showed the negative
log of the P values. Doxorubicin-resistant genes are shown on upper and
doxorubicin-sensitive genes are shown on below. D: Volcano plot of
univariate cox proportional hazards results of PAS of enriched
pathways. Horizontal axis shows the univariate cox proportional hazards
confidences and vertical axis showed the negative log of the P values.
Positive pathways are shown on upper and negative pathways are shown on
below. E: Univariate cox proportional hazards confidences of 6 positive
pathways had significant risk PAS and 7 negative pathways had
significant protective PAS. (For interpretation of the references to