Graphical abstract

   graphic file with name ga1.jpg
   [39]Open in a new tab

   Keywords: SNP, Drug response, Breast cancer, Prognosis

Abstract

   Detecting SNPs associated with drug efficacy or toxicity is helpful to
   facilitate personalized medicine. Previous studies usually find SNPs
   associated with clinical outcome only in patients received a specific
   treatment. However, without information from patients without drug
   treatment, it is possible that the detected SNPs are associated with
   patients’ clinical outcome even without drug treatment. Here we aimed
   to detect drug response SNPs based on data from patients with and
   without drug treatment through combing the cox proportional-hazards
   model and pairwise Kaplan-Meier survival analysis. A pipeline named
   Detection of Drug Response SNPs (DDRS) was built and applied to TCGA
   breast cancer data including 363 patients with doxorubicin treatment
   and 321 patients without any drug treatment. We identified 548
   doxorubicin associated SNPs. Drug response score derived from these
   SNPs were associated with drug-resistant level (indicated by IC[50]) of
   breast cancer cell lines. Enrichment analyses showed that these SNPs
   were enriched in active epigenetic regulation markers (e.g., H3K27ac).
   Compared with random genes, the cis-eQTL genes of these SNPs had a
   shorter protein–protein interaction distance to doxorubicin associated
   genes. In addition, linear discriminant analysis showed that the eQTL
   gene expression levels could be used to predict clinical outcome for
   patients with doxorubicin treatment (AUC = 0.738). Specifically, we
   identified rs2817101 as a drug response SNP for doxorubicin treatment.
   Higher expression level of its cis-eQTL gene GSTA1 is associated with
   poorer survival. This approach can also be applied to identify new drug
   associated SNPs in other cancers.

1. Introduction

   Cancer is a group of diseases involving uncontrolled growth and spread
   of abnormal cells. Cancer incidence and mortality are rapidly growing.
   Much work is ongoing to achieve better treatment of this group of
   diseases. However, as an extremely heterogeneous condition, it is
   estimated that any particular class of drugs is ineffective in about
   75% of cancer patients [40][1]. Therefore, how to realize personalized
   medicine (i.e., selecting optional therapy according to patients’
   personal profile) remains a key challenge.

   It is reported that genetics account for 20%–95% of variability in drug
   disposition and effects [41][2]. Single nucleotide polymorphisms (SNPs)
   account for about 80% of the overall genomic heterogeneity [42][3].
   Moreover, a number of pharmacogenomics studies have demonstrated that
   SNPs could influence the efficacy and side effects of drugs [43][4].
   For anti-cancer drugs (e.g., irinotecan, mercaptopurine, 5-flurouracil,
   and tamoxifen), several SNPs have been reported to be associated with
   drug efficacy or toxicity [44][5], [45][6], [46][7]. These early
   studies generally focused on SNPs within pre-specified genes of
   interest, which might miss other potentially significant polymorphisms.
   Recently, with the advances in high throughput technologies,
   genome-wide association study (GWAS) provides a hypothesis-free
   approach to identify novel SNPs that are responsible for drug response.
   For example, Khan et al. [47][8] reported that the common SNP rs8113308
   mapped to 19q13.41 was associated with reduced survival among endocrine
   treated breast cancer patients. Cairns et al. [48][9] identified a SNP
   in CSMD1 associated with breast cancer–free interval in a phase III
   randomized trial of anastrozole versus exemestane. Generally, these
   studies usually tested the association between SNPs and drug responses
   (e.g., recurrence-free survival) only in patients received a specific
   treatment. However, without information from patients without drug
   treatment, this design cannot discriminate drug response SNPs from
   clinical outcome associated SNPs. That’s, it is possible that the
   detected SNPs are associated with patients’ clinical outcome even
   without drug treatment.

   In this study, we aimed to find drug response SNPs based on data from
   patients with and without drug treatment. A pipeline named Detection of
   Drug Response SNPs (DDRS) was built through combing the cox
   proportional-hazards model and pairwise Kaplan-Meier survival analysis.
   Briefly, first, we fit a Cox’s proportional hazards model including
   three variates, including SNP, drug treatment status, and an
   interaction term between SNP and drug treatment in data including
   patients with and without drug treatment. SNPs with significant drug
   treatment and interaction term were remained. Second, Kaplan-Meier (KM)
   survival analysis were performed in patients with different genotypes
   to further generate the final set of SNPs. SNPs with significant KM
   results in patients without drug treatment were removed. To test the
   performance of DDRS, data of breast cancer patients from The Cancer
   Genome Project (TCGA) project were analyzed.

2. Methods

2.1. Pipeline of DDRS

   The outline of DDRS is shown in [49]Fig. 1. The input data of the study
   population included the genotype and patient’s survival information.
   Specifically, patients without drug treatment were also included. We
   used the patients’ survival status to indicate drug response since
   overall survival is believed as the primary end point to evaluate the
   outcome of any drug [50][10]. Overall survival has also been used as
   the endpoint to find drug response and toxicity loci in previous
   studies [51][11], [52][12]. First, a Cox’s proportional hazards model
   was built to detect significant SNP-drug interactions (P < 0.05 for
   [MATH: <msub><mi>β</mi><mn>3</mn></msub> :MATH]
   ).
   [MATH: <mrow><mi>h</mi><mfenced close=")"
   open="("><mrow><mi>t</mi><mo>,</mo><mi>X</mi></mrow></mfenced><mo
   linebreak="goodbreak">=</mo><msub><mi>h</mi><mn>0</mn></msub><mfenced
   close=")" open="("><mrow><mi>t</mi></mrow></mfenced><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">p</mi><mfenced close=")"
   open="("><mrow><msub><mi>β</mi><mn>1</mn></msub><mo>×</mo><mi>d</mi><mi
   >r</mi><mi>u</mi><mi>g</mi><mo
   linebreak="badbreak">+</mo><msub><mi>β</mi><mn>2</mn></msub><mo>×</mo><
   mi>S</mi><mi>N</mi><mi>P</mi><mo
   linebreak="badbreak">+</mo><msub><mi>β</mi><mn>3</mn></msub><mo>×</mo><
   mi>d</mi><mi>r</mi><mi>u</mi><mi>g</mi><mo>×</mo><mi>S</mi><mi>N</mi><m
   i>P</mi></mrow></mfenced></mrow> :MATH]
   (1)

Fig. 1.

   [53]Fig. 1
   [54]Open in a new tab

   An overview of the approach. Step 1: We performed SNP × drug
   interaction analysis in all patients (including patients with drug
   treatment and patients without any drug treatment). SNPs with
   significant drug term (P < 0.05 for
   [MATH: <mrow><msub><mi>β</mi><mn>1</mn></msub></mrow> :MATH]
   ) and significant interaction term (P < 0.05 for
   [MATH: <mrow><msub><mi>β</mi><mn>3</mn></msub></mrow> :MATH]
   ) were remained. Step 2: For SNPs obtained from the first step, we
   performed Kaplan-Meier (KM) analysis in subjects with different
   genotypes to select SNPs associated with drug response in patients with
   drug treatment. KM analysis in patients without any drug treatment was
   also performed to remove the SNPs associated with survival but this
   association was not related to drug treatment. Step 3: We performed
   univariate cox proportional hazards analysis for each drug response
   SNPs get from the first two steps. The coefficients for all SNPs was
   used to calculate drug response score (DRS) in another population to
   test the performance of drug response prediction.

   We only focused on drugs with significant effect on survival (P < 0.05
   for
   [MATH: <mrow><msub><mi>β</mi><mn>1</mn></msub></mrow> :MATH]
   ) and significant interaction terms (P < 0.05 for
   [MATH: <msub><mi>β</mi><mn>3</mn></msub> :MATH]
   ). Second, we performed KM survival analysis in patients with and
   without drug treatment separately to further select the SNPs related to
   drug response. For KM analysis, only subgroups with more than 20
   patients (at least 5 patients with dead events) were remained for
   analysis. SNPs with FDR-adjusted P < 0.05 in the KM analysis of
   patients with drug treatment were selected. In addition, SNPs with
   P < 0.05 in the KM analysis of patients without drug treatment were
   removed to rule out the associations unrelated to drug treatment.
   Finally, to test the generalization ability of the selected SNP sets in
   other populations, we calculated the correlation between each SNP and
   survival in one population:
   [MATH: <mrow><mi>h</mi><mfenced close=")"
   open="("><mrow><mi>t</mi><mo>,</mo><mi>X</mi></mrow></mfenced><mo
   linebreak="goodbreak">=</mo><msub><mi>h</mi><mn>0</mn></msub><mfenced
   close=")" open="("><mrow><mi>t</mi></mrow></mfenced><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">x</mi><mi
   mathvariant="normal">p</mi><mrow><mo
   stretchy="false">(</mo><mi>β</mi><mo>×</mo><mi>S</mi><mi>N</mi><mi>P</m
   i><mo stretchy="false">)</mo></mrow></mrow> :MATH]
   (2)

   and constructed a drug response score (DRS) for each individual in
   another dataset as follows:
   [MATH: <mrow><mi>S</mi><mo
   linebreak="goodbreak">=</mo><mo>∑</mo><msub><mi>β</mi><mi>i</mi></msub>
   <mo>×</mo><msub><mi>X</mi><mi>i</mi></msub></mrow> :MATH]
   (3)

   where
   [MATH: <msub><mi>β</mi><mi>i</mi></msub> :MATH]
   represents the coefficient of SNPi in Eq. [55](2) and
   [MATH: <msub><mi>X</mi><mi>i</mi></msub> :MATH]
   represents the tested allele’s copies of SNPi in Eq. [56](2). The
   correlation between S and drug response were then calculated to test
   whether the selected SNP sets could be used for survival prediction.

2.2. Genotype and clinical data collection processing

   Clinical information, drug treatment information, germline DNA
   genotypes for TCGA breast cancer samples (N = 1096) were obtained from
   the Genomic Data Commons Data Portal
   ([57]https://cancergenome.nih.gov/, GDC portal). The genotyping
   platform for all patients was the Affymetrix 6.0 array. Genotypes with
   score < 0.1 are considered to be highly confident (Broad institute,
   BIRDSUITE software) and 928,706 SNPs were retained in the study.

2.3. DRS and drug-resistant levels in cell lines

   We constructed DRSs for 45 cancer cell lines (Table S1) to test whether
   DRS can be used to predict drug response. Gene expression and drug
   response data for these cell lines were obtained from the Genomics of
   Drug Sensitivity in Cancer (GDSC) database [58][13]. We used IC[50]
   (half maximal inhibitory concentration) for cell apoptosis to indicate
   the drug-resistant level. Corresponding genotype data for all cell
   lines were obtained from the Gene Expression Omnibus (GEO) database
   ([59]GSE34211 and [60]GSE41308). We used R package crlmm [61][14] for
   genotype calling.

2.4. Functional annotation of the selected SNPs

   We annotated the epigenetic regulatory features for SNPs using data
   from the following resources: 1) we downloaded the 8 types of ChIP-seq
   data from the GEO database ([62]GSE85158 [63][15]) for 11 breast cancer
   cells. Sequencing reads were mapped to the human genome reference
   (hg19) using Bowtie2 [64][16] with default settings. MACS2 [65][17] was
   used to call peaks (-g hs -q 0.05 -n --keep-dup all). 2) we downloaded
   the predicted enhancer region for BRCA from CistromeCancer [66][18].
   The genome coordinates were converted from hg38 to hg19 using liftover
   [67][19]. Next, we performed 10,000 permutation tests by using random
   SNP sets generated by SNPsnap [68][20] with default settings. To
   calculate the enrichment p-value, we simply count the number of sets
   with annotated SNPs as or more extreme than our selected SNP set, and
   divide that number by the total number (10,000). The enrichment score
   (ES) was calculated as follows:
   [MATH: <mrow><mi mathvariant="normal">E</mi><mi
   mathvariant="normal">S</mi><mo linebreak="goodbreak">=</mo><mfenced
   close=")"
   open="("><mrow><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn><
   /mrow><mrow><mn>10</mn><mo>,</mo><mn>000</mn></mrow></msubsup><mfrac><m
   sub><mi>x</mi><mn>0</mn></msub><msub><mi>x</mi><mi>i</mi></msub></mfrac
   ></mrow></mfenced><mo
   stretchy="false">/</mo><mrow><mn>10</mn><mo>,</mo><mn>000</mn></mrow></
   mrow> :MATH]

   where
   [MATH: <msub><mi>x</mi><mn>0</mn></msub> :MATH]
   is the number of annotated SNPs from our selected SNP set, and
   [MATH: <msub><mi>x</mi><mi>i</mi></msub> :MATH]
   is the number of annotated SNPs in the ith random SNP set.

2.5. eQTL analysis

   Gene expression data normalized using RSEM [69][21] and segmented copy
   number variation data were downloaded from Genome Data Analysis Center
   database (GDAC, [70]http://gdac.broadinstitute.org/). Absolute gene
   copy numbers were calculated from segmented copy number files by
   ABSOLUTE [71][22], and then used as covariate to adjust the gene
   expression data [72][23]. To remove the effect of population structure
   on gene expression, we used smartpca in the EIGENSOFT [73][24] program
   to perform principal component (PC) analysis, and selected the top 10
   PCs from genome-wide genotype data as covariates. To remove the hidden
   batch effects and other potential confounders in the gene expression
   data, we also used the Probabilistic Estimation of Expression Residuals
   [74][25] (PEER) method to select the first 15 PEER factors as
   covariates. Age, pathologic stage and race were also used as
   covariates. EQTL analysis was performed using matrix eQTL [75][26].
   SNPs with false discovery rates (FDR) < 0.05 were defined as eQTL
   genes. Cis-eQTL genes were defined if the SNP was located within 1 Mb
   from the gene transcriptional start site (TSS).

2.6. Extraction of drug target genes and protein–protein interactions (PPIs)

   We collected doxorubicin targets genes from the DrugBank database
   [76][27] and Therapeutic Target Database [77][28]. We also collected
   doxorubicin related enzymes, carriers and transporters from DrugBank.
   Genes directly interacted with doxorubicin (confidence score over 0.7)
   were collected from STITCH database [78][29]. To get PPIs, we
   downloaded the human interactome published by Cheng et al. [79][30],
   which collected 243,603 PPIs of 16,677 unique proteins. Next, we
   performed 10,000 permutation tests by randomly chose same number of
   genes from 16,677 proteins. To calculate the permutation p-value, we
   simply compared the mean shortest PPI distances from eQTL genes to
   doxorubicin and from randomly chosen genes to doxorubicin. We count the
   number of randomly chosen gene sets with shorter mean PPI distances
   than our eQTL gene set, and divide that number by the total permutation
   number (10,000) [80][30].

2.7. Doxorubicin resistance prediction model

   We constructed multigene classifiers using these eQTL genes and Linear
   Discriminant Analysis (LDA) in [81]GSE20194, which has 230 patients
   received 6 months of preoperative chemotherapy including doxorubicin.
   182 patients were categorized as residual invasive cancer (RD) and 48
   patients were categorized as pathological complete response (pCR, no
   residual invasive cancer in the breast or lymph nodes). We performed
   1000 times repeated 1-fold cross-validation. The classifier performance
   on the validation data were assessed by using the area under the
   receiver operating characteristic curve (ROC-AUC).

2.8. Pathway enrichment and pathway activity inference

   The eQTL genes were ranked according to the product of the β from eQTL
   analysis and the β from Cox analysis (i.e., β[cox] * β[eQTL]). For
   genes with more than one significant SNPs, the results with the maximum
   absolute value was remained. Pathway enrichment analysis for these
   genes with pre-ranked values was then performed using GSEA [82][31].
   Positive pathways enriched at P < 0.05 were recognized as
   doxorubicin-resistant/risk pathways and negative pathways enriched at
   P < 0.05 were recognized as doxorubicin-sensitive/protective pathways.
   The pathways’ activity score (PAS) were calculated with diffrank
   [83][32].

3. Result

3.1. Identification of doxorubicin response SNPs in breast cancer

   We applied this approach to TCGA breast cancer data to detect
   doxorubicin response SNPs. 363 patients with doxorubicin treatment and
   321 patients without any drug treatment was included. Detailed
   characteristics of the patients are provided in [84]Table 1. Using the
   cox’s proportional hazards model, we identified 8020 SNPs that might be
   interacting with doxorubicin treatment. Next, KM analysis in patients
   with doxorubicin treatment showed that subjects with different
   genotypes of 619 SNPs had significant different survival status (FDR
   P < 0.05). In addition, using KM analysis in patients without any drug
   treatment, we further ruled out the SNP-survival associations not
   related to drug treatment and 71 SNPs were removed. Therefore, after
   analyzing with the DDRS pipeline, a total of 548 doxorubicin associated
   SNPs were identified with the detailed information summarized in
   Supplementary Table S2.

Table 1.

   Pre-treatment characteristics of the patients.
   Patient with doxorubicin treatment Patients without any drug treatment
   All patients
   Number 363 321 684
   Age
   <=50 173 (48%) 71 (22%) 244 (36%)
   >50 189 (52%) 240 (77%) 429 (64%)
   Mean (SD) 52 (10) 62 (14) 56 (13)
   Unknown 1 10 11
     
   Nodal status
   Positive 238 (66%) 152 (48%) 390 (57%)
   Negative 122 (33%) 158 (49%) 280 (41%)
   Unknown 3 (1%) 11 (3%) 14 (2%)
     
   T stage
   1 73 (21%) 79 (25%) 152 (22%)
   2 233 (64%) 177 (55%) 410 (60%)
   3 52 (14%) 41 (13%) 93 (14%)
   4 5 (1%) 22 (7%) 27 (4%)
   Unknown 0 (0%) 2 (0%) 2 (0%)
     
   Pathologic stage
   1 36 (10%) 59 (18%) 95 (14%)
   2 213 (59%) 174 (54%) 387 (57%)
   3 108 (30%) 67 (21%) 175 (26%)
   4 2 (0%) 9 (3%) 11 (1%)
   Unknown 4 (1%) 12 (4%) 16 (2%)
     
   ER Status
   Positive 244 (67%) 225 (70%) 469 (69%)
   Negative 106 (29%) 79 (25%) 185 (27%)
   Unknown 13 (4%) 17 (5%) 30 (4%)
     
   PR Status
   Positive 209 (58%) 188 (59%) 397 (58%)
   Negative 139 (38%) 115 (36%) 254 (37%)
   Unknown 15 (4%) 18 (5%) 33 (5%)
     
   HER2 Status
   Positive 45 (12%) 61 (19%) 106 (15%)
   Negative 207 (57%) 144 (45%) 351 (51%)
   Unknown 111 (31%) 116 (36%) 227 (34%)
   [85]Open in a new tab

3.2. DRS derived from selected SNP are associated with drug resistant level

   We calculated the DRS in the rest 269 TCGA breast cancer patients with
   other drug treatment. As shown in [86]Fig. 2A, compared with other
   patients, the basal like patients had the highest DRS. We also
   calculated doxorubicin-associated DRS in breast cancer cell lines
   ([87]Fig. 2B). These cells were divided into doxorubicin-resistant
   group (IC[50] higher than median value) and doxorubicin-sensitive group
   (IC[50] lower than median value). The DRS in the doxorubicin-resistant
   group was significantly higher than the doxorubicin-sensitive group
   (P = 0.044).

Fig. 2.

   [88]Fig. 2
   [89]Open in a new tab

   A: Boxplot of DRS of patients in different molecular subtypes. Patients
   were stratified into groups of Luminal A, Luminal B, HER2 and Basal
   like by the ER, PR and HER2 markers. B: Boxplot of DRS between
   doxorubicin-resistant cells (High IC[50]) and doxorubicin-sensitive
   (Low IC[50]) cells. C: Enrichment analysis heatmap plot of those
   significant SNPs’ epigenetic annotation. Enrichment score (ES) are
   color-coded from light to dark.

3.3. The selected SNPs are enriched in active regulatory epigenetic markers

   According to the genomic region annotation results, over 94% SNPs are
   located in the intergenic or intronic region. We further examined
   whether these doxorubicin-associated SNPs were associated with active
   epigenetic regulation using ChIP-seq data from 11 breast cancer cell
   lines (Table S1, [90]Fig. 2C). The results showed that 34% SNPs are
   located in active epigenetic mark regions of at least one breast cancer
   cell lines. Further analysis showed that these SNPs were significantly
   enriched in histone epigenetic regions associated with active enhancer
   (H3K27ac [91][33], H3K4me1 [92][34] and H4K8ac [93][35]), active gene
   transcription (H3K36me3 [94][36], H3K9ac [95][33] and H2BK120ub1
   [96][37]) ([97]Fig. 2C) and also in predicted BRCA enhancer regions
   (P = 0.0008, 10,000 permutations; 1.46-fold).

3.4. The eQTL gene expression levels could be used to predict clinical
outcome

   We identified 958 cis-eQTL genes for these 548 doxorubicin associated
   SNPs. Compared with random selected genes, these genes had
   significantly shorter mean PPI distances to doxorubicin target genes,
   doxorubicin related enzymes/carriers/transporters, and
   doxorubicin-interacting genes ([98]Fig. 3A, P < 0.001). Next, we
   estimated these genes’ doxorubicin-response predictive power. We used
   all the genes’ expression as features to train a LDA classifier to
   distinguish patients from RD and pCR. The ROC-AUC was 0.738 (CI:
   0.736–0.741) ([99]Fig. 3B).

Fig. 3.

   [100]Fig. 3
   [101]Open in a new tab

   A: Violin plot of mean shortest PPI distances to doxorubicin target.
   Red bars represent the mean PPI distance of the cis-eQTL genes to
   doxorubicin target genes, doxorubicin related enzymes, doxorubicin
   related enzymes/carriers/transporters, and doxorubicin-interacting
   genes. Blue bars represent the mean shortest PPI distance of 10,000
   groups of randomly selected genes. B: ROC curve for the predictive
   performance of the LDA genomic pCR predictor with these eQTL genes as
   features. C: Volcano plot of univariate cox proportional hazards
   results of eQTL genes. Horizontal axis showed the univariate cox
   proportional hazards confidences and vertical axis showed the negative
   log of the P values. Doxorubicin-resistant genes are shown on upper and
   doxorubicin-sensitive genes are shown on below. D: Volcano plot of
   univariate cox proportional hazards results of PAS of enriched
   pathways. Horizontal axis shows the univariate cox proportional hazards
   confidences and vertical axis showed the negative log of the P values.
   Positive pathways are shown on upper and negative pathways are shown on
   below. E: Univariate cox proportional hazards confidences of 6 positive
   pathways had significant risk PAS and 7 negative pathways had
   significant protective PAS. (For interpretation of the references to