Graphical abstract graphic file with name ga1.jpg [39]Open in a new tab Keywords: SNP, Drug response, Breast cancer, Prognosis Abstract Detecting SNPs associated with drug efficacy or toxicity is helpful to facilitate personalized medicine. Previous studies usually find SNPs associated with clinical outcome only in patients received a specific treatment. However, without information from patients without drug treatment, it is possible that the detected SNPs are associated with patients’ clinical outcome even without drug treatment. Here we aimed to detect drug response SNPs based on data from patients with and without drug treatment through combing the cox proportional-hazards model and pairwise Kaplan-Meier survival analysis. A pipeline named Detection of Drug Response SNPs (DDRS) was built and applied to TCGA breast cancer data including 363 patients with doxorubicin treatment and 321 patients without any drug treatment. We identified 548 doxorubicin associated SNPs. Drug response score derived from these SNPs were associated with drug-resistant level (indicated by IC[50]) of breast cancer cell lines. Enrichment analyses showed that these SNPs were enriched in active epigenetic regulation markers (e.g., H3K27ac). Compared with random genes, the cis-eQTL genes of these SNPs had a shorter protein–protein interaction distance to doxorubicin associated genes. In addition, linear discriminant analysis showed that the eQTL gene expression levels could be used to predict clinical outcome for patients with doxorubicin treatment (AUC = 0.738). Specifically, we identified rs2817101 as a drug response SNP for doxorubicin treatment. Higher expression level of its cis-eQTL gene GSTA1 is associated with poorer survival. This approach can also be applied to identify new drug associated SNPs in other cancers. 1. Introduction Cancer is a group of diseases involving uncontrolled growth and spread of abnormal cells. Cancer incidence and mortality are rapidly growing. Much work is ongoing to achieve better treatment of this group of diseases. However, as an extremely heterogeneous condition, it is estimated that any particular class of drugs is ineffective in about 75% of cancer patients [40][1]. Therefore, how to realize personalized medicine (i.e., selecting optional therapy according to patients’ personal profile) remains a key challenge. It is reported that genetics account for 20%–95% of variability in drug disposition and effects [41][2]. Single nucleotide polymorphisms (SNPs) account for about 80% of the overall genomic heterogeneity [42][3]. Moreover, a number of pharmacogenomics studies have demonstrated that SNPs could influence the efficacy and side effects of drugs [43][4]. For anti-cancer drugs (e.g., irinotecan, mercaptopurine, 5-flurouracil, and tamoxifen), several SNPs have been reported to be associated with drug efficacy or toxicity [44][5], [45][6], [46][7]. These early studies generally focused on SNPs within pre-specified genes of interest, which might miss other potentially significant polymorphisms. Recently, with the advances in high throughput technologies, genome-wide association study (GWAS) provides a hypothesis-free approach to identify novel SNPs that are responsible for drug response. For example, Khan et al. [47][8] reported that the common SNP rs8113308 mapped to 19q13.41 was associated with reduced survival among endocrine treated breast cancer patients. Cairns et al. [48][9] identified a SNP in CSMD1 associated with breast cancer–free interval in a phase III randomized trial of anastrozole versus exemestane. Generally, these studies usually tested the association between SNPs and drug responses (e.g., recurrence-free survival) only in patients received a specific treatment. However, without information from patients without drug treatment, this design cannot discriminate drug response SNPs from clinical outcome associated SNPs. That’s, it is possible that the detected SNPs are associated with patients’ clinical outcome even without drug treatment. In this study, we aimed to find drug response SNPs based on data from patients with and without drug treatment. A pipeline named Detection of Drug Response SNPs (DDRS) was built through combing the cox proportional-hazards model and pairwise Kaplan-Meier survival analysis. Briefly, first, we fit a Cox’s proportional hazards model including three variates, including SNP, drug treatment status, and an interaction term between SNP and drug treatment in data including patients with and without drug treatment. SNPs with significant drug treatment and interaction term were remained. Second, Kaplan-Meier (KM) survival analysis were performed in patients with different genotypes to further generate the final set of SNPs. SNPs with significant KM results in patients without drug treatment were removed. To test the performance of DDRS, data of breast cancer patients from The Cancer Genome Project (TCGA) project were analyzed. 2. Methods 2.1. Pipeline of DDRS The outline of DDRS is shown in [49]Fig. 1. The input data of the study population included the genotype and patient’s survival information. Specifically, patients without drug treatment were also included. We used the patients’ survival status to indicate drug response since overall survival is believed as the primary end point to evaluate the outcome of any drug [50][10]. Overall survival has also been used as the endpoint to find drug response and toxicity loci in previous studies [51][11], [52][12]. First, a Cox’s proportional hazards model was built to detect significant SNP-drug interactions (P < 0.05 for [MATH: β3 :MATH] ). [MATH: ht,X=h0texpβ1×drug+β2×< mi>SNP+β3×< mi>drug×SNP :MATH] (1) Fig. 1. [53]Fig. 1 [54]Open in a new tab An overview of the approach. Step 1: We performed SNP × drug interaction analysis in all patients (including patients with drug treatment and patients without any drug treatment). SNPs with significant drug term (P < 0.05 for [MATH: β1 :MATH] ) and significant interaction term (P < 0.05 for [MATH: β3 :MATH] ) were remained. Step 2: For SNPs obtained from the first step, we performed Kaplan-Meier (KM) analysis in subjects with different genotypes to select SNPs associated with drug response in patients with drug treatment. KM analysis in patients without any drug treatment was also performed to remove the SNPs associated with survival but this association was not related to drug treatment. Step 3: We performed univariate cox proportional hazards analysis for each drug response SNPs get from the first two steps. The coefficients for all SNPs was used to calculate drug response score (DRS) in another population to test the performance of drug response prediction. We only focused on drugs with significant effect on survival (P < 0.05 for [MATH: β1 :MATH] ) and significant interaction terms (P < 0.05 for [MATH: β3 :MATH] ). Second, we performed KM survival analysis in patients with and without drug treatment separately to further select the SNPs related to drug response. For KM analysis, only subgroups with more than 20 patients (at least 5 patients with dead events) were remained for analysis. SNPs with FDR-adjusted P < 0.05 in the KM analysis of patients with drug treatment were selected. In addition, SNPs with P < 0.05 in the KM analysis of patients without drug treatment were removed to rule out the associations unrelated to drug treatment. Finally, to test the generalization ability of the selected SNP sets in other populations, we calculated the correlation between each SNP and survival in one population: [MATH: ht,X=h0texp(β×SNP) :MATH] (2) and constructed a drug response score (DRS) for each individual in another dataset as follows: [MATH: S=βi ×Xi :MATH] (3) where [MATH: βi :MATH] represents the coefficient of SNPi in Eq. [55](2) and [MATH: Xi :MATH] represents the tested allele’s copies of SNPi in Eq. [56](2). The correlation between S and drug response were then calculated to test whether the selected SNP sets could be used for survival prediction. 2.2. Genotype and clinical data collection processing Clinical information, drug treatment information, germline DNA genotypes for TCGA breast cancer samples (N = 1096) were obtained from the Genomic Data Commons Data Portal ([57]https://cancergenome.nih.gov/, GDC portal). The genotyping platform for all patients was the Affymetrix 6.0 array. Genotypes with score < 0.1 are considered to be highly confident (Broad institute, BIRDSUITE software) and 928,706 SNPs were retained in the study. 2.3. DRS and drug-resistant levels in cell lines We constructed DRSs for 45 cancer cell lines (Table S1) to test whether DRS can be used to predict drug response. Gene expression and drug response data for these cell lines were obtained from the Genomics of Drug Sensitivity in Cancer (GDSC) database [58][13]. We used IC[50] (half maximal inhibitory concentration) for cell apoptosis to indicate the drug-resistant level. Corresponding genotype data for all cell lines were obtained from the Gene Expression Omnibus (GEO) database ([59]GSE34211 and [60]GSE41308). We used R package crlmm [61][14] for genotype calling. 2.4. Functional annotation of the selected SNPs We annotated the epigenetic regulatory features for SNPs using data from the following resources: 1) we downloaded the 8 types of ChIP-seq data from the GEO database ([62]GSE85158 [63][15]) for 11 breast cancer cells. Sequencing reads were mapped to the human genome reference (hg19) using Bowtie2 [64][16] with default settings. MACS2 [65][17] was used to call peaks (-g hs -q 0.05 -n --keep-dup all). 2) we downloaded the predicted enhancer region for BRCA from CistromeCancer [66][18]. The genome coordinates were converted from hg38 to hg19 using liftover [67][19]. Next, we performed 10,000 permutation tests by using random SNP sets generated by SNPsnap [68][20] with default settings. To calculate the enrichment p-value, we simply count the number of sets with annotated SNPs as or more extreme than our selected SNP set, and divide that number by the total number (10,000). The enrichment score (ES) was calculated as follows: [MATH: ES=i=1< /mrow>10,000x0xi/10,000 :MATH] where [MATH: x0 :MATH] is the number of annotated SNPs from our selected SNP set, and [MATH: xi :MATH] is the number of annotated SNPs in the ith random SNP set. 2.5. eQTL analysis Gene expression data normalized using RSEM [69][21] and segmented copy number variation data were downloaded from Genome Data Analysis Center database (GDAC, [70]http://gdac.broadinstitute.org/). Absolute gene copy numbers were calculated from segmented copy number files by ABSOLUTE [71][22], and then used as covariate to adjust the gene expression data [72][23]. To remove the effect of population structure on gene expression, we used smartpca in the EIGENSOFT [73][24] program to perform principal component (PC) analysis, and selected the top 10 PCs from genome-wide genotype data as covariates. To remove the hidden batch effects and other potential confounders in the gene expression data, we also used the Probabilistic Estimation of Expression Residuals [74][25] (PEER) method to select the first 15 PEER factors as covariates. Age, pathologic stage and race were also used as covariates. EQTL analysis was performed using matrix eQTL [75][26]. SNPs with false discovery rates (FDR) < 0.05 were defined as eQTL genes. Cis-eQTL genes were defined if the SNP was located within 1 Mb from the gene transcriptional start site (TSS). 2.6. Extraction of drug target genes and protein–protein interactions (PPIs) We collected doxorubicin targets genes from the DrugBank database [76][27] and Therapeutic Target Database [77][28]. We also collected doxorubicin related enzymes, carriers and transporters from DrugBank. Genes directly interacted with doxorubicin (confidence score over 0.7) were collected from STITCH database [78][29]. To get PPIs, we downloaded the human interactome published by Cheng et al. [79][30], which collected 243,603 PPIs of 16,677 unique proteins. Next, we performed 10,000 permutation tests by randomly chose same number of genes from 16,677 proteins. To calculate the permutation p-value, we simply compared the mean shortest PPI distances from eQTL genes to doxorubicin and from randomly chosen genes to doxorubicin. We count the number of randomly chosen gene sets with shorter mean PPI distances than our eQTL gene set, and divide that number by the total permutation number (10,000) [80][30]. 2.7. Doxorubicin resistance prediction model We constructed multigene classifiers using these eQTL genes and Linear Discriminant Analysis (LDA) in [81]GSE20194, which has 230 patients received 6 months of preoperative chemotherapy including doxorubicin. 182 patients were categorized as residual invasive cancer (RD) and 48 patients were categorized as pathological complete response (pCR, no residual invasive cancer in the breast or lymph nodes). We performed 1000 times repeated 1-fold cross-validation. The classifier performance on the validation data were assessed by using the area under the receiver operating characteristic curve (ROC-AUC). 2.8. Pathway enrichment and pathway activity inference The eQTL genes were ranked according to the product of the β from eQTL analysis and the β from Cox analysis (i.e., β[cox] * β[eQTL]). For genes with more than one significant SNPs, the results with the maximum absolute value was remained. Pathway enrichment analysis for these genes with pre-ranked values was then performed using GSEA [82][31]. Positive pathways enriched at P < 0.05 were recognized as doxorubicin-resistant/risk pathways and negative pathways enriched at P < 0.05 were recognized as doxorubicin-sensitive/protective pathways. The pathways’ activity score (PAS) were calculated with diffrank [83][32]. 3. Result 3.1. Identification of doxorubicin response SNPs in breast cancer We applied this approach to TCGA breast cancer data to detect doxorubicin response SNPs. 363 patients with doxorubicin treatment and 321 patients without any drug treatment was included. Detailed characteristics of the patients are provided in [84]Table 1. Using the cox’s proportional hazards model, we identified 8020 SNPs that might be interacting with doxorubicin treatment. Next, KM analysis in patients with doxorubicin treatment showed that subjects with different genotypes of 619 SNPs had significant different survival status (FDR P < 0.05). In addition, using KM analysis in patients without any drug treatment, we further ruled out the SNP-survival associations not related to drug treatment and 71 SNPs were removed. Therefore, after analyzing with the DDRS pipeline, a total of 548 doxorubicin associated SNPs were identified with the detailed information summarized in Supplementary Table S2. Table 1. Pre-treatment characteristics of the patients. Patient with doxorubicin treatment Patients without any drug treatment All patients Number 363 321 684 Age <=50 173 (48%) 71 (22%) 244 (36%) >50 189 (52%) 240 (77%) 429 (64%) Mean (SD) 52 (10) 62 (14) 56 (13) Unknown 1 10 11 

 Nodal status Positive 238 (66%) 152 (48%) 390 (57%) Negative 122 (33%) 158 (49%) 280 (41%) Unknown 3 (1%) 11 (3%) 14 (2%) 

 T stage 1 73 (21%) 79 (25%) 152 (22%) 2 233 (64%) 177 (55%) 410 (60%) 3 52 (14%) 41 (13%) 93 (14%) 4 5 (1%) 22 (7%) 27 (4%) Unknown 0 (0%) 2 (0%) 2 (0%) 

 Pathologic stage 1 36 (10%) 59 (18%) 95 (14%) 2 213 (59%) 174 (54%) 387 (57%) 3 108 (30%) 67 (21%) 175 (26%) 4 2 (0%) 9 (3%) 11 (1%) Unknown 4 (1%) 12 (4%) 16 (2%) 

 ER Status Positive 244 (67%) 225 (70%) 469 (69%) Negative 106 (29%) 79 (25%) 185 (27%) Unknown 13 (4%) 17 (5%) 30 (4%) 

 PR Status Positive 209 (58%) 188 (59%) 397 (58%) Negative 139 (38%) 115 (36%) 254 (37%) Unknown 15 (4%) 18 (5%) 33 (5%) 

 HER2 Status Positive 45 (12%) 61 (19%) 106 (15%) Negative 207 (57%) 144 (45%) 351 (51%) Unknown 111 (31%) 116 (36%) 227 (34%) [85]Open in a new tab 3.2. DRS derived from selected SNP are associated with drug resistant level We calculated the DRS in the rest 269 TCGA breast cancer patients with other drug treatment. As shown in [86]Fig. 2A, compared with other patients, the basal like patients had the highest DRS. We also calculated doxorubicin-associated DRS in breast cancer cell lines ([87]Fig. 2B). These cells were divided into doxorubicin-resistant group (IC[50] higher than median value) and doxorubicin-sensitive group (IC[50] lower than median value). The DRS in the doxorubicin-resistant group was significantly higher than the doxorubicin-sensitive group (P = 0.044). Fig. 2. [88]Fig. 2 [89]Open in a new tab A: Boxplot of DRS of patients in different molecular subtypes. Patients were stratified into groups of Luminal A, Luminal B, HER2 and Basal like by the ER, PR and HER2 markers. B: Boxplot of DRS between doxorubicin-resistant cells (High IC[50]) and doxorubicin-sensitive (Low IC[50]) cells. C: Enrichment analysis heatmap plot of those significant SNPs’ epigenetic annotation. Enrichment score (ES) are color-coded from light to dark. 3.3. The selected SNPs are enriched in active regulatory epigenetic markers According to the genomic region annotation results, over 94% SNPs are located in the intergenic or intronic region. We further examined whether these doxorubicin-associated SNPs were associated with active epigenetic regulation using ChIP-seq data from 11 breast cancer cell lines (Table S1, [90]Fig. 2C). The results showed that 34% SNPs are located in active epigenetic mark regions of at least one breast cancer cell lines. Further analysis showed that these SNPs were significantly enriched in histone epigenetic regions associated with active enhancer (H3K27ac [91][33], H3K4me1 [92][34] and H4K8ac [93][35]), active gene transcription (H3K36me3 [94][36], H3K9ac [95][33] and H2BK120ub1 [96][37]) ([97]Fig. 2C) and also in predicted BRCA enhancer regions (P = 0.0008, 10,000 permutations; 1.46-fold). 3.4. The eQTL gene expression levels could be used to predict clinical outcome We identified 958 cis-eQTL genes for these 548 doxorubicin associated SNPs. Compared with random selected genes, these genes had significantly shorter mean PPI distances to doxorubicin target genes, doxorubicin related enzymes/carriers/transporters, and doxorubicin-interacting genes ([98]Fig. 3A, P < 0.001). Next, we estimated these genes’ doxorubicin-response predictive power. We used all the genes’ expression as features to train a LDA classifier to distinguish patients from RD and pCR. The ROC-AUC was 0.738 (CI: 0.736–0.741) ([99]Fig. 3B). Fig. 3. [100]Fig. 3 [101]Open in a new tab A: Violin plot of mean shortest PPI distances to doxorubicin target. Red bars represent the mean PPI distance of the cis-eQTL genes to doxorubicin target genes, doxorubicin related enzymes, doxorubicin related enzymes/carriers/transporters, and doxorubicin-interacting genes. Blue bars represent the mean shortest PPI distance of 10,000 groups of randomly selected genes. B: ROC curve for the predictive performance of the LDA genomic pCR predictor with these eQTL genes as features. C: Volcano plot of univariate cox proportional hazards results of eQTL genes. Horizontal axis showed the univariate cox proportional hazards confidences and vertical axis showed the negative log of the P values. Doxorubicin-resistant genes are shown on upper and doxorubicin-sensitive genes are shown on below. D: Volcano plot of univariate cox proportional hazards results of PAS of enriched pathways. Horizontal axis shows the univariate cox proportional hazards confidences and vertical axis showed the negative log of the P values. Positive pathways are shown on upper and negative pathways are shown on below. E: Univariate cox proportional hazards confidences of 6 positive pathways had significant risk PAS and 7 negative pathways had significant protective PAS. (For interpretation of the references to