Abstract Background Prioritization of breast cancer patients based on the risk of resistance to tamoxifen plays a significant role in personalized therapeutic planning and improving disease course and outcomes. Methods In this work, we demonstrate that a genome-wide pathway-centric computational framework elucidates molecular pathways as markers of tamoxifen resistance in ER+ breast cancer patients. In particular, we associated activity levels of molecular pathways with a wide spectrum of response to tamoxifen, which defined markers of tamoxifen resistance in patients with ER+ breast cancer. Findings We identified five biological pathways as markers of tamoxifen failure and demonstrated their ability to predict the risk of tamoxifen resistance in two independent patient cohorts (Test cohort1: log-rank p-value = 0.02, adjusted HR = 3.11; Test cohort2: log-rank p-value = 0.01, adjusted HR = 4.24). We have shown that these pathways are not markers of aggressiveness and outperform known markers of tamoxifen response. Furthermore, for adoption into clinic, we derived a list of pathway read-out genes and their associated scoring system, which assigns a risk of tamoxifen resistance for new incoming patients. Interpretation We propose that the identified pathways and their read-out genes can be utilized to prioritize patients who would benefit from tamoxifen treatment and patients at risk of tamoxifen resistance that should be offered alternative regimens. Funding This work was supported by the Rutgers SHP Dean's research grant, Rutgers start-up funds, Libyan Ministry of Higher Education and Scientific Research, and Katrina Kehlet Graduate Award from The NJ Chapter of the Healthcare Information Management Systems Society. Keywords: Therapeutic resistance, Pathway markers, Tamoxifen, Breast cancer __________________________________________________________________ Research in context. Evidence before this study Treatment resistance plays a central role in disease management and outcomes, especially for patients with oncologic malignancies. Several groups have studied response to tamoxifen in ER+ breast cancer, yet the identification of markers that accurately predict tamoxifen resistance remains limited. Added value of this study In this work, we derived a computational framework to predict treatment resistance to tamoxifen in ER+ breast cancer patients based on behavior of their molecular pathways, defined from changes in mRNA expression profiles. Our analysis identified five molecular pathways and their corresponding read-out genes that successfully predict risk to fail tamoxifen, as validated in two independent patient cohorts. These pathways have not been previously reported as associated with risk of tamoxifen resistance and provide an accurate estimate of response to tamoxifen, outperforming previously known resistance signatures. Implications of all the available evidence Our study derived a marker panel to identify patients at risk of primary tamoxifen failure. Together with other studies, our work builds a foundation for personalized therapeutic planning for patients with ER+ breast cancer patients. Alt-text: Unlabelled box 1. Introduction Despite recent advances in diagnosis, classification, and therapeutic management, breast cancer (BC) remains one of the leading causes of cancer-related death in women worldwide [[29][1], [30][2], [31][3]]. Nearly 70% of all diagnosed cases of breast tumors are estrogen receptor-positive (ER+) [[32]4,[33]5], making treatments that have anti-estrogen effects in the breast cells, such as tamoxifen, the standard-of-care for patients with ER+ breast cancers [[34]4,[35][6], [36][7], [37][8], [38][9]]. Despite the significant success of tamoxifen administration, nearly 30% of treated patients develop therapeutic resistance, ultimately leading to metastasis and lethality [[39]1,[40]10]. Therefore, prioritization of patients based on the risk of resistance to tamoxifen before treatment administration could play a significant role in personalize therapeutic planning for patients with ER+ breast cancer and builds a foundation to improve disease course and outcomes. Tamoxifen is a selective estrogen receptor modulator (SERM) and has agonist or antagonist activity depending on the tissue type [[41]11]. In the breast cells, tamoxifen directly binds to the ER, blocking estrogen from attaching to the receptor and thus inhibiting the activity of estrogen-regulated genes and causing the repression of estrogenic effects [[42]4,[43]5,[44]12,[45]13]. However, the emergence of alternative mechanisms of estrogenic stimulation has been shown to cause resistance to tamoxifen. For example, some studies have demonstrated that ER+ breast cancers that overexpress HER2 and EGFR can activate the components of downstream signaling pathways which then stimulate both ER and estrogen receptor co-activator AIB1, and thus induce the estrogen agonistic activity of tamoxifen in breast cancer cells [[46]14,[47]15]. Another study noticed that the increased expression of HER2 signaling can also downregulate progesterone receptor (PR) levels in the ER+ breast tumors, where losing the PR expression serves as a biomarker of hyperactive growth factor signaling, leading to another possible mechanism of tamoxifen resistance [[48]16]. Despite the emerging role of HER2 in tamoxifen resistance, it only accounts for 10% of ER+ breast cancers [[49]12,[50]17], suggesting more complex resistance mechanisms in these cases and presenting a central clinical problem for patients with ER+ breast cancer [[51]4,[52]5,[53]10,[54]12]. In recent years, several groups have developed gene expression signatures of tamoxifen response for ER+ patients, including 10 gene-signature by Men et al. [[55]18], 21 gene-signature by Paik et al. [[56]19] (known as Oncotype DX), and 2 gene-signature by Ma et al. [[57]20] While these signatures provide substantial advances to our understanding of individual genes involved in resistance, they do not yet capture the complex interplay between biological mechanisms that governs tamoxifen resistance. Here, we propose a pathway-centric computational framework to elucidate tamoxifen resistance and demonstrate that it outperforms known gene-based approaches. Advantages of our pathway-based approach lies in (i) its ability to identify a tightly connected cooperative group of genes unified by the same function [[58][21], [59][22], [60][23]]; (ii) studying molecular pathways, rather than individual genes, produces more reliable read-out outputs as they are less susceptible to experimental noise [[61]24]; (iii) pathway-level view enhances our understanding of the biological mechanisms related to disease and treatment response [[62][25], [63][26], [64][27], [65][28]]; and finally (iv) looking at alterations in biological pathways enhances the likelihood of identifying potential therapeutic targets to preclude or overcome resistance. In this work, we have established a systematic pathway-centric computational framework to elucidate molecular pathways as markers of tamoxifen resistance in ER+ breast cancer patients, which we call pathER. Through the analysis of pathway activity in each ER+ patient and their association with response to tamoxifen (Training cohort, n = 53), we identified five biological pathways as markers of tamoxifen resistance: Retrograde Neurotrophin Signalling, Loss of NLP from Mitotic Centrosomes, RNA Polymerase III Transcription Initiation from Type 2 Promoter, EIF2 pathway, and Valine, Leucine and Isoleucine Biosynthesis. We have demonstrated the ability of the identified five candidate pathways to predict the risk of tamoxifen resistance in two independent patient cohorts [[66]29] (Test cohort 1, n = 66: log-rank p-value = 0.02, accuracy of leave one out cross-validation (LOOCV) = 85.8%; Test cohort 2, n = 77: log-rank p-value = 0.01, accuracy of LOOCV = 82.5%) and their independence from known covariates, such as age, tumor grade, tumor size, lymph node status, and PR status, as the absence of PR in ER+ tumor can be an indicator of HER2 activation and an aggressive phenotype [[67]16] (Test cohort 1, adjusted hazard ratio = 3.11; Test cohort 2, adjusted hazard ratio = 4.24). Furthermore, we performed stratified Kaplan-Meier survival analyses on the PR+ and PR- patients as well as patients with Ki-67 low and Ki-67 high status and demonstrated the five candidate pathways can predict risk of resistance to tamoxifen in each PR and Ki-67 group with high accuracy. Importantly, as a negative control, we have demonstrated that the identified five candidate pathways did not classify patients simply based on the disease aggressiveness (log-rank p-value = 0.7, hazard ratio = 1.246) and that in fact pathways associated with disease aggressiveness do not overlap with the five candidate pathways. We have compared our method to other computational techniques to tackle treatment response, including Epsi et al. [[68]28] (which utilized extreme-responder analysis, using tails of the treatment response distribution to define a treatment response signature), Zhong et al. [[69]30] (which used Support Vector Machine approach as a base), Yu et al. [[70]31] (which uses random forest approach as a base), and mRNA data alone (without considering molecular pathways) and demonstrated that our method outperforms these techniques in predicting risk of resistance to tamoxifen. Furthermore, we have compared our pathway signature to other known signatures of tamoxifen response [71][18], [72][19], [73][20] and have shown the superiority of our pathway-based approach (adjusted hazard ratio = 3.11, hazard p-value = 0.0278). Finally, to enhance clinical applicability of our finding, we derived five read-out genes (each of which reflects activity changes in a corresponding pathway) and defined their treatment failure scoring system, indicating risk of developing tamoxifen resistance in two patient cohorts (Test cohort 1, adjusted hazard ratio = 3.1; Test cohort 2, adjusted hazard ratio = 6.95). Thus, we propose that the identified five candidate pathways and their read-out genes can potentially be used to prioritize patients who would benefit from tamoxifen treatment as their first-line therapy, and to identify patients at risk of tamoxifen resistance who should be offered an alternative regimen plan. 2. Materials and Methods 2.1. Patient cohorts utilized for study analysis All gene expression datasets of patients with ER+ breast cancer were obtained from publicly available GEO data repository [[74]32] from multi-institutional multi-PI comprehensive Loi et al. [[75]29] study [76]GSE6532 ([77]Supplementary Fig. 1, [78]Supplementary Table 1): (i) KIT-[79]GSE6532 utilized as a Training cohort; (ii) GUYT-[80]GSE6532, utilized as Test cohort 1; (iii) OXFT-[81]GSE6532, utilized as a Test cohort 2; and (iv) KIU-[82]GSE6532, utilized as a negative control cohort. Training cohort contains patient profiles of primary ER+ breast tumors (n = 57), archived at the Uppsala University Hospital (Uppsala, Sweden), profiled on Affymetrix Human Genome U133A array and Affymetrix Human Genome U133B array. Test cohort 1 contains patient profiles of primary tumors from patients with ER+ breast cancer (n = 70), archived at the Guy's Hospital (London, United Kingdom), profiled on Affymetrix Human Genome U133 Plus 2.0 Array. Test cohort 2 contains patient profiles of primary ER+ breast tumors (n = 77), archived at the John Radcliffe Hospital (Oxford, United Kingdom), profiled on Affymetrix Human Genome U133A, B array. Negative control cohort consists of not-treated patients with ER+ primary breast tumors (n = 51), profiled on Affymetrix Human Genome U133A, B array. All primary tumors samples in Training and Test cohorts were collected through surgery, diagnosed between 1980 and 1995 and received tamoxifen-only treatment for 5 years post-diagnosis as their adjuvant treatment. 2.2. Data normalization and filtering For each gene expression microarray dataset, a matrix of RMA (Robust Microarray Analysis) normalized signal intensity values was used [[83]29]. Using the most updated annotation file from GEO and the latest Affymetrix annotation files from Thermo Fisher database [[84]33], each probe set ID was annotated to gene ID; thereafter, probe IDs that annotated to different gene IDs or did not annotate to any gene ID were excluded. When multiple probe set IDs were mapped to the same gene, probes with the highest coefficient of variation (CV) were selected [[85]34, [86]35]. 2.3. Breast cancer molecular subtypes Gene expression classifier of the breast cancer subtypes (PAM50) was applied to assign breast cancer patients to one of the intrinsic molecular subtypes: luminal A, luminal B, HER2-enriched, triple-negative/basal-like, and normal-like [[87]36,[88]37]. The subtype classification of each patient was determined based on the closeness between the average expression profile of 50 genes in each subtype centroid and the corresponding gene expression pattern of patient tumor, where the distances were measured utilizing Spearman's rank correlation [[89]36]. We utilized genefu package in R, intrinsic.cluster.predict function with pam50 [[90]38] to assign subtype membership and eliminate samples with HER2-enriched, triple-negative/basal-like, and normal-like subtypes (i.e., non ER+). 2.4. Single-sample gene set enrichment analysis (ssGSEA) To estimate pathway enrichment in each ER+ patient, we performed single-sample (i.e., single-patient) pathway enrichment analysis, where standardized gene expression profile for each patient was used as reference and genes from each biological pathway were used as a query in an unweighted (i.e., each gene had the same weight) single-sample gene set enrichment analysis (ssGSEA) [[91]39,[92]40]. For such analysis gene expression values for each gene were transformed into standardized scores (i.e., z-scores) in order to bring the expression level into a common scale across all samples [[93]41,[94]42]. Z-score for each gene in each sample was computed by subtracting the average intensity of this gene across samples from the intensity of this gene in each sample and dividing it by the gene's standard deviation (SD), where mean and standard deviation were estimated for each gene across all samples [[95]41]. In this way, after such z-scoring, each gene's mean is standardized to 0 and standard deviation to 1. Ranked list of z-scores across all genes for a given sample then defines a single-sample (i.e., single-patient) signature, utilized as a reference for pathway enrichment analysis. To acquire a comprehensive list of pathways, we utilized MSigDB C2 pathway database [[96]43] which includes curated selections of 833 pathways obtained from human gene sets using Reactome [[97]44], KEGG [[98]45], and BioCarta databases. Reactome database is a curated resource that describes fundamental biological processes including cell signaling, metabolism, regulatory, and human diseases with particular focus on signaling and metabolic pathways derived from a wide list of biomedical experiments and literature references [[99]44,[100][46],