Abstract The aim of this study was to perform comparative analysis of multiple public datasets of gene expression in order to identify common genes as potential prognostic biomarkers. Additionally, the study sought to identify biological processes and pathways that are most significantly associated with early distant metastases (<5 years) in women with estrogen receptor-positive (ER+) breast tumors. Datasets from three published studies were selected for in silico analysis of gene expression profiles of ER+ breast cancer, using time to distant metastasis as the clinical endpoint. A subset of 44 differently expressed genes (DEGs) was found common to all three studies and characterized by mitotic checkpoint genes and pathways that regulate mitotic spindle and chromosome dynamics. DEG promoter regions were enriched with NFY binding sites. Analysis of miRNA target sites identified significant enrichment of miR-192, miR-193B, and miR-16-1 targets. Aberrant mitotic regulation could drive increased genomic instability leading to a progression towards an early onset metastatic phenotype. The relative importance of mitotic instability may reflect the clinical utility of mitotic poisons in metastatic breast cancer, including poisons such as the taxanes, epothilones, and vinca alkaloids. Keywords: estrogen receptor alpha-positive, mitotic checkpoint signaling, mitotic regulation network, microRNA targets, early distant metastasis Introduction As the second-leading cause of cancer-related death in women, breast cancer is a serious public health problem. The American Cancer Society reported over 228,000 new cases of breast cancer in 2012 and nearly 40,000 breast cancer related deaths in women per year in the United States.[35]^1 A significant majority (∼70%) of these women have a tumor that is estrogen receptor alpha-positive (ER+).[36]^2 While we have made significant progress in our understanding of primary breast tumors, we lack sufficient knowledge and understanding of distant metastasis.[37]^3 Gene expression analysis and other molecular profiling tools are revolutionizing our understanding of all cancers.[38]^4 Multiple groups have used these technologies to construct specific gene signatures that aim to identify patients with ER+ breast cancer and thus who are at risk for developing resistance to endocrine or anti-hormonal therapy.[39]^5^–[40]^8 However, the accuracy, sensitivity, and specificity of these signatures have been difficult to validate in follow-up studies.[41]^9 Tumor sample and patient population heterogeneity could certainly be confounding factors, but arguably the main complication is the unique challenge of analyzing large-scale, high-dimensional genomic and/or proteomic metadata in a biologically meaningful fashion.[42]^10 Most importantly, beyond the identification of classifiers, the field lacks a clear path along which to move or progress. For these profiles and signatures to be truly useful to the thousands of women diagnosed with ER+ breast cancer every year, we must also use them to understand cancer biology at the functional (mechanistic) and phenotype levels and to inform the design of novel therapeutic agents. There are several reasons for revisiting published studies of gene expression profiling in cancer and for conducting additional comparative analyses. Global profiling of gene expression provides comprehensive information on genome-wide transcriptome alterations in tumor samples and any individual study rarely utilizes all available measurements. This rich resource of publicly available data is simply not being fully utilized for downstream analysis. It is also increasingly evident that the specific gene expression signatures reported in many published studies are not reproducible.[43]^10 However, the underlying biological processes, pathways, and functional categories by which these genes are annotated appear to be much more common; therefore, downstream systems biology analysis is becoming increasingly more important. Fortunately, computational tools and functional annotation resources are constantly being improved and updated, providing new opportunities for deeper analysis and biological interpretation of expression profiling results. Over the past three years we have developed a Georgetown Database of Cancer (G-DOC) where many public and clinically relevant molecular profiling studies are being stored. The database has a major focus on breast cancer.[44]^11 G-DOC is a publicly available Web platform that enables translational cancer research by integrating patient characteristics and clinical outcome data with a variety of high-throughput research data, all in a unified environment. G-DOC includes a broad collection of bioinformatics and systems biology tools for the analysis and visualization of four major “omics” types: DNA, mRNA, microRNA, and metabolites. By establishing a standard uniform data processing pipeline and robust quality control of the data, we have accumulated more than 20 breast cancer studies that are processed in a similar manner, allowing for cross study comparisons and further analysis. Currently, G-DOC contains data from more than 3600 breast cancer cases and 1700 gastrointestinal cancer cases. The three studies presented here are also available as part of the G-DOC collection. Given the prevalence (∼70% of cases diagnosed annually),[45]^2 unique biology, and long-term recurrence profile of ER+ breast cancer, we focused our efforts on the development of prognostic biomarkers of early distant metastases (<5 years) in women with this type of breast cancer. After breast-conserving surgery and appropriate radio and systemic therapy (endocrine and/or conventional chemotherapy), the risk of distant metastasis in ER+ breast cancer patients peaks between 2 and 3 years post-diagnosis.[46]^12 Risk gradually decreases after this point but reaches a plateau at 5 years; no further reduction in risk is observed, even at 10 years post-diagnosis. This is in stark contrast to women with ER− breast tumors; these women show a much higher risk of distant metastasis at 2 years post-diagnosis, but from 5 to 10 years post-diagnosis (and presumably, beyond) are actually less likely to present with distant metastasis when compared to their counterparts that have ER+ breast cancer. Mortality rates for ER+ and ER− breast cancer are broadly consistent with this trend; risk of death for women with ER− breast cancer becomes lower than that for women with ER+ breast cancer at 6 years post-diagnosis.[47]^12 Given these data, we proposed two hypotheses. Firstly, we hypothesize that primary tumors from women with ER+ breast cancer who go on to develop early distant metastasis are characterized by molecular functions and biological processes distinct from those in women with ER+ breast cancer who do not. Secondly, we hypothesize that key components of these signaling networks are amenable to established or emerging targeted therapeutic agents. Our discovery of a common network of overexpressed and highly interconnected genes controlling the entire sequence of mitotic events in the three studies we examined of patients with recurrence within 5 years of initial diagnosis provides insight into potential large scale deregulation of the mitotic machinery that is associated with, and might even facilitate, a progression toward early metastasis. Methods Study selection Datasets from three previously published studies (described below) were selected for in silico identification of a gene expression signature for ER+ breast cancer, early distant metastasis. Patient groups were defined as either (a) ER+ breast cancer patients with no documented distant metastasis at >5 years (Group 1, no met) or (b) ER+ breast cancer patients with documented distant metastasis at ≤5 years (Group 2, met). All data were derived from surgical specimens of ER+ primary breast tumors arrayed on Affymetrix U133A GeneChips. Patient groups consisted of the Loi dataset (two studies), the Sotiriou dataset, and the Desmedt dataset. Additionally, the Schmidt dataset acted as a Validation Set. The Loi datasets consisted of two studies.[48]^8 Set A consisted of 255 patients with early-stage (I,II) ER+ and/or progesterone receptor positive (PR+) breast cancer that received either Tamoxifen or no systemic adjuvant treatment. Set B consisted of 355 patients with early-stage (I,II) ER+ breast cancer that received either Tamoxifen or no systemic adjuvant treatment. Raw data were obtained from GEO (accession number [49]GSE6532). Patients in both studies/datasets had a mix of lymph node-negative and -positive disease. The Sotiriou dataset[50]^13 dataset consisted of samples from 189 patients with primary operable invasive breast cancer. The training set KJX64 contained data from 64 women with ER+ primary breast cancer who received either Tamoxifen or no systemic adjuvant treatment; the validation set KJ125 contained data from 125 ER+ and ER− breast tumor samples. Only ER+ patients were used in our analysis. No patient in the KJ125 validation set received any adjuvant systemic therapy, and patients in both sets had a mix of lymph node-negative and -positive disease. Raw data were obtained from GEO (accession number [51]GSE2990). The Desmedt dataset[52]^14 consisted of 198 samples from women with lymph node-negative ER+ and ER− breast cancer who received no systemic therapy; only ER+ patients were used in our analysis. Raw data were obtained from GEO (accession number [53]GSE7390). Upon completion of our comparative analysis of 3 breast cancer studies we identified an additional published breast cancer dataset that was used as a “validation” dataset (Schmidt dataset).[54]^15 The main purpose of this additional analysis was to check if other studies based on distant metastasis outcome contain similar patterns of gene expression and if similar functional groups and pathways could be identified as most significantly affected in this additional dataset. This additional dataset consisted of 200 samples from women with lymph node-negative ER+ breast cancer who received no systemic therapy. Raw data were obtained from GEO (accession number [55]GSE11121). Data processing and gene selection A data processing pipeline was established for uniform processing, normalization, and in silico analysis of raw gene expression data. Data normalization and quality control scripts were written in R with tools from the open source software package Bioconductor. Data processing for each dataset was conducted using this standard pipeline, including background correction, normalization with RMA (Robust Multichip Average),[56]^16 and Median Polish[57]^17 followed by robust quality control. Group comparison was performed using LIMMA[58]^18 with multiple testing correction based on Benjamini and Hochberg false discovery rate estimates (Bioconductor).[59]^19 From each dataset, we identified differentially expressed genes with a greater than 1.0 fold change and adjusted P-values P ≤ 0.05. Systems biology analysis These gene sets were further analyzed using several systems biology tools in order to obtain biologically relevant functional annotations. The systems biology analysis pipeline included a combination of original methods developed by our team, open source software, and proprietary software. For each study, we conducted functional profiling using Gene Ontology Enrichment, Pathway Analysis Enrichment using multiple pathway databases, Gene Set Enrichment Analysis, and Regulatory Subnetwork Enrichment (Pathway Studio 9), as well as upstream and downstream common regulators analysis (Pathway Studio 9). These methods were also applied to a list of DEGs common to all three studies. This common subset of DEGs was further analyzed using network analysis based on known protein-protein interaction databases (Reactome,[60]^20 STRING[61]^21), as well as Ingenuity Knowledgebase and Ingenuity Pathways Analysis (IPA) 8.6 (Ingenuity Systems, [62]www.ingenuity.com). Parameter settings for all of the enrichment analyses—including GO ontology enrichment, pathway enrichment, and subnetwork enrichment—were the same with P-values generated based on Fisher’s exact test and using a P-value threshold of 0.05 (P ≤ 0.05). Statistical analysis Statistical analysis of breast cancer clinical data was performed using Prism 5.0c for Mac (GraphPad Software, San Diego, CA). Survival functions for all 3 studies were estimated by Kaplan-Meier analysis and compared using the Mantel-Cox log rank test. For the Loi and Sotiriou datasets, differences in the distribution of lymph node (LN) status (LN+ vs. LN−) and adjuvant therapy (Tamoxifen treatment vs. no systemic therapy) between Groups 1 and 2 were assessed by Fisher’s exact test or χ^2 analysis; this analysis was not performed on the Desmedt dataset because all patients were untreated and LN−. Differences in tumor size and age between Groups 1 and 2 were determined by two-tailed, unpaired t-test (parametric) or Mann-Whitney rank sum test (non-parametric), as appropriate. For all three studies, P ≤ 0.05 was considered statistically significant. Results Comparative analysis of three breast cancer studies To identify differentially expressed genes that are characteristic of early distant metastasis in ER+ breast cancer, we selected three publicly available datasets with sufficient length of follow-up to perform a meaningful comparison between those patients with documented distant metastasis within 5 years and those with no distant metastasis within 5 years. The Loi, Sotiriou, and Desmedt studies[63]^8^,[64]^13^,[65]^14 fulfill this criterion and show no significant difference in overall distant metastasis (DM)-free survival proportions ([66]Additional file 1). To identify potential sources of bias between patient groups, we examined four clinical parameters that could serve as independent poor prognostic factors for distant metastasis, independent of gene expression signature(s): primary tumor size, age, lymph node status, and adjuvant therapy.[67]^22 Age, nodal status, and the distribution of Tamoxifen-treated and untreated patients are equivalent (data not shown), but primary tumor size was significantly larger in Group 2 patients (distant metastasis within 5 years) than in Group 1 patients (no distant metastasis after 5 years) for all three studies ([68]Additional file 2). Enrichment analysis of differentially expressed genes indicates common mitosis-related processes affected in all 3 studies We next identified statistically significant, differentially expressed genes (DEGs; fold change > 1.0 and P ≤ 0.05) for each of the three studies ([69]Additional file 3). For each set of DEGs, functional profiling using Gene Ontology Enrichment was performed. Strikingly, many of the top-ranked functional categories from each study are related to specific mitotic events and mechanisms. For example, in the Loi study these categories included kinetochore and spindle assembly processes, checkpoint regulation, spindle organization, positive regulation of exit from mitosis, and other categories related to mitotic events. We then used Gene Set Enrichment Analysis (GSEA) to query a large collection of databases including Reactome, Biocarta, KEGG, and Pathway Commons; we identified additional functional categories relevant to cell cycle checkpoints. In particular, GSEA results for the Loi study include the role of Ran in mitotic spindle regulation and genes involved in prometaphase ([70]Table 1). When compared, these three studies had the same top-ranked GO categories statistically significantly enriched for cellular processes associated with mitosis ([71]Fig. 1). More detailed analyses were performed on a subset of 44 DEGs (39 up-regulated, 5 down-regulated) common to all three studies (intersection gene set, [72]Table 2). The expression pattern of this intersection gene set (IGS) was consistent across all 3 studies with almost all 44 genes showing over-expression in early metastasis group ([73]Fig. 2). These 44 genes were functionally annotated using Ingenuity Knowledge Base ([74]Additional file 4). For almost all of the genes, we found that their annotations include associations with mitotic checkpoints and/or specific mitotic events. However, since these genes exhibit similar patterns of co-expression in all three studies, it is of considerable interest to analyze these genes as a group to better understand the underlying subnetwork structure of this gene set and their collective involvement in regulation of specific mitotic events. Table 1. Gene set enrichment analysis results for Loi studies. Gene set name Description # genes in overlap P-value REACTOME_CELL_CYCLE_MITOTIC Genes involved in cell cycle, mitotic 56 4.45 e^−6 REACTOME_MITOTIC_M_M_G1_PHASES Genes involved in mitotic M-M/G1 phases 33 2.57 e^−5 REACTOME_MITOTIC_PROMETAPHASE Genes involved in mitotic prometaphase 22 7.84 e^−5 REACTOME_G2_M_CHECKPOINTS Genes involved in G2/M checkpoints 13 1.95 e^−4 REACTOME_E2F_MEDIATED_REGULATION_OF_ DN_DNA_REPLICATION Genes involved in E2F mediated regulation of DNA replication 11 2.32 e^−4 KEGG_CELL_CYCLE Cell cycle 26 3.25 e^−4 REACTOME_E2F_TRANSCRIPTIONAL_TARGETS_ AS_AT_G1_S Genes involved in E2F transcriptional targets at G1/S 8 8.71 e^−4 REACTOME_G1_S_TRANSITION Genes involved in G1/S transition 21 9.9 e^−4 REACTOME_CELL_CYCLE_CHECKPOINTS Genes involved in cell cycle checkpoints 22 1.14 e^−3 BIOCARTA_RANMS_PATHWAY Role of Ran in mitotic spindle regulation 5 1.64 e^−3 [75]Open in a new tab Figure 1. [76]Figure 1. [77]Open in a new tab Results of GO enrichment analysis showing top-ranked biological processes enriched with DEGs for the three studies. Table 2. List of 44 genes common to all three studies with expression change indicated up/down. Loi_Desmedt_Sotiriou Expression AURKA Up BRCA1 Up BUB1B Up CCNB1 Up CCNB2 Up CDC25C Up CDCA3 Up CDKN3 Up CENPA Up CSE1L Up DIXDC1 Down DSCC1 Up DUSP4 Down ESPL1 Up FANCI Up GINS2 Up GNG12 Down HMMR Up HNRNPAB Up KIF15 Up KIF18A Up KIF18B Up KIF4A Up KIFC1 Up LRRC59 Up MCM2 Up MELK Up NEK2 Up PDLIM4 Down PLK1 Up PRC1 Up PTTG1 Up RACGAP1 Up SPAG5 Up TIMELESS Up TMEM106C Up TMPO Up TOP2A Up TPX2 Up TRIP13 Up TROAP Up UBE2C Up UBE2S Up ZFP36L2 Down [78]Open in a new tab Figure 2. [79]Figure 2. [80]Open in a new tab Pathway diagrams of intersection DEGs for each of the three studies. Notes: Color of the nodes corresponds to a level of expression as fold change. Red—overexpression; Blue—downregulation, Grey nodes represent other interacting proteins with no significant change. Pathway and subnetwork enrichment analysis using multiple systems biology tools We took several different approaches to both better understand the functional commonalities between members of the IGS and to determine which gene changes are correlated with specific molecular events. The first was to conduct a comprehensive functional annotation and profiling of this cluster of differentially expressed genes by applying enrichment analysis similar to the analysis used for each dataset. We have extended this type of analysis to a variety of experimentally derived functional categories of genes/proteins such as GO biological processes and a common collection of canonical signaling and regulatory pathways. Using GO enrichment analysis, we found a large number of mitosis related processes that were significantly enriched with this 44 gene IGS. The top 20 categories were all related to specific mitotic events/processes and are shown in [81]Table 3. Additional enrichment analysis of IGS was conducted using two independent commercial knowledge bases developed by Ariadne Genomics and Ingenuity Systems Inc. The results obtained with both knowledge bases were consistent with each other and similar to previous GO enrichment analysis. Due to the difference in content of these two knowledge bases, some pathways are present in only one of the databases. For example, the Ingenuity analysis showed that a top ranked pathway was Polo-like kinase 1 (PLK1) in regulation of mitosis ([82]Fig. 3), while it is not present in the Pathway Studio pathway collection. Overall, we found multiple top ranked categories related to biological function and regulatory pathways of mitotic checkpoints using both Ariadne and Ingenuity databases. Table 3. Top 20 biological processes as a result of GO enrichment analysis for 44 DEGs from intersection set. Name Total entities Overlap Percent overlap Overlapping entities P-value Cell cycle 604 23 3 PLK1, PTTG1, BRCA1, CDC25C, BUB1B, CCNB2, AURKA, UBE2S, TPX2, SPAG5, UBE2C, KIFC1, MCM2, RACGAP1, PRC1, NEK2, FANCI, TIMELESS, DSCC1, CDCA3, CDKN3, DIXDC1, KIF18B 5.23E-29 Cell division 336 18 5 PLK1, PTTG1, CCNB1, CDC25C, BUB1B, CCNB2, AURKA, UBE2S, TPX2, SPAG5, UBE2C, KIFC1, RACGAP1, PRC1, NEK2, TIMELESS, CDCA3, KIF18B 4.96E-25 Mitosis 252 15 5 PLK1, PTTG1, CDC25C, BUB1B, CCNB2, AURKA, TPX2, SPAG5, UBE 2C, KIFC1, NEK2, TIMELESS, CDCA3, KIF15, KIF18B 1.72E-21 Mitotic cell cycle 316 14 4 PLK1, PTTG1, CCNB1, CDC25C, BUB1B, CCNB2, AURKA, UBE2C, MCM2, CENPA, NEK2, GINS2, KIF18A, KIF18B 2.98E-18 Anaphase-promoting complex-dependent proteasomal ubiquitin-dependent protein catabolic process 90 7 7 PLK1, PTTG1, CCNB1, BUB1B, AURKA, UBE2S, UBE2C 2.99E-11 Microtubule-based movement 121 6 4 KIFC1, RACGAP1, KIF4A, KIF18A, KIF15, KIF18B 1.31E-08 Spindle organization 22 4 18 BUB1B, AURKA, SPAG5, UBE2C 1.95E-08 Cell cycle checkpoint 141 6 4 CCNB1, CDC25C, BUB1B, CCNB2, UBE2C, MCM2 3.28E-08 Chromosome segregation 75 5 6 PTTG1, ESPL1, BRCA1, TOP2A, NEK2 5.26E-08 Phosphatidylinositol-mediated signaling 77 5 6 BUB1B, AURKA, SPAG5, UBE2C, TOP2A 6.00E-08 Mitotic prometaphase 90 5 5 PLK1, CCNB1, BUB1B, CENPA, KIF18A 1.32E-07 DNA replication 179 6 3 BRCA1, CDC25C, MCM2, TOP2A, GINS2, DSCC1 1.35E-07 M phase of mitotic cell cycle 96 5 5 PLK1, CDC25C, BUB1B, CENPA, KIF18A 1.82E-07 G2-M transition of mitotic cell cycle 116 5 4 PLK1, CCNB1, CDC25C, CCNB2, NEK2 4.69E-07 Mitotic sister chromatid segregation 14 3 21 ESPL1, KIFC1, NEK2 7.91E-07 Cytokinesis 60 4 6 PLK1, ESPL1, RACGAP1, PRC1 1.25E-06 Cell proliferation 429 7 1 PLK1, CDC25C, BUB1B, TPX2, ZFP36L2, CSE1L, KIF15 1.46E-06 Free ubiquitin chain polymerization 2 2 100 UBE2S, UBE2C 1.73E-06 Regulation of ubiquitin-protein ligase activity involved in mitotic cell cycle 77 4 5 PLK1, CCNB1, BUB1B, UBE2C 3.42E-06 Homologous chromosome segregation 4 2 50 PTTG1, ESPL1 1.04E-05 [83]Open in a new tab Figure 3. [84]Figure 3. [85]Open in a new tab Results of pathway enrichment analysis (Ingenuity Pathway Analysis 8.5) for 44 intersection genes from the three studies. Note: Top ranked significantly enriched group shows four pathways related to Polo-like Kinase 1 regulation of mitotic events. Subnetwork enrichment analysis (pathway studio) The IGS showed very similar clusters of overexpressed genes downstream of a small number of key regulators of mitosis. For example, in each study PLK1 and CCNB1 (cyclin B1) nucleate a series of highly overlapping subnetworks of uniformly over-expressed genes ([86]Fig. 4). Analysis of subnetwork enrichment of gene sets associated with known cell processes (Pathway Studio) has shown similar results with all top ranked processes related to specific mitosis-related processes ([87]Additional file 5). For instance gene sets from the three top ranked cellular processes (spindle assembly, chromosome segregation, and genome instability) were significantly enriched with genes from the IGS ([88]Fig. 5). Figure 4. [89]Figure 4. [90]Open in a new tab Subnetwork enrichment analysis results for upstream regulators of expression of 44 intersection genes. Notes: Top two regulators are shown as nodes with downstream interactions indicated by edges. Colors of the nodes correspond to a level of expression as fold change. Red—overexpression. Color of the edges corresponds to a type of regulation of expression: Green—known positive regulation; Red—known negative regulation; Dark Grey—direct regulation with unknown effect; Light Grey—general regulation with unknown effect; Yellow—protein modification; Arrows indicate upregulation; Blunt ends indicate downregulation. Figure 5. [91]Figure 5. [92]Open in a new tab Subnetwork enrichment analysis of downstream cellular processes regulated by 44 genes from intersection list. Note: Top three enriched cellular processes with genes regulating these processes shown as nodes connected to cell processes by edges. Transcription factor regulation Additional analysis was performed to determine whether there are functional modules among these 44 genes that are sharing common transcription factor (TF) regulation. TF regulation was determined by applying transcription factor analysis in Ingenuity IPA and subnetwork analysis of common expression regulators (Pathway Studio, Ariadne). Several TFs were identified as having their target genes enriched with genes from the intersection gene set. The top 10 TFs ranked by significance of subnetwork enrichment include E2F4 as the top ranked regulator of expression, followed by FOXM1 ([93]Table 4). Additional promoter enrichment analysis performed by directly querying Molecular Signatures Database (MSigDB)[94]^23 indicated that promoter regions of 14 of the 44 genes contain at least one consensus binding site for the transcription factor NFY (P = 0.009); serum response factor (SRF) and E2F sites were also significantly enriched (P < 0.05). Table 4. Top 10 transcription factors ranked by significance of subnetwork enrichment among 44 genes. Expression Targets of neighbors Total # of overlap Percent overlap Gene set seed Overlapping entities P-value E2F4 68 6 8 E2F4 PLK1, CCNB1, BRCA1, CCNB2, MCM2, UBE2C 5.26E-08 FOXM1 133 7 5 FOXM1 PLK1, CCNB1, CDC25C, AURKA, CCNB2, NEK2, CENPA 1.26E-07 E2F1 243 8 3 E2F1 PTTG1, CCNB1, BRCA1, CDC25C, RACGAP1, MCM2, MELK, DUSP4 5.15E-07 BRCA1 83 5 5 BRCA1 PLK1, CCNB1, BRCA1, BUB1B, NEK2 5.04E-06 TP53 720 11 1 TP53 PTTG1, PLK1, CCNB1, BRCA1, CDC25C, HMMR, CCNB2, BUB1B, TOP2A, DUSP4, PRC1 5.40E-06 MYBL2 40 4 9 MYBL2 PLK1, CCNB1, TOP2A, UBE2C 6.75E-06 MASTL 2 2 66 MASTL CCNB1, BUB1B 2.71E-05 CCAAT factors 199 6 3 CCAAT factors PTTG1, PLK1, CCNB1, CDC25C, CCNB2, TOP2A 2.73E-05 E2F6 22 3 13 E2F6 CCNB1, BRCA1, CCNB2 4.41E-05 CDC5L 3 2 50 CDC5L PLK1, CDC25C 5.41E-05 [95]Open in a new tab microRNA target gene analysis Analysis was done to determine groups of genes from IGS that are collectively targeted by specific microRNAs. The IGS of 44 genes was analyzed using a similar subnetwork enrichment methodology applied to subnetworks of microRNA target genes. An intersection of target predictions by at least two of three prediction algorithms (TargetScan, Miranda and Pictar)—combined with a large set of experimentally validated targets—was used to identify subnetworks of co-regulated genes in PS Ariadne. A large number of additional experimentally validated targets were derived from Tarbase 6.0,[96]^24 uploaded into Pathway Studio, and included in the enrichment analysis together with computationally predicted targets. Several connected subnetworks of microRNAs and their targets were identified with 29 genes targeted by 12 microRNAs ([97]Fig. 6A), which represents a highly interconnected mitotic regulation network of miRNA and their targets. Importantly, a majority of microRNA-target interactions in this network were identified as experimentally validated according to Tarbase 6.0 with supporting evidence from published studies. The exceptions are for 3 interactions: miR-132 with BRCA1; miR-27B with NEK2; miR-98 with DUSP4. These 3 predicted interactions are shown as dotted lines in grey color on [98]Figure 6A. Noticeably, BRCA1 is targeted by most the microRNAs (eight microRNAs; all but one (miR-132) are experimentally validated); miR-192 targets the largest number of genes from IGS (11 genes, all interactions are experimentally validated according to Tarbase 6.0). Figure 6. [99]Figure 6. [100]Open in a new tab Subnetwork enrichment analysis of upstream transcriptional and post-transcriptional regulators. (A) 12 microRNAs (yellow nodes) were significantly enriched with targets (experimentally validated and/or predicted) among 44 genes (red nodes) from the intersection list. Only 3 interactions in this network are computationally predicted: miR-132 with BRCA1; miR-27B with NEK2; and miR-98 with DUSP4. These 3 predicted interactions are shown as dotted lines in grey color. All other interactions in this network are experientially validated (Tarbase 6.0) and shown as blue dotted lines. (B) Top three transcription factors regulating microRNA regulatory subnetwork (based on enrichment analysis of known transcription factors). Common transcription regulators of miRNAs and their targets Since microRNA and their target genes might have some common upstream regulators of expression, an additional round of enrichment analysis was performed to determine common transcription factors that could control expression of this subnetwork of microRNAs and their targets. This analysis was based on experimental evidence collected within the Resnet database (Pathway Studio). The top ranked TF was TP53, which has known and validated interactions with several genes and microRNAs from our regulatory module. The top three TFs appear to regulate a significant part of the miRNA-gene mitotic regulation network ([101]Fig. 6B, TFs are highlighted in blue, edges with arrows indicate transcriptional activation, edges with blunt ends indicate transcriptional inhibition). Protein interaction analysis We next sought to understand how the protein products of these IGS genes might physically and/or functionally interact, using Reactome and STRING knowledge bases. Consistent with other analyses, Reactome revealed significant enrichment in multiple categories related to mitotic signaling, genome/chromosome stability, and DNA replication and repair ([102]Additional file 6), with an even greater number of mitotic events enriched with genes from a 44 gene intersection set (75 categories out of 81). Using STRING we identified a large association network with highly interconnected nodes, including all but seven proteins encoded by members of the 44-gene IGS. The large fraction of associations between the proteins is based on evidence of co-expression or co-occurrence, while a smaller subnetwork of connections linking ∼25% of the genes in our intersection list is based on direct protein-protein interactions, including CCNB1, PLK1, AURKA, and UBE2C ([103]Fig. 7A). Using STRING functionality, we generated a protein co-expression matrix for the entire set of 44 genes ([104]Fig. 7B). Importantly, co-expression evidence collected in the STRING knowledgebase for homo sapiens shows that the majority of proteins are co-expressed with high confidence across multiple experimental settings and various human tissues. Co-expression matrices also show discrete clusters of co-expressed proteins implying several co-regulated, functional modules ([105]Fig. 7B). Figure 7. [106]Figure 7. [107]Open in a new tab (A) STRING-generated protein-protein association networks within the 44 gene intersection list. (B) STRING-generated protein co-expression map (Homo sapiens) for 44 gene intersection list. Downstream regulation network analysis Based on the enrichment analysis of common regulators of downstream cellular processes, it is evident that the 44 differentially expressed genes are involved with regulation of an unusually large number of specific biological events and processes that are intrinsic to mitosis. Cell progression through the consecutive stages of mitosis involves a sophisticated and redundant network of highly interactive genes. Currently, this elaborate network of regulatory interactions is not well understood and poorly annotated at the level of pathways. Using computational tools provided by Pathway Studio we combined our enrichment findings for individual cellular processes and attempted to construct a crude, granular level representation of mitotic pathways that include approximately 20 specific mitotic events. These events include the initiation of mitosis, multiple intermediate steps associated with mitotic spindle formation, and final steps of chromosome segregation and cytokinesis ([108]Fig. 8). Twenty-nine genes from the intersection gene set were found to be directly associated with at least two mitotic processes. We have conditionally grouped these genes in two categories: major mitotic regulators that are involved with more than five mitotic processes, and those with five or less. Interestingly, by overlaying our findings regarding enrichment of disease phenotypes (presented below), we found that 9 out of 10 metastasis-associated genes are in the major mitotic regulators category ([109]Fig. 8, nodes are highlighted with green color). Figure 8. [110]Figure 8. [111]Open in a new tab Pathway Studio analysis showing top 20 cellular processes all related to consecutive steps of progression through mitosis. Notes: Nodes representing genes from the 44 intersection gene list are grouped according to number of downstream processes they are associated with: outside circle of nodes includes all nodes with <5 downstream processes; group of nodes in the middle includes highly interactive nodes with >5 processes. Genes that are significantly associated with metastasis phenotype (based on published data) are highlighted with green color. To understand cellular processes correlated with disease phenotype, a subnetwork analysis was done for enrichment of the disease phenotype gene sets (Ingenuity, Ariadne). We found that all significantly enriched disease-related gene sets represent phenotypes associated with cancer in general; these are also associated with several specific types of cancer that includes breast cancer and colorectal cancer ([112]Fig. 9). Interestingly, the top three ranked categories were aneuploidy, polyploidy, and tetraploidy, a fact supported by a large number of published evidence that connects 20 out of 44 genes to ploidy related phenotypes. One of the significantly enriched disease categories was related to a phenotype of metastasis with 10 out of the 44 genes overlapping in this category ([113]Fig. 9, represented by green circles). All of these genes belong to highly interconnected hubs within the 44 gene network and all are directly involved with regulation of multiple mitotic events. Figure 9. [114]Figure 9. [115]Open in a new tab Enrichment analysis of disease/phenotype categories. Notes: Groups of genes (red nodes) from 44 gene intersection list were identified that are overrepresented among proteins associated with various disease phenotypes. Top 10 significantly overrepresented disease categories are shown as rectangular nodes (blue color). Genes that are significantly associated with metastasis phenotype (based on published data) are highlighted with green color. We have also examined the intersection DEG list for genes whose protein products are implicated in response to either conventional cytotoxic or novel targeted therapeutic agents. TOP2A (target of epirubicin),[116]^25 PLK1 (target of BI 2536),[117]^26^,[118]^27 and AURKA (target of MLN8054 and MLN8237)[119]^28^,[120]^29 are each overexpressed in primary tumors from patients who went on to develop distant metastasis within five years. Comparison with additional validation study An additional breast cancer dataset[121]^15 was used as a validation dataset to compare and confirm the results of the meta-analysis of the other three studies. Gene expression was compared between two groups of samples from this study. The first was a baseline group that included samples from ER+, node negative tumors without systemic treatment and without distant metastasis (DMFS time > 5 years, 134 samples). The second group, the comparison group, included ER+, node negative tumors without systemic treatment with distant metastasis (DMFS time < 5 years, 26 samples). A total of 685 genes were found to be differentially expressed ([122]Additional file 8). This set of DEGs was further analyzed for enrichment of GO categories, as well as subnetwork analysis. The list of top enriched GO categories and cellular processes (Pathway Studio) was almost identical to the top enriched categories for the other three studies and for the IGS ([123]Additional file 9). Comparison with the IGS of 44 genes showed significant overlap with 31 genes in common ([124]Additional file 10) with a similar pattern of expression—all but two genes were over expressed in the group with distant metastasis. Discussion The goal of this work was to apply bioinformatics approaches to publicly available gene expression data, with the end goal of identifying genes associated with early distant metastasis (<5 years) in ER+ breast cancer. Across three independent clinical studies, we found significant enrichment of multiple differentially expressed genes associated with many gene ontology categories, pathways, and biological processes that are associated with regulation of mitosis. A subset of 44 differentially expressed genes was common for all 3 studies. This intersection gene set is highly interactive and represents a significant part of the mitotic regulation network (MRN) of genes controlling all stages in the process of cell division. This process starts from G2/M entry checkpoint, chromosome condensation, centrosome and centriole duplication, mitotic spindle formation and organization, and includes exit from M-phase. Using GO enrichment analysis, we found over 20 different biological processes that were significantly enriched with genes from the MRN that were overexpressed in all three studies of breast cancer. However, we also observed that all top-ranked, enriched biological categories that are related to mitosis regulation were represented by additional genes that were not shared among the three studies. This is evident from the comparison of gene lists for each top functional category in all three studies ([125]Fig. 1). While each GO category was represented and enriched by DEGs in all three studies, the number of genes varied between studies. This finding supports similar, recently reported observations for several types of cancer, where some of the aberrantly expressed genes sets were different between different samples whilst all mapping to a small number of the same biological pathways.[126]^30 Therefore, such pathways and/or interacting gene networks could potentially serve as biomarkers of specific tumor phenotypes. In our meta-analysis we found the same top GO categories and pathways ranked by enrichment across each of three studies all related to regulation of mitotic check points and regulation. The observation of common top ranked pathways across all three studies was reinforced by the analysis of an additional validation study.[127]^15 All top GO categories were almost identical for the Schmidt study and each category was represented by some of the same (and also some different) genes from the same categories. While intersection with the 44 genes was significant (31 genes), additional functionally relevant genes were also found in many of the same GO groups. Similarly we found additional DEGs from the Schmidt study that were from the same top ranked pathways but were not common for all four studies. For example, four pathways involving Polo-like kinases were similarly significantly enriched in the Schmidt dataset, whereas some of the DEGs were different, albeit closely related, to the genes from the 44 IGS ([128]Additional file 11). The idea that mitotic checkpoint failure, deregulated mitosis, and/or impaired chromosome segregation is indicative of poor outcome in many cancers, including those of the breast, is not new. For example, Hu et al recently reported that metastatic breast cancer, in both mouse and human, is characterized by a conserved gene signature that—specifically in ER+ tumors—is linked to the mitotic spindle checkpoint.[129]^31 This has implications for anti-mitotic agents and spindle poisons such as vinca alkaloids and taxanes, which have yet to be proven to be 100% effective in preventing tumor formation.[130]^32 It has been shown that spindle poisons prevent cancer cells from forming spindle microtubules and cause them to go into apoptosis.[131]^33 It has long been known that mitosis-regulating genes promote normal cell division and ensure chromosomal and genomic stability. It was also believed that loss of function mutations or deletions are required to impair mitotic checkpoint regulation.[132]^34^,[133]^35 However, studies have indicated that very few mutations are found among mitotic checkpoint genes in most human tumors.[134]^36^–[135]^39 The most commonly reported pattern was that of overexpression in tumors, including many of the genes we report here.[136]^40^,[137]^41 Much evidence indicates that overexpression of these checkpoint genes in fact leads to impairment of regulatory controls and increased mitotic activity of tumor cells.[138]^41 Several of the highly interactive nodes of the mitotic regulation network reported here are over-expressed in breast cancer and implicated in generation of chromosomal and genomic instability. Overexpression of the Aurora-A kinase in mammalian cells leads to centrosome amplification, genetic instability, and transformation.[139]^42 Inhibition of Aurora-A has been shown to lead to increased cisplatin sensitivity.[140]^43 Similar observations have been reported on human PTTG1 (hPTTG1), the index mammalian securin. PTTG1 is a critical component of the spindle checkpoint controlling faithful chromatid separation and has also been identified as a proto-oncogene.[141]^44 hPTTG1 abundance[142]^45 or loss of function,[143]^46 both result in abnormal mitosis and chromosomal instability. The mitotic-spindle/hyaluronan-binding protein RHAMM, an essential component of normal mitotic machinery, is over-expressed in many human tumors. Intracellular RHAMM associates with BRCA1 and BARD1; this association attenuates the mitotic-spindle-promoting activity of RHAMM that might contribute to tumor progression by promoting genomic instability.[144]^47 Another important player, Polo-like kinase (PLK1), was consistently overexpressed in each of the three studies in our comparison and is known to be a key regulator of mitotic pathways such as centrosome cycle control ([145]Fig. 3). Overexpression of PLK1 is shown to be associated with increased chromosomal instability, while its inhibition prevents the growth of metastatic breast cancer cells in the brain.[146]^48 These findings also suggest a connection to the metastatic phenotype of breast tumor cells. Overall, our enrichment analysis has shown that 10 of 44 genes that were overexpressed in all three studies are experimentally associated with the metastasis phenotype; it was also shown that nine of these genes are major regulatory hubs of the mitosis regulatory network reported. Importantly, dysregulation of these hubs has been connected to abnormal mitosis and genomic instability. Missegregation of chromosomes is a hallmark of genomic instability,[147]^49 and chromosomal instability (CIN) as measured by single nucleotide polymorphism analysis of breast tumors is significantly associated with time to distant metastasis in ER+ disease.[148]^50 However, our discovery of a common network of highly interconnected genes controlling the entire sequence of mitotic events, consistently overexpressed in all three studies, provides insight into potential large scale deregulation of the mitotic machinery that is associated with, and might even facilitate, a progression toward early metastasis. Analysis of common upstream regulators of the mitotic regulatory network We also identified several points of transcriptional and post-transcriptional control of this mitotic regulation network. Transcriptional analysis The promoter regions of more than 30% of the genes in the IGS are enriched for NFY transcription factor binding sites. Six out of 44 have been reported in literature as experimentally validated targets of NFYA, NFYB, or NFYC ([149]Additional file 7). An association between breast cancer progression[150]^51 or metastasis[151]^52 and enrichment of binding sites for NFY has been previously reported. NFY can cooperate with ER (via SP1 sites) to regulate transcription of E2F1, which is also enriched in our DEG list. However, estrogen-stimulated activation of NFY requires non-genomic effects of ER, specifically through cAMP and PKA.[152]^53 Interestingly, there is documented evidence that connects poor response to Tamoxifen (the only systemic therapy received by some patients in this study) and non-genomic (ie, non-transcriptional), ER-dependent signaling, which often involves kinases such as Src, Mapk, and Akt (reviewed in literature).[153]^54 E2F sites were found to be significantly enriched in the promoter regions of the 44 genes in the IGS using MSigDB analysis, which is in concordance with the TF and subnetwork analyses using Ingenuity, and Pathway Studio, respectively. The main characterized function of the E2F family of transcription factors is the regulation of the cell cycle, as well as genes involved in DNA synthesis, repair, apoptosis, development, and differentiation. It is therefore understandable that E2F sites are enriched in the IGS of the three studies. Several members of the E2F family regulate expression of MYBL2, a transcription factor on our list that regulates the expression of genes involved in cancer progression. E2F4—which was the top ranked TF we found in our study ([154]Table 4)—has primarily been described as a repressor of both transcription and cell cycle progression,[155]^55^–[156]^57 but it can also be a transcriptional activator. Overexpression of E2F4 in colorectal cancer cells has been shown to contribute to hyperproliferation.[157]^58^–[158]^60 A more recent study suggested that low E2F4 gene expression and is predictive for survival of ER negative breast cancer patients.[159]^61 E2F transcription factor1 (E2F1), the third top ranked TF from our study, is responsible for the regulation of FOXM1 expression (the second top ranked TF here), which plays a key role in epirubicin resistance in cancer cells.[160]^62 Another study has shown that E2F1 expression was enhanced in Tamoxifen-resistant MCF-7 breast cancer cells.[161]^63 The FOXM1 transcription factor is a transcriptional activator involved in several biological processes, including cell proliferation and differentiation, cell cycle progression, DNA repair, angiogenesis, and apoptosis. Elevated FOXM1 expression is found in numerous cancers and studies have found FOXM1 to be one of the most commonly upregulated genes in human solid tumors,[162]^64 thus it is not surprising this was the second top ranked TF found in our analysis. Third and fourth on the list were Breast Cancer Type 1 susceptibility protein (BRCA1) and Tumor Protein p53 (TP53). BRCA1 is involved in DNA repair and tumor suppression through interaction with a large network of DNA repair proteins. Mutations in BRCA1 are found in 5%–50% of familial breast cancers and are often associated with increased risk. Studies suggest that TP53 mutations may precede loss of the BRCA1 wild-type allele.[163]^65 TP53 mutations are often found in BRCA1 mutant tumors and play a large role in tumor suppression. Mutant p53 genes are found in a number of cancers and cause the loss of tumor suppressor activity. Another TF on our list—cell division cycle 5-like protein (Cdc5L)—plays a key role in regulation of cell division. It is also thought to suppress p53-induced growth arrest.[164]^66 Several CCAAT factors were seen as top hits across the three studies ([165]Table 4). CCAAT-enhancer binding proteins are transcriptional activators involved in numerous biological processes, including cellular proliferation. A recent study has shown that CCAAT/enhancer binding protein delta CpG island methylation is associated with metastasis in breast cancer.[166]^67 Finally, one of the TFs on the list that is not a well know regulator is microtubule-associated serine/threonine kinase-like (MASTL), which is the human ortholog of the Greatwall kinase that has been shown to be most active in mitosis, facilitating mitotic entry, anaphase, and cytokinesis.[167]^68 microRNA analysis Post-transcriptional gene regulation by small, non-coding RNAs (microRNAs) are also of considerable interest due to their significance in numerous biological processes. Currently studies indicate that the primary mechanism of microRNA interaction with their mRNA targets is through translational inhibition, although there is some debate about the role of microRNAs affecting mRNA stability and promoting degradation.[168]^69 Twelve distinct microRNAs were found to have their target sets significantly enriched with 29 of the 44 genes, with miR-192 targets being the most over-represented. All but three of these microRNA to target interactions have been experimentally validated and previously reported (Tarbase 6.0). Several of these microRNAs and their targets are well characterized as being associated with breast cancer. Specifically, miR-192 has been shown to be one of the p53 inducible microRNAs that controls multiple cell cycle checkpoints and inhibits cell proliferation when overexpressed.[169]^70^,[170]^71 In addition to cell cycle regulation, it is interesting to note that p53-mediated upregulation of miR-192 prevents the epithelial-mesenchymal transition (an in vitro corollary of metastatic potential) in hepatocellular carcinoma through repression of the transcription factor ZEB2.[171]^72 Another important microRNA regulatory network was detected among the 44 genes that includes miR-193B and its experimentally validated targets (the second highest ranked microRNA with eight genes targeted). It has been recently shown that miR-193B targets estrogen receptor-α (ERα) and inhibits estrogen-induced growth of breast cancer cells.[172]^73^–[173]^75 Importantly, miR-193B is a known moderator of ER signaling. Also of note, reduced expression of miR-342 has been associated with tamoxifen-resistant breast tumors and breast cancer cell line MCF-7/HER2Δ16 and MCF-7 variants TAMR1 and LCC2.[174]^76 Two other microRNAs from this set, miR-212 and miR-132 (miR212/miR132 family), were shown to regulate epithelial-stromal interactions during development of the mammary gland.[175]^77 They were also reported as over-expressed in pancreatic adenocarcinoma and targeting retinoblastoma tumor suppressor Rb1.[176]^78 Therapeutic targets Protein products of three of the 44 genes are targets of specific therapeutic agents; each of these three is also implicated as prognostic or predictive biomarkers in ER+ breast cancer. Topoisomerase II alpha (TOPO2A, more commonly referred to as TOP2A) is a target of the anthracycline class of chemotherapeutic agents (doxorubicin/adriamycin, epirubicin, etc.) and a poor prognostic factor in ER+ breast cancer that is associated with a proliferative index (Ki67 labeling).[177]^79 However, the strength of TOPO2A amplification or overexpression as a predictive factor for beneficial response to anthracyclines remains unclear.[178]^80 PLK1 is more commonly expressed in ER− breast tumors[179]^81 and was recently identified in an siRNA library screen as a selective inhibitor of tumor-initiating cells in triple negative breast cancer cell lines.[180]^26 However, in women with ER+ breast cancer, upregulation of a 14-3-3ζ/YWHAZ-centered network predicts poor response to Tamoxifen and distant metastasis; PLK1 is a major component of this network.[181]^82 Finally, AURKA is a key component of SCMGENE, a 3-gene signature (ER, HER2, AURKA) that performs as well, or better, than the original gene signatures developed to classify basal, HER2+, luminal A, and luminal B breast cancer.[182]^83 Down regulation of AURKA can also reverse estrogen-mediated growth in breast cancer cells.[183]^84 Conclusion A meta-analysis of three independent ER+ breast cancer studies has shown a common pattern of expression for 44 genes involved with the regulation of most stages of mitosis, starting from G2/M transition, mitotic entry, specific processes of chromosome condensation, centrosome and centriole duplication, spindle assembly, chromosome attachment, alignment, and segregation. The pattern of expression of these common genes is remarkably similar in all three datasets, with almost all genes being overexpressed in patients with early metastasis. This cluster of genes encodes proteins that are highly interactive and known to be involved with multiple checkpoints regulating entrance and transition through major phases of mitosis. Extensive literature mining has shown that overexpression of a significant number of these proteins is associated with abnormal mitotic events that lead to abnormal cell division and consequently an increased rate of chromosomal and genomic instability. Importantly, our findings indicate that this gene cluster is differentially expressed in patients with early metastasis; therefore, we hypothesize that correlated overexpression of multiple key regulators of mitotic checkpoints could be indicative of early onset and increase of genomic instability and a progression towards a metastatic cell phenotype. Ten of the genes common to all three studies were already reported as associated with a metastatic phenotype. Few details are known with regard to specific mechanisms of how genomic instability contributes to such progression. However, our meta-analysis has narrowed down a list of potential players and their transcriptional and post-transcriptional regulators, thus providing additional evidence for further mechanistic studies and validation experiments. Acknowledgments