Abstract Patients with estrogen receptor-negative breast cancer generally have a worse prognosis than estrogen receptor-positive patients. Nevertheless, a significant proportion of the estrogen receptor-negative cases have favorable outcomes. Identifying patients with a good prognosis, however, remains difficult, as recent studies are quite limited. The identification of molecular biomarkers is needed to better stratify patients. The significantly mutated genes may be potentially used as biomarkers to identify the subtype and to predict outcomes. To identify the biomarkers of receptor-negative breast cancer among the significantly mutated genes, we developed a workflow to screen significantly mutated genes associated with the estrogen receptor in breast cancer by a gene coexpression module. The similarity matrix was calculated with distance correlation to obtain gene modules through a weighted gene coexpression network analysis. The modules highly associated with the estrogen receptor, called important modules, were enriched for breast cancer-related pathways or disease. To screen significantly mutated genes, a new gene list was obtained through the overlap of the important module genes and the significantly mutated genes. The genes on this list can be used as biomarkers to predict survival of estrogen receptor-negative breast cancer patients. Furthermore, we selected six hub significantly mutated genes in the gene list which were also able to separate these patients. Our method provides a new and alternative method for integrating somatic gene mutations and expression data for patient stratification of estrogen receptor-negative breast cancers. Keywords: breast cancer patient stratification, estrogen receptor-negative, distance correlation, significantly mutated gene, gene coexpression network 1. Introduction Breast cancer is a heterogeneous disease with many subtypes which exhibits significant differences in response to therapy and patient outcomes (Jonasson et al., [29]2019). Breast cancer has been known to be an endocrine-related cancer (Wu et al., [30]2020), and the majority of breast cancer subtypes are hormone-associated (DeSantis et al., [31]2017; Xu et al., [32]2019). The expression of the estrogen receptor (ER), progesterone receptor (PR), and human epithelial growth factor receptor 2 (HER2) as predictive and/or prognostic markers has been well established in multiple studies (Francis et al., [33]2019). Endocrine therapies that target the ER have long been the cornerstone of therapy approaches for the majority of breast cancer patients. However, 20–30% of breast tumors do not express ER and are not responsive to existing endocrine therapies (Ni et al., [34]2011). The prognosis of estrogen receptor-negative (ER^−) breast cancer is worse than estrogen receptor-positive (ER^+) breast cancer in most situations, but ER^− breast cancer patients do not always have a poor clinical outcome. Due to the lack of reliable biomarkers, it is impossible to identify ER^− tumors with a good prognosis (Teschendorff et al., [35]2007; Zhang et al., [36]2016). Several studies have revealed that different chromosomal and gene expression patterns are present in patients with different estrogen receptor statuses (Zhang et al., [37]2009; Fohlin et al., [38]2020). Thus, an accurate grouping of ER^− breast cancer into clinically relevant subtypes is of particular importance for therapeutic decision making. Cancer is often driven by the accumulation of genetic alterations. Until now, the somatic mutation landscapes and signatures of more than a dozen major cancer types have been reported. However, pinpointing the driver mutations and cancer genes from millions of available cancer somatic mutations remains a significant challenge (Cheng et al., [39]2016). In The Cancer Genome Atlas (TCGA) database, a phenomenon can be observed that the position and nature of somatic mutations can often be translated to changes of protein structures or functions of the genes. The affected gene is designated as a significantly mutated gene (SMG). SMGs are the result of splice-site change, nonsense, nonstop, or frame-shift mutations (Zhang et al., [40]2016). The prevalence of SMGs in almost all cancer types has allowed for postulation that they may be act potentially as biomarkers for subtyping and testing for use in cancer patient outcome predictions, or a starting point of clarifying the tumorigenesis process (Cancer Genome Atlas Network, [41]2012). Network approaches have provided the means to bridge the gap between individual genes and systems oncology (Ghazalpour et al., [42]2006). Weighted gene coexpression network analysis (WGCNA) is a systems biology method used to analyze gene expression profiling data which is widely used in bioinformatics (Zhang and Horvath, [43]2005). WGCNA can help researchers analyze the relationships between genes and pathogenic mechanisms. Instead of linking thousands of genes to the disease, this method focuses on the relationship between gene modules and disease traits (Huang et al., [44]2020). Many studies that constructed the coexpression networks in breast cancer used WGCNA. Coexpression networks were used to screen hub genes from the co-expression module using the relationship between genes and traits, together with the core position of genes in the module (Tang et al., [45]2019; Jia et al., [46]2020). A coexpression network analysis can also identify the prognostic lncRNAs (Liu et al., [47]2019; Li et al., [48]2020). However, these studies did not consider the information of genetic mutations in breast cancer. SMGs are not always concentrated in specific genomic loci, which suggests that instead of common genes, mutations may affect some pathways or gene interaction networks (Zhang et al., [49]2016). So, in this work, we propose a method to screen SMGs using gene coexpression networks to identify the SMGs that highly relate to ER_Status. We show the development of a workflow for screening SMGs associated with clinical data of the estrogen receptor in breast cancer by a gene coexpression module. The new gene list was designated as important-SMGs. The identified genes, which were used to stratify patients with different subtypes of cancers, were suggested to represent biomarkers. Our method provides a new alternative method for cancer patient stratification by integrating transcriptomic, variants, and clinic data. 2. Methods In this work, we propose a method for screening SMGs by a gene coexpression module associated with clinical data of breast cancer and the estrogen receptor; the workflow is summarized in [50]Figure 1. We calculated the similarity coexpression matrix by distance correlation for WGCNA to construct a gene coexpression network and to obtain the gene modules. Distance correlation has a perfect theoretical system and works for both linear and nonlinear dependence between any two vectors (Székely et al., [51]2007). WGCNA is a method used to identify clusters (modules) of highly correlated genes (Zhang and Horvath, [52]2005). We identified some important modules that were significantly associated with the measured clinical estrogen receptor data. SMGs were then selected from the TCGA tumor somatic mutation data and the important-SMGs were obtained through the overlap between the important module genes and the SMGs. Furthermore, we respectively chose the hub SMGs in the important modules and acquired six genes which can be used as the biomarkers for stratification and prediction of patient survival of ER^− breast cancer. Figure 1. [53]Figure 1 [54]Open in a new tab Workflow of identifying new biomarkers using transcriptomic and variants data. 2.1. Datasets The TCGA datasets used in this study can be found in the [55]Data Portal for TCGA-Breast Cancer (Weinstein et al., [56]2013), For the construction of the gene coexpression and the SMGs selection, we used the TCGA dataset. The gene expression profile was measured experimentally using the Illumina HiSeq 2000 RNA Sequencing platform with log[2](x+1) transformed RSEM normalized count (Cancer Genome Atlas Network, [57]2012). The samples were screened based on RNA-seq data and clinical data, after which we selected genes with a variable coefficient of more than 0.2 and a mean >1. Ultimately, we obtained 5,076 genes. The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset from the [58]cBioportal website (Cerami et al., [59]2012) contains cDNA microarray performed on the Illumina HT-12 platform (Curtis et al., [60]2012; Pereira et al., [61]2016). The details of data normalization can be found in Margolin et al. ([62]2013). For validation, both datasets containing gene expression data and matching survival time (months) were used for survival analysis. Samples in the METABRIC were screened based on the clinical data (contain ER_Status, Days, Vital_Status). The sample numbers used in the two datasets are shown in [63]Table 1. Table 1. Sample numbers in two datsets. Dataset Total SMGs ER^+ ER^− Deceased/Living (ER^−) TCGA 637 383 499 133 23/110 ≈ 0.209 METABRIC 1,888 – 1,435 424 240/184 ≈ 1.304 [64]Open in a new tab There are more samples in METABRIC and longer clinical follow-up time. 2.2. Distance Correlation In 2007, distance correlation was proposed by Szekely, Rizzo, and Bakirov in the paper titled Measuring and Testing Dependence by Correlation of Distances published in the Annals of Statistics (Székely et al., [65]2007). In this work, the similarity coexpression matrix was calculated with distance correlation for WGCNA to perform a gene coexpression network analysis. Unlike the Pearson correlation, distance correlation works for both linear and nonlinear dependence between two gene expression profiles. However, distance correlation is still a relatively expensive computation. The time complexity of distance correlation was O(n^2). Distance correlation was calculated using the [66]energy package in R (see the references in the manual for more