Abstract

   Patients with estrogen receptor-negative breast cancer generally have a
   worse prognosis than estrogen receptor-positive patients. Nevertheless,
   a significant proportion of the estrogen receptor-negative cases have
   favorable outcomes. Identifying patients with a good prognosis,
   however, remains difficult, as recent studies are quite limited. The
   identification of molecular biomarkers is needed to better stratify
   patients. The significantly mutated genes may be potentially used as
   biomarkers to identify the subtype and to predict outcomes. To identify
   the biomarkers of receptor-negative breast cancer among the
   significantly mutated genes, we developed a workflow to screen
   significantly mutated genes associated with the estrogen receptor in
   breast cancer by a gene coexpression module. The similarity matrix was
   calculated with distance correlation to obtain gene modules through a
   weighted gene coexpression network analysis. The modules highly
   associated with the estrogen receptor, called important modules, were
   enriched for breast cancer-related pathways or disease. To screen
   significantly mutated genes, a new gene list was obtained through the
   overlap of the important module genes and the significantly mutated
   genes. The genes on this list can be used as biomarkers to predict
   survival of estrogen receptor-negative breast cancer patients.
   Furthermore, we selected six hub significantly mutated genes in the
   gene list which were also able to separate these patients. Our method
   provides a new and alternative method for integrating somatic gene
   mutations and expression data for patient stratification of estrogen
   receptor-negative breast cancers.

   Keywords: breast cancer patient stratification, estrogen
   receptor-negative, distance correlation, significantly mutated gene,
   gene coexpression network

1. Introduction

   Breast cancer is a heterogeneous disease with many subtypes which
   exhibits significant differences in response to therapy and patient
   outcomes (Jonasson et al., [29]2019). Breast cancer has been known to
   be an endocrine-related cancer (Wu et al., [30]2020), and the majority
   of breast cancer subtypes are hormone-associated (DeSantis et al.,
   [31]2017; Xu et al., [32]2019). The expression of the estrogen receptor
   (ER), progesterone receptor (PR), and human epithelial growth factor
   receptor 2 (HER2) as predictive and/or prognostic markers has been well
   established in multiple studies (Francis et al., [33]2019). Endocrine
   therapies that target the ER have long been the cornerstone of therapy
   approaches for the majority of breast cancer patients. However, 20–30%
   of breast tumors do not express ER and are not responsive to existing
   endocrine therapies (Ni et al., [34]2011). The prognosis of estrogen
   receptor-negative (ER^−) breast cancer is worse than estrogen
   receptor-positive (ER^+) breast cancer in most situations, but ER^−
   breast cancer patients do not always have a poor clinical outcome. Due
   to the lack of reliable biomarkers, it is impossible to identify ER^−
   tumors with a good prognosis (Teschendorff et al., [35]2007; Zhang et
   al., [36]2016). Several studies have revealed that different
   chromosomal and gene expression patterns are present in patients with
   different estrogen receptor statuses (Zhang et al., [37]2009; Fohlin et
   al., [38]2020). Thus, an accurate grouping of ER^− breast cancer into
   clinically relevant subtypes is of particular importance for
   therapeutic decision making.

   Cancer is often driven by the accumulation of genetic alterations.
   Until now, the somatic mutation landscapes and signatures of more than
   a dozen major cancer types have been reported. However, pinpointing the
   driver mutations and cancer genes from millions of available cancer
   somatic mutations remains a significant challenge (Cheng et al.,
   [39]2016). In The Cancer Genome Atlas (TCGA) database, a phenomenon can
   be observed that the position and nature of somatic mutations can often
   be translated to changes of protein structures or functions of the
   genes. The affected gene is designated as a significantly mutated gene
   (SMG). SMGs are the result of splice-site change, nonsense, nonstop, or
   frame-shift mutations (Zhang et al., [40]2016). The prevalence of SMGs
   in almost all cancer types has allowed for postulation that they may be
   act potentially as biomarkers for subtyping and testing for use in
   cancer patient outcome predictions, or a starting point of clarifying
   the tumorigenesis process (Cancer Genome Atlas Network, [41]2012).

   Network approaches have provided the means to bridge the gap between
   individual genes and systems oncology (Ghazalpour et al., [42]2006).
   Weighted gene coexpression network analysis (WGCNA) is a systems
   biology method used to analyze gene expression profiling data which is
   widely used in bioinformatics (Zhang and Horvath, [43]2005). WGCNA can
   help researchers analyze the relationships between genes and pathogenic
   mechanisms. Instead of linking thousands of genes to the disease, this
   method focuses on the relationship between gene modules and disease
   traits (Huang et al., [44]2020). Many studies that constructed the
   coexpression networks in breast cancer used WGCNA. Coexpression
   networks were used to screen hub genes from the co-expression module
   using the relationship between genes and traits, together with the core
   position of genes in the module (Tang et al., [45]2019; Jia et al.,
   [46]2020). A coexpression network analysis can also identify the
   prognostic lncRNAs (Liu et al., [47]2019; Li et al., [48]2020).
   However, these studies did not consider the information of genetic
   mutations in breast cancer.

   SMGs are not always concentrated in specific genomic loci, which
   suggests that instead of common genes, mutations may affect some
   pathways or gene interaction networks (Zhang et al., [49]2016). So, in
   this work, we propose a method to screen SMGs using gene coexpression
   networks to identify the SMGs that highly relate to ER_Status. We show
   the development of a workflow for screening SMGs associated with
   clinical data of the estrogen receptor in breast cancer by a gene
   coexpression module. The new gene list was designated as
   important-SMGs. The identified genes, which were used to stratify
   patients with different subtypes of cancers, were suggested to
   represent biomarkers. Our method provides a new alternative method for
   cancer patient stratification by integrating transcriptomic, variants,
   and clinic data.

2. Methods

   In this work, we propose a method for screening SMGs by a gene
   coexpression module associated with clinical data of breast cancer and
   the estrogen receptor; the workflow is summarized in [50]Figure 1. We
   calculated the similarity coexpression matrix by distance correlation
   for WGCNA to construct a gene coexpression network and to obtain the
   gene modules. Distance correlation has a perfect theoretical system and
   works for both linear and nonlinear dependence between any two vectors
   (Székely et al., [51]2007). WGCNA is a method used to identify clusters
   (modules) of highly correlated genes (Zhang and Horvath, [52]2005). We
   identified some important modules that were significantly associated
   with the measured clinical estrogen receptor data. SMGs were then
   selected from the TCGA tumor somatic mutation data and the
   important-SMGs were obtained through the overlap between the important
   module genes and the SMGs. Furthermore, we respectively chose the hub
   SMGs in the important modules and acquired six genes which can be used
   as the biomarkers for stratification and prediction of patient survival
   of ER^− breast cancer.

Figure 1.

   [53]Figure 1
   [54]Open in a new tab

   Workflow of identifying new biomarkers using transcriptomic and
   variants data.

2.1. Datasets

   The TCGA datasets used in this study can be found in the [55]Data
   Portal for TCGA-Breast Cancer (Weinstein et al., [56]2013), For the
   construction of the gene coexpression and the SMGs selection, we used
   the TCGA dataset. The gene expression profile was measured
   experimentally using the Illumina HiSeq 2000 RNA Sequencing platform
   with log[2](x+1) transformed RSEM normalized count (Cancer Genome Atlas
   Network, [57]2012). The samples were screened based on RNA-seq data and
   clinical data, after which we selected genes with a variable
   coefficient of more than 0.2 and a mean >1. Ultimately, we obtained
   5,076 genes.

   The Molecular Taxonomy of Breast Cancer International Consortium
   (METABRIC) dataset from the [58]cBioportal website (Cerami et al.,
   [59]2012) contains cDNA microarray performed on the Illumina HT-12
   platform (Curtis et al., [60]2012; Pereira et al., [61]2016). The
   details of data normalization can be found in Margolin et al.
   ([62]2013). For validation, both datasets containing gene expression
   data and matching survival time (months) were used for survival
   analysis. Samples in the METABRIC were screened based on the clinical
   data (contain ER_Status, Days, Vital_Status). The sample numbers used
   in the two datasets are shown in [63]Table 1.

Table 1.

   Sample numbers in two datsets.
   Dataset  Total SMGs ER^+  ER^− Deceased/Living (ER^−)
   TCGA      637  383   499  133      23/110 ≈ 0.209
   METABRIC 1,888  –   1,435 424     240/184 ≈ 1.304
   [64]Open in a new tab

   There are more samples in METABRIC and longer clinical follow-up time.

2.2. Distance Correlation

   In 2007, distance correlation was proposed by Szekely, Rizzo, and
   Bakirov in the paper titled Measuring and Testing Dependence by
   Correlation of Distances published in the Annals of Statistics (Székely
   et al., [65]2007). In this work, the similarity coexpression matrix was
   calculated with distance correlation for WGCNA to perform a gene
   coexpression network analysis. Unlike the Pearson correlation, distance
   correlation works for both linear and nonlinear dependence between two
   gene expression profiles. However, distance correlation is still a
   relatively expensive computation. The time complexity of distance
   correlation was O(n^2). Distance correlation was calculated using the
   [66]energy package in R (see the references in the manual for more