Abstract

   Breast cancer is a disease with high heterogeneity. Cancer is not
   usually caused by a single gene, but by multiple genes and their
   interactions with others and surroundings. Estimating breast
   cancer-specific gene–gene interaction networks is critical to elucidate
   the mechanisms of breast cancer from a biological network perspective.
   In this study, sample-specific gene–gene interaction networks of breast
   cancer samples were established by using a sample-specific network
   analysis method based on gene expression profiles. Then, gene–gene
   interaction networks and pathways related to breast cancer and its
   subtypes and stages were further identified. The similarity and
   difference among these subtype-related (and stage-related) networks and
   pathways were studied, which showed highly specific for subtype
   Basal-like and Stages IV and V. Finally, gene pairwise interactions
   associated with breast cancer prognosis were identified by a Cox
   proportional hazards regression model, and a risk prediction model
   based on the gene pairs was established, which also performed very well
   on an independent validation data set. This work will help us to better
   understand the mechanism underlying the occurrence of breast cancer
   from the sample-specific network perspective.

   Keywords: Breast cancer, Gene interaction network,
   Stages/subtypes-related networks, Cox proportional hazards regression
   model, Sample-specific networks

Introduction

   According to the latest data from the survey of the International
   Agency for Research on Cancer (IARC) in 2018, the incidence of breast
   cancer is 24.2% among women worldwide, ranking first in female cancers
   ([32]Bray et al., 2018). At present, the incidence of breast cancer is
   the highest, and its mortality ranks fourth in China. Breast cancer has
   strong heterogeneity. Based on the TNM staging system, breast cancer
   can be divided into Stages I, II, III, IV and V. There are many
   clinical types of breast cancer according to pathological
   classification and molecular classification. The pathological
   classification generally divides breast cancer into invasive and
   non-invasive breast cancer. The gold standard for the molecular typing
   of breast cancer is PAM50 molecular typing based on the expression
   profile of 50 genes, which classifies breast cancer into the
   Normal-like, LuminalA, LuminalB, Basal-like, and Her2 subtypes
   ([33]Perou et al., 2000).

   The molecular typing of breast cancer has important reference value for
   clinical treatment of breast cancer. However, molecular typing requires
   transcriptome sequencing which is difficult to promote clinically.
   Currently, the diagnosis of breast cancer classification is mainly
   through immunohistochemistry (IHC), namely, diagnosis by the expression
   of four markers, ER (oestrogen receptor), PR (progestin receptor), HER2
   gene (human epidermal growth factor receptor 2) and Ki-67 protein
   (proliferating cell nuclear antigen). ER and PR are important
   indicators for endocrine therapy and prognosis evaluation in breast
   cancer. Studies have shown that their expressions are positively
   correlated with total survival, treatment failure time, endocrine
   therapy response time, and recurrence time ([34]Hammond et al., 2010;
   [35]Fitzgibbons et al., 2010). In 2009, Cheang used GEP (gene
   expression analysis) to determine 14% as the threshold of Ki-67, which
   could be used to divide patients into two groups with good and bad
   prognoses ([36]Cheang et al., 2009). In 2011, the St. Gallen
   International Expert Consensus agreed to include Ki-67 as an important
   standard for molecular typing, which is the key to distinguishing the
   Luminal A and Luminal B subtypes ([37]Goldhirsch et al., 2011). In the
   growth and metastasis of breast cancer, HER2 is one of the most
   important factors, and its status can be used to predict the effect of
   drug treatment for breast cancer. Early detection and diagnosis and
   timely treatment are of great significance to improve the survival rate
   of breast cancer patients.

   The etiology of breast cancer is still not clear, and there are many
   related factors, such as individual differences and a lack of effective
   treatments. With the development of biomedicine, personalized medicine
   is becoming the direction of breast cancer treatment in the future. At
   present, the medical plan can only be formulated through the study of
   single gene expression and mutation information. However, this
   information cannot fully reflect the personalized interaction and
   regulation among genes. Because onset and progression of cancer are
   often caused by the disruption of important biological networks such as
   cell cycle and apoptosis, but not a single gene. Indeed, there is a new
   and cutting-edge field of medical research, called network medicine,
   whose basic idea is that human diseases are rarely caused by single
   molecular determinant, but more likely influenced by a network of
   interacting molecular determinants with the propensity to cluster
   together in the human interactome ([38]Barabási, Gulbahce & Loscalzo,
   2011; [39]Conte et al., 2019; [40]Fiscon et al., 2018). Gene–gene
   interaction networks can reveal the interaction relations and
   regulatory mechanisms among genes, and they have the irreplaceable
   function of the single-gene monitoring of information (such as
   expression and mutation) in many aspects ([41]Liu et al., 2016).
   Therefore, the mechanism of the occurrence and development of breast
   cancer can be explored through changes in the interactions between
   genes. In this article, we constructed sample-specific networks of
   breast cancer samples by calculating the correlation coefficient of
   protein-coding gene pairs to explore the gene–gene interaction networks
   related to breast cancer stages and subtypes (see [42]Fig. 1).

Figure 1. An integrative framework identifying breast cancer-related
gene–gene interaction networks.

   [43]Figure 1
   [44]Open in a new tab

   (A) Construction of sample-specific networks based on gene expression
   data. A reference network can be established based on the expression
   profiles of n reference samples by calculating the correlation
   coefficients PCC[n] of gene pairs. Then, adding a new cancer sample
   sd[x] into the reference samples, a perturbed network is established by
   calculating the new correlation PCC[n +] [1] of the n + 1 samples.
   Because of sample sd[x], the perturbed network is different from the
   reference network, and the difference ∆PCC[n] (PCC[n + 1] − PCC[n]) of
   each edge in the background network constitutes the differential
   network. Then, the significance of each edge can be quantified by a
   statistical Z-test. The sample-specific network for sample sd[x] is
   composed by those edges with significant ∆PCC[n]. (B) The framework to
   identify the breast cancer-related gene–gene interaction network based
   on gene expression. Using the sample-specific network analysis method,
   m cancer sample-specific networks were constructed. Then, these
   constructed sample-specific networks were analyzed to identify breast
   cancer-related networks, stage-related networks and subtype-related
   networks, as well as gene-interaction biomarkers associated with the
   prognosis of breast cancer. Moreover, pathway enrichment analysis based
   on KEGG pathways and survival analysis based on the LASSO regression
   model were performed.

   The survival time of different patients with breast cancer is
   significantly different. At present, the 5-year survival rate of breast
   cancer patients in China has reached 83.2%. However, the 5-year
   survival rate of advanced cancer patients and Basal-like cancer
   patients are significantly lower, so it is necessary to study the
   biomarkers that affect the prognosis of breast cancer. In 2009,
   [45]Parker et al. (2009) established a single-gene level survival
   analysis model to improve the prognosis of breast cancer and predict
   the efficacy of chemotherapy. However, the robustness of the gene-based
   model is not very high. Thus, this paper aims to establish a more
   stable prognostic analysis model of breast cancer patients through
   gene–gene interactions. We used the differential correlation
   coefficients to model the prognosis of breast cancer. Lasso regression
   is suitable for data analysis and model construction with many
   independent variables but a limited sample size ([46]Zhang et al.,
   2015). In this study, we used a Lasso regression model to effectively
   reduce the dimensionality of large gene pairs and then identified the
   gene interactions related to the prognosis of breast cancer. Finally, a
   multivariate Cox proportional hazards regression analysis based on the
   gene interactions was carried out to predict the survival of patients
   with breast cancer (see [47]Fig. 1B). A prognosis model was established
   and it also performed very well on an independent validation data.

Materials and Methods

Datasets

   In this article, the RNA sequencing (RNA-seq) data of 290 normal breast
   tissues was downloaded from the GTEx database
   ([48]https://gtexportal.org/home/), and the RNA-seq data of 1,093
   breast cancer samples was downloaded from the TCGA database
   ([49]https://portal.gdc.cancer.gov/). A human protein–protein
   interaction network was from the STRING database version 11.0
   ([50]https://string-db.org/), and gene sets of all available186 KEGG
   pathways were downloaded from the GSEA/MSigDB database
   ([51]http://software.broadinstitute.org/gsea/msigdb). In addition, the
   clinical information of the breast cancer patients was downloaded from
   the TCGA database, including TNM stage, prognosis survival time and
   other information. The 290 normal breast tissues were used as reference
   samples. The gene expression data sets of normal and cancer samples
   were both converted to the TPM form and contain 18,006 genes in total.
   The independent validation data of the prognosis model was from the
   [52]GSE3494 set in GEO Datasets, which contains 251 expression profiles
   of breast tumors by array.

Construction of sample-specific networks

   In this study, gene–gene interactions with high confidence
   (comprehensive score >0.9) were selected from the STRING database,
   which include regulatory, physical and co-expression protein–protein
   interaction networks. Furthermore, the above gene–gene interactions
   with both genes in one of the 186 KEGG pathways were used as the
   background network (or template network), which contained 3,257 genes
   in total. The sample-specific network method aimed to calculate the
   difference of the gene co-expression when the single cancer sample was
   added to a bunch of normal samples. In short, the sample-specific
   networks to be constructed are actually networks with significant
   perturbation edges of gene co-expression.

   In the following analysis, sample-specific networks for breast cancer
   samples were constructed based on gene expression profiles by using the
   method introduced in reference ([53]Liu et al., 2016) (see [54]Fig.
   1A). First, using the gene expression data of n reference samples,
   namely all the normal breast tissues data, the reference network can be
   constructed by calculating the correlation coefficient PCC[n] (the
   Pearson correlation coefficient (PCC)) of the gene pairs connected in
   the background network. The weights of the edges in the reference
   network are the PCC of the corresponding gene pairs. Then, the
   expression data of a single breast cancer sample was added to the
   reference samples, and the perturbed network of the single sample was
   constructed by calculating the new correlation coefficient PCC[n + 1]
   of the gene pairs in the background network. For the single breast
   cancer sample, the differential correlation coefficients of each edge
   between the perturbed network and the reference network were calculated
   as:
   [MATH: <mi mathvariant="normal">Δ</mi><msub><mrow><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow><mi>n</mi></msub><mo>=</mo><msub><mro
   w><mi mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow><mrow><mi>n</mi><mo>+</mo><mn>1</mn><
   /mrow></msub><mo>−</mo><msub><mrow><mi mathvariant="normal">P</mi><mi
   mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow><mi>n</mi></msub> :MATH]
   , which called differential network for the sample. In reference
   [55]Liu et al. (2016) have proved that
   [MATH: <mi mathvariant="normal">Δ</mi><msub><mrow><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow><mi>n</mi></msub> :MATH]
   follows a normal distribution with a mean value of 0 and a variance of
   [MATH: <mstyle displaystyle="true"
   scriptlevel="0"><mrow><mfrac><mrow><mn>1</mn><mo>−</mo><msubsup><mrow><
   mi mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow><mi>n</mi><mn>2</mn></msubsup></mrow>
   <mrow><mi>n</mi><mo>−</mo><mn>1</mn></mrow></mfrac></mrow></mstyle>
   :MATH]
   when n is large enough. The significance level of each
   [MATH: <mi mathvariant="normal">Δ</mi><mrow><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow> :MATH]
   was determined by the Z-test. The statistical Z-value is calculated as
   follows with the null hypothesis that
   [MATH: <mi mathvariant="normal">Δ</mi><msub><mrow><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow><mi>n</mi></msub> :MATH]
   is equal to 0:
   [MATH: <mi>Z</mi><mo>=</mo><mstyle displaystyle="true"
   scriptlevel="0"><mrow><mfrac><mrow><mi
   mathvariant="normal">Δ</mi><msub><mrow><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow><mi>n</mi></msub></mrow><mrow><mrow><
   mo>(</mo><mrow><mn>1</mn><mo>−</mo><msubsup><mrow><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow><mi>n</mi><mn>2</mn></msubsup></mrow>
   <mo>)</mo></mrow><mrow><mo>/</mo></mrow><mrow><mo>(</mo><mrow><mi>n</mi
   ><mo>−</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow></mfrac></mrow></ms
   tyle> :MATH]

   Then, we can obtain the P-value for each gene pair from the Z-value.
   Gene pairs (or edges) were considered statistically significant if
   their P-values < 0.01. All significant edges constitute the
   sample-specific network. Thus, adding the expression data of 1,093
   breast cancer samples to the reference samples one at a time, we
   finally constructed 1,093 sample-specific networks.

Identification of stage/subtype-related gene–gene interaction networks

   Only the gene pairs that are perturbed significantly in the most breast
   cancer samples are considered to be related to breast cancer. Then, the
   edges that are perturbed significantly in more than 90% of the samples
   by the binomial right-sided test (P-value < 0.05) constitute a
   gene–gene interaction network related to breast cancer. Specifically,
   we firstly divided the above breast cancer samples into different
   stages or subtypes based on TNM staging and PAM50 subtype system,
   secondly selected the perturbed significantly edges in more than 90%
   samples of different stages or subtypes, and then the stage/subtype
   gene–gene interaction networks were constructed.

   A slight change in the expression of high-degree genes in the network
   may cause disturbances of the entire network. Thus, these genes with
   high degree are considered to be the key genes for the onset and
   development of breast cancer. We selected genes with degrees >5 in the
   identified breast cancer-related network for the subsequent enrichment
   analysis. Furthermore, we also identified the key genes related to each
   TNM stage and PAM50 subtype with the same method. Here, because of the
   small number of Stage V samples, Stage V was combined with Stage IV.

Pathway enrichment analysis

   For the pathway enrichment analysis, we used the hypergeometric test as
   follows:
   [MATH:
   <mi>p</mi><mrow><mo>(</mo><mrow><mi>m</mi><mo>,</mo><mi>M</mi><mo>,</mo
   ><mi>N</mi><mo>,</mo><mi>n</mi></mrow><mo>)</mo></mrow><mo>=</mo><mn>1<
   /mn><mo>−</mo><munderover><mrow><mo
   movablelimits="false">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>0</mn>
   </mrow><mrow><mi>m</mi><mo>−</mo><mn>1</mn></mrow></munderover><mstyle
   displaystyle="true"
   scriptlevel="0"><mrow><mfrac><mrow><mrow><mo>(</mo><mrow><mtable
   rowspacing="4pt"
   columnspacing="1em"><mtr><mtd><mi>M</mi></mtd></mtr><mtr><mtd><mi>i</mi
   ></mtd></mtr></mtable></mrow><mo>)</mo></mrow><mrow><mo>(</mo><mrow><mt
   able rowspacing="4pt"
   columnspacing="1em"><mtr><mtd><mrow><mi>N</mi><mo>−</mo><mi>M</mi></mro
   w></mtd></mtr><mtr><mtd><mrow><mi>n</mi><mo>−</mo><mi>i</mi></mrow></mt
   d></mtr></mtable></mrow><mo>)</mo></mrow></mrow><mrow><mrow><mo>(</mo><
   mrow><mtable rowspacing="4pt"
   columnspacing="1em"><mtr><mtd><mi>N</mi></mtd></mtr><mtr><mtd><mi>n</mi
   ></mtd></mtr></mtable></mrow><mo>)</mo></mrow></mrow></mfrac></mrow></m
   style> :MATH]

   where N is the total number of genes in the background network, M
   represents the number of key genes related to breast cancer (or a stage
   or subtype of breast cancer), n accounts for the number of genes in a
   pathway, and m represents the number of genes that both in the pathway
   and in key genes related to breast cancer (or a stage or subtype of
   breast cancer). Then, the pathway with P-value < 0.05 was considered as
   significantly enriched in the breast cancer (or a stage or subtype of
   breast cancer) samples. Otherwise, we regarded that the pathway is not
   enriched in the corresponding group.

Survival analysis by the Cox regression model

   Different from the usual survival analysis based on gene expression,
   the perturbation of gene co-expression
   [MATH: <mi mathvariant="normal">Δ</mi><mrow><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow> :MATH]
   (i.e., gene pairs or edges) was used to survival analysis. According to
   the clinical data of patients with breast cancer, we utilized the
   “survival” package and “survminer” package in R/Bioconductor to
   establish a univariate Cox proportional hazards regression model by
   setting patients’ survival conditions (survival time and survival
   status) as the dependent variables and the
   [MATH: <mi mathvariant="normal">Δ</mi><mrow><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">C</mi></mrow> :MATH]
   of gene pairs in the differential network for each breast cancer
   samples as the covariates. Gene pairs with P-values < 0.05 were
   considered to be related to the prognosis of breast cancer ([56]Cheng,
   2018).

   A large number of covariates may cause overfitting in establishing a
   multivariate Cox proportional hazards regression model; thus, using the
   least absolute shrinkage and selection operator (LASSO), we further
   selected the key gene pairs from these significant ones obtained by the
   univariate Cox proportional hazards analysis. LASSO is a common method
   used in high-dimensional data regression, which can select
   prognosis-related gene pairs of breast cancer by shrinking regression
   coefficients. The tuning parameter (λ) with the smallest mean-square
   error was selected by four-fold cross-validation to establish an
   optimal LASSO regression model. Then, the coefficients of most gene
   pairs reduced to zero, and a small number of gene pairs with nonzero
   coefficients were considered to be closely correlated with the
   prognosis of breast cancer.

   LASSO Cox analysis was performed by using the “glmnet” package in R.
   Then, the risk score for each sample was calculated by the LASSO Cox
   regression model. According to the median risk score, breast cancer
   patients were divided into two groups (a high-risk group and a low-risk
   group). In addition, 234 breast tumors with relapse free survival
   information in the validation data set were analyzed by using the above
   sample-specific network method, and risk scores were calculated by the
   Cox regression model based on 1,093 samples in TCGA. Then the
   validation samples were also divided into two groups in the same way.
   Finally, the corresponding Kaplan–Meier survival curves were plotted by
   using the packages “survminer” and “survival” in R.

Results

Breast cancer-related gene–gene interaction networks

   The background network consisted of 46,916 edges and 3,237 genes. In
   addition, 2,190 gene pairs were identified as significantly related to
   breast cancer, which constituted the gene–gene interaction network
   related to breast cancer (including 915 genes in total). We use the
   Cytoscape software to visualize the breast cancer-related network (see
   [57]Fig. 2).

Figure 2. Gene–gene interaction networks related to breast cancer.

   [58]Figure 2
   [59]Open in a new tab

   Nodes in these networks stand for genes, and the size of the nodes
   corresponds to the degree of the genes in the network. The purple nodes
   represent the genes with degrees ≥15, and the blue ones are the genes
   with degrees <15.

   Genes with degrees >5 in the breast cancer-related gene–gene
   interaction network (198 in total which are shown in [60]Table S1).
   Among them, some genes with higher degrees (>20) have been shown to be
   related to breast cancer. For example, CCNB1 has strong power to
   predict the survival of breast cancer patients with the phenotype of ER
   positive ([61]Ding et al., 2014). The overexpression of GRB2 has been
   demonstrated to be significantly associated with the occurrence and
   poor prognosis of breast cancer ([62]Zhang et al., 2016). PCNA has been
   proven to be a marker of proliferation in the diagnosis of breast
   cancer ([63]Juríková et al., 2016), SF3B4 has been shown to be a tumor
   suppressor, and somatic inactivating mutations occasionally occur in
   breast cancer ([64]Denu & Burkard, 2017). UBE2C may promote the
   development of breast cancer ([65]Mo et al., 2017). High Cdc20 and
   securin immune expression are associated with extremely poor outcomes
   in breast cancer patients ([66]Karra et al., 2014), and overexpression
   of RPL17 affects breast cancer-associated brain metastases ([67]Yuan,
   Wang & Cheng, 2018). MAD2L1 may have great effect on breast cancer
   progression, and its expression might help to predict breast cancer
   prognosis ([68]Wang et al., 2015). The high expression of TRA2B is
   closely related to the cancer cell survival and therapeutic sensitivity
   of breast cancer ([69]Best et al., 2013). GTF2H4 has been identified to
   be related to the survival risk of breast cancer ([70]Ge et al., 2019).

Stage-related gene–gene interaction networks

   The results of the four stage-related gene–gene interaction networks
   are shown in [71]Figs. S1A–[72]S1D. And the top 10 genes with the
   highest degrees in these four networks are displayed in [73]Figs.
   S1E–[74]S1H. There are obvious similarities and differences among the
   four stage-related gene interaction networks. There are 81 key genes
   shared by all stages (see [75]Table S3), among which RPL17, CCNB1, and
   SF3B4 are genes that are highly (with degrees >25) related to breast
   cancer. Stage I has 5 specific genes: PSMC5, SDHB, RPL11, SDHA, and
   RPL13. Stage II has 3 specific genes: STX6, CCNA2, and CDC25C. Stage
   III has 4 specific genes: NDUFA6, EPN1, SF3A3, and LSM7. Stage IV has
   the largest number of specific genes, with a total of 38, among which
   CDC42, LSM2, NDUFS6, and CDC25A are strongly associated with it. And
   these stage-specific key genes are shown in [76]Table S4.

Subtype-related gene–gene interaction networks

   The results of the four subtype-related gene interaction networks are
   shown in [77]Figs. S2A–[78]S2D. And the top 10 genes with the highest
   degrees in these four networks are displayed in [79]Figs. S2E–[80]S2H.
   The four subtype-related networks share similar and different
   characteristics. There are 34 key genes shared by the four subtypes
   (see [81]Table S7). Among them, RPL17 and CCNB1 have higher degrees.
   The Luminal A subtype has 11 specific genes, including RPL23A, RPL10,
   and PRPF6, which are greatly related to it with higher degrees. The
   Luminal B subtype has 17 specific genes, including COX6C, EGFR, and
   CLTC, which are related to it with higher degrees. The Her2 subtype has
   3 specific genes, NDUFA6, CCR8, and CASP3. The Basal-like subtype has
   the largest number of specific genes, 17 in total, including LSM2,
   DDX5, SF3A3, and MAGOH, with higher degrees. And these subtype-specific
   key genes are shown in [82]Table S8.

Pathways enriched in breast cancer patients

   There were 41 pathways (see [83]Table S2) enriched in the breast cancer
   samples according to the pathway enrichment analysis, including some
   immune-related pathways, such as the Toll-like receptor signaling
   pathway, antigen processing and presentation, complement and
   coagulation cascades, the RIG-I-like receptor signaling pathway, and
   the cytosolic DNA-sensing pathway. Some important signal transduction
   and signal molecular interaction pathways were also included, such as
   the MAPK signaling pathway, Wnt signaling pathway, cytokine-cytokine
   receptor interaction, and ECM–receptor interaction pathways. Breast
   cancer is closely related to endocrine disorders ([84]Sakoda et al.,
   2008), two endocrine-related pathways, adipocytokine signaling pathway,
   and PPAR signaling pathway, have also been identified as being related
   to breast cancer. In addition, some metabolic pathways, especially
   lipid metabolism pathways, have also been identified as being
   associated with breast cancer ([85]Merdad et al., 2015), such as the
   steroid hormone biosynthesis, arachidonic acid metabolism, arginine and
   proline metabolism pathway, and glycerolipid metabolism. Additionally,
   pathways in cancer was also enriched. The enrichment results are shown
   in [86]Fig. 3A.

Figure 3. Pathways enriched in breast cancer, as well as different stages and
subtypes of it.

   [87]Figure 3
   [88]Open in a new tab

   (A) KEGG pathways enriched in breast cancer samples, ranked by
   −log10(p). (B) Overlap and difference of the enriched pathways in the
   four breast cancer stages. There are 11 commonly enriched pathways in
   the four stages. The number of Stage IV-specific pathways was 18. (C)
   Overlap and difference of the enriched pathways in the four PAM50
   subtypes. There are four commonly enriched pathways in the four PAM50
   subtypes. The number of Basal-like specific pathways is 9.

   Most of these pathways have been documented to be related to breast
   cancer. For example, the dysregulation of the steroid hormone
   biosynthesis pathway may affect steroid hormone levels and may thus be
   related to the susceptibility to breast cancer ([89]Sakoda et al.,
   2008). The PPAR signaling pathway may play an important role in the
   neoadjuvant chemotherapy response of breast cancer ([90]Chen et al.,
   2012). Mounting preclinical evidence supports targeting the MAPK
   signaling pathway in the triple negative breast cancer (TNBC)
   ([91]Giltnane & Balko, 2014). AMPK activators inhibit breast cancer
   cell proliferation by inhibiting DVL3-promoted Wnt/β-catenin signaling
   pathway activity ([92]Zou et al., 2017). Toll-like receptors may play
   dual roles in human cancers ([93]Khademalhosseini & Arababadi, 2019).
   The co-activation of the Hedgehog and Wnt signaling pathways is a poor
   prognostic marker in TNBC ([94]Bhateja et al., 2019). Prl-3 is closely
   related to cell migration and invasion in TNBC ([95]Gari et al., 2016).
   The YHD inhibition of 4T1 breast tumor growth may be related to the
   negative regulation of the JAK/STAT3 pathway by repressing the
   expression of IL-6 and TGF-β ([96]Mao, Feng & Gong, 2018).

Stage-related pathways

   The overlapping of pathways enriched in the four TNM stages are shown
   in [97]Fig. 3B. The proportion of enriched pathways shared by the four
   stages (see [98]Table S5) is relatively high, including the Wnt
   signaling pathway, MAPK signaling pathway, regulation of actin
   cytoskeleton, calcium signaling pathway, pathways in cancer, and cell
   adhesion molecules, which have been shown to have a high correlation
   with breast cancer ([99]Giltnane & Balko, 2014; [100]Zou et al., 2017;
   [101]Kazazian et al., 2017; [102]Woltmann et al., 2014; [103]Saadatmand
   et al., 2013). The pathways enriched in different stages are slightly
   different, especially Stage IV of breast cancer, which has 18 specific
   enriched pathways, among which the PPAR signaling pathway, ECM-receptor
   interaction, tight junction, TGF-beta signaling pathway, NOD-like
   receptor signaling pathway, and other signaling pathways are mostly
   related to the metastases of breast cancer ([104]Chen et al., 2012;
   [105]Bao et al., 2019; [106]Yang et al., 2019; [107]Tang et al., 2017;
   [108]Peng et al., 2016).

   As we expected, Stage IV was specifically enriched the most pathways
   (18 in total, see [109]Table S6) different from other stages. This
   result is probably because Stage IV breast cancer patients are the most
   serious, and their cancer cells are likely to have deteriorated and
   metastasized. Therefore, the disruption of the biological system
   balance of breast cancer patients at this stage is larger than that of
   other stages. Thus, the specific enriched pathways of Stage IV are
   correspondingly more.

Subtype-related pathways

   The overlap and difference of the enriched pathways in the four PAM50
   subtypes are shown in [110]Fig. 3C. There are slight differences in the
   subtype-related pathways. There are 4 enriched pathways shared by the
   four subtypes (see [111]Table S9) including the cytokine–cytokine
   receptor interaction. As a special subtype of breast cancer, the
   Basal-like subtype (or TNBC) is characterized by high histological
   differentiation, a high risk of metastasis, a high recurrence rate, and
   a low survival rate. Probably due to the higher risk of Basal-like
   subtype, there are 9 specific pathways enriched in it, including the
   leukocyte transendothelial migration and chemokine signaling pathway.
   The subtype-specific enriched pathways are shown in [112]Table S10.

Prognosis-related gene pairs

   A total of 5,652 gene pairs significantly related to the survival and
   prognosis of breast cancer were found by the univariate Cox
   proportional hazards model. In addition, 272 gene pairs were further
   identified by Lasso regression (see [113]Fig. S3). A multivariate Cox
   proportional hazards regression model with these gene pairs as
   independent variables was constructed as follows: Score = 206.3 ∗ (ENO,
   PGK2) + 35.9 ∗ (EN0, PKLR) + 4.1 ∗ (EBP, HSD17B7) + 5.5 ∗ (CYP1B,
   HSD17B1) − 3.4 ∗ (NDUFB2, NDUFB4) − 0.6 ∗ (ATP6V1A, ATP6V1B1) + ….

   The risk scores of the 1,093 breast cancer patients in TCGA were
   calculated by this model. The median of the risk scores divided all
   patients into two groups. The corresponding Kaplan–Meier survival curve
   is shown in [114]Fig. 4A. Of note, survival analysis indicates that
   overall survival probability of patients with high risk scores is
   significantly lower than that with low risk scores (P-value < 0.0001).

Figure 4. Kaplan–Meier survival analysis.

   [115]Figure 4
   [116]Open in a new tab

   (A) Kaplan–Meier survival plots for two different groups of breast
   cancer patients in TCGA. The X axis is survival days. The Y axis is
   overall survival rate. (B) Kaplan–Meier survival plots for two
   different groups of breast tumors in the independent validation data
   set. The X axis is relapse free survival time (days). The Y axis is
   relapse free survival rate.

   In addition, the risk scores of the 234 breast tumors in the validation
   data set were also calculated by the above model with 264 gene pairs (8
   gene pairs were omitted since these genes were not included in the
   expression profile of the validation data). In the same way, there are
   two groups with different scores. The relapse free survival
   probabilities of the two groups are significantly different (P-value <
   0.0001), and the relapse free survival status of tumors in the low
   score group are all “alive” (see [117]Fig. 4B). This result indicates
   that the prognosis model based on gene pairs can well predict the
   survival time of breast tumors in the independent validation data set.

Discussion

   At present, research on cancer pathology is limited to gene expression
   and mutation information. However, the model of one gene to one disease
   is no longer suitable for the study of complex diseases. In fact, genes
   do not exist in isolation but participate in some complex biological
   networks, such as gene–gene interaction networks. Gene mutations or
   surroundings changes often affect the balance of gene interaction
   networks and the perturbation of the networks then affect the onset and
   development of complex diseases. Studies have shown that some genetic
   elements of breast cancer are related to nearby gene expression, such
   as some repetitive DNA in ER+/HER2-breast cancer and transposable
   elements ([118]Yandım & Karakülah, 2019; [119]Karakülah et al., 2019).
   Therefore, network analysis can provide a more comprehensive and
   systematic point of view, to better understand the human disease onset
   and development mechanism.

   Based on personalized medicine, Precision Medicine is a new medical
   concept and medical model, which needs to grasp the specific
   characteristic of different cancer samples accurately. The analysis of
   the biological network disturbance for each cancer patient conforms to
   the concept of precision medicine. In addition, the personalized
   medical treatment of breast cancer is in a relatively slow development
   stage.

   In this paper, sample-specific networks of breast cancer samples were
   established to explore the gene–gene interaction networks related to
   the TNM stages and PAM50 subtypes of breast cancer. Then, the pathways
   related to breast cancer were identified by hypergeometric test.
   Through the same method, we also obtained the stage-related pathways
   and subtype-related pathways. Finally, the edge biomarkers (gene pairs)
   that are closely related to the prognosis of breast cancer were
   determined by using the LASSO regression model, and then a more stable
   prognostic analysis model was established by using these biomarkers.
   Our results indicate that the prognosis model has the robust and strong
   generalization capability, and it can be used in different gene
   expression data sets.

   Many studies have shown that network-based methods are more robust and
   effective than single-gene-based methods, such as SWIM and WGCNA
   ([120]Paci et al., 2017; [121]Langfelder & Horvath, 2008). SWIM is a
   tool able to extract from complex correlation networks the so-called
   “switch genes” that could be associated to the transition from
   physiological to a pathological condition. The WGCNA method plans to
   exploit the correlation patterns among genes. The advantages of
   network-based methods have been well documented and accepted in the
   analysis of noisy high-throughput data. Different from the usual
   network-based method, we made better use of a prior background network
   to explore the sample-specific networks. And the sample-specific
   networks are actually networks with significant perturbation edges of
   gene co-expression in our study, which is really very different from
   WGCNA. This study helps us to better understand the heterogeneity and
   mechanism of breast cancer from an individual-level perspective.
   Precision medicine advocates the development of individualized
   treatment according to the unique features of patients. Therefore,
   identifying the unique pathogeny embedded in each patient is important
   to develop a treatment strategy for each patient. Our sample-specific
   network analysis of breast cancer will promote the development of
   precision medicine.

Conclusions

   In this article, the sample-specific network of each breast cancer
   sample was constructed based on network analysis, and further breast
   cancer (subtype/stage)-related gene–gene interaction networks were
   identified. The edge biomarkers (gene pairs) related to the prognosis
   of breast cancer were also identified and a risk prediction model was
   established based on these edge biomarkers finally.

   This study develops an individualized network analysis for each patient
   which would promote a new train of thought and method for the precision
   medicine. This whole process of sample-specific network analysis using
   co-expression can also be used to analyze other cancers. However, the
   co-expression perturbation which used to construct sample-specific
   network, does not roundly measure the changes of gene interactions. So,
   we will consider further designing a method which can characterize the
   perturbation of gene interactions comprehensively. In addition, how to
   obtain subtype-specific networks (or stage-specific networks) from
   sample-specific networks based on network structure is still a problem
   worth considering.

Supplemental Information

   Supplemental Information 1. Gene-gene interaction networks related to
   breast cancer TNM stages. Nodes in these networks stand for genes, and
   the size of the nodes corresponds to the degree of the genes in the
   network. The purple nodes represent the genes with degrees ≥ 15, and
   the b.

   <!--[if !supportLists]-->(A-<!--[endif]-->D) Gene-gene interaction
   networks associated with Stage I, II, III, and IV respectively. (E-H)
   The bar charts of top 10 genes with the highest degrees in gene–gene
   interaction networks related to Stage I, II, III, and IV. The Y axis is
   gene, and the X axis is the gene degree.
   [122]Click here for additional data file.^ (9.2MB, png)
   DOI: 10.7717/peerj.9161/supp-1
   Supplemental Information 2. Gene-gene interaction networks related to
   breast cancer PAM50 subtypes. Nodes in these networks stand for genes,
   and the size of the nodes corresponds to the degree of the genes in the
   network. The purple nodes represent the genes with degrees ≥ 15, and t.

   <!--[if !supportLists]-->(A–D) Gene-gene interaction networks
   associated with LumA, LumB, Her2, and Basal-like subtypes respectively.
   (E–H) The bar charts of top 10 genes with the highest degrees in
   gene–gene interaction networks related to LumA, LumB, Her2, and
   Basal-like subtypes. The Y axis is gene, and the X axis is the gene
   degree.
   [123]Click here for additional data file.^ (8.7MB, png)
   DOI: 10.7717/peerj.9161/supp-2
   Supplemental Information 3. Establishment of the LASSO regression
   model.

   (A) Four-fold cross-validation for tuning parameter (λ) selection in
   the LASSO model. (B) LASSO coefficient profiles of 272 gene
   interactions.
   [124]Click here for additional data file.^ (186.9KB, png)
   DOI: 10.7717/peerj.9161/supp-3
   Supplemental Information 4. Breast cancer related key genes and
   pathways.

   Table S1: Key genes of BRCA (the genes with degrees > 5 in the breast
   cancer-related gene-gene interaction network).

   Table S2: Pathways enriched in BRCA (the pathways with P-value <0.05 by
   the enrichment analysis in all breast cancer samples).
   [125]Click here for additional data file.^ (15.1KB, xlsx)
   DOI: 10.7717/peerj.9161/supp-4
   Supplemental Information 5. Key genes and enriched pathways related to
   the four TNM stages of breast cancer.

   Table S3: Key genes shared by four stages.

   Table S4: Stage-specific key genes.

   Table S5: Enriched pathways shared by four stages.

   Table S6: Stage-specific enriched pathways.
   [126]Click here for additional data file.^ (14.6KB, xlsx)
   DOI: 10.7717/peerj.9161/supp-5
   Supplemental Information 6. Key genes and enriched pathways related to
   the PAM50 subtypes of breast cancer.

   Table S7: Key genes shared by four subtypes.

   Table S8: Subtype-specific key genes.

   Table S9: Enriched pathways shared by four subtypes.

   Table S10: Subtype-specific enriched pathways.
   [127]Click here for additional data file.^ (13.5KB, xlsx)
   DOI: 10.7717/peerj.9161/supp-6

Acknowledgments