Abstract
Breast cancer is a disease with high heterogeneity. Cancer is not
usually caused by a single gene, but by multiple genes and their
interactions with others and surroundings. Estimating breast
cancer-specific gene–gene interaction networks is critical to elucidate
the mechanisms of breast cancer from a biological network perspective.
In this study, sample-specific gene–gene interaction networks of breast
cancer samples were established by using a sample-specific network
analysis method based on gene expression profiles. Then, gene–gene
interaction networks and pathways related to breast cancer and its
subtypes and stages were further identified. The similarity and
difference among these subtype-related (and stage-related) networks and
pathways were studied, which showed highly specific for subtype
Basal-like and Stages IV and V. Finally, gene pairwise interactions
associated with breast cancer prognosis were identified by a Cox
proportional hazards regression model, and a risk prediction model
based on the gene pairs was established, which also performed very well
on an independent validation data set. This work will help us to better
understand the mechanism underlying the occurrence of breast cancer
from the sample-specific network perspective.
Keywords: Breast cancer, Gene interaction network,
Stages/subtypes-related networks, Cox proportional hazards regression
model, Sample-specific networks
Introduction
According to the latest data from the survey of the International
Agency for Research on Cancer (IARC) in 2018, the incidence of breast
cancer is 24.2% among women worldwide, ranking first in female cancers
([32]Bray et al., 2018). At present, the incidence of breast cancer is
the highest, and its mortality ranks fourth in China. Breast cancer has
strong heterogeneity. Based on the TNM staging system, breast cancer
can be divided into Stages I, II, III, IV and V. There are many
clinical types of breast cancer according to pathological
classification and molecular classification. The pathological
classification generally divides breast cancer into invasive and
non-invasive breast cancer. The gold standard for the molecular typing
of breast cancer is PAM50 molecular typing based on the expression
profile of 50 genes, which classifies breast cancer into the
Normal-like, LuminalA, LuminalB, Basal-like, and Her2 subtypes
([33]Perou et al., 2000).
The molecular typing of breast cancer has important reference value for
clinical treatment of breast cancer. However, molecular typing requires
transcriptome sequencing which is difficult to promote clinically.
Currently, the diagnosis of breast cancer classification is mainly
through immunohistochemistry (IHC), namely, diagnosis by the expression
of four markers, ER (oestrogen receptor), PR (progestin receptor), HER2
gene (human epidermal growth factor receptor 2) and Ki-67 protein
(proliferating cell nuclear antigen). ER and PR are important
indicators for endocrine therapy and prognosis evaluation in breast
cancer. Studies have shown that their expressions are positively
correlated with total survival, treatment failure time, endocrine
therapy response time, and recurrence time ([34]Hammond et al., 2010;
[35]Fitzgibbons et al., 2010). In 2009, Cheang used GEP (gene
expression analysis) to determine 14% as the threshold of Ki-67, which
could be used to divide patients into two groups with good and bad
prognoses ([36]Cheang et al., 2009). In 2011, the St. Gallen
International Expert Consensus agreed to include Ki-67 as an important
standard for molecular typing, which is the key to distinguishing the
Luminal A and Luminal B subtypes ([37]Goldhirsch et al., 2011). In the
growth and metastasis of breast cancer, HER2 is one of the most
important factors, and its status can be used to predict the effect of
drug treatment for breast cancer. Early detection and diagnosis and
timely treatment are of great significance to improve the survival rate
of breast cancer patients.
The etiology of breast cancer is still not clear, and there are many
related factors, such as individual differences and a lack of effective
treatments. With the development of biomedicine, personalized medicine
is becoming the direction of breast cancer treatment in the future. At
present, the medical plan can only be formulated through the study of
single gene expression and mutation information. However, this
information cannot fully reflect the personalized interaction and
regulation among genes. Because onset and progression of cancer are
often caused by the disruption of important biological networks such as
cell cycle and apoptosis, but not a single gene. Indeed, there is a new
and cutting-edge field of medical research, called network medicine,
whose basic idea is that human diseases are rarely caused by single
molecular determinant, but more likely influenced by a network of
interacting molecular determinants with the propensity to cluster
together in the human interactome ([38]Barabási, Gulbahce & Loscalzo,
2011; [39]Conte et al., 2019; [40]Fiscon et al., 2018). Gene–gene
interaction networks can reveal the interaction relations and
regulatory mechanisms among genes, and they have the irreplaceable
function of the single-gene monitoring of information (such as
expression and mutation) in many aspects ([41]Liu et al., 2016).
Therefore, the mechanism of the occurrence and development of breast
cancer can be explored through changes in the interactions between
genes. In this article, we constructed sample-specific networks of
breast cancer samples by calculating the correlation coefficient of
protein-coding gene pairs to explore the gene–gene interaction networks
related to breast cancer stages and subtypes (see [42]Fig. 1).
Figure 1. An integrative framework identifying breast cancer-related
gene–gene interaction networks.
[43]Figure 1
[44]Open in a new tab
(A) Construction of sample-specific networks based on gene expression
data. A reference network can be established based on the expression
profiles of n reference samples by calculating the correlation
coefficients PCC[n] of gene pairs. Then, adding a new cancer sample
sd[x] into the reference samples, a perturbed network is established by
calculating the new correlation PCC[n +] [1] of the n + 1 samples.
Because of sample sd[x], the perturbed network is different from the
reference network, and the difference ∆PCC[n] (PCC[n + 1] − PCC[n]) of
each edge in the background network constitutes the differential
network. Then, the significance of each edge can be quantified by a
statistical Z-test. The sample-specific network for sample sd[x] is
composed by those edges with significant ∆PCC[n]. (B) The framework to
identify the breast cancer-related gene–gene interaction network based
on gene expression. Using the sample-specific network analysis method,
m cancer sample-specific networks were constructed. Then, these
constructed sample-specific networks were analyzed to identify breast
cancer-related networks, stage-related networks and subtype-related
networks, as well as gene-interaction biomarkers associated with the
prognosis of breast cancer. Moreover, pathway enrichment analysis based
on KEGG pathways and survival analysis based on the LASSO regression
model were performed.
The survival time of different patients with breast cancer is
significantly different. At present, the 5-year survival rate of breast
cancer patients in China has reached 83.2%. However, the 5-year
survival rate of advanced cancer patients and Basal-like cancer
patients are significantly lower, so it is necessary to study the
biomarkers that affect the prognosis of breast cancer. In 2009,
[45]Parker et al. (2009) established a single-gene level survival
analysis model to improve the prognosis of breast cancer and predict
the efficacy of chemotherapy. However, the robustness of the gene-based
model is not very high. Thus, this paper aims to establish a more
stable prognostic analysis model of breast cancer patients through
gene–gene interactions. We used the differential correlation
coefficients to model the prognosis of breast cancer. Lasso regression
is suitable for data analysis and model construction with many
independent variables but a limited sample size ([46]Zhang et al.,
2015). In this study, we used a Lasso regression model to effectively
reduce the dimensionality of large gene pairs and then identified the
gene interactions related to the prognosis of breast cancer. Finally, a
multivariate Cox proportional hazards regression analysis based on the
gene interactions was carried out to predict the survival of patients
with breast cancer (see [47]Fig. 1B). A prognosis model was established
and it also performed very well on an independent validation data.
Materials and Methods
Datasets
In this article, the RNA sequencing (RNA-seq) data of 290 normal breast
tissues was downloaded from the GTEx database
([48]https://gtexportal.org/home/), and the RNA-seq data of 1,093
breast cancer samples was downloaded from the TCGA database
([49]https://portal.gdc.cancer.gov/). A human protein–protein
interaction network was from the STRING database version 11.0
([50]https://string-db.org/), and gene sets of all available186 KEGG
pathways were downloaded from the GSEA/MSigDB database
([51]http://software.broadinstitute.org/gsea/msigdb). In addition, the
clinical information of the breast cancer patients was downloaded from
the TCGA database, including TNM stage, prognosis survival time and
other information. The 290 normal breast tissues were used as reference
samples. The gene expression data sets of normal and cancer samples
were both converted to the TPM form and contain 18,006 genes in total.
The independent validation data of the prognosis model was from the
[52]GSE3494 set in GEO Datasets, which contains 251 expression profiles
of breast tumors by array.
Construction of sample-specific networks
In this study, gene–gene interactions with high confidence
(comprehensive score >0.9) were selected from the STRING database,
which include regulatory, physical and co-expression protein–protein
interaction networks. Furthermore, the above gene–gene interactions
with both genes in one of the 186 KEGG pathways were used as the
background network (or template network), which contained 3,257 genes
in total. The sample-specific network method aimed to calculate the
difference of the gene co-expression when the single cancer sample was
added to a bunch of normal samples. In short, the sample-specific
networks to be constructed are actually networks with significant
perturbation edges of gene co-expression.
In the following analysis, sample-specific networks for breast cancer
samples were constructed based on gene expression profiles by using the
method introduced in reference ([53]Liu et al., 2016) (see [54]Fig.
1A). First, using the gene expression data of n reference samples,
namely all the normal breast tissues data, the reference network can be
constructed by calculating the correlation coefficient PCC[n] (the
Pearson correlation coefficient (PCC)) of the gene pairs connected in
the background network. The weights of the edges in the reference
network are the PCC of the corresponding gene pairs. Then, the
expression data of a single breast cancer sample was added to the
reference samples, and the perturbed network of the single sample was
constructed by calculating the new correlation coefficient PCC[n + 1]
of the gene pairs in the background network. For the single breast
cancer sample, the differential correlation coefficients of each edge
between the perturbed network and the reference network were calculated
as:
[MATH: ΔPCCn=PCCn+1<
/mrow>−PCCn :MATH]
, which called differential network for the sample. In reference
[55]Liu et al. (2016) have proved that
[MATH: ΔPCCn :MATH]
follows a normal distribution with a mean value of 0 and a variance of
[MATH: 1−<
mi mathvariant="normal">PCCn2
n−1
:MATH]
when n is large enough. The significance level of each
[MATH: ΔPCC :MATH]
was determined by the Z-test. The statistical Z-value is calculated as
follows with the null hypothesis that
[MATH: ΔPCCn :MATH]
is equal to 0:
[MATH: Z=ΔPCCn<
mo>(1−PCCn2
)/(n−1) :MATH]
Then, we can obtain the P-value for each gene pair from the Z-value.
Gene pairs (or edges) were considered statistically significant if
their P-values < 0.01. All significant edges constitute the
sample-specific network. Thus, adding the expression data of 1,093
breast cancer samples to the reference samples one at a time, we
finally constructed 1,093 sample-specific networks.
Identification of stage/subtype-related gene–gene interaction networks
Only the gene pairs that are perturbed significantly in the most breast
cancer samples are considered to be related to breast cancer. Then, the
edges that are perturbed significantly in more than 90% of the samples
by the binomial right-sided test (P-value < 0.05) constitute a
gene–gene interaction network related to breast cancer. Specifically,
we firstly divided the above breast cancer samples into different
stages or subtypes based on TNM staging and PAM50 subtype system,
secondly selected the perturbed significantly edges in more than 90%
samples of different stages or subtypes, and then the stage/subtype
gene–gene interaction networks were constructed.
A slight change in the expression of high-degree genes in the network
may cause disturbances of the entire network. Thus, these genes with
high degree are considered to be the key genes for the onset and
development of breast cancer. We selected genes with degrees >5 in the
identified breast cancer-related network for the subsequent enrichment
analysis. Furthermore, we also identified the key genes related to each
TNM stage and PAM50 subtype with the same method. Here, because of the
small number of Stage V samples, Stage V was combined with Stage IV.
Pathway enrichment analysis
For the pathway enrichment analysis, we used the hypergeometric test as
follows:
[MATH:
p(m,M,N,n)=1<
/mn>−∑i=0
m−1(Mi)(N−Mn−i)(<
mrow>Nn) :MATH]
where N is the total number of genes in the background network, M
represents the number of key genes related to breast cancer (or a stage
or subtype of breast cancer), n accounts for the number of genes in a
pathway, and m represents the number of genes that both in the pathway
and in key genes related to breast cancer (or a stage or subtype of
breast cancer). Then, the pathway with P-value < 0.05 was considered as
significantly enriched in the breast cancer (or a stage or subtype of
breast cancer) samples. Otherwise, we regarded that the pathway is not
enriched in the corresponding group.
Survival analysis by the Cox regression model
Different from the usual survival analysis based on gene expression,
the perturbation of gene co-expression
[MATH: ΔPCC :MATH]
(i.e., gene pairs or edges) was used to survival analysis. According to
the clinical data of patients with breast cancer, we utilized the
“survival” package and “survminer” package in R/Bioconductor to
establish a univariate Cox proportional hazards regression model by
setting patients’ survival conditions (survival time and survival
status) as the dependent variables and the
[MATH: ΔPCC :MATH]
of gene pairs in the differential network for each breast cancer
samples as the covariates. Gene pairs with P-values < 0.05 were
considered to be related to the prognosis of breast cancer ([56]Cheng,
2018).
A large number of covariates may cause overfitting in establishing a
multivariate Cox proportional hazards regression model; thus, using the
least absolute shrinkage and selection operator (LASSO), we further
selected the key gene pairs from these significant ones obtained by the
univariate Cox proportional hazards analysis. LASSO is a common method
used in high-dimensional data regression, which can select
prognosis-related gene pairs of breast cancer by shrinking regression
coefficients. The tuning parameter (λ) with the smallest mean-square
error was selected by four-fold cross-validation to establish an
optimal LASSO regression model. Then, the coefficients of most gene
pairs reduced to zero, and a small number of gene pairs with nonzero
coefficients were considered to be closely correlated with the
prognosis of breast cancer.
LASSO Cox analysis was performed by using the “glmnet” package in R.
Then, the risk score for each sample was calculated by the LASSO Cox
regression model. According to the median risk score, breast cancer
patients were divided into two groups (a high-risk group and a low-risk
group). In addition, 234 breast tumors with relapse free survival
information in the validation data set were analyzed by using the above
sample-specific network method, and risk scores were calculated by the
Cox regression model based on 1,093 samples in TCGA. Then the
validation samples were also divided into two groups in the same way.
Finally, the corresponding Kaplan–Meier survival curves were plotted by
using the packages “survminer” and “survival” in R.
Results
Breast cancer-related gene–gene interaction networks
The background network consisted of 46,916 edges and 3,237 genes. In
addition, 2,190 gene pairs were identified as significantly related to
breast cancer, which constituted the gene–gene interaction network
related to breast cancer (including 915 genes in total). We use the
Cytoscape software to visualize the breast cancer-related network (see
[57]Fig. 2).
Figure 2. Gene–gene interaction networks related to breast cancer.
[58]Figure 2
[59]Open in a new tab
Nodes in these networks stand for genes, and the size of the nodes
corresponds to the degree of the genes in the network. The purple nodes
represent the genes with degrees ≥15, and the blue ones are the genes
with degrees <15.
Genes with degrees >5 in the breast cancer-related gene–gene
interaction network (198 in total which are shown in [60]Table S1).
Among them, some genes with higher degrees (>20) have been shown to be
related to breast cancer. For example, CCNB1 has strong power to
predict the survival of breast cancer patients with the phenotype of ER
positive ([61]Ding et al., 2014). The overexpression of GRB2 has been
demonstrated to be significantly associated with the occurrence and
poor prognosis of breast cancer ([62]Zhang et al., 2016). PCNA has been
proven to be a marker of proliferation in the diagnosis of breast
cancer ([63]Juríková et al., 2016), SF3B4 has been shown to be a tumor
suppressor, and somatic inactivating mutations occasionally occur in
breast cancer ([64]Denu & Burkard, 2017). UBE2C may promote the
development of breast cancer ([65]Mo et al., 2017). High Cdc20 and
securin immune expression are associated with extremely poor outcomes
in breast cancer patients ([66]Karra et al., 2014), and overexpression
of RPL17 affects breast cancer-associated brain metastases ([67]Yuan,
Wang & Cheng, 2018). MAD2L1 may have great effect on breast cancer
progression, and its expression might help to predict breast cancer
prognosis ([68]Wang et al., 2015). The high expression of TRA2B is
closely related to the cancer cell survival and therapeutic sensitivity
of breast cancer ([69]Best et al., 2013). GTF2H4 has been identified to
be related to the survival risk of breast cancer ([70]Ge et al., 2019).
Stage-related gene–gene interaction networks
The results of the four stage-related gene–gene interaction networks
are shown in [71]Figs. S1A–[72]S1D. And the top 10 genes with the
highest degrees in these four networks are displayed in [73]Figs.
S1E–[74]S1H. There are obvious similarities and differences among the
four stage-related gene interaction networks. There are 81 key genes
shared by all stages (see [75]Table S3), among which RPL17, CCNB1, and
SF3B4 are genes that are highly (with degrees >25) related to breast
cancer. Stage I has 5 specific genes: PSMC5, SDHB, RPL11, SDHA, and
RPL13. Stage II has 3 specific genes: STX6, CCNA2, and CDC25C. Stage
III has 4 specific genes: NDUFA6, EPN1, SF3A3, and LSM7. Stage IV has
the largest number of specific genes, with a total of 38, among which
CDC42, LSM2, NDUFS6, and CDC25A are strongly associated with it. And
these stage-specific key genes are shown in [76]Table S4.
Subtype-related gene–gene interaction networks
The results of the four subtype-related gene interaction networks are
shown in [77]Figs. S2A–[78]S2D. And the top 10 genes with the highest
degrees in these four networks are displayed in [79]Figs. S2E–[80]S2H.
The four subtype-related networks share similar and different
characteristics. There are 34 key genes shared by the four subtypes
(see [81]Table S7). Among them, RPL17 and CCNB1 have higher degrees.
The Luminal A subtype has 11 specific genes, including RPL23A, RPL10,
and PRPF6, which are greatly related to it with higher degrees. The
Luminal B subtype has 17 specific genes, including COX6C, EGFR, and
CLTC, which are related to it with higher degrees. The Her2 subtype has
3 specific genes, NDUFA6, CCR8, and CASP3. The Basal-like subtype has
the largest number of specific genes, 17 in total, including LSM2,
DDX5, SF3A3, and MAGOH, with higher degrees. And these subtype-specific
key genes are shown in [82]Table S8.
Pathways enriched in breast cancer patients
There were 41 pathways (see [83]Table S2) enriched in the breast cancer
samples according to the pathway enrichment analysis, including some
immune-related pathways, such as the Toll-like receptor signaling
pathway, antigen processing and presentation, complement and
coagulation cascades, the RIG-I-like receptor signaling pathway, and
the cytosolic DNA-sensing pathway. Some important signal transduction
and signal molecular interaction pathways were also included, such as
the MAPK signaling pathway, Wnt signaling pathway, cytokine-cytokine
receptor interaction, and ECM–receptor interaction pathways. Breast
cancer is closely related to endocrine disorders ([84]Sakoda et al.,
2008), two endocrine-related pathways, adipocytokine signaling pathway,
and PPAR signaling pathway, have also been identified as being related
to breast cancer. In addition, some metabolic pathways, especially
lipid metabolism pathways, have also been identified as being
associated with breast cancer ([85]Merdad et al., 2015), such as the
steroid hormone biosynthesis, arachidonic acid metabolism, arginine and
proline metabolism pathway, and glycerolipid metabolism. Additionally,
pathways in cancer was also enriched. The enrichment results are shown
in [86]Fig. 3A.
Figure 3. Pathways enriched in breast cancer, as well as different stages and
subtypes of it.
[87]Figure 3
[88]Open in a new tab
(A) KEGG pathways enriched in breast cancer samples, ranked by
−log10(p). (B) Overlap and difference of the enriched pathways in the
four breast cancer stages. There are 11 commonly enriched pathways in
the four stages. The number of Stage IV-specific pathways was 18. (C)
Overlap and difference of the enriched pathways in the four PAM50
subtypes. There are four commonly enriched pathways in the four PAM50
subtypes. The number of Basal-like specific pathways is 9.
Most of these pathways have been documented to be related to breast
cancer. For example, the dysregulation of the steroid hormone
biosynthesis pathway may affect steroid hormone levels and may thus be
related to the susceptibility to breast cancer ([89]Sakoda et al.,
2008). The PPAR signaling pathway may play an important role in the
neoadjuvant chemotherapy response of breast cancer ([90]Chen et al.,
2012). Mounting preclinical evidence supports targeting the MAPK
signaling pathway in the triple negative breast cancer (TNBC)
([91]Giltnane & Balko, 2014). AMPK activators inhibit breast cancer
cell proliferation by inhibiting DVL3-promoted Wnt/β-catenin signaling
pathway activity ([92]Zou et al., 2017). Toll-like receptors may play
dual roles in human cancers ([93]Khademalhosseini & Arababadi, 2019).
The co-activation of the Hedgehog and Wnt signaling pathways is a poor
prognostic marker in TNBC ([94]Bhateja et al., 2019). Prl-3 is closely
related to cell migration and invasion in TNBC ([95]Gari et al., 2016).
The YHD inhibition of 4T1 breast tumor growth may be related to the
negative regulation of the JAK/STAT3 pathway by repressing the
expression of IL-6 and TGF-β ([96]Mao, Feng & Gong, 2018).
Stage-related pathways
The overlapping of pathways enriched in the four TNM stages are shown
in [97]Fig. 3B. The proportion of enriched pathways shared by the four
stages (see [98]Table S5) is relatively high, including the Wnt
signaling pathway, MAPK signaling pathway, regulation of actin
cytoskeleton, calcium signaling pathway, pathways in cancer, and cell
adhesion molecules, which have been shown to have a high correlation
with breast cancer ([99]Giltnane & Balko, 2014; [100]Zou et al., 2017;
[101]Kazazian et al., 2017; [102]Woltmann et al., 2014; [103]Saadatmand
et al., 2013). The pathways enriched in different stages are slightly
different, especially Stage IV of breast cancer, which has 18 specific
enriched pathways, among which the PPAR signaling pathway, ECM-receptor
interaction, tight junction, TGF-beta signaling pathway, NOD-like
receptor signaling pathway, and other signaling pathways are mostly
related to the metastases of breast cancer ([104]Chen et al., 2012;
[105]Bao et al., 2019; [106]Yang et al., 2019; [107]Tang et al., 2017;
[108]Peng et al., 2016).
As we expected, Stage IV was specifically enriched the most pathways
(18 in total, see [109]Table S6) different from other stages. This
result is probably because Stage IV breast cancer patients are the most
serious, and their cancer cells are likely to have deteriorated and
metastasized. Therefore, the disruption of the biological system
balance of breast cancer patients at this stage is larger than that of
other stages. Thus, the specific enriched pathways of Stage IV are
correspondingly more.
Subtype-related pathways
The overlap and difference of the enriched pathways in the four PAM50
subtypes are shown in [110]Fig. 3C. There are slight differences in the
subtype-related pathways. There are 4 enriched pathways shared by the
four subtypes (see [111]Table S9) including the cytokine–cytokine
receptor interaction. As a special subtype of breast cancer, the
Basal-like subtype (or TNBC) is characterized by high histological
differentiation, a high risk of metastasis, a high recurrence rate, and
a low survival rate. Probably due to the higher risk of Basal-like
subtype, there are 9 specific pathways enriched in it, including the
leukocyte transendothelial migration and chemokine signaling pathway.
The subtype-specific enriched pathways are shown in [112]Table S10.
Prognosis-related gene pairs
A total of 5,652 gene pairs significantly related to the survival and
prognosis of breast cancer were found by the univariate Cox
proportional hazards model. In addition, 272 gene pairs were further
identified by Lasso regression (see [113]Fig. S3). A multivariate Cox
proportional hazards regression model with these gene pairs as
independent variables was constructed as follows: Score = 206.3 ∗ (ENO,
PGK2) + 35.9 ∗ (EN0, PKLR) + 4.1 ∗ (EBP, HSD17B7) + 5.5 ∗ (CYP1B,
HSD17B1) − 3.4 ∗ (NDUFB2, NDUFB4) − 0.6 ∗ (ATP6V1A, ATP6V1B1) + ….
The risk scores of the 1,093 breast cancer patients in TCGA were
calculated by this model. The median of the risk scores divided all
patients into two groups. The corresponding Kaplan–Meier survival curve
is shown in [114]Fig. 4A. Of note, survival analysis indicates that
overall survival probability of patients with high risk scores is
significantly lower than that with low risk scores (P-value < 0.0001).
Figure 4. Kaplan–Meier survival analysis.
[115]Figure 4
[116]Open in a new tab
(A) Kaplan–Meier survival plots for two different groups of breast
cancer patients in TCGA. The X axis is survival days. The Y axis is
overall survival rate. (B) Kaplan–Meier survival plots for two
different groups of breast tumors in the independent validation data
set. The X axis is relapse free survival time (days). The Y axis is
relapse free survival rate.
In addition, the risk scores of the 234 breast tumors in the validation
data set were also calculated by the above model with 264 gene pairs (8
gene pairs were omitted since these genes were not included in the
expression profile of the validation data). In the same way, there are
two groups with different scores. The relapse free survival
probabilities of the two groups are significantly different (P-value <
0.0001), and the relapse free survival status of tumors in the low
score group are all “alive” (see [117]Fig. 4B). This result indicates
that the prognosis model based on gene pairs can well predict the
survival time of breast tumors in the independent validation data set.
Discussion
At present, research on cancer pathology is limited to gene expression
and mutation information. However, the model of one gene to one disease
is no longer suitable for the study of complex diseases. In fact, genes
do not exist in isolation but participate in some complex biological
networks, such as gene–gene interaction networks. Gene mutations or
surroundings changes often affect the balance of gene interaction
networks and the perturbation of the networks then affect the onset and
development of complex diseases. Studies have shown that some genetic
elements of breast cancer are related to nearby gene expression, such
as some repetitive DNA in ER+/HER2-breast cancer and transposable
elements ([118]Yandım & Karakülah, 2019; [119]Karakülah et al., 2019).
Therefore, network analysis can provide a more comprehensive and
systematic point of view, to better understand the human disease onset
and development mechanism.
Based on personalized medicine, Precision Medicine is a new medical
concept and medical model, which needs to grasp the specific
characteristic of different cancer samples accurately. The analysis of
the biological network disturbance for each cancer patient conforms to
the concept of precision medicine. In addition, the personalized
medical treatment of breast cancer is in a relatively slow development
stage.
In this paper, sample-specific networks of breast cancer samples were
established to explore the gene–gene interaction networks related to
the TNM stages and PAM50 subtypes of breast cancer. Then, the pathways
related to breast cancer were identified by hypergeometric test.
Through the same method, we also obtained the stage-related pathways
and subtype-related pathways. Finally, the edge biomarkers (gene pairs)
that are closely related to the prognosis of breast cancer were
determined by using the LASSO regression model, and then a more stable
prognostic analysis model was established by using these biomarkers.
Our results indicate that the prognosis model has the robust and strong
generalization capability, and it can be used in different gene
expression data sets.
Many studies have shown that network-based methods are more robust and
effective than single-gene-based methods, such as SWIM and WGCNA
([120]Paci et al., 2017; [121]Langfelder & Horvath, 2008). SWIM is a
tool able to extract from complex correlation networks the so-called
“switch genes” that could be associated to the transition from
physiological to a pathological condition. The WGCNA method plans to
exploit the correlation patterns among genes. The advantages of
network-based methods have been well documented and accepted in the
analysis of noisy high-throughput data. Different from the usual
network-based method, we made better use of a prior background network
to explore the sample-specific networks. And the sample-specific
networks are actually networks with significant perturbation edges of
gene co-expression in our study, which is really very different from
WGCNA. This study helps us to better understand the heterogeneity and
mechanism of breast cancer from an individual-level perspective.
Precision medicine advocates the development of individualized
treatment according to the unique features of patients. Therefore,
identifying the unique pathogeny embedded in each patient is important
to develop a treatment strategy for each patient. Our sample-specific
network analysis of breast cancer will promote the development of
precision medicine.
Conclusions
In this article, the sample-specific network of each breast cancer
sample was constructed based on network analysis, and further breast
cancer (subtype/stage)-related gene–gene interaction networks were
identified. The edge biomarkers (gene pairs) related to the prognosis
of breast cancer were also identified and a risk prediction model was
established based on these edge biomarkers finally.
This study develops an individualized network analysis for each patient
which would promote a new train of thought and method for the precision
medicine. This whole process of sample-specific network analysis using
co-expression can also be used to analyze other cancers. However, the
co-expression perturbation which used to construct sample-specific
network, does not roundly measure the changes of gene interactions. So,
we will consider further designing a method which can characterize the
perturbation of gene interactions comprehensively. In addition, how to
obtain subtype-specific networks (or stage-specific networks) from
sample-specific networks based on network structure is still a problem
worth considering.
Supplemental Information
Supplemental Information 1. Gene-gene interaction networks related to
breast cancer TNM stages. Nodes in these networks stand for genes, and
the size of the nodes corresponds to the degree of the genes in the
network. The purple nodes represent the genes with degrees ≥ 15, and
the b.
(A-D) Gene-gene interaction
networks associated with Stage I, II, III, and IV respectively. (E-H)
The bar charts of top 10 genes with the highest degrees in gene–gene
interaction networks related to Stage I, II, III, and IV. The Y axis is
gene, and the X axis is the gene degree.
[122]Click here for additional data file.^ (9.2MB, png)
DOI: 10.7717/peerj.9161/supp-1
Supplemental Information 2. Gene-gene interaction networks related to
breast cancer PAM50 subtypes. Nodes in these networks stand for genes,
and the size of the nodes corresponds to the degree of the genes in the
network. The purple nodes represent the genes with degrees ≥ 15, and t.
(A–D) Gene-gene interaction networks
associated with LumA, LumB, Her2, and Basal-like subtypes respectively.
(E–H) The bar charts of top 10 genes with the highest degrees in
gene–gene interaction networks related to LumA, LumB, Her2, and
Basal-like subtypes. The Y axis is gene, and the X axis is the gene
degree.
[123]Click here for additional data file.^ (8.7MB, png)
DOI: 10.7717/peerj.9161/supp-2
Supplemental Information 3. Establishment of the LASSO regression
model.
(A) Four-fold cross-validation for tuning parameter (λ) selection in
the LASSO model. (B) LASSO coefficient profiles of 272 gene
interactions.
[124]Click here for additional data file.^ (186.9KB, png)
DOI: 10.7717/peerj.9161/supp-3
Supplemental Information 4. Breast cancer related key genes and
pathways.
Table S1: Key genes of BRCA (the genes with degrees > 5 in the breast
cancer-related gene-gene interaction network).
Table S2: Pathways enriched in BRCA (the pathways with P-value <0.05 by
the enrichment analysis in all breast cancer samples).
[125]Click here for additional data file.^ (15.1KB, xlsx)
DOI: 10.7717/peerj.9161/supp-4
Supplemental Information 5. Key genes and enriched pathways related to
the four TNM stages of breast cancer.
Table S3: Key genes shared by four stages.
Table S4: Stage-specific key genes.
Table S5: Enriched pathways shared by four stages.
Table S6: Stage-specific enriched pathways.
[126]Click here for additional data file.^ (14.6KB, xlsx)
DOI: 10.7717/peerj.9161/supp-5
Supplemental Information 6. Key genes and enriched pathways related to
the PAM50 subtypes of breast cancer.
Table S7: Key genes shared by four subtypes.
Table S8: Subtype-specific key genes.
Table S9: Enriched pathways shared by four subtypes.
Table S10: Subtype-specific enriched pathways.
[127]Click here for additional data file.^ (13.5KB, xlsx)
DOI: 10.7717/peerj.9161/supp-6
Acknowledgments