Abstract
Improved insight into the molecular mechanisms of triple negative
breast cancer (TNBC) is required to predict prognosis and develop a new
therapeutic strategy for targeted genes. The aim of this study is to
identify key genes which may affect the prognosis of TNBC patients by
bioinformatic analysis. In our study, the RNA sequencing (RNA-seq)
expression data of 116 breast cancer lacking ER, PR, and HER2
expression and 113 normal tissues were downloaded from The Cancer
Genome Atlas (TCGA). We screened out 147 differentially co-expressed
genes in TNBC compared to non-cancerous tissue samples by using
weighted gene co-expression network analysis (WGCNA) and differential
gene expression analysis. Then, Gene Ontology (GO) and Kyoto
Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses
were constructed, revealing that 147 genes were mainly enriched in
nuclear division, chromosomal region, ATPase activity, and cell cycle
signaling. After using Cytoscape software for protein-protein
interaction (PPI) network analysis and LASSO feature selection, a total
of fifteen key genes were identified. Among them, BUB1 and CENPF were
significantly correlated with the overall survival rate (OS) difference
of TNBC patients (p value < 0.05). In addition, BUB1, CCNA2, and PACC1
showed significant poor disease-free survival (DFS) in TNBC patients (p
value < 0.05), and may serve as candidate biomarkers in TNBC diagnosis.
Thus, our results collectively suggest that BUB1, CCNA2, and PACC1
genes could play important roles in the progression of TNBC and provide
attractive therapeutic targets.
Keywords: triple negative breast cancer, differentially co-expressed
genes, bioinformatics analysis, biomarkers
1. Introduction
Triple negative breast cancer (TNBC) is a subtype of breast cancer that
lacks expression of the estrogen receptor (ER), progesterone receptor
(PR), and human epidermal growth factor receptor 2 (HER2) following
immunohistochemical analysis [[28]1]. It is estimated that
approximately 200,000 cases are diagnosed with TNBC among women
worldwide, accounting for 15% of all breast cancers each year [[29]2].
Although there are many types of cancer treatments, including
surgeries, chemotherapies, and radiation therapies, approved for
patients with TNBC, it remains a significantly poorer prognosis with
lower survival rates than other types of breast cancer. Early detection
and treatments are crucial for improving cancer outcomes [[30]3,[31]4].
The rapid development of microarray and next-generation sequencing
technologies provides researchers with the ability to detect changes in
gene expression data of various cancer types [[32]5]. Weighted gene
co-expression network analysis (WGCNA), also known as weighted
correlation network analysis, is a systematic biology approach to
describe the correlation of gene expression between different samples.
This approach has been widely used to find highly relevant gene
clusters (modules) and identify candidate hub genes based on the
interconnectivity of gene modules and the association between gene
modules and clinical traits [[33]6]. LASSO is one of the common machine
learning methods. It can effectively select important feature values
with non-zero coefficients through regularization. Therefore, it is
widely used in the classification or feature selection of
high-dimensional data [[34]7,[35]8].
In our study, WGCNA integrated with differential gene expression
analysis was applied to analyze high-throughput sequencing data from
the TCGA database to identify differentially co-expressed genes between
TNBC and normal tissues samples, followed by Gene Ontology (GO) and
Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment
analyses. Moreover, a protein–protein interaction network was
established by STRING, and candidate key genes were identified by the
CytoHubba plug-in in Cytoscape and a LASSO regression model. Finally,
we verified the prognostic value of key genes based on data from
Kaplan–Meier (KM) Plotter databases [[36]9]. These findings may be
important in evaluating the malignant potential and clinical outcomes
of patients with TNBC.
2. Materials and Methods
2.1. Data Collection from TCGA
The workflow of this study is shown in [37]Figure 1. The whole
transcriptome sequencing data set of raw read counts and fragments per
kilobase per million (FPKM) values, and corresponding clinical data
([38]Supplementary Table S1) of all breast cancer patients were
downloaded from TCGA-BRCA dataset using R package TCGAbiolinks
(V2.14.1) [[39]10]. The raw read counts are used for the analysis of
differentially expressed genes, while FPKM data were used for WGCNA and
subsequent downstream analyses. We collected 1222 breast cancer
samples, including 1109 tumor and 113 normal tissue samples, from the
TCGA database. Based on the clinical information of the samples, 116
TNBC samples with the lack of expression in ER, PR, and HER2 and 113
normal breast tissue samples were retained in this study for further
analysis. The demographic information of the 229 samples is shown in
[40]Supplementary Table S1.
Figure 1.
[41]Figure 1
[42]Open in a new tab
Flowchart of this study.
2.2. Screening for Differentially Expressed Genes
Differential gene expression analysis provides a method to identify the
changes of gene expression between experimental and control groups. In
the present study, the R package DESeq2 (V1.26.0) [[43]11] was applied
to screen the differentially expressed genes (DEGs) in selected TNBC
compared with normal samples. The gene names corresponding to the
Ensemble ids were converted into a gene symbol via the Ensembl
database. For the repetitive gene symbols, only the one with the
highest sum of expression levels is kept. Before analyzing DEGs, DESeq2
will use the median of ratios method to perform normalization
preprocessing on the input raw read counts data. Subsequently, genes
whose total expression level is less than 1 in all samples are deleted.
The p-values of DEGs were corrected to false discovery rate (FDR)
following the Benjamini–Hochberg method [[44]12]. DEGs with FDR < 0.05
and | log2 fold change (FC)| > 1 were considered a statistically
significant difference. Finally, 9226 DEGs were visualized as a volcano
plot using R package ggplot2 (V3.3.0) [[45]13], and used further as
candidate genes for co-expression network construction.
2.3. Construction of Co-Expression Network and Identificaion of Key Modules
WGCNA is used to explore gene modules highly related to the external
traits of the samples. The weighted gene co-expression network of 9226
DEGs in our study was constructed through the R package WGCNA (V1.69)
[[46]6]. The correlation coefficients between genes were calculated
using the following formula:
[MATH:
aij=<
mo>|Sij|
mrow>β :MATH]
(
[MATH: aij :MATH]
: Adjacency between gene i and gene j,
[MATH: Sij :MATH]
: Pearson’s correlation between gene
[MATH: i :MATH]
and gene
[MATH: j :MATH]
, β: Soft threshold). Then, the adjacency matrix was converted into a
topological overlap matrix (TOM). In WGCNA, the linkage hierarchical
clustering was carried out for the genes dendrogram based on
dissimilarity measure (1-TOM), and the minimum size (gene group) was
set as 150 in order to classify the genes with similar expression
profiles into the same gene module.
To further determine the key modules in the co-expression network, the
relevance of module eigengene (ME) to clinical traits was calculated,
and the association of each gene with clinical significance was
measured by gene significance (GS). Module significance (MS) is
considered as the average GS for all genes in a module. The modules
that typically have the highest absolute MS ranking among all modules
were selected for further analysis.
2.4. Candidate Module Hub Genes Identification
After identifying significant interest of modules, GS and module
membership (MM) were calculated for each gene. In WGCNA, MM refers to
the correlation between genes and gene expression profiles. Hub genes
are a subset of highly interconnected genes (nodes) within key modules
of co-expression network, and are significantly associated with
biological functions. In our study, to further identify genes related
to TNBC, we selected genes with |GS| > 0.2 and |MM| > 0.8 as module hub
genes.
2.5. Functional and Pathway Enrichment Analyses
To understand the function of hub genes in the key modules, GO and KEGG
pathways were implemented using R package clusterProfiler (V3.14.3)
[[47]14]. The clusterProfiler uses Bioconductor GO.db (V3.14) and
KEGG.db (V3.2.4) to annotate genes with GO and KEGG terms. In GO
analysis, GO terms contain biological process (BP), cellular component
(CC), and molecular function (MF) [[48]15]. In addition to the function
annotation of the genes, KEGG is helpful to clarify the signaling
transduction pathway involved in hub genes [[49]16]. A p-value < 0.05
was set as the cut-off standard.
2.6. PPI Network Construction and Modules Selection
The online STRING database ([50]https://string-db.org/, V11.0)
(accessed on 18 December 2021) is used to predict and analyze the
functional association network between proteins in the organism
[[51]17]. In this study, the STRING database was applied for
constructing the PPI network of hub genes, and the PPI network with
functional association score ≥ 0.7 was visualized based on Cytoscape
([52]https://cytoscape.org/, V3.8.0) (accessed on 18 December 2021)
[[53]18].
Subsequently, the hub genes of the PPI network were screened by using
the CytoHubba (V0.1) plug-in in Cytoscape according to the maximum
clique centrality (MCC) score of each node. The MCC algorithm formula
is as follows:
[MATH:
MCC(v)=∑C∈S(v)(| C
|−1)! :MATH]
(
[MATH:
S(v) :MATH]
: The collection of Maximal Clique containing node
[MATH: v :MATH]
,
[MATH:
(| C
|−1)
! :MATH]
: is the product of all positive integers less than
[MATH:
| C |<
/mo> :MATH]
). The 10 genes with the highest MCC score are selected as the hub
genes. As previously reported, MCC algorithm can effectively find the
important nodes in the network, and the nodes with the top 10 MCC
scores have been proven to be effective [[54]19].
2.7. Feature Selection by LASSO Regression
LASSO regression analysis was performed using the R package glmnet
(V4.1) [[55]20], and LASSO regression analysis methods are often used
for feature selection or as a binary classifier. In this study, a
combination of candidate genes associated with TNBC was selected using
the features of LASSO feature selection. Cross-validation can
effectively improve the performance of the model, so we use 10-fold
cross-validation to train the model. Lambda 1 standard error
(lambda.1se) usually optimizes regularization, which keeps the error
and the minimum error within one standard deviation error [[56]21].
2.8. Verification of Prognostic Value of Hub Genes
To verify the prognostic value of hub genes and significant genes, OS
and DFS prognostic analyses were performed based on the expression and
clinical data from KM Plotter ([57]http://www.kmplot.com) (accessed on
18 December 2021) databases. KM Plotter used Cox proportional hazards
regression were used to perform prognostic analysis for each gene, and
the Benjamini–Hochberg method was used to correct for multiple
hypothesis testing [[58]9,[59]22]. We divided the patients into high
and low expression groups according to their average expression value
of hub genes, and the p-value of the prognostic analysis results was
corrected by FDR. The log-rank p-values < 0.05 were considered to be
statistically significant.
3. Results
3.1. Identification of DEGs in TNBC from TCGA
The DEGs analysis was performed on the RNA-seq data of 116 TNBC and 113
normal tissue samples from TCGA database. In this study, we screened
out a total of 9226 DEGs, including 5626 up-regulated and 3600
down-regulated genes, based on the screening criteria of FDR < 0.05 and
|log2 FC| > 1. The volcano plot of 9226 DEGs is shown in [60]Figure 2.
Figure 2.
[61]Figure 2
[62]Open in a new tab
Volcano plot of DEGs between 116 TNBC and 113 normal tissue samples.
3.2. Weighted Co-Expression Network Construction and Gene Modules
Identification
We utilized the WGCNA package to construct a co-expression network for
9226 DEGs, and those with similar expression patterns were grouped into
the same gene module through average linkage clustering. In choosing
the best thresholds, the network topology for soft-thresholding powers
from 1 to 20 was calculated by scale independence and mean connectivity
analysis of modules ([63]Figure 3a,b). As shown in [64]Figure 3c,d,
when the soft threshold β was selected as 5, the power for the
scale-free topology fit index reached 0.95.
Figure 3.
[65]Figure 3
[66]Open in a new tab
Determination of soft thresholding power in WGCNA. (a) Analysis of the
scale-free fit index for various soft thresholding powers β. (b)
Analysis of the mean connectivity for various soft thresholding powers.
(c) Histogram of connectivity distribution when β = 5. (d) Checking the
scale free topology when β = 5.
To ensure compliance with low correlation, the dissimilarity between
each module was evaluated. The modules with the dissimilarity < 0.2
were subsequently merged ([67]Figure 4a), resulting in a total of 14
modules in this study ([68]Figure 4b). The genes that could not ne
classified into any modules were collected in the gray module and were
not used in following analysis. Then, the correlations between modules
and clinical phenotypes (cancer and normal) were calculated and plotted
as shown in [69]Figure 5. The result shows that the blue module was
highly related to cancer (r = 0.87, p = 4 × 10^−71). Based on the
cut-off criteria (|MM| > 0.8 and |GS| > 0.2), 147 module hub genes were
selected in the blue module for further analysis.
Figure 4.
[70]Figure 4
[71]Open in a new tab
Module clustering results by WGCNA. (a) Dendrogram of dissimilarity
between modules. (b) Gene dendrogram based on dissimilar (1-TOM)
clustering. Each branch of the tree diagram represents a gene.
Figure 5.
[72]Figure 5
[73]Open in a new tab
Correlation between modules and clinical features. (Each line contains
the corresponding correlation and p-value.).
3.3. Module Genes Identification and Functional Enrichment Analysis
In order to investigate the biological function of 147 candidate module
genes, we performed GO and KEGG pathway enrichment analysis by using R
package clusterProfiler ([74]Figure 6a,b). Specifically, the BP group
genes were focused on nuclear division, organelle fission, cell cycle
checkpoint, and nuclear chromosome segregation. In addition, the CC
group genes were mainly related to midbody, spindle, chromosomal
region, and microtubule. Moreover, the MF group genes were mainly
enriched in protein serine/threonine kinase activity, histone kinase
activity, and cyclin-dependent protein kinase activity. Under the
threshold of p < 1 × 10^−4, the KEGG pathway analysis of 147 candidate
module genes showed that cell cycle, progesterone-mediated oocyte
maturation, and DNA replication were most significantly enriched.
Figure 6.
[75]Figure 6
[76]Open in a new tab
Functional enrichment analysis of 147 module genes. (a) GO analysis for
hub genes. (b) KEGG pathway analysis for hub genes. BP, biological
process; CC, cellular component; MF, molecular function.
3.4. PPI Network Construction and Hub Genes Identification
Among 147 candidate module genes, the PPI network was constructed with
145 nodes and 2205 edges using the STRING database, and visualized
using Cytoscape software ([77]Figure 7). Using the MCC algorithm of
CtyoHubba plugin in Cytoscape, the co-expression network of the top ten
highest-scored genes, including PDZ Binding Kinase (PBK), DNA
Topoisomerase II Alpha (TOP2A), Cell Division Cycle Associated 8
(CDCA8), Abnormal Spindle Microtubule Assembly (ASPM), Cyclin A2
(CCNA2), Kinesin Family Member 20A (KIF20A), BUB1 Mitotic Checkpoint
Serine/Threonine Kinase (BUB1), Aurora Kinase B (AURKB), Cyclin
Dependent Kinase 1 (CDK1), and Cyclin B2 (CCNB2), is shown in
[78]Figure 8 and selected as hub genes.
Figure 7.
[79]Figure 7
[80]Open in a new tab
Visualization of the PPI network by Cytoscape.
Figure 8.
[81]Figure 8
[82]Open in a new tab
The top 10 genes with the highest value of MCC scores in the PPI
network. (The closer the background color of the circle is to red, the
higher the MCC score of the gene in the network, and the red is the
highest score.).
3.5. Select Significant TNBC Genes Using LASSO Regression Model
This study adjusts for starting a variable selection with a large
number of features by validation followed by training. Therefore, 147
candidate model genes were used to fit the LASSO regression model, and
10-fold cross-validation was used to train the model. According to
lambda 1se, 0.03422616 was determined as an appropriate λ value
([83]Figure 9A). Finally, six non-zero coefficient TNBC genes (CDK1,
CENPF, MCM7, PACC1, TUBB, and UBE2C) were obtained ([84]Figure 9B),
which were then used in survival analysis to verify the prognostic
value.
Figure 9.
[85]Figure 9
[86]Open in a new tab
Feature selection was using the LASSO regression model through 10-fold
cross-validation and lambda 1se. (A) Generate coefficient distribution
plots for log(lambda) sequence. (B) LASSO non-zero coefficient 6
significant genes in TNBC.
3.6. Survival Analysis and Prognostic Value Verification of Key Genes
In order to verify the prognostic value of 10 hub genes and six
significant genes, KM Plotter was applied to 201 TNBC patients for OS
analysis, and 576 TNBC patients for DFS analysis. Among the 10 pivot
genes, the survival rate of BUB1 (FDR = 0.041) in the OS analysis was
significantly different between the high and low expression groups (p <
0.05, [87]Figure 10c). In the DFS analysis of BUB1 (FDR = 0.043), CCNA2
(FDR = 0.043), and CDCA8 (FDR = 0.043), there was a significant
difference in survival rate between the high and low expression groups
(p < 0.05, [88]Figure 11). On the other hand, in the OS analysis of six
important genes, the low expression of CENPF (FDR = 0.041) was
significantly correlated with the poor survival rate of TNBC patients
(p < 0.05, [89]Figure 12). The DFS analysis showed that the high
expression of PACC1 (FDR = 0.0152) was significantly correlated with
the poor prognosis of TNBC patients (p < 0.05, [90]Figure 13).
Figure 10.
[91]Figure 10
[92]Open in a new tab
Overall survival (OS) analysis of 10 hub genes in TNBC patients by KM
Plotter. (n = 201) (a) ASPM (b) AURKB (c) BUB1 (d) CCNA2 (e) CCNB2 (f)
CDCA8 (g) CDK1 (h) KIF20A (i) PBK (j) TOP2A.
Figure 11.
[93]Figure 11
[94]Open in a new tab
Overall survival (OS) analysis of 6 Significant genes in TNBC patients
by KM Plotter. (n = 201) (a) CDK1 (b) CENPF (c) MCM7 (d) PACC1 (e) TUBB
(f) UBE2C.
Figure 12.
[95]Figure 12
[96]Open in a new tab
Disease free survival (DFS) analysis of 10 hub genes in TNBC patients
by KM Plotter. (n = 534) (a) ASPM (b) AURKB (c) BUB1 (d) CCNA2 (e)
CCNB2 (f) CDCA8 (g) CDK1 (h) KIF20A (i) PBK (j) TOP2A.
Figure 13.
[97]Figure 13
[98]Open in a new tab
Disease free survival (DFS) analysis of 6 Significant genes in TNBC
patients by KM Plotter. (n = 534) (a) CDK1 (b) CENPF (c) MCM7 (d) PACC1
(e) TUBB (f) UBE2C.
4. Discussion
In this study, we used a comprehensive bioinformatics analysis to
identify 147 co-expressed genes that were differentially expressed
between TNBC and normal tissue samples in the TCGA-BRCA dataset. As
shown by the GO analysis results, 147 differentially co-expressed genes
are mainly enriched in nuclear division, chromosomal region, and ATPase
activity, while the KEGG pathway analysis is mainly enriched in the
cell cycle signaling. Moreover, according to the MCC scores using
CytoHubba plugin in Cytoscape, the top ten hub genes (PBK, TOP2A,
CDCA8, ASPM, CCNA2, KIF20A, BUB1, AURKB, CDK1, and CCNB2) related to
TNBC were screened out, and using LASSO regression feature selection
screened six significant TNBC genes (CDK1, CENPF, MCM7, PACC1, TUBB,
and UBE2C). We further evaluated the prognostic values of the ten
highest-scored genes and six significant genes in KM Plotter databases.
The OS analyses show that BUB1 and CENPF had a significantly poor
prognosis in TNBC patients, while BUB1, CCNA2, CDCA8, and PACC1 show a
significantly poor prognosis in TNBC patients in DFS analyses.
CENPF is reported to be a protein that interacts with microtubules and
participates in the development of the cell cycle [[99]23,[100]24].
High expression of CENPF has been observed in various cancers, such as
prostate cancer and breast cancer. In addition, the high expression of
CENPF may eventually induce bone metastasis of breast cancer cells
[[101]25]. Several evidences suggest that enhanced CENPF is a reliable
prognostic indicator of poor survival for breast cancer [[102]25],
prostate cancer [[103]26], and hepatocellular carcinoma [[104]27].
However, our research results show that lower expression of CENPF can
cause poor OS prognosis in TNBC. It would be necessary to perform
experiments to explore the mechanisms where CENPF is involved in
patients with poor survival outcomes.
BUB1, a mitotic checkpoint serine/threonine kinase, serves an important
role in the establishment of mitotic spindle checkpoint in breast
cancer cells [[105]28,[106]29,[107]30]. BUB1 expression is correlated
with unfavorable prognosis in patients with malignant tumors, including
breast [[108]31,[109]32] and liver cancers [[110]33]. Our study
suggested that a low level of BUB1 expression had reduced OS rates in
TNBC patients. Nevertheless, in the DFS prognosis analysis, higher BUB1
expression had a poor survival rate compared with it expressing a low
level. This would be worth investigating further.
Cell cycle regulator cyclin A2, also known as CCNA2, is a member of the
highly conserved cyclin family. It has been confirmed to be
up-regulated in a variety of cancers and participate in the progression
of various cancers [[111]34,[112]35,[113]36]. According to previous
reports that the high expression of CCNA2 can lead to the production of
TNBC cells, it has been identified as a potential candidate for the
treatment of TNBC [[114]37,[115]38]. As we found in our study, the high
expression of CCNA2 will promote better survival rates of DFS in
patients with TNBC, which is worthy of in-depth study.
CDCA8 is known as Borealin/Dasra B, which is a component of the
chromosomal passenger complex essential for transmission of the genome
in mitosis [[116]39]. CDCA8 plays an important role in several types of
cancer, and its overexpression may act as an oncogene
[[117]40,[118]41,[119]42]. Overexpression of CDCA8 in breast cancer was
reported to be associated with triple-negative phenotype and a shorter
overall survival [[120]43]. However, our study has shown that low CDCA8
expression results in shorter DFS survival. Further studied are
required to investigate whether CDCA8 is an indicator of good
prognosis.
The alias of PACC1 is called TMEM206, which is a member of the
transmembrane (TMEM) protein family. The TMEM protein family is a
potential player in the metastasis of cancer cells [[121]44]. It is
currently known that TMEM206 activates the AKT and ERK signaling
pathways to worsen the progression of colorectal cancer and promote the
proliferation, spread, and metastasis of colon cancer cells [[122]45].
TMEM206 cells are also up-regulated in liver cancer cells [[123]46],
but the mechanism of action between TMEM206 and cancer is still poorly
understood, and it has not been reported to be related to TNBC cells.
The results of this study show that in the DFS analysis of PACC1 in
TNBC patients, the high expression of PACC1 will lead to significant
adverse prognostic effects, which may be related to the high expression
of PACC1 inducing the proliferation and metastasis of cancer cells.
Therefore, PACC1 may be a novel potential therapeutic target for TNBC.
This study still has its limitations, the unavoidable issue of batch
effects, and the results may not apply to all TNBC patients. In
summary, this study found BUB1, CCNA2, and PACC1 genes as three poor
prognostic genes related to TNBC cells. These key genes may serve as
potential biomarkers for the treatment or early diagnosis of TNBC.
5. Conclusions
The combination of WGCNA and differential gene expression analysis was
used to construct a gene co-expression network, and the modular hub
genes essential to TNBC were determined. Our results show that these
genes are related to angiogenesis, cell proliferation, and cell cycle
progression. In addition, we also used survival analysis to verify the
results of MCC score ranking and LASSO feature selection, and found
that the most important central gene in OS is BUB1, and the most
important central genes in DFS are BUB1, CCNA2, and PACC1. However, the
molecular mechanisms of these three key genes in TNBC need to be
further studied through experiments, which will bring new insights to
the research of TNBC.
Acknowledgments