Abstract
Gastric cancer (GC) is one of the most common types of malignancy. Its
potential molecular mechanism has not been clarified. In this study, we
aimed to explore potential biomarkers and prognosis-related hub genes
associated with GC. The gene chip dataset [32]GSE79973 was downloaded
from the GEO datasets and limma package was used to identify the
differentially expressed genes (DEGs). A total of 1269 up-regulated and
330 down-regulated genes were identified. The protein-protein
interactions (PPI) network of DEGs was constructed by STRING V11
database, and 11 hub genes were selected through intersection of 11
topological analysis methods of CytoHubba in Cytoscape plug-in. All the
11 selected hub genes were found in the module with the highest score
from PPI network of all DEGs by the molecular complex detection (MCODE)
clustering algorithm. In order to explore the role of the 11 hub genes,
we performed GO function and KEGG pathway analysis for them and found
that the genes were enriched in a variety of functions and pathways
among which cellular senescence, cell cycle, viral carcinogenesis and
p53 signaling pathway were the most associated with GC. Kaplan-Meier
analysis revealed that 10 out of the 11 hub genes were related to the
overall survival of GC patients. Further, seven of the 11 selected hub
genes were verified significantly correlated with GC by uni- or
multivariable Cox model and LASSO regression analysis including C3,
CDK1, FN1, CCNB1, CDC20, BUB1B and MAD2L1. C3, CDK1, FN1, CCNB1, CDC20,
BUB1B and MAD2L1 may serve as potential prognostic biomarkers and
therapeutic targets for GC.
Subject terms: Computational biology and bioinformatics, Mathematics
and computing
Introduction
Gastric cancer (GC), a common heterogeneous disease, is one of the most
deadly malignancies worldwide, especially in East Asia^[33]1.
Previously, many patients missed the optimal diagnosis and treatment
time, which leads to tumor cell metastasis and progression to advanced
cancer. Currently, there is a lack of effective biomarkers for early
diagnosis. Comprehensive treatment and cancer surveillance have been
identified as one of the major obstacles to improve the prognosis of
gastric cancer^[34]2. Therefore, a deeper understanding of the
mechanisms involved in GC progression and identification of potential
biomarkers and targets for the diagnosis, prognosis and therapy of GC
are urgently needed.
Bioinformatics analysis methods, are powerful tools for identifying
potential biomarkers related to diagnosis and treatment, including the
analysis of gene interaction networks, gene annotation and microarray
expression profiles^[35]3. For example, Hao et al. explored 10 genes
(COL1A1, COL3A1, COL1A2, COL5A2, FN1, THBS1, COL5A1, SPARC, COL18A1 and
COL11A1) as potential biomarkers and therapeutic targets for GC,
through analysis data from the Gene Expression Omnibus (GEO)
database^[36]4. Furthermore, Zhu et al. found that CDK1 overexpression
was a prognostic factor for hepatocellular carcinoma (HCC), which makes
it a potential therapeutic target and biomarker for HCC diagnosis,
through analysis data from GEO and the Cancer Genome Atlas
(TCGA)^[37]5. Liao et al. identified two genes (SERPINE1 and SPARC) as
potential biomarkers and therapeutic targets for GC^[38]6. However, the
methods of hub gene selection in the above literatures was single and
the potential molecular mechanism of gastric cancer was still unclear,
which needs further exploration.
In this study, DEGs between human GC tissues and normal tissues were
identified using Limma based on GEO datasets. And a directed acyclic
graph was constructed with Bingo plug-in to view the overall enrichment
of DEGs. Next we constructed a PPI network of DEGs based on the STRING
V11 database and visualized it using Cytoscape software. And then 11
topological analysis methods were adopted to select the hub genes.
Important module in the network related to the hub genes were
abstracted by MCODE. In addition, in order to explore the role of the
11 selected hub genes in the pathogenesis of GC, we performed GO
function and KEGG pathway enrichment analysis for the genes. Finally,
Kaplan-Meier analysis was performed to evaluate the prognostic value of
these hub genes. And Cox model and LASSO regression analysis were used
to verify these hub genes further.
Results
Data preprocessing
It can be seen from the weight and residual symbol maps (Fig. [39]1a–b)
that all the points were evenly distributed. In the relative
logarithmic expression graph (Fig. [40]1c), all the samples were near
the zero point without outliers. For the RNA degradation diagram
(Fig. [41]1d), generally we need 5′-terminal lower than
3′-terminal^[42]7. Our results showed that the data was in good quality
and suitable for downstream analysis.
Figure 1.
[43]Figure 1
[44]Open in a new tab
Maps to check the quality of the data. (a) is a weight map, (b) is a
residual symbol map, (c) is a relative logarithmic expression and (d)
is an RNA degradation map. [The figures were created by R version 3.5.1
([45]https://www.r-project.org/)].
Differentially expressed genes and enrichment analysis
Though Limma method, a total of 1599 DEGs were identified in the
dataset [46]GSE79973, of which 1269 were up-regulated and 330 were
down-regulated (Fig. [47]2). Cytoscape’s plug-in Bingo generated a
directed acyclic graph (Fig. [48]3), in which branches represented
inclusion relationships. The range of functions defined by the arrow
direction from top to bottom was getting smaller and smaller, the
deeper color the higher degree of enrichment. The graph was divided
into three parts, representing BP, MF and CC. These genes were thought
to be involved in the regulation of RNA metabolic processes, positive
and negative regulation of phosphorous metabolic processes in the
adaptive immune response based on somatic cells, extracellular matrix
tissue and etc.
Figure 2.
[49]Figure 2
[50]Open in a new tab
Volcanic map of DEGs, red color represents DEGs in GC including
up-regulated and down-regulated genes. [The figure was created by Limma
([51]http://www.bioconductor.org/packages/release/bioc/html/limma.html)
].
Figure 3.
Figure 3
[52]Open in a new tab
Enrichment analysis of DEGs. The deeper color the higher degree of
enrichment. [The figure was created by Bingo plug-in in Cytoscape
([53]https://cytoscape.org/)].
PPI network and GC-associated clustering module construction
Based on 1599 DEGs, a PPI network was constructed where 1376 genes
formed the network with 14,394 edges (Fig. [54]4a). And hub genes were
identified by 11 topological analysis methods, where the top 20 genes
were selected for each method (Supplementary file [55]1), among which
C3, CDK1, AURKA, CDC20, CCNA2, AURKB, CCNB1, BUB1B, MAD2L1, UBE2C and
FN1 were found in the intersection of at least five methods and were
selected as GC related hub genes. We also obtained the clustering
module with the highest score from PPI network of all DEGs
(Fig. [56]4b) by MCODE algorithm. It was found that all the 11 hub
genes were contained in this module.
Figure 4.
[57]Figure 4
[58]Open in a new tab
PPI network and highest score clustering module. (a) PPI network,
yellow color indicates the hub genes, red nodes represent DEGs. (b) The
highest score clustering module generated by MCODE, yellow color
indicates the hub genes, red nodes represent DEGs. [The figures were
created by STRING V11.0 ([59]https://string-db.org/) and MCODE plug-in
in Cytoscape ([60]https://cytoscape.org/)].
GO and KEGG functional enrichment analysis for the selected hub genes
In order to explore the role of the 11 selected hub genes, we performed
GO function and KEGG pathway analysis for them. The results of GO
function enrichment indicated GO terms for 187 biological processes
(BP), 29 cell components (CC) and 7 molecular functions (MF), see
Supplementary file [61]2. Figure [62]5 showed the top 10 terms for BP,
CC and all the terms for MF and KEGG. Among these top terms, we found 6
terms associated with GC and some similar cancers. These terms were
mitotic spindle checkpoint, anaphase-promoting complex, cell cycle,
viral carcinogenesis, cellular senescence and p53 signaling pathway.
Kim et al. found that frequent mutations of human Mad2, but not Bub1,
in gastric cancers cause defective mitotic spindle checkpoint^[63]8.
Zhu et al. found that the miR-383 inhibited the cell cycle progression
of gastric cancer cells via targeting cyclin E2^[64]9. Uozaki et al.
studied that gastric cancer and viral carcinogenesis through epigenetic
mechanism^[65]10. Ji et al. found that microRNA miR-34 was a direct
target of p53, which played an anti-cancer role in the downstream of
p53 pathway^[66]11. By modulating the cellular senescence through
E2F/miR-106b-5p/p21 axis, Dong et al. found a novel mechanism by which
BRD4 regulated cancer cell proliferation and provided new insights into
using BET inhibitors as potential anticancer drugs^[67]12. Dai et al.
found that activation of anaphase-promoting complex by p53 induced a
state of dormancy in cancer cells against chemotherapeutic
stress^[68]13.
Figure 5.
Figure 5
[69]Open in a new tab
GO and KEGG functional enrichment analysis. (a) Top 10 terms in BP
category. (b) Top 10 terms in CC category. (c) All terms in MF
category. (d) All terms in KEGG pathway analysis. (The figures were
created by R version 3.5.1 based on the KEGG pathway database
[70]www.kegg.jp/kegg/kegg1.html).
Survival analysis
Using the Kaplan–Meier plotter database, the prognostic value of the 11
hub genes were evaluated in GC patients. All the 11 hub genes were
up-regulated (
[MATH:
log2foldchange(FC)>1 :MATH]
, Table [71]1). It was found that 10 out of the 11 hub genes including
C3, CDK1, AURKA, CCNA2, AURKB, CCNB1, BUB1B, MAD2L1, FN1 and UBE2C
(p-value < 0.05) had significant difference of overall survival between
high and low expression. The results showed that the survival rate of
high expression C3, AURKB, FN1 and UBE2C groups were significantly
lower than that of low expression groups, and the survival rate of low
expression CDK1, AURKA, CCNA2, CCNB1, BUB1B, MAD2L1 groups were
significantly lower than that of high expression (Fig. [72]6).
Table 1.
Regulation of 11 hub genes.
Gene symbol Log FC p value Adj p val
FN1 2.182008173 5.82E-−06 4.96E−05
UBE2C 2.1341329 0.000173126 0.000693925
CCNB1 1.736845192 0.000104015 0.000459134
MAD2L1 1.635046726 2.17E−05 0.000133896
CCNA2 1.582022494 4.42E−05 0.000231949
C3 1.561270443 0.006073294 0.013605247
CDK1 1.321178004 0.002097257 0.005543793
BUB1B 1.318731789 0.002695901 0.006858173
CDC20 1.301949207 0.005086582 0.011692519
AURKB 1.240762245 4.32E−05 0.000227121
AURKA 1.023628153 0.002870315 0.007224631
[73]Open in a new tab
Figure 6.
[74]Figure 6
[75]Open in a new tab
Survival plot of the significant genes by Kaplan Meier test. The
Kaplan–Meier test p-value < 0.05: (a) C3, (b) CDK1, (c) AURKA, (d)
CCNA2, (e) AURKB, (f) CCNB1, (g) BUB1B, (h) MAD2L1, (i) FN1 and (k)
UBE2C. The test p-value > 0.05: (j) CDC20. [The figures were created by
Kaplan–Meier
([76]https://kmplot.com/analysis/index.php?p=service&cancer=gastric)].
Verification based on another dataset
Based on dataset [77]GSE19826, Limma identified 333 DEGs. These DEGs
were imported into the STRING V11 database to obtain a TSV file of
protein interactions. And hub genes were identified by 11 topological
analysis methods, where the top 20 genes were selected for each method
(Supplementary file [78]3), among which FN1, CDK1, MMP9, CCNB1, AURKA,
UBE2C, AURKB, CCNA2, FOXM1, ITGB1, EZH2, RRM2, THBS1, CXCL8, CDC20,
COL1A1, BUB1B, MAD2L1, HIF1A and CDH2 were found in the intersection of
at least five methods, including 10 out of the 11 hub genes in the
above conclusion, except C3. Important clustering module related to the
11 hub genes were abstracted by CytoHubba which had 133 nodes and 1077
edges (Supplementary Fig. [79]1a). We obtained two clustering modules
with the highest score from PPI network of all DEGs (Supplementary
Fig. [80]1b,c) by MCODE algorithm. It was found that all the 11 hub
genes were contained in this module.
Cox analysis and LASSO regression analysis
The analysis results of correlation between DEGs expression and overall
survival (OS) as well as other clinical features investigated by Cox
analysis were shown in Table [81]2. The results suggested that six
clinical features including age, stage, grade, T, M and N and six hub
genes including C3, CDK1, FN1, CCNB1, CDC20 and BUB1B were revealed
significantly correlated with OS (p-value < 0.05) by univariate or
multivariate Cox analysis. The LASSO method established regression
model and continued to screen the 11 hub genes. By setting different
[MATH: λ :MATH]
, the path change graph of the regression coefficient was obtained
(Fig. [82]7a). The trend of each curve in the figure represented the
change of the regression coefficient path. It could be seen that the
regression coefficients were mostly compressed to zero, which showed
that the model had a good advantage in dimensionality reduction and
variable selection. Each point in Fig. [83]7b corresponded to a penalty
value, and the position of the vertical dashed line represented the
number of genes selected under the optimal model. It could be seen from
the figure that there were five genes under the optimal model,
including C3, CCNB1, CDC20, FN1 and MAD2L1. Therefore, seven of the 11
selected hub genes including C3, CDK1, FN1, CCNB1, CDC20, BUB1B and
MAD2L1 were verified through Cox analysis or LASSO regression analysis
which could be taken as independent prognostic biomarkers for GC.
Table 2.
Univariate and multivariate Cox analysis.
Variables Univariate Cox Multivariate Cox
HR HR.95L HR.95H p value Variables HR HR.95L HR.95H p value
Age 1.022212351 1.005811513 1.038880621 0.007764558 Age 1.03433333
1.016841941 1.052125601 0.000104766
Grade 1.420247616 1.023095183 1.971569532 0.036046698 Grade 1.567172974
1.103167545 2.226344622 0.012138576
Stage 1.653117193 1.348649804 2.026320283 1.30E-06 Stage 1.353901129
1.024986088 1.788364047 0.032862236
T 1.39801979 1.137483947 1.718230255 0.001451642 M 2.195910547
1.337728548 3.60463499 0.001866895
M 2.372585415 1.502749378 3.745908422 0.000208927 C3 1.001531756
1.00010612 1.002959424 0.035207018
N 1.387710481 1.195331802 1.611050903 1.68E-05 CDK1 0.958783737
0.928669867 0.989874107 0.009736787
C3 1.161967017 1.048765553 1.2873872 0.004099372 BUB1B 8.568665988
1.604562951 45.75827753 0.011965015
CDK1 1.129353067 1.012197799 1.260068291 0.029485595 FN1 1.100769547
1.00269362 1.20843852 0.043751763
FN1 1.125876038 1.008258757 1.25721383 0.035198627 CCNB1 1.478932638
1.056967554 2.069355618 0.022416923
BUB1B 1.122389719 1.00533721 1.25307078 0.03990889 CDC20 0.689337825
0.509422258 0.932795201 0.015918054
[84]Open in a new tab
Figure 7.
[85]Figure 7
[86]Open in a new tab
LASSO regression model analysis. (a) is the path change chart of
regression coefficient and (b) is the change curve of penalty term.
[The figure was created by R version 3.5.1
([87]https://www.r-project.org/)].
Discussion
The study of molecular genetics and signal transduction pathways are
helpful for further understanding of the pathogenesis and early
diagnosis of GC. Therefore, recognition of DEGs for GC based on
transcriptome microarray datasets may contribute to early diagnosis and
develop effective therapies.
In our study, a total of 1599 DEGs were identified in the dataset
[88]GSE79973, of which 1269 were up-regulated and 330 were
down-regulated. Based on these DEGs, a PPI network was constructed
where 1376 genes formed the network with 14,394 edges. And 11 hub genes
were selected through intersection of 11 topological analysis methods.
Clustering module related to these 11 genes were obtained by MCODE. In
order to explore the role of the 11 selected hub genes in the
pathogenesis of GC, we performed GO function and KEGG pathway
enrichment analysis for them and found that the genes were enriched in
a variety of functions and pathways. Kaplan–Meier analysis revealed
that 10 out of the 11 hub genes were related to the overall survival of
GC patients. Cox analysis and LASSO regression analysis showed that
seven of the 11 selected hub genes were significantly correlated with
GC.
The overall aim of this study was to identify the hub genes which may
serve as potential biomarkers for GC diagnosis and therapy, and to
further explore the potential mechanisms of GC by integrated profiling
analysis. In our study, seven of the 11 selected hub genes including
C3, CDK1, CCNB1, BUB1B, MAD2L1, FN1 and CDC20 were considered to be the
most likely independent prognostic biomarkers associated with GC. And
five of them were newly found including C3, CDC20, CCNB1, BUB1B and
MAD2L1 compared with previously published results using the same
dataset. Some relevant literatures suggested from biological point of
view that most of these found hub genes played important roles on GC.
Kitano et al. showed that the synthesis and secretion of C3 by all the
tested GC derived cell lines in response to TNF, suggested that C3 may
be secreted in the gastric wall as part of its normal physiology, or as
a result of tumour pathology and thereby participate in local immune or
inflammatory responses^[89]14. Lee et al. found that the high
expression of CDK1 in GC patients may imply a strong biological ability
of tumor invasion and CDK1 was the target gene of mir-490-5p.
Down-regulation of mir-490-5p and up-regulation of CDK1 can promote the
proliferation ability of GC cells and the transformation of G1/S
phase^[90]15. A study by Kidokoro et al. found that CDC20 was often
up-regulated in many types of tumors and significantly inhibited by
ectopic introduction of p53. Additionally, treatment of cancer cells
with siRNA against CDC20 can induce G2/M arrest and inhibit cell
growth^[91]16. CCNB1 knockdown by RNA interference was found to
significantly inhibited proliferation, migration and invasion of HCC
cells^[92]17. Hudler et al. found that the expression of BUB1B in GC
tissues was significantly higher than that in adjacent normal tissues,
nearly
[MATH: (8.875±1.08) :MATH]
times^[93]18. Frio et al. found that at least one BUB1B mutation can
result in autosomal recessively inherited susceptibility to
gastrointestinal cancer, as do mutations in MUTYH and the
mismatch-repair genes^[94]19. FN1 was an significant regulatory factor
promoting the development and formation of various cancer cells, such
as laryngeal, skin squamous carcinoma^[95]20,[96]21 and brain
glioblastoma^[97]22. Zhang et al. found that miR-200c can inhibit the
migration, proliferation and invasion of GC cells in vitro by directly
combining with FN1, which indicated that mir-200c and FN1 may be
potential biomarkers or therapeutic methods for GC^[98]23. A study by
Wang et al. confirmed the prognostic value of two key mitotic
checkpoint genes MAD2L1 and BUB1, which have been included in multiple
gene expression signatures for breast cancer prognosis. And they also
found that these genes were biologically relevant to breast cancer
progression, as suppression of their expression was associated with
reduced tumor cell growth, migration and invasion^[99]24.
This study provides important clues for exploring potential biomarkers
and targets for the diagnosis, prognosis and treatment of GC. In future
work, if condition permits, we hope to conduct some experiments to
verify the important relation between these hub genes and GC from
biological point of view.
Conclusion
Through 11 topological analysis methods, we identified 11 hub genes for
GC. We validated these hub genes through functional enrichment
analysis, the clustering module with the highest score, relevant
literatures, Kaplan–Meier analysis, Cox analysis and LASSO regression
analysis. The results suggested that seven of the 11 selected hub genes
including C3, CDK1, CCNB1, BUB1B, MAD2L1, FN1 and CDC20 may serve as
potential prognostic biomarkers and therapeutic targets for GC. These
results may provide a theoretical direction for future research with
regards to the molecular mechanisms of the progression of GC.
Materials and methods
Dataset
We collected the set of gene expression profiles of GC from the Gene
Expression Omnibus database
([100]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE79973).
This dataset includes 10 GC samples and 10 normal gastric samples. The
platform was [101]GPL570 (Affymetrix Human Genome U133 Plus 2.0). This
dataset was used by previous studies^[102]4,[103]6,[104]25–[105]27,
where the authors mainly clarified the emerging role of long non-coding
RNA (lncRNA) in cancer development, explored novel lncRNA candidates,
identified key candidate genes and circRNA, and explored the molecular
mechanism of GC through comprehensive analysis of mRNA and miRNA
expression profiles. Here we aimed to identify potential prognostic
biomarkers of GC based on 11 topological analysis methods and used the
MCODE method and survival analysis to verify these hub genes.
Data preprocessing
The fitPLM function in the affyPLM package was adopted to perform
regression calculation on the dataset. Then, weight map, residual
symbol map, relative logarithmic expression map and RNA degradation map
were used to check whether the data quality was suitable for downstream
analysis. Finally, the normal samples and gastric cancer samples were
processed by RMA method and the missing values were added by nearest
neighbor method (KNN, k-Nearest Neighbor).
Identification of DEGs and enrichment analysis
The lmFit and eBayes functions in the limma package were used to
identify the DEGs between GC and control samples. The threshold values
[MATH:
log2foldchange(FC)>1 :MATH]
and p-value < 0.05 indicated up-regulated DEGs, while
[MATH:
log2(FC)<-1
:MATH]
and p-value < 0.05 indicated down-regulated DEGs. we also used the
Cytoscape plug-in Bingo to examine the enrichment of all DEGs in the
biological processes of Biological Process (BP), Molecular Function
(MF) and Cellular Component (CC)^[106]28.
PPI network and GC-associated clustering module construction
We constructed a PPI network of DEGs using the STRING V11
database^[107]29. Based on the genes in the network, we searched for
hub genes through 11 topological analysis methods^[108]30. The selected
hub genes were verified by MCODE method.
GO function and KEGG pathway enrichment analysis
To identify the potential functions of the selected hub genes, we used
the clusterProfiler package in R to perform GO functions and KEGG
pathway analysis of these genes^[109]31. ClusterProfiler is an R
package of Bioconductor which can perform statistical analysis and
visualization of functional clustering of gene sets or gene clusters.
Survival analysis
The Kaplan–Meier plotter database ([110]http://kmplot.com), an online
tool to evaluate the prognostic values of genes in breast, ovarian,
lung and gastric cancer patients, was applied to analyze the
associations between the identified hub genes and overall
survival^[111]32. The hazard ratio (HR) and its 0.95 confidence
intervals were calculated. p < 0.05 was used to indicate a
statistically significant difference^[112]33.
Verification based on another dataset
In order to validate the efficiency of our method, another mRNA
expression dataset ([113]GSE19826) was downloaded from the GEO database
([114]https://www.ncbi.nlm.nih.gov/geo/), the mRNA profifiles were
based on the [115]GPL570 platform (Affymetrix Human Genome U113 Plus
2.0 Array). The [116]GSE19826 dataset included 15 paracancerous tissues
and 12 tumor tissues. The same method was carried out on this dataset.
Firstly, the Limma method was used to identify DEGs between GC tissues
and paracancerous tissues. Secondly, a PPI network of DEGs was
constructed based on the STRING V11 database. Based on the genes in the
network, we searched for hub genes through 11 topological analysis
methods in the CytoHubba. Then, the clustering module related to these
hub genes were constructed by CytoHubba and the selected hub genes were
verified by MCODE method.
Cox analysis and LASSO regression analysis
The TCGA data with corresponding clinical features of GC was downloaded
from TCGA database ([117]https://tcgadata.nci.nih.gov/tcga/) which
contained 375 tumor tissue samples and 32 paracancerous tissue samples.
Cox proportional hazard models of univariate and multivariate were used
to calculate 95% confidence interval (CI) and hazard ratio (HR) for the
DEGs and clinical features where survival package was used for
statistical analysis. Univariate Cox analysis model was used to compare
the relationship between clinical features and survival rates.
Multivariate Cox analysis model was used to evaluate how the genes
expression and the clinical factors affect overall survival (OS). p <
0.05 was set as the threshold. LASSO regression analysis was applied to
confirm the selected hub genes further. When solving the regression
model, LASSO fitted the model that contained
[MATH: λ :MATH]
prediction variables by constraining or punishing the coefficients and
compressing the coefficient estimation value to zero direction, so that
the model had a better interpretation effect. The screening of related
variables could be done through the cv.glmnet and glmne function in R
language.
Consent for publication
All the authors have consented for the publication.
Supplementary Information
[118]Supplementary Information 1.^ (24KB, xls)
[119]Supplementary Information 2.^ (32KB, xlsx)
[120]Supplementary Information 3.^ (423.3KB, pdf)
Acknowledgements