Abstract

   Background: Gastric cancer (GC) is one of the most prevalent cancers
   all over the world. The molecular mechanisms of GC remain unclear and
   not well understood. GC cases are majorly diagnosed at the late stage,
   resulting in a poor prognosis. Advances in molecular biology techniques
   allow us to get a better understanding of precise molecular mechanisms
   and enable us to identify the key genes in the carcinogenesis and
   progression of GC.

   Methods: The present study used datasets from the GEO database to
   screen differentially expressed genes (DEGs) between GC and normal
   gastric tissues. GO and KEGG enrichments were utilized to analyze the
   function of DEGs. The STRING database and Cytoscape software were
   applied to generate protein–protein network and find hub genes. The
   expression levels of hub genes were evaluated using data from the TCGA
   database. Survival analysis was conducted to evaluate the prognostic
   value of hub genes. The GEPIA database was involved to correlate key
   gene expressions with the pathological stage. Also, ROC curves were
   constructed to assess the diagnostic value of key genes.

   Results: A total of 607 DEGs were identified using three GEO datasets.
   GO analysis showed that the DEGs were mainly enriched in extracellular
   structure and matrix organization, collagen fibril organization,
   extracellular matrix (ECM), and integrin binding. KEGG enrichment was
   mainly enriched in protein digestion and absorption, ECM-receptor
   interaction, and focal adhesion. Fifteen genes were identified as hub
   genes, one of which was excluded for no significant expression between
   tumor and normal tissues. COL1A1, COL5A2, P4HA3, and SPARC showed high
   values in prognosis and diagnosis of GC.

   Conclusion: We suggest COL1A1, COL5A2, P4HA3, and SPARC as biomarkers
   for the diagnosis and prognosis of GC.

   Keywords: gastric cancer, bioinformatics analysis, microarray,
   differentially expressed genes, prognosis, diagnosis

Introduction

   According to data published in 2021, gastric cancer (GC), among all
   cancers, ranked fourth in cancer-related deaths ([34]Sung et al.,
   2021). Stomach adenocarcinoma (STAD), the most common histological
   type, accounts for more than 90% of GC ([35]Ajani et al., 2017).
   Although endoscopy or histological detection has developed a lot in
   recent years, the majority of GC patients are diagnosed at their late
   and advanced stage due to an insidious onset, resulting in high
   morbidity and mortality ([36]Chen et al., 2020; [37]Wang W. et al.,
   2020). However, advances in molecular biology techniques allow us to
   approach precise molecular mechanisms of carcinogenesis and enable us
   to find potential diagnostic and prognostic biomarkers for GC.

   Previous bioinformatic studies resulted in different biomarkers due to
   different screening criteria and different datasets from Gene
   Expression Omnibus (GEO) ([38]Sun et al., 2017; [39]Zheng H.-C. et al.,
   2017; [40]Shi and Zhang, 2019). In the present study, we identified
   DEGs based on three datasets from GEO, [41]GSE19826, [42]GSE54129, and
   [43]GSE118916. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathway enrichment analysis were performed subsequently.
   Afterwards, we constructed the protein–protein interaction (PPI)
   network to identify hub genes using the STRING database and Cytoscape
   software. Then, we performed the survival analysis, including overall
   survival (OS), disease-free survival (DSS), and progress-free interval
   (PFI) to identify candidate genes. The expression of candidate genes
   and their correlation with the pathological stage were further analyzed
   along with the diagnostic value. A total of four genes were identified
   as potential biomarkers for GC in our study.

Materials and Methods

Microarray Datasets

   RNA-sequencing datasets containing gastric cancer tissue samples and
   normal tissue samples were obtained from the GEO database ([44]Barrett
   et al., 2013), ([45]https://www.ncbi.nlm.nih.gov/geo/) and three GEO
   datasets, including [46]GSE19826 ([47]Wang Q. et al., 2012),
   [48]GSE54129, and [49]GSE118916 ([50]Li et al., 2019), were downloaded
   for further analysis.

Identification of Differentially Expressed Genes

   The limma package (version: 3.40.2) of R software was used to identity
   the DEGs in three datasets. The adjusted p value was analyzed to
   correct for false positive results in GEO datasets. “Adjusted p < 0.05
   and fold change >1.5” were defined as the thresholds for the screening
   of the differential expression of mRNAs. Subsequently, the ggplot
   package (version: 3.3.3) of R software was used to make a Venn diagram
   to extract the common DEGs of the three datasets.

Enrichment Analysis of Differentially Expressed Genes

   The clusterProfiler package ([51]Yu et al., 2012) (version 3.14.3) of R
   software was used for enrichment analysis with the following ontology
   sources: GO biological processes (BPs), cellular components (CCs),
   molecular functions (MFs), and KEGG pathway. Adjusted p < 0.05 and q <
   0.2 were set as the critical standard for significant enrichment.

Analysis of Protein–Protein Interaction Network

   The PPI network of DEGs was generated using the search tool of the
   STRING database ([52]Szklarczyk et al., 2019) (version 11.5). The
   “Multiple Proteins by Names/Identifiers” tool was chosen in this study.
   The organism was set as “Homo Sapiens.” Required score was set as high
   confidence (0.700), and FDR stringency was set to medium (5%). The PPI
   network was exported for further analysis with the Cytoscape software
   ([53]Otasek et al., 2019) (version 3.8.2). The plugin MCODE
   ([54]Bandettini et al., 2012) (version 2.0.0, degree cutoff: 2, node
   score cutoff: 0.2, K-score: 2) was applied to identify the hub genes in
   the PPI network. The module with the highest degree was used in the
   following analysis.

The Expressions and Survival Analysis of Hub Genes

   The Cancer Genome Atlas (TCGA) project is an open database aiming to
   link cancer genomic data to patients’ clinicopathological information
   ([55]https://www.cancer.gov/tcga). Raw counts of RNA-sequencing data
   (level 3) were obtained from TCGA along with corresponding
   clinicopathological information ([56]Liu et al., 2018). TPM-formatted
   RNA-sequencing data of normal tissues from the Genotype-Tissue
   Expression Project (GTEx) were obtained from the University of
   California Santa Cruz ([57]Vivian et al., 2017)
   ([58]https://xenabrowser.net/datapages/). Tumor/normal differential
   expression analyses of hub genes were conducted using R software. We
   conducted the survival analysis, including the OS, DSS, and PFI, with
   the Xiantao Academic platform (survminer package of R software). DEGs
   related to the OS, DSS, and PFI were considered as our purpose genes
   and were involved in the following data analysis.

Correlation Analysis of Purpose Genes

   GEPIA ([59]Tang et al., 2017) ([60]http://gepia.cancer-pku.cn/) is a
   database that enable users to analyze the RNA-sequencing expression in
   various ways. We used GEPIA to correlate our purpose genes with the
   pathological stage. Correlation analysis among purpose genes were
   conducted using R software embedded in Xiantao Academic. Correlation
   among these purpose genes were visualized with a heat map generated by
   the ggplot package.

Statistical Analysis

   Xiantao Academic ([61]https://www.xiantao.love/products) is a platform
   embedded with R software and R packages for data analyzing. The major
   analysis was performed using Xiantao Academic in the present study.
   Chi-square test and the Wilcoxon rank sum test were utilized in the
   analysis depending on the data. Spearman correlation analysis was used
   in different expression of genes. In the analysis of the correlation of
   gene expression with pathological stage, the expression data are first
   log2 (TPM+1) transformed, and the method was one-way ANOVA, using
   pathological stage as a variable for calculating differential
   expression. p value <0.05 was regarded as statistically significant.

Results

Identification of Differentially Expressed Genes

   The present study involved three GEO datasets, [62]GSE19826,
   [63]GSE54129, and [64]GSE118916. [65]GSE19826 contained 12 pairs of
   samples from GC tumor and adjacent non-tumor tissues and three normal
   tissues. [66]GSE54129 contained 111 GC tumor tissue samples and 21
   normal tissue samples. [67]GSE118916 contained 15 pairs of Gastric
   cancer tumor and adjacent non-tumor (normal) tissues. There were 138 GC
   tumor tissue samples and 51 normal tissue samples in total involved in
   the present study. The flow chart is shown in [68]Figure 1. We
   identified 607 DEGs including 294 up-regulated genes and 313
   down-regulated genes in GC tissue samples ([69]Figure 2).

FIGURE 1.

   [70]FIGURE 1
   [71]Open in a new tab

   Flowchart diagram for bioinformatics analysis.

FIGURE 2.

   FIGURE 2
   [72]Open in a new tab

   The Venn diagram shows a total of 607 differentially expressed genes
   including 294 up-regulated genes and 313 down-regulated genes.

Functional Enrichment Analysis of Differentially Expressed Genes

   We conducted a functional enrichment analysis of DEGs using R software
   and R codes embedded in Xiantao platform. DEGs are enriched in 341
   terms of GO BP, including extracellular structure organization,
   extracellular matrix (ECM) organization, collagen fibril organization,
   bone development and connective tissue development, etc. DEGs were
   enriched in 43 terms of GO CC, including collagen-containing
   extracellular matrix, endoplasmic reticulum lumen, basement membrane,
   extracellular matrix component and collagen trimer, etc. DEGs were
   enriched in 35 terms of GO MF, including extracellular matrix
   structural constituent, extracellular matrix structural constituent
   conferring tensile strength, integrin binding, glycosaminoglycan
   binding, and platelet-derived growth factor binding ([73]Figure 3A;
   [74]Supplementary Table S1). DEGs were enriched in 10 terms of KEGG,
   including protein digestion and absorption, ECM-receptor interaction,
   Focal adhesion, human papillomavirus infection, beta-alanine
   metabolism, fatty acid degradation, gastric acid secretion, histidine
   metabolism, drug metabolism—cytochrome P450, and carbon metabolism
   ([75]Figure 3B; [76]Supplementary Table S1).

FIGURE 3.

   [77]FIGURE 3
   [78]Open in a new tab

   Functional analysis of DEGs. Top five GO terms enrichment in biological
   process (BP), cell composition (CC), and molecular function (MF) (A).
   KEGG enrichment of DGEs (B).

Protein–Protein Interaction Network to Identify Hub Genes

   A PPI network of 607 DEGs, containing a total of 317 nodes and 606
   edges, was generated using STRING, and an interaction score >0.7 was
   considered a high-confidence interaction relationship. We identified 15
   nodes and 81 edges with MCODE plugin. The module with the highest
   degree was used in the following analysis. The hub genes included
   COL1A1, COL1A2, COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A2, COL6A3,
   COL11A1, MMP2, P4HA3, PCOLCE, PLOD1, and SPARC ([79]Figure 4). Gene
   expression profiles of the 15 hub genes between GC tumor samples and
   normal samples are shown in [80]Figure 5. The expression of COL6A2
   showed no difference in tumor and normal tissues, so it was excluded in
   further analyses. The remaining 14 genes were considered as candidate
   genes for potential diagnostic and prognostic biomarkers.

FIGURE 4.

   FIGURE 4
   [81]Open in a new tab

   The PPI network of 15 hub genes selected with the MCODE plugin of
   Cytoscape.

FIGURE 5.

   [82]FIGURE 5
   [83]Open in a new tab

   Gene expression of 15 hub genes (COL1A1, COL1A2, COL3A1, COL4A1,
   COL4A2, COL5A1, COL5A2, COL6A2, COL6A3, COL11A1, MMP2, P4HA3, PCOLCE,
   PLOD1, and SPARC) based on TGCA and GTEx databases. ***p < 0.001; ns,
   not statistically significant.

Survival and Correlation Analysis

   We conducted Kaplan–Meier survival analysis with the candidate genes.
   Candidate genes related to OS, DSS, or PFI were considered as key
   genes. Among the 14 candidate genes, COL1A1 (HR = 1.41, p = 0.042),
   COL4A1 (HR = 1.45, p = 0.029), COL5A2 (HR = 1.54, p = 0.011), P4HA3 (HR
   = 1.57, p = 0.011), and SPARC (HR = 1.47, p = 0.022) were associated
   with the OS of STAD ([84]Figure 6). COL5A2 was associated with DSS (HR
   = 1.70, p = 0.015) ([85]Figure 7) and PFI (HR = 1.44, p = 0.043)
   ([86]Figure 8). Therefore, this study focused on the five key genes,
   COL1A1, COL4A1, COL5A2, P4HA3, and SPARC. Further analysis of the
   correlation between these key genes and the pathological stage of GC
   showed that COL1A1, COL5A2, P4HA3, and SPARC were significantly
   correlated to cancer pathological stages. However, COL4A1 showed no
   significance in the correlation analysis ([87]Figure 9). Therefore, we
   identify COL1A1, COL5A2, P4HA3, and SPARC as potential biomarkers for
   prognosis of GC.

FIGURE 6.

   [88]FIGURE 6
   [89]Open in a new tab

   Overall survival analysis of 14 candidate genes (COL1A1, COL1A2,
   COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A3, COL11A1, MMP2, P4HA3,
   PCOLCE, PLOD1, and SPARC). COL1A1, COL4A1, COL5A2, P4HA3, and SPARC (HR
   = 1.47, p = 0.022) were associated with OS.

FIGURE 7.

   [90]FIGURE 7
   [91]Open in a new tab

   Disease-specific survival analysis of 14 candidate genes (COL1A1,
   COL1A2, COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A3, COL11A1, MMP2,
   P4HA3, PCOLCE, PLOD1, and SPARC). COL5A2 was associated with DSS (HR =
   1.70, p = 0.015).

FIGURE 8.

   [92]FIGURE 8
   [93]Open in a new tab

   Progress free interval analysis of 14 candidate genes (COL1A1, COL1A2,
   COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A3, COL11A1, MMP2, P4HA3,
   PCOLCE, PLOD1, and SPARC). COL5A2 was associated with PFI (HR = 1.44, p
   = 0.043).

FIGURE 9.

   [94]FIGURE 9
   [95]Open in a new tab

   Correlation analysis between five key genes (COL1A1, COL4A1, COL5A2,
   P4HA3, and SPARC) and the pathological stage of GC shows they are
   potential prognostic markers.

Correlation Expression and Diagnostic Analysis

   We analyzed the correlation between these four genes on Xiantao
   Academic based on data from TCGA and found that all of these genes were
   highly correlated with each other. The r value ranged from 0.84 to 0.92
   (p < 0.01) ([96]Figure 10). We used a receiver operating characteristic
   (ROC) curve to assess the diagnostic value of the purpose genes using
   Xiantao Academic tools based on TCGA and GTEx samples. The area under
   curve (AUC) of COL1A1, COL5A2, P4HA3, and SPARC was 0.916, 0.802,
   0.874, and 0.895, respectively. The results, as shown previously,
   suggested that these four genes we selected could effectively
   distinguish GC samples with normal samples ([97]Figure 11). COL1A1,
   COL5A2, P4HA3, and SPARC could be biomarkers for the diagnosis and
   prognosis of GC.

FIGURE 10.

   FIGURE 10
   [98]Open in a new tab

   The expression of four genes (COL1A1, COL5A2, P4HA3, and SPARC) are
   correlated with each other in GC.

FIGURE 11.

   FIGURE 11
   [99]Open in a new tab

   ROC of four key genes (COL1A1, COL5A2, P4HA3, and SPARC) shows they are
   of high diagnostic value in GC.

Discussion

   GC is one of the most diagnosed cancers and has brought great burden to
   global health. Patients were likely to be diagnosed in their late stage
   due to the lack of specific clinical symptoms at an early stage. Thus,
   patients with GC have poor prognosis. It is urgent to identify relevant
   biomarkers that are valid for both diagnostic and prognostic
   evaluation. Bioinformatics analysis enables us to explore the genetic
   alterations in GC and has been proved to be a useful approach to
   identify new biomarkers in plenty of diseases. An initial objective of
   the project was to identify appropriate biomarkers of GC using
   bioinformatics analysis.

   In the current study, we identified 607 DEGs meeting the criteria. GO
   enrichment suggested those genes were significantly associated with
   extracellular structure and matrix organization, collagen fibril
   organization, and ECM and integrin binding. KEGG was mainly enriched in
   protein digestion and absorption, ECM-receptor interaction, and focal
   adhesion. In accordance with the present results, previous studies have
   reported that cancer-associated fibroblasts are essential in creating
   extracellular matrix structure and metabolism and account for the
   adaptive resistance to chemotherapy caused by immune reprogramming of
   the tumor microenvironment ([100]Quante et al., 2011; [101]Kalluri,
   2016). Extracellular matrix plays a significant part in the creation of
   tumor microenvironment and promotes malignancy ([102]Madsen and
   Sidenius, 2008; [103]Najafi et al., 2019; [104]Mohan et al., 2020;
   [105]Piersma et al., 2020; [106]Wang W. et al., 2020). Integrins
   coordinate ECM–cell and cell–cell interactions, signal transmission,
   gene expression, and cell function. The interaction between integrin
   and the cancer glycol microenvironment plays a significant part in
   regulating cancer progression ([107]Marsico et al., 2018).

   The results of this study showed that COL1A1, COL4A1, COL5A2, P4HA3,
   and SPARC were associated with the OS of GC. COL5A2 was associated with
   DSS and PFI. Further analysis of the correlation between these key
   genes and the pathological stages of GC showed that COL4A1 showed no
   significance in the correlation analysis to pathological stages.
   Therefore, we identify four genes as potential biomarkers of GC
   including COL1A1, COL5A2, P4HA3, and SPARC. Also, the diagnostic value
   of these genes was confirmed in the following analysis.

   COL1A1 is an important member of the type-I collagen family, the main
   fibrillar collagen and an essential structural component of the ECM
   ([108]Li J. et al., 2016). Many bioinformatic analyses identified
   COL1A1 as a biomarker of GC ([109]Wang W. et al., 2020; [110]Wang Y. et
   al., 2021; [111]Zhao et al., 2021). Abnormal expression of COL1A1 has
   been reported in several cancers, including hepatocellular carcinoma,
   ovarian cancer, and colorectal cancer, as well as in GC ([112]Li J. et
   al., 2016; [113]Zhang et al., 2018; [114]Ma et al., 2019; [115]Li et
   al., 2020). In vitro, enhanced expression of COL1A1 promotes the
   invasion and migration of GC cells, while knocking out COL1A1 inhibits
   the increase in cell metastasis ability ([116]Li et al., 2021). It
   plays an important role in promoting tumor cell proliferation,
   migration, invasion, epithelial–mesenchymal transformation (EMT), and
   chemotherapy resistance ([117]Armstrong et al., 2004; [118]Koenig et
   al., 2006; [119]Shintani et al., 2008; [120]Yang et al., 2014;
   [121]Zheng X. et al., 2017; [122]Yamazaki et al., 2018; [123]Shi et
   al., 2021). ROC analysis showed high diagnostic value of COL1A1 (AUC =
   0.916) based on 414 GC samples and 210 normal gastric tissues. This
   finding is consistent with that of [124]Zhao et al. (2021) (AUC =
   0.917) based on 375 GC samples and 32 normal samples ([125]Zhao et al.,
   2021). The diagnostic and prognostic values of COL1A1 were confirmed
   with extra data (more samples than others) from the present work.

   COL5A2 is a member of the type-V collagen family which is also a
   significant structural component of the ECM. COL5A2 was reported to
   promote proliferation and invasion in colon cancer and prostate cancer
   ([126]Ren et al., 2021; [127]Wang J. et al., 2021). Also, it has a
   strong correlation to the prognosis of renal cancer and gastric cancer
   ([128]Ding et al., 2021; [129]Tan et al., 2021). The overexpression of
   COL5A2 promoted the migration of GC cells in vitro and in vivo, and the
   knockdown of COL5A2 could significantly decrease the migration of cell
   ([130]Tan et al., 2021). A previous study had demonstrated that
   patients with higher COL5A2 levels were more likely to suffer from
   renal metastasis (AUC = 0.878). Among all those genes we identified as
   potential biomarkers, COL5A2 was the unique gene that was associated
   with the OS, DSS, and PFI of GC, which had not been reported in
   previous studies. The value of AUC in our current study is 0.802 based
   on data from TCGA and GTEx. Therefore, COL5A2 could serve as a novel
   biomarker of GC. Also, we would perform biological experiments to
   verify the result.

   Previous research showed that P4HA3 was up-regulated in head and neck
   squamous cell carcinoma (HNSCC) tissue, and it was demonstrated to
   promote HNSCC cell proliferation, invasion, and migration in vitro
   ([131]Wang T. et al., 2020). A recent study showed that the
   de-regulation of P4HA3 was associated with increased metastasis and
   poor prognosis of GC ([132]Song et al., 2018). In the present work, the
   value of AUC of P4HA3 is 0.875, which indicated high value of diagnosis
   and has not been reported in previous studies. The result suggests that
   P4HA3 is a potential biomarker of GC.

   SPARC is one of the first-known matricellular protein that modulates
   interactions between cells and the ECM. It has divergent actions due to
   different categories of tumors. It shows anti-tumor or tumor-promoting
   effects in different cancers ([133]Tai and Tang, 2008). What is
   surprising is that previous research results are inconsistent. As
   [134]Zhang et al. (2012) and [135]Zhang et al. (2014) reported, “SPARC
   expression is negatively correlated with the clinicopathological
   factors of gastric cancer and inhibits malignancy of gastric cancer
   cells,” and they confirmed the anti-tumor activity of SPARC in vivo and
   in vitro. The anti-tumor activity was also reported by [136]Wang L. et
   al. (2012) in a clinical trial involving 80 gastric cancer samples and
   30 normal samples. On the contrary, the tumor-promoting effect of SPARC
   was also reported in GC. Over expression of SPARC promoted GC
   progression, including serosal invasion, lymph node, and distant
   metastasis, and tended to poor prognosis of patients ([137]Zhao et al.,
   2010; [138]Sato et al., 2013; [139]Wang et al., 2014). Also, the
   invasion and proliferation ability was inhibited in SPARC knockdown
   MGC803 and HGC 27 gastric cancer cell lines, which demonstrated the
   tumor-promoting activity of SPARC. Increased expression of SPARC in
   this study corroborates these earlier findings ([140]Li Z. et al.,
   2016; [141]Liao et al., 2018; [142]Li et al., 2019). ROC analysis
   showed high diagnostic value of SPARC, and the value of AUC was 0.895
   in the current study. Biological experiments in different cell lines
   and clinical samples are necessary to verify the result.

   The correlation between these four genes was analyzed, and we found
   that all of these genes were highly correlated with each other, which
   enhanced their possibility as potential biomarkers of GC.

   In summary, previous studies have identified COL1A1 as a biomarker for
   GC diagnosis and prognosis. COL5A2, P4HA3, and SPARC were reported to
   be associated with poor prognosis (OS and DSS, but not PFI); however,
   the diagnostic value has not been recognized. In the present study, the
   prognosis values of COL1A1, COL5A2, P4HA3, and SPARC were confirmed.
   The ROC analysis showed that they could distinguish between GC samples
   and normal samples effectively. Thus, we suggest COL1A1, COL5A2, P4HA3,
   and SPARC as biomarkers for both diagnosis and prognosis of GC. Each of
   the biomarkers identified in the present work plays a significant role
   in the ECM, which highlights the importance of the tumor
   microenvironment in GC. Compared with similar studies, we suggested
   those genes as both diagnostic and prognostic biomarkers for GC.
   Nevertheless, the current results are all derived from bioinformatics
   analysis and are limited by the absence of confirmation. Due to
   different screening criteria, previous bioinformatics research produced
   different biomarkers. Many of the biomarkers have been verified, and
   the combination of those results might be more rigorous. Further
   clinical experiments are underway to verify their value in GC.

Acknowledgments