Abstract Background: Gastric cancer (GC) is one of the most prevalent cancers all over the world. The molecular mechanisms of GC remain unclear and not well understood. GC cases are majorly diagnosed at the late stage, resulting in a poor prognosis. Advances in molecular biology techniques allow us to get a better understanding of precise molecular mechanisms and enable us to identify the key genes in the carcinogenesis and progression of GC. Methods: The present study used datasets from the GEO database to screen differentially expressed genes (DEGs) between GC and normal gastric tissues. GO and KEGG enrichments were utilized to analyze the function of DEGs. The STRING database and Cytoscape software were applied to generate protein–protein network and find hub genes. The expression levels of hub genes were evaluated using data from the TCGA database. Survival analysis was conducted to evaluate the prognostic value of hub genes. The GEPIA database was involved to correlate key gene expressions with the pathological stage. Also, ROC curves were constructed to assess the diagnostic value of key genes. Results: A total of 607 DEGs were identified using three GEO datasets. GO analysis showed that the DEGs were mainly enriched in extracellular structure and matrix organization, collagen fibril organization, extracellular matrix (ECM), and integrin binding. KEGG enrichment was mainly enriched in protein digestion and absorption, ECM-receptor interaction, and focal adhesion. Fifteen genes were identified as hub genes, one of which was excluded for no significant expression between tumor and normal tissues. COL1A1, COL5A2, P4HA3, and SPARC showed high values in prognosis and diagnosis of GC. Conclusion: We suggest COL1A1, COL5A2, P4HA3, and SPARC as biomarkers for the diagnosis and prognosis of GC. Keywords: gastric cancer, bioinformatics analysis, microarray, differentially expressed genes, prognosis, diagnosis Introduction According to data published in 2021, gastric cancer (GC), among all cancers, ranked fourth in cancer-related deaths ([34]Sung et al., 2021). Stomach adenocarcinoma (STAD), the most common histological type, accounts for more than 90% of GC ([35]Ajani et al., 2017). Although endoscopy or histological detection has developed a lot in recent years, the majority of GC patients are diagnosed at their late and advanced stage due to an insidious onset, resulting in high morbidity and mortality ([36]Chen et al., 2020; [37]Wang W. et al., 2020). However, advances in molecular biology techniques allow us to approach precise molecular mechanisms of carcinogenesis and enable us to find potential diagnostic and prognostic biomarkers for GC. Previous bioinformatic studies resulted in different biomarkers due to different screening criteria and different datasets from Gene Expression Omnibus (GEO) ([38]Sun et al., 2017; [39]Zheng H.-C. et al., 2017; [40]Shi and Zhang, 2019). In the present study, we identified DEGs based on three datasets from GEO, [41]GSE19826, [42]GSE54129, and [43]GSE118916. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were performed subsequently. Afterwards, we constructed the protein–protein interaction (PPI) network to identify hub genes using the STRING database and Cytoscape software. Then, we performed the survival analysis, including overall survival (OS), disease-free survival (DSS), and progress-free interval (PFI) to identify candidate genes. The expression of candidate genes and their correlation with the pathological stage were further analyzed along with the diagnostic value. A total of four genes were identified as potential biomarkers for GC in our study. Materials and Methods Microarray Datasets RNA-sequencing datasets containing gastric cancer tissue samples and normal tissue samples were obtained from the GEO database ([44]Barrett et al., 2013), ([45]https://www.ncbi.nlm.nih.gov/geo/) and three GEO datasets, including [46]GSE19826 ([47]Wang Q. et al., 2012), [48]GSE54129, and [49]GSE118916 ([50]Li et al., 2019), were downloaded for further analysis. Identification of Differentially Expressed Genes The limma package (version: 3.40.2) of R software was used to identity the DEGs in three datasets. The adjusted p value was analyzed to correct for false positive results in GEO datasets. “Adjusted p < 0.05 and fold change >1.5” were defined as the thresholds for the screening of the differential expression of mRNAs. Subsequently, the ggplot package (version: 3.3.3) of R software was used to make a Venn diagram to extract the common DEGs of the three datasets. Enrichment Analysis of Differentially Expressed Genes The clusterProfiler package ([51]Yu et al., 2012) (version 3.14.3) of R software was used for enrichment analysis with the following ontology sources: GO biological processes (BPs), cellular components (CCs), molecular functions (MFs), and KEGG pathway. Adjusted p < 0.05 and q < 0.2 were set as the critical standard for significant enrichment. Analysis of Protein–Protein Interaction Network The PPI network of DEGs was generated using the search tool of the STRING database ([52]Szklarczyk et al., 2019) (version 11.5). The “Multiple Proteins by Names/Identifiers” tool was chosen in this study. The organism was set as “Homo Sapiens.” Required score was set as high confidence (0.700), and FDR stringency was set to medium (5%). The PPI network was exported for further analysis with the Cytoscape software ([53]Otasek et al., 2019) (version 3.8.2). The plugin MCODE ([54]Bandettini et al., 2012) (version 2.0.0, degree cutoff: 2, node score cutoff: 0.2, K-score: 2) was applied to identify the hub genes in the PPI network. The module with the highest degree was used in the following analysis. The Expressions and Survival Analysis of Hub Genes The Cancer Genome Atlas (TCGA) project is an open database aiming to link cancer genomic data to patients’ clinicopathological information ([55]https://www.cancer.gov/tcga). Raw counts of RNA-sequencing data (level 3) were obtained from TCGA along with corresponding clinicopathological information ([56]Liu et al., 2018). TPM-formatted RNA-sequencing data of normal tissues from the Genotype-Tissue Expression Project (GTEx) were obtained from the University of California Santa Cruz ([57]Vivian et al., 2017) ([58]https://xenabrowser.net/datapages/). Tumor/normal differential expression analyses of hub genes were conducted using R software. We conducted the survival analysis, including the OS, DSS, and PFI, with the Xiantao Academic platform (survminer package of R software). DEGs related to the OS, DSS, and PFI were considered as our purpose genes and were involved in the following data analysis. Correlation Analysis of Purpose Genes GEPIA ([59]Tang et al., 2017) ([60]http://gepia.cancer-pku.cn/) is a database that enable users to analyze the RNA-sequencing expression in various ways. We used GEPIA to correlate our purpose genes with the pathological stage. Correlation analysis among purpose genes were conducted using R software embedded in Xiantao Academic. Correlation among these purpose genes were visualized with a heat map generated by the ggplot package. Statistical Analysis Xiantao Academic ([61]https://www.xiantao.love/products) is a platform embedded with R software and R packages for data analyzing. The major analysis was performed using Xiantao Academic in the present study. Chi-square test and the Wilcoxon rank sum test were utilized in the analysis depending on the data. Spearman correlation analysis was used in different expression of genes. In the analysis of the correlation of gene expression with pathological stage, the expression data are first log2 (TPM+1) transformed, and the method was one-way ANOVA, using pathological stage as a variable for calculating differential expression. p value <0.05 was regarded as statistically significant. Results Identification of Differentially Expressed Genes The present study involved three GEO datasets, [62]GSE19826, [63]GSE54129, and [64]GSE118916. [65]GSE19826 contained 12 pairs of samples from GC tumor and adjacent non-tumor tissues and three normal tissues. [66]GSE54129 contained 111 GC tumor tissue samples and 21 normal tissue samples. [67]GSE118916 contained 15 pairs of Gastric cancer tumor and adjacent non-tumor (normal) tissues. There were 138 GC tumor tissue samples and 51 normal tissue samples in total involved in the present study. The flow chart is shown in [68]Figure 1. We identified 607 DEGs including 294 up-regulated genes and 313 down-regulated genes in GC tissue samples ([69]Figure 2). FIGURE 1. [70]FIGURE 1 [71]Open in a new tab Flowchart diagram for bioinformatics analysis. FIGURE 2. FIGURE 2 [72]Open in a new tab The Venn diagram shows a total of 607 differentially expressed genes including 294 up-regulated genes and 313 down-regulated genes. Functional Enrichment Analysis of Differentially Expressed Genes We conducted a functional enrichment analysis of DEGs using R software and R codes embedded in Xiantao platform. DEGs are enriched in 341 terms of GO BP, including extracellular structure organization, extracellular matrix (ECM) organization, collagen fibril organization, bone development and connective tissue development, etc. DEGs were enriched in 43 terms of GO CC, including collagen-containing extracellular matrix, endoplasmic reticulum lumen, basement membrane, extracellular matrix component and collagen trimer, etc. DEGs were enriched in 35 terms of GO MF, including extracellular matrix structural constituent, extracellular matrix structural constituent conferring tensile strength, integrin binding, glycosaminoglycan binding, and platelet-derived growth factor binding ([73]Figure 3A; [74]Supplementary Table S1). DEGs were enriched in 10 terms of KEGG, including protein digestion and absorption, ECM-receptor interaction, Focal adhesion, human papillomavirus infection, beta-alanine metabolism, fatty acid degradation, gastric acid secretion, histidine metabolism, drug metabolism—cytochrome P450, and carbon metabolism ([75]Figure 3B; [76]Supplementary Table S1). FIGURE 3. [77]FIGURE 3 [78]Open in a new tab Functional analysis of DEGs. Top five GO terms enrichment in biological process (BP), cell composition (CC), and molecular function (MF) (A). KEGG enrichment of DGEs (B). Protein–Protein Interaction Network to Identify Hub Genes A PPI network of 607 DEGs, containing a total of 317 nodes and 606 edges, was generated using STRING, and an interaction score >0.7 was considered a high-confidence interaction relationship. We identified 15 nodes and 81 edges with MCODE plugin. The module with the highest degree was used in the following analysis. The hub genes included COL1A1, COL1A2, COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A2, COL6A3, COL11A1, MMP2, P4HA3, PCOLCE, PLOD1, and SPARC ([79]Figure 4). Gene expression profiles of the 15 hub genes between GC tumor samples and normal samples are shown in [80]Figure 5. The expression of COL6A2 showed no difference in tumor and normal tissues, so it was excluded in further analyses. The remaining 14 genes were considered as candidate genes for potential diagnostic and prognostic biomarkers. FIGURE 4. FIGURE 4 [81]Open in a new tab The PPI network of 15 hub genes selected with the MCODE plugin of Cytoscape. FIGURE 5. [82]FIGURE 5 [83]Open in a new tab Gene expression of 15 hub genes (COL1A1, COL1A2, COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A2, COL6A3, COL11A1, MMP2, P4HA3, PCOLCE, PLOD1, and SPARC) based on TGCA and GTEx databases. ***p < 0.001; ns, not statistically significant. Survival and Correlation Analysis We conducted Kaplan–Meier survival analysis with the candidate genes. Candidate genes related to OS, DSS, or PFI were considered as key genes. Among the 14 candidate genes, COL1A1 (HR = 1.41, p = 0.042), COL4A1 (HR = 1.45, p = 0.029), COL5A2 (HR = 1.54, p = 0.011), P4HA3 (HR = 1.57, p = 0.011), and SPARC (HR = 1.47, p = 0.022) were associated with the OS of STAD ([84]Figure 6). COL5A2 was associated with DSS (HR = 1.70, p = 0.015) ([85]Figure 7) and PFI (HR = 1.44, p = 0.043) ([86]Figure 8). Therefore, this study focused on the five key genes, COL1A1, COL4A1, COL5A2, P4HA3, and SPARC. Further analysis of the correlation between these key genes and the pathological stage of GC showed that COL1A1, COL5A2, P4HA3, and SPARC were significantly correlated to cancer pathological stages. However, COL4A1 showed no significance in the correlation analysis ([87]Figure 9). Therefore, we identify COL1A1, COL5A2, P4HA3, and SPARC as potential biomarkers for prognosis of GC. FIGURE 6. [88]FIGURE 6 [89]Open in a new tab Overall survival analysis of 14 candidate genes (COL1A1, COL1A2, COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A3, COL11A1, MMP2, P4HA3, PCOLCE, PLOD1, and SPARC). COL1A1, COL4A1, COL5A2, P4HA3, and SPARC (HR = 1.47, p = 0.022) were associated with OS. FIGURE 7. [90]FIGURE 7 [91]Open in a new tab Disease-specific survival analysis of 14 candidate genes (COL1A1, COL1A2, COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A3, COL11A1, MMP2, P4HA3, PCOLCE, PLOD1, and SPARC). COL5A2 was associated with DSS (HR = 1.70, p = 0.015). FIGURE 8. [92]FIGURE 8 [93]Open in a new tab Progress free interval analysis of 14 candidate genes (COL1A1, COL1A2, COL3A1, COL4A1, COL4A2, COL5A1, COL5A2, COL6A3, COL11A1, MMP2, P4HA3, PCOLCE, PLOD1, and SPARC). COL5A2 was associated with PFI (HR = 1.44, p = 0.043). FIGURE 9. [94]FIGURE 9 [95]Open in a new tab Correlation analysis between five key genes (COL1A1, COL4A1, COL5A2, P4HA3, and SPARC) and the pathological stage of GC shows they are potential prognostic markers. Correlation Expression and Diagnostic Analysis We analyzed the correlation between these four genes on Xiantao Academic based on data from TCGA and found that all of these genes were highly correlated with each other. The r value ranged from 0.84 to 0.92 (p < 0.01) ([96]Figure 10). We used a receiver operating characteristic (ROC) curve to assess the diagnostic value of the purpose genes using Xiantao Academic tools based on TCGA and GTEx samples. The area under curve (AUC) of COL1A1, COL5A2, P4HA3, and SPARC was 0.916, 0.802, 0.874, and 0.895, respectively. The results, as shown previously, suggested that these four genes we selected could effectively distinguish GC samples with normal samples ([97]Figure 11). COL1A1, COL5A2, P4HA3, and SPARC could be biomarkers for the diagnosis and prognosis of GC. FIGURE 10. FIGURE 10 [98]Open in a new tab The expression of four genes (COL1A1, COL5A2, P4HA3, and SPARC) are correlated with each other in GC. FIGURE 11. FIGURE 11 [99]Open in a new tab ROC of four key genes (COL1A1, COL5A2, P4HA3, and SPARC) shows they are of high diagnostic value in GC. Discussion GC is one of the most diagnosed cancers and has brought great burden to global health. Patients were likely to be diagnosed in their late stage due to the lack of specific clinical symptoms at an early stage. Thus, patients with GC have poor prognosis. It is urgent to identify relevant biomarkers that are valid for both diagnostic and prognostic evaluation. Bioinformatics analysis enables us to explore the genetic alterations in GC and has been proved to be a useful approach to identify new biomarkers in plenty of diseases. An initial objective of the project was to identify appropriate biomarkers of GC using bioinformatics analysis. In the current study, we identified 607 DEGs meeting the criteria. GO enrichment suggested those genes were significantly associated with extracellular structure and matrix organization, collagen fibril organization, and ECM and integrin binding. KEGG was mainly enriched in protein digestion and absorption, ECM-receptor interaction, and focal adhesion. In accordance with the present results, previous studies have reported that cancer-associated fibroblasts are essential in creating extracellular matrix structure and metabolism and account for the adaptive resistance to chemotherapy caused by immune reprogramming of the tumor microenvironment ([100]Quante et al., 2011; [101]Kalluri, 2016). Extracellular matrix plays a significant part in the creation of tumor microenvironment and promotes malignancy ([102]Madsen and Sidenius, 2008; [103]Najafi et al., 2019; [104]Mohan et al., 2020; [105]Piersma et al., 2020; [106]Wang W. et al., 2020). Integrins coordinate ECM–cell and cell–cell interactions, signal transmission, gene expression, and cell function. The interaction between integrin and the cancer glycol microenvironment plays a significant part in regulating cancer progression ([107]Marsico et al., 2018). The results of this study showed that COL1A1, COL4A1, COL5A2, P4HA3, and SPARC were associated with the OS of GC. COL5A2 was associated with DSS and PFI. Further analysis of the correlation between these key genes and the pathological stages of GC showed that COL4A1 showed no significance in the correlation analysis to pathological stages. Therefore, we identify four genes as potential biomarkers of GC including COL1A1, COL5A2, P4HA3, and SPARC. Also, the diagnostic value of these genes was confirmed in the following analysis. COL1A1 is an important member of the type-I collagen family, the main fibrillar collagen and an essential structural component of the ECM ([108]Li J. et al., 2016). Many bioinformatic analyses identified COL1A1 as a biomarker of GC ([109]Wang W. et al., 2020; [110]Wang Y. et al., 2021; [111]Zhao et al., 2021). Abnormal expression of COL1A1 has been reported in several cancers, including hepatocellular carcinoma, ovarian cancer, and colorectal cancer, as well as in GC ([112]Li J. et al., 2016; [113]Zhang et al., 2018; [114]Ma et al., 2019; [115]Li et al., 2020). In vitro, enhanced expression of COL1A1 promotes the invasion and migration of GC cells, while knocking out COL1A1 inhibits the increase in cell metastasis ability ([116]Li et al., 2021). It plays an important role in promoting tumor cell proliferation, migration, invasion, epithelial–mesenchymal transformation (EMT), and chemotherapy resistance ([117]Armstrong et al., 2004; [118]Koenig et al., 2006; [119]Shintani et al., 2008; [120]Yang et al., 2014; [121]Zheng X. et al., 2017; [122]Yamazaki et al., 2018; [123]Shi et al., 2021). ROC analysis showed high diagnostic value of COL1A1 (AUC = 0.916) based on 414 GC samples and 210 normal gastric tissues. This finding is consistent with that of [124]Zhao et al. (2021) (AUC = 0.917) based on 375 GC samples and 32 normal samples ([125]Zhao et al., 2021). The diagnostic and prognostic values of COL1A1 were confirmed with extra data (more samples than others) from the present work. COL5A2 is a member of the type-V collagen family which is also a significant structural component of the ECM. COL5A2 was reported to promote proliferation and invasion in colon cancer and prostate cancer ([126]Ren et al., 2021; [127]Wang J. et al., 2021). Also, it has a strong correlation to the prognosis of renal cancer and gastric cancer ([128]Ding et al., 2021; [129]Tan et al., 2021). The overexpression of COL5A2 promoted the migration of GC cells in vitro and in vivo, and the knockdown of COL5A2 could significantly decrease the migration of cell ([130]Tan et al., 2021). A previous study had demonstrated that patients with higher COL5A2 levels were more likely to suffer from renal metastasis (AUC = 0.878). Among all those genes we identified as potential biomarkers, COL5A2 was the unique gene that was associated with the OS, DSS, and PFI of GC, which had not been reported in previous studies. The value of AUC in our current study is 0.802 based on data from TCGA and GTEx. Therefore, COL5A2 could serve as a novel biomarker of GC. Also, we would perform biological experiments to verify the result. Previous research showed that P4HA3 was up-regulated in head and neck squamous cell carcinoma (HNSCC) tissue, and it was demonstrated to promote HNSCC cell proliferation, invasion, and migration in vitro ([131]Wang T. et al., 2020). A recent study showed that the de-regulation of P4HA3 was associated with increased metastasis and poor prognosis of GC ([132]Song et al., 2018). In the present work, the value of AUC of P4HA3 is 0.875, which indicated high value of diagnosis and has not been reported in previous studies. The result suggests that P4HA3 is a potential biomarker of GC. SPARC is one of the first-known matricellular protein that modulates interactions between cells and the ECM. It has divergent actions due to different categories of tumors. It shows anti-tumor or tumor-promoting effects in different cancers ([133]Tai and Tang, 2008). What is surprising is that previous research results are inconsistent. As [134]Zhang et al. (2012) and [135]Zhang et al. (2014) reported, “SPARC expression is negatively correlated with the clinicopathological factors of gastric cancer and inhibits malignancy of gastric cancer cells,” and they confirmed the anti-tumor activity of SPARC in vivo and in vitro. The anti-tumor activity was also reported by [136]Wang L. et al. (2012) in a clinical trial involving 80 gastric cancer samples and 30 normal samples. On the contrary, the tumor-promoting effect of SPARC was also reported in GC. Over expression of SPARC promoted GC progression, including serosal invasion, lymph node, and distant metastasis, and tended to poor prognosis of patients ([137]Zhao et al., 2010; [138]Sato et al., 2013; [139]Wang et al., 2014). Also, the invasion and proliferation ability was inhibited in SPARC knockdown MGC803 and HGC 27 gastric cancer cell lines, which demonstrated the tumor-promoting activity of SPARC. Increased expression of SPARC in this study corroborates these earlier findings ([140]Li Z. et al., 2016; [141]Liao et al., 2018; [142]Li et al., 2019). ROC analysis showed high diagnostic value of SPARC, and the value of AUC was 0.895 in the current study. Biological experiments in different cell lines and clinical samples are necessary to verify the result. The correlation between these four genes was analyzed, and we found that all of these genes were highly correlated with each other, which enhanced their possibility as potential biomarkers of GC. In summary, previous studies have identified COL1A1 as a biomarker for GC diagnosis and prognosis. COL5A2, P4HA3, and SPARC were reported to be associated with poor prognosis (OS and DSS, but not PFI); however, the diagnostic value has not been recognized. In the present study, the prognosis values of COL1A1, COL5A2, P4HA3, and SPARC were confirmed. The ROC analysis showed that they could distinguish between GC samples and normal samples effectively. Thus, we suggest COL1A1, COL5A2, P4HA3, and SPARC as biomarkers for both diagnosis and prognosis of GC. Each of the biomarkers identified in the present work plays a significant role in the ECM, which highlights the importance of the tumor microenvironment in GC. Compared with similar studies, we suggested those genes as both diagnostic and prognostic biomarkers for GC. Nevertheless, the current results are all derived from bioinformatics analysis and are limited by the absence of confirmation. Due to different screening criteria, previous bioinformatics research produced different biomarkers. Many of the biomarkers have been verified, and the combination of those results might be more rigorous. Further clinical experiments are underway to verify their value in GC. Acknowledgments