Abstract Background Hepatocellular carcinoma (HCC) is a type of primary liver tumor with poor prognosis and high mortality, and its molecular mechanism remains incompletely understood. This study aimed to use bioinformatics technology to identify differentially expressed genes (DEGs) in HCC pathogenesis, hoping to identify novel biomarkers or potential therapeutic targets for HCC research. Methods The bioinformatics analysis of our research mostly involved the following two datasets: Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA). First, we screened DEGs based on the R packages (limma and edgeR). Using the DAVID database, the Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses of DEGs were carried out. Next, the protein-protein interaction (PPI) network of the DEGs was built in the STRING database. Then, hub genes were screened through the cytoHubba plug-in, followed by verification using the GEPIA and Oncomine databases. We demonstrated differences in levels of the protein in hub genes using the Human Protein Atlas (HPA) database. Finally, the hub genes prognostic values were analyzed by the GEPIA database. Additionally, using the Comparative Toxicogenomics Database (CTD), we constructed the drug-gene interaction network. Results We ended up with 763 DEGs, including 247 upregulated and 516 downregulated DEGs, that were mainly enriched in the epoxygenase P450 pathway, oxidation-reduction process, and metabolism-related pathways. Through the constructed PPI network, it can be concluded that the P53 signaling pathway and the cell cycle are the most obvious in module analysis. From the PPI, we filtered out eight hub genes, and these genes were significantly upregulated in HCC samples, findings consistent with the expression validation results. Additionally, survival analysis showed that high level gene expression of CDC20, CDK1, MAD2L1, BUB1, BUB1B, CCNB1, and CCNA2 were connected with the poor overall survival of HCC patients. Toxicogenomics analysis showed that only topotecan, oxaliplatin, and azathioprine could reduce the gene expression levels of all seven hub genes. Conclusion The present study screened out the key genes and pathways that were related to HCC pathogenesis, which could provide new insight for the future molecularly targeted therapy and prognosis evaluation of HCC. Keywords: hepatocellular carcinoma, bioinformatics, differentially expressed genes, survival, biomarker, GEO, TCGA Introduction Accounting for 75-85% of all primary liver cancer, hepatocellular carcinoma (HCC) is the main histological classification of liver cancer, which is the fourth most frequent cause of cancer-related death globally ([31]Harris et al., 2019; [32]Yang J.D. et al., 2019). The liver is the second most common cancer-prone organ, after the lungs, as was shown by the recent cancer study in China ([33]Fu and Wang, 2018). On the whole, the estimated morbidity of HCC per 100,000 world standard population is 40.0 in males and 15.3 in females ([34]Zhu et al., 2016). Major risk factors for HCC include genetic predisposition, epigenetic variation, chronic hepatitis B infection, hepatitis C virus infection, smoking, obesity, aflatoxin exposure, and diabetes ([35]Puszyk et al., 2013; [36]Baecker et al., 2018). Transplantation is the most useful way to treat HCC; however, after the transplantation process, the tumor recurrence and metastasis rates are high ([37]Au and Chok, 2018; [38]Aufhauser et al., 2018). More than 70% of patients at advanced stage are not suitable for transplantation, whether due to the tumor burden or liver dysfunction ([39]Wang et al., 2019). Therefore, it is urgent to recognize new biomarkers that can act as molecular targets for therapy, and predictors of the prognosis of HCC. With the development of times and technological progress, microarray and high-throughput sequencing technologies have matured and become more reliable, and public databases are improving, such as the Gene Expression Omnibus (GEO)^[40]1 and the Cancer Genome Atlas (TCGA)^[41]2. The advancement of microarray ([42]Yang X. et al., 2018) and high throughput sequencing technologies ([43]Weinstein et al., 2013) has provided a highly efficient tools to explore key genetic or epigenetic changes in disease to identify biological markers that can be applied to disease diagnosis, therapy, and prognosis ([44]Weinstein et al., 2013; [45]Wang et al., 2018; [46]Yang X. et al., 2018; [47]Li et al., 2019). Additionally, the application of integrated bioinformatics methods in cancer research can solve the problem of different results due to errors caused by different technical platforms or small sample size, thus finding much valuable biological information ([48]Liu X. et al., 2018; [49]Deng et al., 2019; [50]Yan et al., 2019; [51]Yang K. et al., 2019). In this research, by analyzing and distinguishing genes in human HCC samples and normal hepatocyte samples using TCGA and GEO datasets, we firstly screened out differentially expressed genes (DEGs). Then, GO and KEGG pathway enrichment analyses were applied in the further exploration of the main biological functions, which regulated by the DEGs. After that, the final step is to utilize a protein–protein interaction (PPI) network, survival analyses and drug-gene interaction network analyses to ascertain crucial genes and pathways which affecting the pathogenic mechanism and prognosis of HCC patients. Materials and Methods Gene Expression Datasets The microarray gene expression dataset of [52]GSE121248, which comprises 70 hepatocellular carcinoma samples and 37 normal liver samples, was obtained from the GEO website and exploited as discovery dataset to identify DEGs. The included dataset met the following criteria: (1) dataset included human HCC samples and normal liver samples. (2) they contained at least ten samples. (3) dataset was obtained from the Affymetrix Human Genome U133 Plus 2.0 Array [HG-U133_Plus_2] microarray platform. The raw RNA sequencing data, which comprises 374 HCC samples and 50 normal liver tissue samples, was selected from the TCGA liver hepatocellular carcinoma (TCGA-LIHC) dataset and used as a validation dataset. Identification of DEGs We used the R language to analyze the original CEL files of the [53]GSE121248 dataset. The preprocessing procedures: using the “affy” R package to RMA background correction, Log2conversion, Quantile normalization, and Median polish algorithm summarization ([54]Bolstad et al., 2003; [55]Gautier et al., 2004). Using the bioconductor annotation package to convert microarray data probes into gene symbol. If multiple probes were mapped to a gene symbol, take the average value as the final expression value of the gene ([56]Zhang et al., 2018). Next, | log2fold change (FC)| > 1 and adjusted p value <0.05 were used to select the DEGs between tumor and normal tissues using the LIMMA package ([57]Ritchie et al., 2015; [58]Nagy et al., 2018). DEGs Validation Using the TCGA Dataset The DEGs from the [59]GSE121248 dataset were validated using the TCGA-LIHC dataset. The edgeR package of R software was applied to normalize and analyze the TCGA-LIHC dataset ([60]Robinson et al., 2010). | log2fold change (FC)| > 1 and p-value <0.05 were considered significant differences. The overlapping DEGs between [61]GSE121248 and TCGA-LIHC datasets were clustered using the pheatmap and were retained for further study. The overlapping DEGs were analyzed using VennDiagram and ggplot2 packages in R software to draw Venn diagrams and volcano plots, to visualize the identified DEGs ([62]Chen and Boutros, 2011). Functional Enrichment Analysis of Overlapping DEGs We used the Database for Annotation, Visualization and Integrated Discovery (DAVID version 6.8)^[63]3 to elucidate potential GO function [including biological processes (BP), molecular functions (MF), cellular components (CC)] and signaling pathways (KEGG) related to the overlapping DEGs ([64]Dennis et al., 2003; [65]Kanehisa et al., 2017). We used threshold p-value 0.05. Protein–Protein Interaction Network Construction and Module Analysis The Search Tool for the Retrieval of Interacting Genes (STRING version 11)^[66]4 database was one of the largest online databases of known protein-protein interactions covering the largest number of species ([67]Szklarczyk et al., 2017). The parameter of interactions was set with a confidence score >0.7. The confidence score refers to the strength of data support in terms of the thickness of the line. Confidence score >0.7 means high confidence. Overlapping DEGs were entered into Cytoscape software (version 3.7.2)^[68]5 to construct and analyze PPI network ([69]Shannon et al., 2003). Moreover, the Cytoscape plug-in MCODE was used to screen crucial clustering modules in the entire network ([70]Bader and Hogue, 2003). Identification of Hub Genes The Cytoscape plug-in CytoHubba was used to calculate the protein node degree ([71]Chin et al., 2014; [72]Cao et al., 2018). The top three methods [(Maximal Clique Centrality (MCC), Maximum Neighborhood Component (MNC), and Density of Maximum Neighborhood Component (DMNC)] were selected to provide the analyzed results. Each method displayed their top ten genes. A Venn diagram was generated to visualize common hub genes based on these three methods. Expression Analysis of Hub Genes in Multiple Databases The hub genes mRNA expression levels were finally validated in two databases, Gene Expression Profiling Interactive Analysis (GEPIA)^[73]6 ([74]Tang et al., 2017) and Oncomine. Oncomine (Version4.5)^[75]7 is an online database that has the comprehensive cancer mutation spectrum, gene expression data and related clinical information, which can be used to discover new biomarkers or new therapeutic targets ([76]Rhodes et al., 2004). In addition to detecting the mRNA expression levels of the hub genes, we also investigated the protein levels in HCC tissues and normal liver tissues using the human protein atlas database (HPA v19)^[77]8 ([78]Thul and Lindskog, 2018). Survival Analysis Gene Expression Profiling Interactive Analysis is a newly developed online database for cancer and normal gene expression profiling. In the current study, the overall survival of each hub gene was analyzed using LIHC dataset in the GEPIA database. The patients were divided into two groups (the high- and low-expression group) according to the median expression level of each hub gene. This division method could evaluate the difference in overall survival probability between these two groups. We were drawn the overall survival curves of each hub gene using the GEPIA database, with a p-value <0.05. Drug-Gene Interaction Network Analysis The Comparative Toxicogenomics Database (CTD)^[79]9, an online database providing information on the interactions between gene products and chemotherapeutic drugs, and their relationships to diseases) was used to construct the chemotherapeutic drug-gene interaction network ([80]Davis et al., 2019). The networks were visualized by Cytoscape software 3.7.2^[81]10. Results Identification of DEGs The gene expression dataset of [82]GSE121248, which contains 70 LIHC samples and 37 normal liver samples, was analyzed in the limma package using | logFC| > 1 and corrected p-value <0.05 of R software. In total, 1,518 DEGs (557 high expression genes and 961 low expression genes) were identified between HCC tissue samples and normal liver tissue samples. The volcano map and heatmap of all DEGs are shown in [83]Figures 1A,C. Additionally, compared with normal liver tissues in the TCGA-LIHC dataset, 2,898 DEGs were obtained in LIHC tissues, comprising 1,299 upregulated genes and 1,599 downregulated genes ([84]Figure 1B). Furthermore, 763 overlapping DEGs (247 high expression genes and 516 low expression genes) were identified between the [85]GSE121248 and TCGA-LIHC datasets using a Venn diagram ([86]Figure 1D). [87]Figure 1E shows clustering analysis results of the 763 overlapping DEGs based on the TCGA-LIHC dataset. FIGURE 1. [88]FIGURE 1 [89]Open in a new tab Identification of DEGs. (A,B) show the volcano maps of DEGs for (A) [90]GSE121248 dataset, (B) TCGA-LIHC dataset. (C) The heatmap of the top 50 DEGs in dataset [91]GSE121248. The green color and red color in the heatmap indicate low and high expression of DEGs. (D) Venn diagrams of the DEGs between the [92]GSE121248 dataset and the TCGA-LIHC dataset. (E) The heatmap of the top 100 overlapping DEGs according to the value of | logFC| in TCGA-LIHC dataset. The color in heatmaps from green to red shows the progression from down-regulation to up-regulation. Enrichment Analysis of Overlapping DEGs We conducted GO and KEGG pathway enrichment analysis to further elucidate potential biological functions associated with the 763 overlapping DEGs of HCC. The GO analysis results of the DEGs were classified into molecular functions, biological processes and cellular components. For molecular functions, the overlapping DEGs were mainly associated with oxidoreductase activity, monooxygenase activity, heme binding and oxygen binding ([93]Figure 2A). In the BP category, the epoxygenase P450 pathway, oxidation-reduction process, response to drug and cell division were enriched ([94]Figure 2B). In the CC category, they were enriched in extracellular regions, such as extracellular exosomes and the extracellular space ([95]Figure 2C). The pathway enrichment analysis results showed that overlapping DEGs mainly participated in multiple metabolism pathways, such as fatty acid degradation, glycine, serine and threonine metabolism, and tryptophan metabolism ([96]Figure 2D). FIGURE 2. [97]FIGURE 2 [98]Open in a new tab Enrichment analysis of the overlapping DEGs. (A–C) illustrate the GO enrichment analysis results: (A) molecular function, (B) biological process and (C) cellular components. (D) KEGG pathway enrichment analysis results. PPI Network Establishment and Module Analysis To further reveal the potential relationships between proteins encoded by DEGs, a PPI network was constructed using the STRING database. Network analysis of overlapping DEGs revealed 526 nodes and 4,173 edges in the PPI network. Additionally, we conducted module analysis using the MCODE plug-in to detect crucial clustering modules. In total, 29 clusters were obtained in MCODE, and the top three modules with the highest scores were selected as hub modules. Module 1 contained 63 nodes and 1,752 edges with the highest score of 56.516 and was mainly enriched in cell cycle, oocyte meiosis, P53 signaling pathway and progesterone-mediated oocyte maturation ([99]Figure 3A). Module 2 contained 17 nodes and 80 edges with a score of 10 and mainly participated in PPAR signaling pathway and glycerolipid metabolism ([100]Figure 3B). Module 3 comprised 28 nodes and 100 edges with a score of 7.407 and was mainly implicated in Chemical carcinogenesis, Peroxisome, Metabolic pathways and Drug metabolism cytochrome P450 ([101]Figure 3C). FIGURE 3. [102]FIGURE 3 [103]Open in a new tab Venn diagram and the top three clustering modules of PPI network. (A) Module 1 with an MCODE score of 56.5. The red nodes are the hub genes. (B) Module 2 obtained a score of 10.0 from MCODE. (C) Module 3 with an MCODE score of 7.4. Edges represent the protein-protein associations. The higher the module score, the more important the module is in the PPI network. (D) Venn diagrams of the hub genes between three methods (MNC, MCC, and DMNC). Hub Genes Selection From the PPI Network The Cytoscape plug-in cytoHubba including the top three algorithms (MCC, MNC, and DMNC) was applied to select hub genes, and the top 10 genes were selected by each of the three methods. The Venn diagram identified eight overlapping hub genes based on these three methods ([104]Figure 3D): cell division cycle protein 20 homolog (CDC20), cyclin-dependent kinase1 (CDK1), mitotic spindle assembly checkpoint protein MAD2A (MAD2L1), threonine-protein kinase BUB1 (BUB1), threonine-protein kinase BUB1 beta (BUB1B), mitotic-specific cyclin-B1 (CCNB1), mitotic-specific cyclin-B2 (CCNB2) and cyclin-A2 (CCNA2). These eight hub genes were used for further analysis. Validation of Hub Genes in Multiple Databases Oncomine and GEPIA were applied to validate the differentially expression levels of 8 hub genes between HCC tissues and normal liver tissues in HCC. These eight hub genes were all remarkably overexpressed in HCC samples ([105]Figure 4). Moreover, a summary of hub genes in multiple tumors indicated that hub genes were significantly overexpressed in HCC ([106]Figure 5). Furthermore, we also investigated the protein expression levels in HCC tissue samples and normal liver tissue samples using the human protein atlas database. Because the HPA dataset could not provide immunohistochemical information on BUB1 and BUB1B, we showed the results of the remaining six staining pairs in [107]Figure 6. The protein expression levels of hub genes were agreed with the mRNA expression results, and most genes were overexpressed in HCC tissue ([108]Figure 7). These findings indicate that the overexpression of these hub genes may play a critical role in HCC mechanism. FIGURE 4. [109]FIGURE 4 [110]Open in a new tab Validation of eight hub genes mRNA expression levels in HCC tissues vs. normal liver tissues using the GEPIA database (A–H). The red color represents the tumor samples and the gray color represents the normal liver samples. FIGURE 5. [111]FIGURE 5 [112]Open in a new tab An summary of mRNA expression results of 8 hub genes in multiple tumors using the Oncomine database. The numbers in colored cells show the quantities of datasets with high (red) or low (blue) mRNA expression of the hub genes. FIGURE 6. [113]FIGURE 6 [114]Open in a new tab The OS analysis of 8 hub genes in the HCC patients using the GEPIA database. The red curve is the high expression group and the blue curve is the low-expression group. p-value < 0.05. FIGURE 7. [115]FIGURE 7 [116]Open in a new tab Immunohistochemical staining analysis of hub genes (CCNA2, CCNB1, CCNB2, CDC20, CDK1, and MAD2L1) in HCC tissues and normal liver tissues. Survival Analysis We further used the GEPIA database to analyze the prognostic value of these 8 hub genes in HCC patients. The survival analysis of patients in the GEPIA database was based on the TCGA-LIHC data set. We used threshold p-value 0.05 and calculated the hazards ratio based on Cox PH Model ([117]Xu et al., 2020). The relatively higher expression of CDC20 (HR = 2.3; P = 3.8e-06), CDK1 (HR = 2; P = 0.00017), MAD2L1 (HR = 1.7; P = 0.0047), BUB1 (HR = 1.8; P = 0.001), BUBIB (HR = 1.7; P = 0.0028), CCNB1 (HR = 2; P = 0.00015), and CCNA2 (HR = 1.7; P = 0.0037) were associated with a poor prognosis in HCC patients, while only CCNB2 (HR = 1.4; P = 0.052) showed no statistical significance in the overall survival of patients ([118]Figure 6). Drug-Gene Interaction Network Analysis To investigate the potential information on the interactions between hub genes and cancer chemotherapeutics drugs, we used the CTD database to construct chemotherapeutics drug-gene interaction network. Various drugs could influence the mRNA expression level of seven hub genes, namely, CDC20, CDK1, MAD2L1, CCNA2, CCNB1, BUB1, and BUB1B ([119]Figure 8). However, only topotecan, oxaliplatin and azathioprine could reduce expression levels of all seven hub genes. FIGURE 8. [120]FIGURE 8 [121]Open in a new tab Drug-gene interactions network with chemotherapeutic drugs and seven hub genes was constructed using the CTD database. (A–G) shows the relationship between existing chemotherapeutic drugs and the expression levels of hub genes. (A) BUB1, (B) BUB1B, (C) CCNA2, (D) CCNB1, (E) CDC20, (F) MAD2L1, and (G) CDK1. The red and green arrows represent that the chemotherapy drugs will increase or decrease the expression of the hub genes. The number of arrows between hub genes and chemotherapy drugs indicates the number of references supported by previous studies.