Abstract

Background

   Hepatocellular carcinoma (HCC) is a type of primary liver tumor with
   poor prognosis and high mortality, and its molecular mechanism remains
   incompletely understood. This study aimed to use bioinformatics
   technology to identify differentially expressed genes (DEGs) in HCC
   pathogenesis, hoping to identify novel biomarkers or potential
   therapeutic targets for HCC research.

Methods

   The bioinformatics analysis of our research mostly involved the
   following two datasets: Gene Expression Omnibus (GEO) and The Cancer
   Genome Atlas (TCGA). First, we screened DEGs based on the R packages
   (limma and edgeR). Using the DAVID database, the Gene Ontology (GO) and
   Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses of
   DEGs were carried out. Next, the protein-protein interaction (PPI)
   network of the DEGs was built in the STRING database. Then, hub genes
   were screened through the cytoHubba plug-in, followed by verification
   using the GEPIA and Oncomine databases. We demonstrated differences in
   levels of the protein in hub genes using the Human Protein Atlas (HPA)
   database. Finally, the hub genes prognostic values were analyzed by the
   GEPIA database. Additionally, using the Comparative Toxicogenomics
   Database (CTD), we constructed the drug-gene interaction network.

Results

   We ended up with 763 DEGs, including 247 upregulated and 516
   downregulated DEGs, that were mainly enriched in the epoxygenase P450
   pathway, oxidation-reduction process, and metabolism-related pathways.
   Through the constructed PPI network, it can be concluded that the P53
   signaling pathway and the cell cycle are the most obvious in module
   analysis. From the PPI, we filtered out eight hub genes, and these
   genes were significantly upregulated in HCC samples, findings
   consistent with the expression validation results. Additionally,
   survival analysis showed that high level gene expression of CDC20,
   CDK1, MAD2L1, BUB1, BUB1B, CCNB1, and CCNA2 were connected with the
   poor overall survival of HCC patients. Toxicogenomics analysis showed
   that only topotecan, oxaliplatin, and azathioprine could reduce the
   gene expression levels of all seven hub genes.

Conclusion

   The present study screened out the key genes and pathways that were
   related to HCC pathogenesis, which could provide new insight for the
   future molecularly targeted therapy and prognosis evaluation of HCC.

   Keywords: hepatocellular carcinoma, bioinformatics, differentially
   expressed genes, survival, biomarker, GEO, TCGA

Introduction

   Accounting for 75-85% of all primary liver cancer, hepatocellular
   carcinoma (HCC) is the main histological classification of liver
   cancer, which is the fourth most frequent cause of cancer-related death
   globally ([31]Harris et al., 2019; [32]Yang J.D. et al., 2019). The
   liver is the second most common cancer-prone organ, after the lungs, as
   was shown by the recent cancer study in China ([33]Fu and Wang, 2018).
   On the whole, the estimated morbidity of HCC per 100,000 world standard
   population is 40.0 in males and 15.3 in females ([34]Zhu et al., 2016).
   Major risk factors for HCC include genetic predisposition, epigenetic
   variation, chronic hepatitis B infection, hepatitis C virus infection,
   smoking, obesity, aflatoxin exposure, and diabetes ([35]Puszyk et al.,
   2013; [36]Baecker et al., 2018). Transplantation is the most useful way
   to treat HCC; however, after the transplantation process, the tumor
   recurrence and metastasis rates are high ([37]Au and Chok, 2018;
   [38]Aufhauser et al., 2018). More than 70% of patients at advanced
   stage are not suitable for transplantation, whether due to the tumor
   burden or liver dysfunction ([39]Wang et al., 2019). Therefore, it is
   urgent to recognize new biomarkers that can act as molecular targets
   for therapy, and predictors of the prognosis of HCC. With the
   development of times and technological progress, microarray and
   high-throughput sequencing technologies have matured and become more
   reliable, and public databases are improving, such as the Gene
   Expression Omnibus (GEO)^[40]1 and the Cancer Genome Atlas
   (TCGA)^[41]2. The advancement of microarray ([42]Yang X. et al., 2018)
   and high throughput sequencing technologies ([43]Weinstein et al.,
   2013) has provided a highly efficient tools to explore key genetic or
   epigenetic changes in disease to identify biological markers that can
   be applied to disease diagnosis, therapy, and prognosis ([44]Weinstein
   et al., 2013; [45]Wang et al., 2018; [46]Yang X. et al., 2018; [47]Li
   et al., 2019). Additionally, the application of integrated
   bioinformatics methods in cancer research can solve the problem of
   different results due to errors caused by different technical platforms
   or small sample size, thus finding much valuable biological information
   ([48]Liu X. et al., 2018; [49]Deng et al., 2019; [50]Yan et al., 2019;
   [51]Yang K. et al., 2019).

   In this research, by analyzing and distinguishing genes in human HCC
   samples and normal hepatocyte samples using TCGA and GEO datasets, we
   firstly screened out differentially expressed genes (DEGs). Then, GO
   and KEGG pathway enrichment analyses were applied in the further
   exploration of the main biological functions, which regulated by the
   DEGs. After that, the final step is to utilize a protein–protein
   interaction (PPI) network, survival analyses and drug-gene interaction
   network analyses to ascertain crucial genes and pathways which
   affecting the pathogenic mechanism and prognosis of HCC patients.

Materials and Methods

Gene Expression Datasets

   The microarray gene expression dataset of [52]GSE121248, which
   comprises 70 hepatocellular carcinoma samples and 37 normal liver
   samples, was obtained from the GEO website and exploited as discovery
   dataset to identify DEGs. The included dataset met the following
   criteria: (1) dataset included human HCC samples and normal liver
   samples. (2) they contained at least ten samples. (3) dataset was
   obtained from the Affymetrix Human Genome U133 Plus 2.0 Array
   [HG-U133_Plus_2] microarray platform. The raw RNA sequencing data,
   which comprises 374 HCC samples and 50 normal liver tissue samples, was
   selected from the TCGA liver hepatocellular carcinoma (TCGA-LIHC)
   dataset and used as a validation dataset.

Identification of DEGs

   We used the R language to analyze the original CEL files of the
   [53]GSE121248 dataset.

   The preprocessing procedures: using the “affy” R package to RMA
   background correction, Log2conversion, Quantile normalization, and
   Median polish algorithm summarization ([54]Bolstad et al., 2003;
   [55]Gautier et al., 2004). Using the bioconductor annotation package to
   convert microarray data probes into gene symbol. If multiple probes
   were mapped to a gene symbol, take the average value as the final
   expression value of the gene ([56]Zhang et al., 2018). Next, | log2fold
   change (FC)| > 1 and adjusted p value <0.05 were used to select the
   DEGs between tumor and normal tissues using the LIMMA package
   ([57]Ritchie et al., 2015; [58]Nagy et al., 2018).

DEGs Validation Using the TCGA Dataset

   The DEGs from the [59]GSE121248 dataset were validated using the
   TCGA-LIHC dataset.

   The edgeR package of R software was applied to normalize and analyze
   the TCGA-LIHC dataset ([60]Robinson et al., 2010). | log2fold change
   (FC)| > 1 and p-value <0.05 were considered significant differences.
   The overlapping DEGs between [61]GSE121248 and TCGA-LIHC datasets were
   clustered using the pheatmap and were retained for further study. The
   overlapping DEGs were analyzed using VennDiagram and ggplot2 packages
   in R software to draw Venn diagrams and volcano plots, to visualize the
   identified DEGs ([62]Chen and Boutros, 2011).

Functional Enrichment Analysis of Overlapping DEGs

   We used the Database for Annotation, Visualization and Integrated
   Discovery (DAVID version 6.8)^[63]3 to elucidate potential GO function
   [including biological processes (BP), molecular functions (MF),
   cellular components (CC)] and signaling pathways (KEGG) related to the
   overlapping DEGs ([64]Dennis et al., 2003; [65]Kanehisa et al., 2017).
   We used threshold p-value 0.05.

Protein–Protein Interaction Network Construction and Module Analysis

   The Search Tool for the Retrieval of Interacting Genes (STRING version
   11)^[66]4 database was one of the largest online databases of known
   protein-protein interactions covering the largest number of species
   ([67]Szklarczyk et al., 2017). The parameter of interactions was set
   with a confidence score >0.7. The confidence score refers to the
   strength of data support in terms of the thickness of the line.
   Confidence score >0.7 means high confidence. Overlapping DEGs were
   entered into Cytoscape software (version 3.7.2)^[68]5 to construct and
   analyze PPI network ([69]Shannon et al., 2003). Moreover, the Cytoscape
   plug-in MCODE was used to screen crucial clustering modules in the
   entire network ([70]Bader and Hogue, 2003).

Identification of Hub Genes

   The Cytoscape plug-in CytoHubba was used to calculate the protein node
   degree ([71]Chin et al., 2014; [72]Cao et al., 2018). The top three
   methods [(Maximal Clique Centrality (MCC), Maximum Neighborhood
   Component (MNC), and Density of Maximum Neighborhood Component (DMNC)]
   were selected to provide the analyzed results. Each method displayed
   their top ten genes. A Venn diagram was generated to visualize common
   hub genes based on these three methods.

Expression Analysis of Hub Genes in Multiple Databases

   The hub genes mRNA expression levels were finally validated in two
   databases, Gene Expression Profiling Interactive Analysis (GEPIA)^[73]6
   ([74]Tang et al., 2017) and Oncomine. Oncomine (Version4.5)^[75]7 is an
   online database that has the comprehensive cancer mutation spectrum,
   gene expression data and related clinical information, which can be
   used to discover new biomarkers or new therapeutic targets ([76]Rhodes
   et al., 2004). In addition to detecting the mRNA expression levels of
   the hub genes, we also investigated the protein levels in HCC tissues
   and normal liver tissues using the human protein atlas database (HPA
   v19)^[77]8 ([78]Thul and Lindskog, 2018).

Survival Analysis

   Gene Expression Profiling Interactive Analysis is a newly developed
   online database for cancer and normal gene expression profiling. In the
   current study, the overall survival of each hub gene was analyzed using
   LIHC dataset in the GEPIA database. The patients were divided into two
   groups (the high- and low-expression group) according to the median
   expression level of each hub gene. This division method could evaluate
   the difference in overall survival probability between these two
   groups. We were drawn the overall survival curves of each hub gene
   using the GEPIA database, with a p-value <0.05.

Drug-Gene Interaction Network Analysis

   The Comparative Toxicogenomics Database (CTD)^[79]9, an online database
   providing information on the interactions between gene products and
   chemotherapeutic drugs, and their relationships to diseases) was used
   to construct the chemotherapeutic drug-gene interaction network
   ([80]Davis et al., 2019). The networks were visualized by Cytoscape
   software 3.7.2^[81]10.

Results

Identification of DEGs

   The gene expression dataset of [82]GSE121248, which contains 70 LIHC
   samples and 37 normal liver samples, was analyzed in the limma package
   using | logFC| > 1 and corrected p-value <0.05 of R software. In total,
   1,518 DEGs (557 high expression genes and 961 low expression genes)
   were identified between HCC tissue samples and normal liver tissue
   samples. The volcano map and heatmap of all DEGs are shown in
   [83]Figures 1A,C. Additionally, compared with normal liver tissues in
   the TCGA-LIHC dataset, 2,898 DEGs were obtained in LIHC tissues,
   comprising 1,299 upregulated genes and 1,599 downregulated genes
   ([84]Figure 1B). Furthermore, 763 overlapping DEGs (247 high expression
   genes and 516 low expression genes) were identified between the
   [85]GSE121248 and TCGA-LIHC datasets using a Venn diagram ([86]Figure
   1D). [87]Figure 1E shows clustering analysis results of the 763
   overlapping DEGs based on the TCGA-LIHC dataset.

FIGURE 1.

   [88]FIGURE 1
   [89]Open in a new tab

   Identification of DEGs. (A,B) show the volcano maps of DEGs for (A)
   [90]GSE121248 dataset, (B) TCGA-LIHC dataset. (C) The heatmap of the
   top 50 DEGs in dataset [91]GSE121248. The green color and red color in
   the heatmap indicate low and high expression of DEGs. (D) Venn diagrams
   of the DEGs between the [92]GSE121248 dataset and the TCGA-LIHC
   dataset. (E) The heatmap of the top 100 overlapping DEGs according to
   the value of | logFC| in TCGA-LIHC dataset. The color in heatmaps from
   green to red shows the progression from down-regulation to
   up-regulation.

Enrichment Analysis of Overlapping DEGs

   We conducted GO and KEGG pathway enrichment analysis to further
   elucidate potential biological functions associated with the 763
   overlapping DEGs of HCC. The GO analysis results of the DEGs were
   classified into molecular functions, biological processes and cellular
   components. For molecular functions, the overlapping DEGs were mainly
   associated with oxidoreductase activity, monooxygenase activity, heme
   binding and oxygen binding ([93]Figure 2A). In the BP category, the
   epoxygenase P450 pathway, oxidation-reduction process, response to drug
   and cell division were enriched ([94]Figure 2B). In the CC category,
   they were enriched in extracellular regions, such as extracellular
   exosomes and the extracellular space ([95]Figure 2C). The pathway
   enrichment analysis results showed that overlapping DEGs mainly
   participated in multiple metabolism pathways, such as fatty acid
   degradation, glycine, serine and threonine metabolism, and tryptophan
   metabolism ([96]Figure 2D).

FIGURE 2.

   [97]FIGURE 2
   [98]Open in a new tab

   Enrichment analysis of the overlapping DEGs. (A–C) illustrate the GO
   enrichment analysis results: (A) molecular function, (B) biological
   process and (C) cellular components. (D) KEGG pathway enrichment
   analysis results.

PPI Network Establishment and Module Analysis

   To further reveal the potential relationships between proteins encoded
   by DEGs, a PPI network was constructed using the STRING database.
   Network analysis of overlapping DEGs revealed 526 nodes and 4,173 edges
   in the PPI network. Additionally, we conducted module analysis using
   the MCODE plug-in to detect crucial clustering modules. In total, 29
   clusters were obtained in MCODE, and the top three modules with the
   highest scores were selected as hub modules. Module 1 contained 63
   nodes and 1,752 edges with the highest score of 56.516 and was mainly
   enriched in cell cycle, oocyte meiosis, P53 signaling pathway and
   progesterone-mediated oocyte maturation ([99]Figure 3A). Module 2
   contained 17 nodes and 80 edges with a score of 10 and mainly
   participated in PPAR signaling pathway and glycerolipid metabolism
   ([100]Figure 3B). Module 3 comprised 28 nodes and 100 edges with a
   score of 7.407 and was mainly implicated in Chemical carcinogenesis,
   Peroxisome, Metabolic pathways and Drug metabolism cytochrome P450
   ([101]Figure 3C).

FIGURE 3.

   [102]FIGURE 3
   [103]Open in a new tab

   Venn diagram and the top three clustering modules of PPI network. (A)
   Module 1 with an MCODE score of 56.5. The red nodes are the hub genes.
   (B) Module 2 obtained a score of 10.0 from MCODE. (C) Module 3 with an
   MCODE score of 7.4. Edges represent the protein-protein associations.
   The higher the module score, the more important the module is in the
   PPI network. (D) Venn diagrams of the hub genes between three methods
   (MNC, MCC, and DMNC).

Hub Genes Selection From the PPI Network

   The Cytoscape plug-in cytoHubba including the top three algorithms
   (MCC, MNC, and DMNC) was applied to select hub genes, and the top 10
   genes were selected by each of the three methods. The Venn diagram
   identified eight overlapping hub genes based on these three methods
   ([104]Figure 3D): cell division cycle protein 20 homolog (CDC20),
   cyclin-dependent kinase1 (CDK1), mitotic spindle assembly checkpoint
   protein MAD2A (MAD2L1), threonine-protein kinase BUB1 (BUB1),
   threonine-protein kinase BUB1 beta (BUB1B), mitotic-specific cyclin-B1
   (CCNB1), mitotic-specific cyclin-B2 (CCNB2) and cyclin-A2 (CCNA2).
   These eight hub genes were used for further analysis.

Validation of Hub Genes in Multiple Databases

   Oncomine and GEPIA were applied to validate the differentially
   expression levels of 8 hub genes between HCC tissues and normal liver
   tissues in HCC. These eight hub genes were all remarkably overexpressed
   in HCC samples ([105]Figure 4). Moreover, a summary of hub genes in
   multiple tumors indicated that hub genes were significantly
   overexpressed in HCC ([106]Figure 5). Furthermore, we also investigated
   the protein expression levels in HCC tissue samples and normal liver
   tissue samples using the human protein atlas database. Because the HPA
   dataset could not provide immunohistochemical information on BUB1 and
   BUB1B, we showed the results of the remaining six staining pairs in
   [107]Figure 6. The protein expression levels of hub genes were agreed
   with the mRNA expression results, and most genes were overexpressed in
   HCC tissue ([108]Figure 7). These findings indicate that the
   overexpression of these hub genes may play a critical role in HCC
   mechanism.

FIGURE 4.

   [109]FIGURE 4
   [110]Open in a new tab

   Validation of eight hub genes mRNA expression levels in HCC tissues vs.
   normal liver tissues using the GEPIA database (A–H). The red color
   represents the tumor samples and the gray color represents the normal
   liver samples.

FIGURE 5.

   [111]FIGURE 5
   [112]Open in a new tab

   An summary of mRNA expression results of 8 hub genes in multiple tumors
   using the Oncomine database. The numbers in colored cells show the
   quantities of datasets with high (red) or low (blue) mRNA expression of
   the hub genes.

FIGURE 6.

   [113]FIGURE 6
   [114]Open in a new tab

   The OS analysis of 8 hub genes in the HCC patients using the GEPIA
   database. The red curve is the high expression group and the blue curve
   is the low-expression group. p-value < 0.05.

FIGURE 7.

   [115]FIGURE 7
   [116]Open in a new tab

   Immunohistochemical staining analysis of hub genes (CCNA2, CCNB1,
   CCNB2, CDC20, CDK1, and MAD2L1) in HCC tissues and normal liver
   tissues.

Survival Analysis

   We further used the GEPIA database to analyze the prognostic value of
   these 8 hub genes in HCC patients. The survival analysis of patients in
   the GEPIA database was based on the TCGA-LIHC data set. We used
   threshold p-value 0.05 and calculated the hazards ratio based on Cox PH
   Model ([117]Xu et al., 2020). The relatively higher expression of CDC20
   (HR = 2.3; P = 3.8e-06), CDK1 (HR = 2; P = 0.00017), MAD2L1 (HR = 1.7;
   P = 0.0047), BUB1 (HR = 1.8; P = 0.001), BUBIB (HR = 1.7; P = 0.0028),
   CCNB1 (HR = 2; P = 0.00015), and CCNA2 (HR = 1.7; P = 0.0037) were
   associated with a poor prognosis in HCC patients, while only CCNB2 (HR
   = 1.4; P = 0.052) showed no statistical significance in the overall
   survival of patients ([118]Figure 6).

Drug-Gene Interaction Network Analysis

   To investigate the potential information on the interactions between
   hub genes and cancer chemotherapeutics drugs, we used the CTD database
   to construct chemotherapeutics drug-gene interaction network. Various
   drugs could influence the mRNA expression level of seven hub genes,
   namely, CDC20, CDK1, MAD2L1, CCNA2, CCNB1, BUB1, and BUB1B ([119]Figure
   8). However, only topotecan, oxaliplatin and azathioprine could reduce
   expression levels of all seven hub genes.

FIGURE 8.

   [120]FIGURE 8
   [121]Open in a new tab

   Drug-gene interactions network with chemotherapeutic drugs and seven
   hub genes was constructed using the CTD database. (A–G) shows the
   relationship between existing chemotherapeutic drugs and the expression
   levels of hub genes. (A) BUB1, (B) BUB1B, (C) CCNA2, (D) CCNB1, (E)
   CDC20, (F) MAD2L1, and (G) CDK1. The red and green arrows represent
   that the chemotherapy drugs will increase or decrease the expression of
   the hub genes. The number of arrows between hub genes and chemotherapy
   drugs indicates the number of references supported by previous studies.