Abstract Hepatocellular carcinoma (HCC) is a leading cause of cancer-related death. The aim of this study was to identify underlying hub genes and dysregulated pathways associated with the development of HCC using bioinformatics analysis. Differentially expressed protein-coding genes were subjected to transcriptome sequencing in 11 pairs of liver cancer tissue and matched adjacent non-cancerous tissue. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were performed, followed by protein-protein interaction (PPI) network construction. Hub genes were identified via centralities analysis and verified using published datasets. In total, 720 significantly differentially expressed protein-coding genes were identified in the samples, including 335 upregulated genes and 385 downregulated genes. The upregulated genes were significantly enriched in cell adhesion, biological adhesion and cell-cell adhesion GO terms under biological process (BP). Conversely, the downregulated genes were significantly enriched in embryonic organ morphogenesis, embryonic organ development and embryonic morphogenesis. The KEGG pathway analysis showed that the upregulated genes were enriched in ECM-receptor interaction and focal adhesion pathways. Furthermore, the downregulated genes were enriched in the ErbB, VEGF and MAPK signaling pathways. The PPI network and centralities analysis suggested that ITGA2 and 12 alternate genes were significant hub genes. These findings improve current understanding of the molecular mechanisms underlying HCC development and may be helpful in identifying candidate molecular biomarkers for use in diagnosing, treating and monitoring the prognosis of HCC. Keywords: hub gene, differentially expressed protein-coding genes, development, hepatocellular carcinoma, transcriptome sequencing INTRODUCTION Liver cancer is a leading cause of malignancy and mortality worldwide. Hepatocellular carcinoma (HCC), the most common malignancy in the liver, accounts for 85-90% of tumors derived from liver tissue [[48]1]. Despite significant advances in early diagnosis and interventional therapies, there remains a need for novel management methods for advanced HCC [[49]2]. HCC develops through a complicated biological process that involves several genomic changes and various pathogenic molecular mechanisms [[50]3]. The process is slow and involves genomic alterations gradually changing the phenotype of liver cells to produce cellular intermediates that evolve to become cancer cells. Studies of the regulation of gene expression are helpful for understanding the pathogenesis of HCC [[51]4]. RNA sequencing (RNA-Seq) has become useful for examining genetic changes on the whole-genome scale and produces large quantities of data [[52]5]. A comprehensive systematic study of differentially expressed pathways and protein-coding gene interactions can more accurately identify the biological changes that occur during HCC carcinogenesis. In addition, protein-protein interaction (PPI) networks can be used to identify highly connected hub genes that play a key role in maintaining network structure [[53]6]. Consequently, analyzing RNA-Seq data using these bioinformatics methodologies can help predict molecular pathogenesis and identify effective biomarkers of cancer. HCC is a highly heterogeneous disease, and diverse changes in gene expression contribute to the progression of this cancer [[54]7]. To obtain complete understanding of the alterations in gene expression that occur during HCC, RNA-Seq has been used to identify many key gen es involved in disease progression. However, knowledge of the genetic changes that lead to HCC initiation and progression remains fragmented, and key drivers of carcinogenesis are still unknown, limiting the development of targeted therapy for HCC [[55]8]. Furthermore, the overlap of the most significantly dysregulated genes among multiple studies is very low. Inconsistencies in results are caused by various factors, including measurement errors, small sample sizes and different statistical methods [[56]9]. Therefore, understanding the pathogenesis of this disease remains a major challenge, and many hub genes must still be identified. In the present study, we identified differentially expressed protein-coding genes from RNA transcriptional profiling performed on 11 paired cancer tissues and adjacent non-cancerous tissues. Then, we conducted GO, KEGG, PPI network and centralities analyses to study and identify changes in pathways and hub genes. The aim of this study was to improve understanding of HCC carcinogenesis by providing information concerning the genetic changes that occur during disease progression and to uncover the expression of biomarkers with potential use for clinical diagnosis, treatment, and monitoring of disease progression. RESULTS Identification of differentially expressed protein-coding genes in HCC and adjacent non-cancerous tissues Fifteen pairs of samples from tumor and adjacent tissues were collected from patients enrolled in Peking Union Medical College Hospital from May 2015 to April 2016. All patients provided consent. RNA transcriptional profiling was performed on the tissue samples. Approximately 542.8 million reads were mapped to the human hg38 genome, with a mean of 15.7 M reads per sample (range 4.2-23.5 M) ([57]Supplementary Table 1). Paired samples A2 and T2 were filtered out because of the small amount of data obtained. Multidimensional scaling analysis (MDS) was performed to show that the tumor samples were distinct from the adjacent non-cancerous tissue samples [[58]10]. Four samples (A5, T5, T6, and T14) were found to be potentially misidentified and were excluded from further analysis ([59]Supplementary Figure 1). Hence, transcriptome data from 11 paired samples was used to identify differentially expressed genes. We identified 720 significantly differentially expressed protein-coding genes, including 335 upregulated genes and 385 downregulated genes; a heatmap showing these genes can be found in Figure [60]1. Figure 1. Heatmap showing significantly differentially expressed protein-coding genes among 11 paired HCC and adjacent non-cancerous tissue. Figure 1 [61]Open in a new tab Rows represent genes, and columns represent samples. Functional characterization of differentially expressed protein-coding genes To gain insight into the functional characteristics of the identified protein-coding genes, we conducted GO and KEGG pathway enrichment analyses [[62]11]. The GO analysis results showed that the upregulated protein-coding genes were significantly enriched in the cell adhesion, biological adhesion and cell-cell adhesion categories under biological process (BP). Under molecular function (MF), the genes were enriched in glycosaminoglycan binding, polysaccharide binding and pattern binding. Additionally, GO cellular component (CC) analysis revealed genes significantly enriched in extracellular region part, extracellular region, and extracellular space. For the downregulated protein-coding genes, for BP, there was significant enrichment in embryonic organ morphogenesis, embryonic organ development and embryonic morphogenesis. For MF, there was enrichment in sequence-specific DNA binding, transcription regulator activity and transcription factor activity. In addition, GO CC analysis revealed genes significantly enriched in cell-cell junction, apical junction complex and apicolateral plasma membrane (Figure [63]2 and [64]Supplementary Table 2). Moreover, we found upregulated protein-coding genes that were significantly enriched in the ECM-receptor interaction and focal adhesion pathways using KEGG pathway enrichment analysis (Figure [65]2 and [66]Supplementary Table 3). Meanwhile, downregulated protein-coding genes were enriched in the ErbB, VEGF, and MAPK signaling pathways (Figure [67]2 and [68]Supplementary Table 3). We then compared these enriched genes with potential HCC driver genes from the Driver DB 2.0 database and identified 15 protein-coding genes that were recorded as HCC driver genes (triangle node) (Figure [69]2). Figure 2. Functional enrichment analysis of significantly upregulated and downregulated protein-coding genes. Figure 2 [70]Open in a new tab GOcluster plot showing a circular dendrogram of the clustering of the expression spectrum. The inner ring indicates the color-coded logFC. Red represents significantly upregulated (A) and blue represents significantly downregulated (B) protein-coding genes. The outer ring displays the assigned functional terms. KEGG pathway enrichment of significantly upregulated (C) and downregulated (D) protein-coding genes. The red node represents significantly upregulated protein-coding genes. The green node represents significantly downregulated protein-coding gene. The yellow node represents enriched pathway symbols. The triangular node represents HCC driver genes from the Driver DB V2 database. The size of the node represents the number of genes. PPI network construction and centralities analysis PPI network analysis is a powerful tool for recognizing critical hub members among a cluster of molecules. Thus, we conducted PPI network analysis using the STRING database to identify critical members among our protein-coding genes. The PPI networks for the upregulated and downregulated genes are shown in Figure [71]3 and [72]Supplementary Figure 2, respectively. Figure 3. PPI network of significantly upregulated protein-coding genes. Figure 3 [73]Open in a new tab The nodes represent the significantly upregulated protein-coding genes. The edges represent the interaction of significantly upregulated protein-coding genes. The red triangles represent the significantly upregulated hub genes. Centralities represent the possibility that a gene is functionally able to keep communicating nodes together in a biological network. There are five types of centralities: degree centrality, betweenness centrality, stress centrality, closeness centrality and clustering coefficient. The five types of centralities were calculated based on the complex network, and we found that the distribution of the top 10 genes was not completely consistent among various centralities analyses. Therefore, the genes that were among the top 10 identified and that were shared more than twice among the five types of centralities were defined as hub genes. Using these criteria, 6 hub genes were obtained for the upregulated protein-coding genes, with the alpha 2 integrin gene (ITGA2), bone morphogenetic protein 4 (BMP4), PLCB1 and PRKG2 identified as the common hub genes across the degree, betweenness and stress centrality analyses. Moreover, SPINK6 and POU3F4 detected in two of the five types of analysis. In addition, 7 hub genes were identified for the downregulated protein-coding genes, with lysine (K)-specific demethylase 6B (KDM6B), GLI1, MYC, CSNK2A1 and POTEF identified as common hub genes across the degree, betweenness and stress centrality analyses, and HSF1 and SCNN1A were commonly obtained in the degree and stress centrality analyses (Figure [74]4). Figure 4. Distribution of hub genes among the significantly upregulated and downregulated protein-coding genes identified by five types of centrality. Figure 4 [75]Open in a new tab (A) Degree centrality; (B) betweenness centrality; (C) stress centrality; (D) closeness centrality; and (E) clustering coefficient. The role of hub genes in the development of HCC To further explore the functional roles of the identified critical hub molecules in the development of HCC, we evaluated changes in the expression spectra of the 13 hub genes using one-way ANOVA (Kruskal-Wallis test) in HBV-related HCC ([76]GSE25097) and HCV-related HCC datasets ([77]GSE6764). We did not identify SCNN1A and PLCB1 in the HBV-related HCC dataset or POTEF and MYC in the HCV-related HCC dataset. The transcriptome profiling data in the [78]GSE25097 dataset contain 557 samples, including 6 healthy livers, 40 cirrhotic tissue samples, 243 adjacent non-tumor samples, and 268 early-to-advanced stage HCC samples. Moreover, the [79]GSE6764 dataset contains a total of 75 tissue samples, including 10 healthy liver samples, 13 cirrhosis samples, 17 dysplastic nodules, 18 early HCC samples and 17 advanced HCC samples. Consistent with our results, ITGA2, BMP4 and PLCB1 were significantly upregulated, and KDM6B and MYC were significantly downregulated during HCC oncogenesis (Figure [80]5 and [81]Supplementary Figure 4). In contrast to our findings, GLI1, CSNK2A1, POTEF and HSF1 were upregulated and SPINK6 downregulated in the HCC cohort ([82]Supplementary Figures 3 and 4). No significant differences were found in the expression of the other 3 genes ([83]Supplementary Figures 3 and 4). Based on these results, the expression of ITGA2, BMP4 and PLCB1 may contribute to cancer development. ITGA2, BMP4, PLCB1, KDM6B and MYC represent the most likely diagnostic or therapeutic biomarkers associated with HCC. Figure 5. Dynamic expression of ITGA2, BMP4 and KDM6B in HBV-related HCC (up) and HCV-related HCC (down); p < 0. 01 (*), p < 0.001 (**), and p < 0.0001 (***). Figure 5 [84]Open in a new tab DISCUSSION HCC is a common malignant tumor with a heterogeneous molecular pathogenesis that has not been fully elucidated. Identifying significantly dysregulated genes and pathways associated with HCC carcinogenesis may improve understanding of the molecular pathogenesis underlying HCC development and could identify potential biomarkers for treatment [[85]4]. In this study, GO and KEGG pathway enrichment analysis showed that cell adhesion, biological adhesion and cell-cell adhesion were significant GO terms in upregulated protein-coding genes and that embryonic organ morphogenesis, embryonic organ development and embryonic morphogenesis were significant GO terms in downregulated protein-coding genes. Moreover, ECM-receptor interactions and focal adhesion pathways were the relevant pathways associated with the upregulated protein-coding genes, and the ErbB, VEGF, and MAPK signaling pathways were the relevant pathways associated with the downregulated protein-coding genes. Hub genes were identified based on degree centrality, betweenness centrality, stress centrality, closeness centrality and clustering coefficients in the PPI network. The outcomes across these five centralities were integrated to address inconsistent results provided by the different methods of analysis. A total of 6 hub genes were identified among the upregulated protein-coding genes, and 7 hub genes were obtained among the downregulated protein-coding genes. ITGA2, BMP4, PLCB1, PRKG2, KDM6B, GLI1, MYC, CSNK2A1 and POTEF were common hub genes across three centrality methods. In addition, the expression spectra of the 13 hub genes were verified in HCC tumorigenesis using GEO datasets [86]GSE25097 and [87]GSE6764. ITGA2, BMP4, and PLCB1 were significantly upregulated, and KDM6B and MYC were significantly downregulated during HCC oncogenesis, consistent with our results. In contrast, GLI1, CSNK2A1, POTEF and HSF1 were upregulated and SPINK6 downregulated in the HCC cohort. No significant differences were found in the expression of the other 3 genes. BMP4 and ITGA2 were identified as critical hub molecules among the upregulated protein-coding genes and had many interactions with their neighbors. KMD6B was considered to be a hub molecule due to its high degree of downregulation when compared to all protein coding genes. Therefore, we mainly discuss the function of these top degree centrality genes. KDM6B is a demethylase that acts on histone H3 at lysine 27 (H3K27), which specifically catalyzes the demethylation of H3 lysine-27 tri/di methylation (H3K27me3/2) [[88]12]. KDM6B has been shown to play a key role in promoting transcription elongation associated with related elongation factors and RNA polymerase II to activate gene transcription [[89]13, [90]14]. Growing evidence suggests that KDM6B regulates cancer development. However, the effect of KDM6B on tumorigenesis is not consistent among different human cancers. Compared with adjacent normal tissues, KDM6B expression was significantly increased in renal cell carcinoma [[91]15]. Overexpression of KDM6B has also been shown to promote invasion-metastasis cascades and induce the expression of mesenchymal genes in breast cancer [[92]16]. In contrast, KDM6B knockdown promotes tumor sphere formation, increases hepatic metastasis and peritoneal dissemination in pancreatic adenocarcinoma, and enhances epithelial-mesenchymal transition and invasiveness in colon cancer cells [[93]17, [94]18]. Similarly, the knockdown of KDM6B inhibits cell apoptosis and promotes cell growth by reducing the nuclear translocation of FOXO1 in non-small cell lung cancer cells [[95]19]. However, literature focused on KDM6B expression in HCC is lacking. In this study, KDM6B was identified as a hub gene among the downregulated protein-coding genes differentially expressed throughout HCC development. The integrin family of α/β heterodimeric transme-mbrane receptors mediates the binding of cells to the extracellular matrix (ECM) and the intracellular signaling events that occur within the ECM [[96]20]. ITGA2 encodes the integrin α2 subunit of the α2β1 integrin receptor, which is a cell adhesion molecule [[97]20]. Previous studies have shown that the ITGA2 gene is associated with various types of cancer, including colorectal cancer, prostate cancer, hepatocellular carcinoma, pancreatic cancer, breast cancer, melanoma and ovarian carcinoma. ITGA2 is expressed in colon cancer cell lines and is involved in the proliferation and migration of these cells [[98]21]. Slambrouk et al. reported that integrin α2 subunits interact with the ECM protein collagen I, increasing prostate cancer cell adhesion and cellular invasion [[99]22]. Integrin α2 subunits also downregulate E-cadherin-mediated cell-cell adhesion architecture, enhancing pancreatic cancer cell invasiveness, migration and proliferation [[100]23]. In addition, integrin α2β1 inhibits mammalian sterile 20-like kinase 1 kinase (MST1) phosphorylation and activates Yes-associated protein (YAP) oncogenic signaling in HCC [[101]24]. It has been previously shown that ITGA2 is enriched in ECM-receptor interactions and pathways in cancer. However, the impact of ITGA2 on the progression of HCC is still unknown. Therefore, it seems necessary to investigate and clarify the basic biological links between ITGA2 and HCC. BMP4 belongs to the transforming growth factor (TGF-β) superfamily of extracellular signaling molecules, which is of great relevance both during development and in adult tissues [[102]25, [103]26]. BMP signaling regulates early liver development and promotes liver bud morphogenesis as well as the migration, proliferation and survival of hepatoblasts [[104]27, [105]28]. Interestingly, BMP4 has been shown to participate in human carcinogenesis. Recent findings have revealed that BMP4 is overexpressed in breast cancer and may promote cell invasion and migration by modulating TGF-β factor signaling [[106]29]. In addition, BMP4 signaling causes direct overexpression of ID3, a proto-oncogene that contributes to the pathogenesis of human ovarian cancer [[107]30]. BMP4 is upregulated in HCC, and the overexpression of BMP4 promotes the metastasis and proliferation of HCC cells by activating the mitogen-activated protein/extracellular signal-regulated kinase (MEK)/extracellular signal-regulated kinase (ERK) signaling pathway [[108]31]. BMP4 was enriched in KEGG pathways associated with pathways in cancer and basal cell carcinoma in this study. Therefore, BMP4 and its associated pathways may be used as a biological indicator for the development of various types of cancers, including HCC. Although we identified and verified hub genes important for the development of HCC using comprehensive bioinformatics technology, there are some limitations to the present study. First, this study lacked experimental validation of genes and their functions in HCC carcinogenesis. Second, we cannot rule out the possibility that these key genes may be involved in non-developmental aspects of HCC. Finally, the sample size for the RNA-Seq was small, which may have led to a high rate of false-positive outcomes. Hence, a larger sample size is needed for further bioinformatics analysis, and experimental studies are necessary to validate our results. In conclusion, we identified several hub genes and systematically presented the biological processes and signaling pathways associated with the development of HCC. Many of these genes were not previously reported but could play important roles in HCC. Further research is required to focus on the clinical application of these genes and pathways for diagnosing, treating and monitoring the prognosis of HCC. MATERIALS AND METHODS Ethics This study was approved by the Clinical Research Ethics Committee of Peking Union Medical College Hospital. Written informed consent with a signature was obtained from each patient. Clinical samples We collected tissue samples from patients with liver cancer undergoing surgery at Peking Union Medical College Hospital, Beijing. The samples were collected in pairs, i.e., cancer tissue and adjacent non-cancerous tissue. The collected tissue samples were stored in liquid nitrogen. RNA preparation and sequencing First, 50 mg of tissue was lysed in TRIzol (Invitrogen) to extract RNA following the manufacturer's instructions. Next, ribosomal RNA was depleted using a RiboZero Gold kit (Epicentre Bio-technologies). RNA integrity was assessed with an Agilent Bioanalyzer 2100. An RNA-Seq library was generated with the rRNA-depleted samples using an Illumina standard RNA Sample Prep kit according to the manufacturer's instructions. The library was subsequently sequenced on an Illumina HiSeq2500 as 125-bp paired-ends with approximately 300-bp size selection. Published dataset and database Liver cancer-related RNA-Seq data are available from the NCBI GEO database, including two independent microarray datasets: HBV-related HCC ([109]GSE25097) and HCV-related HCC ([110]GSE6764). HCC-related driver genes were obtained from the Driver DB 2.0 database [[111]32]. Transcriptome sequencing analysis Sequencing quality was assessed with FASTQC [[112]33]. After removing adaptor and low-quality reads using cutadapt [[113]34] (-q 10--quality-base=32 -e 0.1 -O 10 -m 50), the clean reads were aligned to human (hg38) GRCh38.p5 ([114]http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.31) genome reference sequences using Tophat2 [[115]35] (-a 6 --microexon-search -m 2); bam files were generated and sorted, and duplicate reads were then removed using Samtools [[116]36]. Read counts were tabulated with HT-Seq [[117]37] using “union” mode and the Gencode human v24 GTF file as a reference. edgeR was used to identify the small library size, make MDS plots of the samples, and check reproducibility from replicates to remove small sample sizes and outliers and check for batch effects [[118]38]. Significantly differential gene expression between tumor and adjacent non-cancerous tissue was estimated, with a minimum twofold change and FDR less than 0.01, using the edgeR package from Bioconductor. A heatmap was plotted using the pheatmap package. Cufflinks [[119]39] (v2.21) was also used to estimate the total transcriptional output based on the Gencode gene annotation for human HG38 (version 24) [[120]40]. Gene ontology and KEGG pathway enrichment analysis We performed Gene Ontology and KEGG pathway enrichment analysis using The Database for Annotation, Visualization and Integrated Discovery (DAVID) version 6.7 ([121]https://david-d.ncifcrf.gov/home.jsp) [[122]11]. Unique lists of significantly differentially expressed protein-coding genes and all the expressed genes (FPKM >0 in any sample) were submitted to the web interface as the gene list and background, respectively. Enrichment results were visualized using R and Cytoscape 3.5.0 software [[123]41]. In addition, we compared these enriched genes with potential HCC driver genes from the Driver DB 2.0 database [[124]32]. PPI network analysis Proteins rarely perform their functions independently, and it is therefore important to investigate protein interactions by studying larger functional groups [[125]42]. Genes that were identified as significantly upregulated or downregulated were mapped to the STRING (Search Tool for the Retrieval of Interacting Genes) version 10.0 database ([126]http://www.string-db.org/) and used to evaluate PPI information and construct a PPI network [[127]6]. The STRING database covers 9.6 million proteins from 2031 organisms. In the PPI network, each node represents a gene, and the edges stand for interactions between nodes. Centralities analysis of the PPI network Researchers have revealed strong correlations between PPI networks and the functions of protein/gene components [[128]43]. Topological centrality is effective for identifying molecules that may play important roles in significantly perturbed networks [[129]44]. The PPI data were downloaded from the STRING database. Centralities analysis was performed using Cytoscape 3.5.0 software. We presented the centralities of the PPI network on a local (degree and clustering coefficient) and global scale (betweenness, closeness and stress). Genes identified in the top 10 genes and that were shared more than twice among the five types of centralities were defined as hub genes. Data submission Sequence data has been deposited at the European Genome-phenome Archive (EGA) ([130]https://ega-archive.org), which is hosted by European Bioinformatics Institute (EBI) and the Centre for Genomic Regulation (CRG), under accession number EGAS00001002526. SUPPLEMENTARY MATERIALS FIGURES AND TABLES [131]oncotarget-08-60358-s001.pdf^ (3.2MB, pdf) [132]oncotarget-08-60358-s002.docx^ (21.3KB, docx) Acknowledgments