Abstract Hepatocellular carcinoma (HCC) accounts for approximately 85–90% of all liver cancer cases and has poor relapse-free survival. There are many gene expression studies that have been performed to elucidate the genetic landscape and driver pathways leading to HCC. However, existing studies have been limited by the sample size and thus the pathogenesis of HCC is still unclear. In this study, we performed an integrated characterization using four independent datasets including 320 HCC samples and 270 normal liver tissues to identify the candidate genes and pathways in the progression of HCC. A total of 89 consistent differentially expression genes (DEGs) were identified. Gene-set enrichment analysis revealed that these genes were significantly enriched for cellular response to zinc ion in biological process group, collagen trimer in the cellular component group, extracellular matrix (ECM) structural constituent conferring tensile strength in the molecular function group, protein digestion and absorption, mineral absorption and ECM-receptor interaction. Network system biology based on the protein–protein interaction (PPI) network was also performed to identify the most connected and important genes based on our DEGs. The top five hub genes including osteopontin (SPP1), Collagen alpha-2(I) chain (COL1A2), Insulin-like growth factor I (IGF1), lipoprotein A (LPA), and Galectin-3 (LGALS3) were identified. Western blot and immunohistochemistry analysis were employed to verify the differential protein expression of hub genes in HCC patients. More importantly, we identified that these five hub genes were significantly associated with poor disease-free survival and overall survival. In summary, we have identified a potential clinical significance of these genes as prognostic biomarkers for HCC patients who would benefit from experimental approaches to obtain optimal outcome. Keywords: hepatocellular carcinoma, differentially expression genes, enrichment analysis, survival analysis, prognosis Introduction Liver cancer is the fourth leading cause of cancer-related death worldwide and ranks sixth in terms of incidence ([41]Bray et al., 2018; [42]Villanueva, 2019). Among all types of primary malignant liver tumors, hepatocellular carcinoma (HCC) accounts for approximately 85–90% of all cases. The major risk factors including chronic infections by hepatitis B virus (HBV) and hepatitis C virus (HCV), aflatoxin exposure, smoking, type 2 diabetes, obesity, and so on ([43]Marengo et al., 2016; [44]Bray et al., 2018; [45]Phukan et al., 2018). As a highly heterogeneous cancer disease, localized HCC patients often have poor prognosis with 5-year overall survival (OS) rate of 30%, and this rate drops below 5% for those with distant metastases ([46]Oweira et al., 2017). For patients at early disease stages, liver resection is the most effective treatment option, however, only less than 30% of HCC patients are eligible surgery, and among those around 70% eventually relapse within 5 years after treatment ([47]Waghray et al., 2015). Over the past few decades, despite advances in chemotherapy, targeted therapy, radiation therapy, and immunotherapy in the clinical arena, the survival of HCC patients has not significantly increased, and translational studies to understand the mechanisms and prognosis remain underwhelming to design novel therapeutic strategies ([48]Visvader, 2011; [49]Aravalli et al., 2013; [50]Llovet et al., 2018). Data, information, knowledge and wisdom (DIKW) model has been widely used in life in all aspects including medicine ([51]Song et al., 2018, [52]2020; [53]Duan, 2019a, [54]b; [55]Duan et al., 2019a, [56]b). In recent years, genome-wide profiling has substantially advanced our understanding of the genetic landscape and driver pathways leading to HCC ([57]Totoki et al., 2014; [58]Schulze et al., 2015; [59]Zucman-Rossi et al., 2015; [60]Ally et al., 2017; [61]Villanueva, 2019), revealing Cellular tumor antigen p53 (TP53), Catenin beta-1 (CTNNB1), Axin-1 (AXIN1), Telomerase reverse transcriptase (TERT) promoter and other key genes as driver mutations, and WNT/β-catenin, p53 cell cycle pathway, oxidative stress, PI3K/AKT/MTOR, and RAS/RAF/MAPK pathways as key signaling pathways involved in liver carcinogenesis. However, existing studies have been of limited sample size that failed to create molecular prognostic indices and also the inconsistent computational methods may have restricted the power to identify potential meaningful molecular biomarkers and new therapeutic targets. Therefore, an integrated bioinformatics study combining the most updated genomic data thus providing novel insight into the mechanisms underlying therapeutic resistance and disease progression is highly warranted. Microarray technology has become an indispensable tool to monitor genome wide expression levels of genes in a given organism and has been successfully used to classify different types of cancer and predict clinical outcomes ([62]Trevino et al., 2007). These microarray technologies have also been applied in many studies to define global gene expression patterns in primary human HCC in an attempt to gain insight into the mechanisms of hepatocarcinogenesis ([63]Crawley and Furge, 2002; [64]Woo et al., 2008; [65]Hoshida et al., 2009; [66]Villanueva et al., 2012; [67]Jin et al., 2015). In the present study, we selected four independent datasets consisting a total of 320 HCC cases and 270 cases of normal liver tissues in the Gene Expression Omnibus (GEO) database to identify reliable markers and pathway alterations linked with the pathogenesis of HCC cases ([68]Wurmbach et al., 2007; [69]Mas et al., 2009; [70]Roessler et al., 2010). We identified 89 differential expression genes (DEGs) including 31 up-regulated genes and 58 down-regulated genes. Gene ontology (GO) analysis revealed cellular response to zinc ion in biological process (BP) group, collagen trimer in the cellular component (CC) group, and extracellular matrix (ECM) structural constituent conferring tensile strength in the molecular function (MF) group. Further pathway enrichment analysis revealed that enrichment in protein digestion and absorption, mineral absorption, propanoate metabolism, and ECM-receptor interaction. Finally, the top five hub genes osteopontin (SPP1), Collagen alpha-2(I) chain (COL1A2), Insulin-like growth factor I (IGF1), lipoprotein A (LPA), and Galectin-3 (LGALS3) were identified from the protein–protein interaction (PPI) network and those highly altered genes were validated by western blot assay and Immunohistochemistry (IHC) analysis and found to be associated with clinical outcome of HCC patients. Materials and Methods Data Source and Identification of DEGs Microarrays data were obtained from the Oncomine 4.5 database^[71]1 contains 715 datasets and 86,733 samples. Of which, we filtered four datasets comprising Mas liver ([72]GSE14323, containing 19 liver tissues and 38 HCCs), Roessler liver ([73]GSE14520 based on [74]GPL571 platform, containing 21 liver tissues and 22 HCCs), Roessler liver 2 ([75]GSE14520 based on [76]GPL3921 platform, containing 220 liver tissues and 225 HCCs), and Wurmbach liver ([77]GSE6764, containing 10 liver tissues and 35 HCCs) after using the following criteria: (a) Analysis type: cancer vs. normal analysis; (b) Cancer type: hepatocellular carcinoma; (c) Data type: mRNA; (d) Sample type: clinical specimen; (e) Microarray platform: Human Genome U133A, U133A 2.0, or U133 Plus 2.0. A total of 270 cases of normal liver tissues and 320 cases of HCCs were included in the integrated analysis. To analyze the DEGs between HCC and normal liver tissues, the data were then processed on GEO2R website^[78]2. The differentially expressed genes were identified using limma R package at a cutoff | logFC| > 1 and adjusted p value < 0.05 (Benjamini & Hochberg). GO and Pathways Enrichment Analysis The annotation function of GO analysis is comprised of three categories: BP, CC, and MF. Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database resource for understanding high-level functions and utilities of the genes or proteins ([79]Kanehisa and Goto, 2000; [80]Kanehisa et al., 2012, [81]2016). GO analysis and KEGG pathway enrichment analysis of candidate DEGs were performed using the R package “clusterProfiler.” Reactome^[82]3 was also used for pathway enrichment analysis ([83]Fabregat et al., 2018). Adjusted p-value less than 0.05 was considered as the cut-off criterion for both GO analysis and pathway enrichment analysis. PPI Network and Modular Analysis Protein–protein interaction network was constructed to determine the importance of these DEGs by comparing the interactions between different DEGs. STRING database^[84]4 and Cytoscape software (3.7.2 version) were applied to construct and visualize the PPI networks ([85]Szklarczyk et al., 2017), followed by Molecular Complex Detection (MCODE) plug-in in Cytoscape for selecting significant modules of hub genes from the PPI network ([86]Bader and Hogue, 2003), with the following criteria: degree cutoff (number of connections with other nodes) ≥ 2, node score cutoff (the most influential parameter for cluster size) ≥ 2, K-core (This parameter filters out clusters that do not contain a maximally inter-connected sub-cluster of at least k degrees. For example, a triangle including three nodes and three edges is a two-core representing two connections per node. Two nodes with two edges between them meet the two-core rule as well) ≥ 2 and max depth (this parameter limits the distance from the seed node within which MCODE can search for cluster members) = 100. KEGG pathway enrichment analysis of the modules was carried out using the online DAVID database^[87]5 ([88]Huang da et al., 2009). Hub Gene Selection and Prognostic Analysis Hub genes were selected based on comparison of top 10 genes ranked by degree and betweenness centrality Network of hub genes. Their co-expressed genes were then analyzed using cBioPortal online platform^[89]6 ([90]Gao et al., 2013). Genetic alterations of these hub genes were explored and compared using the cBioPortal database. Biological process analysis of hub genes was then performed and visualized using plug-in Biological Networks Gene Oncology tool (BiNGO) app in Cytoscape software ([91]Maere et al., 2005). Stage-related information analysis based on gene expression was performed in UALCAN^[92]7, a comprehensive web resource for analyzing omics data ([93]Chandrashekar et al., 2017). Disease-free survival (DFS) is a concept used to describe the period after a successful treatment of cancer. OS means the length of time from either the date of diagnosis or the start of treatment for HCC. DFS and OS are both measured to see how well a new treatment works. DFS and OS analysis associated with these hub genes were performed using the Kaplan–Meier Plotter online database^[94]8. Cell Lines and Cell Culture Hepatocellular carcinoma cell lines Hep3B and HepG2 and human normal liver cell line L02 were obtained from Shanghai Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences. All of these cell lines were cultured in Dulbecco’s modified Eagle’s medium (DMEM) (catalog number 10569010, Gibco) containing 10% (v/v) fetal bovine serum (FBS) (catalog number 10091148, Gibco) supplemented with 1% (v/v) penicillin streptomycin solution (catalog number SV30010, Hyclone) (containing 100 U/ml penicillin and 100 μg/ml streptomycin) in a humidified incubator at 37°C with 5% CO[2]. Protein Preparation and Western Blot Analysis Briefly, HCC cells were lysed with cold M-PER lysate buffer (catalog number 78501, Roche) [containing 1 × protease inhibitors (catalog number 11836153001, Roche) and phosphatase inhibitor cocktail (catalog number 78420, Roche)] and centrifuged at 4°C for 10 min. The protein concentrations of collected supernatants were determined by the BCA protein assay kit (catalog number P0011, Beyotime). Equal amounts of total proteins were separated in 10 or 12% SDS-PAGE and transblotted onto the 0.45 μm PVDF membranes (catalog number 1620177, BIO-RAD). The membranes were blocked in 5% fat-free milk in TBST (150 mM NaCl, 50 mM Tris, pH 7.2) for 1 h at room temperature and subsequently incubated with corresponding primary antibodies as following: anti-SPP1 (catalog number ab8448, Abcam, 1:1000), anti-COL1A2 (catalog number 66761-1-lg, Proteintech, 1:1000), anti-IGF1 (catalog number ab9572, Abcam, 1:1000), anti-LGALS3 (catalog number ab209344, Abcam, 1:1000), anti-GADPH (catalog number 10494-1-AP, Proteintech, 1:10000), and anti-β-Tubulin (catalog number T0023, Affinity, 1:20000) at 4°C overnight, followed by incubation with a donkey anti-mouse (catalog number C61116-02, LI-COR) or goat anti-rabbit (catalog number C80118-05, LI-COR) secondary antibody for 1 h at room temperature. Then membranes were scanned using the Odyssey infrared imaging system (LI-COR) and the images were captured. The gray levels of the bands were determined by Image J software. The expression of proteins was normalized using the GADPH or β-Tubulin values. The assay was performed three independent times. Patient Samples This study was approved by Cancer Hospital of the University of Chinese Academy of Sciences; Zhejiang Cancer Hospital. Twenty formalin-fixed, paraffin-embedded (FFPE) HCC tissues and corresponding adjacent non-cancerous tissues were collected from the Department of Abdominal Surgery, Zhejiang Cancer Hospital. All FFPE HCC tissues were screened by two pathologists independently to confirm the diagnosis of HCC. The most representative tumor and non-cancerous tissues were selected for immunohistochemistry analysis. IHC Analysis Neutral 10% buffered formalin-fixed tissue specimens were embedded in paraffin wax and then sliced to 4-micron thick sections by a microtome. In brief, the tissue slices were firstly deparaffinized, followed by rehydration and a 10-min boiling in 10 mmol/L citrate buffer (pH = 6.4) for antigen retrieval. Then, the sections were treated in methanol containing 3% H[2]O[2] for 20 min to inhibit the endogenous tissue peroxidase activity. After being blocked with 1% bovine serum albumin (BSA) at 37°C for 30 min, IHC staining was carried out for the protein expression of SPP1, COL1A2, IGF1, and LGALS3 using specific primary antibodies at 4°C overnight, followed by staining with species-specific secondary antibodies labeled with horseradish peroxidase (HRP). The slides were developed in diaminobenzidine (DAB) and counter-stained with hematoxylin. Then images of the sections were photographed using an Olympus microscope (Olympus Life Science). The clinical specimen data of LPA were obtained from The Human Protein Atlas database^[95]9. Statistical Analysis Statistical analysis was performed using GraphPad Prism software (version 8.0.1) and R software (version 3.4.2^[96]10). p-value < 0.05 was considered statistically significant. The column diagram was graphed with GraphPad Prism software (version 8.0.1). Results Data Source and Analysis A total of 603, 1,238, 1,095, and 1,722 DEGs have been extracted from the four independent expression datasets Mas liver, Roessler liver, Roessler liver 2, and Wurmbach liver, respectively ([97]Figures 1A–D, [98]Table 1, and [99]Supplementary Tables S1, [100]S2). The comparison of all genes from these datasets identified 89 consistently and significantly dysregulated genes, including 31 up-regulated genes and 58 down-regulated genes in HCC compared to normal liver tissues ([101]Figures 1E,F and [102]Supplementary Table S3). Notably, several genes such as Glypican-3 (GPC3) ([103]Wurmbach et al., 2007) and SPP1 ([104]Roessler et al., 2010) have been reported previously, proving the feasibility of the method. FIGURE 1. [105]FIGURE 1 [106]Open in a new tab The DEGs screened from four independent datasets. Up-regulated DEGs (red-colored dots) and down-regulated (blue-colored dots) DEGs are selected with [logFC] > 1 and adjust p-value < 0.05 from the mRNA expression profiling sets (A) Mas liver, (B) Wurmbach liver, (C) Roessler liver, and (D) Roessler liver 2. Venn diagram showed (E) 31 consistently up-regulated DEGs and (F) 58 consistently down-regulated DEGs in four datasets. TABLE 1. Details of the four HCC datasets. Datasets GSE Tumor Normal References