Abstract

   Integration of public genome-wide gene expression data together with
   Cox regression analysis is a powerful weapon to identify new prognostic
   gene signatures for cancer diagnosis and prognosis. Hepatitis B virus
   (HBV) is a major cause of hepatocellular carcinoma (HCC), however, it
   remains largely unknown about the specific gene prognostic signature of
   HBV-associated HCC. Using Robust Rank Aggreg (RRA) method to integrate
   seven whole genome expression datasets, we identified 82 up-regulated
   genes and 577 down-regulated genes in HBV-associated HCC patients.
   Combination of several enrichment analysis, univariate and multivariate
   Cox proportional hazards regression analysis, we revealed that a
   three-gene (SPP2, CDC37L1, and ECHDC2) prognostic signature could act
   as an independent prognostic indicator for HBV-associated HCC in both
   the discovery cohort and the internal testing cohort. Gene set
   enrichment analysis showed that the high-risk group with lower
   expression levels of the three genes was enriched in bladder cancer and
   cell cycle pathway, whereas the low-risk group with higher expression
   levels of the three genes was enriched in drug metabolism-cytochrome
   P450, PPAR signaling pathway, fatty acid and histidine metabolisms.
   This indicates that patients of HBV-associated HCC with higher
   expression of these three genes may preserve relatively good hepatic
   cellular metabolism and function, which may also protect HCC patients
   from persistent drug toxicity in response to various medication. Our
   findings suggest a three-gene prognostic model that serves as a
   specific prognostic signature for HBV-associated HCC.

   Keywords: Hepatitis B virus associated hepatocellular carcinoma, Robust
   Rank Aggreg analysis, Hub genes, Prognostic signature, Overall
   survival.

Introduction

   In recent years, high-throughput expression profiling data from
   microarray chip and RNA sequencing (RNA-Seq) provide useful information
   regarding the mechanisms and diversity of various cancers, and are
   valuable for the diagnosis, therapeutic response prediction and
   prognosis evaluation in those diseases [43]^1. However, there are many
   obstacles in data preparation, analysis and interpretation because the
   results from one individual study can be rather noisy. Therefore,
   integration of the public genome-wide gene expression data is a
   powerful weapon for cancer diagnosis and prognosis [44]^2.

   Hepatocellular carcinoma (HCC) is one of the most prevalent cancers
   worldwide and the second cause of death among malignancy tumors [45]^3.
   Although more than 40 prognostic gene signatures have been described in
   HCC, none has become a commercial marker useful for clinical
   decision-making [46]^4^, [47]^5. Therefore, identifying a more specific
   gene prognostic signature for HCC with a specific etiology still
   represents a critical approach for the diagnosis and prognosis of HCC.

   Over 50% of HCC cases worldwide are attributed to hepatitis B virus
   (HBV) infection. In areas with high HBV prevalence, such as Southeast
   Asia and sub-Saharan Africa, more than 80% of patients with HCC are due
   to HBV infection [48]^6. Chronic HBV infection plays critical roles in
   the progression of HCC and patients with HBV-associated HCC usually
   have poor clinical recovery and high rate of recurrence after treatment
   [49]^7. Control of HBV infection is effective to reduce the mortality
   of HCC patients in East Asia and Southern Europe [50]^8. Because of the
   integration of HBV DNA into host cellular DNA, and also the expression
   of HBV proteins that have direct effect on cellular functions [51]^9,
   the pathogenesis of HBV-associated HCC has its own specific gene
   signatures as compared to other types of HCC [52]^10^-[53]^12.
   Recently, semi-supervised methods to predict patient survival from gene
   expression data have been widely accepted. These methods allow us to
   use the expression of some critical genes, also known as prognostic
   genes, to predict the survival of future patients and also the
   efficiency of therapeutic treatment [54]^13. As such, screening of HBV
   specific prognostic gene models is ultimately important for the
   diagnosis and prognosis of patients with HBV-associated HCC. However,
   no specific gene prognostic model has been developed for HBV-associated
   HCC until now.

   In this current study, the R package called Robust Rank Aggreg (RRA)
   was used to integrate several HBV-associated HCC experimental sources,
   which is helpful to identify evidence-supported findings and to
   increase the signal as well as to minimize rates of false positive
   findings [55]^14. By using the gene cluster of differential expressed
   genes acquired from RRA, we performed bioinformatics analysis of
   weighted gene co-expression network analysis (WGCNA) [56]^15 to
   identify key gene modules, following by construction of the
   protein-protein interaction (PPI) networks for the key gene modules in
   order to further identify hub genes. Lastly, we identified a three-gene
   prognostic signature in these hub genes by univariate and multivariate
   Cox regression analysis from the [57]GSE14520 dataset.

Materials and Methods

Collection of HBV-associated HCC gene expression datasets

   All HBV-associated HCC datasets were downloaded from GEO
   ([58]http://www.ncbi.nlm.nih.gov/geo/). Every dataset was estimated
   thoroughly through full text. The selection criteria used in this study
   are as follows: 1. gene expression profiling datasets that included
   gene microarray chip technology or RNA-Seq technique; 2. studies
   comparing gene expressions between HBV-associated cancer and
   non-cancerous liver tissue in human samples; 3. expression studies
   using cell lines or body fluid (such as serum, saliva, peripheral
   blood, etc.) were excluded; 4. sample size should be larger than three
   pairs. According to the above screening criteria, 7 datasets were
   finally included in this study: [59]GSE77509 [60]^16, [61]GSE62232
   [62]^17, [63]GSE25097 [64]^18, [65]GSE54238 [66]^19, [67]GSE50579
   [68]^20, [69]GSE14520 [70]^21, and [71]GSE55092 [72]^22 (Table [73]1).

Table 1.

   Characteristics of the studies
   First author and reference Region Number of samples Year GEO accession
   Platforms Data Type
   Yang Y et al [74]^9 China 17 pair 2017 [75]GSE77509 [76]GPL16791
   RNA-seq
   Schulze K et al [77]^10 France 10T,10N 2015 [78]GSE62232 [79]GPL570
   Gene Chip
   Sung WK et al [80]^11 USA 223T,243N 2011 [81]GSE25097 [82]GPL10687 Gene
   Chip
   Yuan SX et al [83]^12 USA 26T,10N 2016 [84]GSE54238 [85]GPL16955 Gene
   Chip
   Neumann O et al [86]^13 Germany 8T,10N 2012 [87]GSE50579 [88]GPL14550
   Gene Chip
   Roessler S et al [89]^14 USA 212 pair 2010 [90]GSE14520 [91]GPL571
   [92]GPL3921 Gene Chip
   Melis M et al [93]^15 USA 39T,81N 2014 [94]GSE55092 [95]GPL570 Gene
   Chip
   [96]Open in a new tab

   Note: T, tumor samples; N, non-cancerous samples; pairs, tumor tissues
   and non-cancerous tissues from the same patient.

Datasets processing and statistical analysis

   Series matrix files of each dataset were downloaded from GEO.
   Normalization of the microarray datasets was performed by the package
   limma of R (version 3.3.3), and normalization of the RNA-seq datasets
   was performed by the package edgeR of R (version 3.3.3) [97]^23. We
   used R package RRA to identify the most significant genes, and the new
   data frame results were constructed with the standard of P value < 0.01
   [98]^14.

Enrichment analysis

   To conduct enrichment analyses, the package cluster Profiler (version
   3.2.14) of R (version 3.3.3) was used for analyzing the Kyoto
   Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology
   (GO) processes, as described previously [99]^24^, [100]^25.

WGCNA

   WGCNA is a typical systemic biological method for describing
   correlation patterns among genes and identifying modules of highly
   correlated genes by using average linkage hierarchical clustering
   coupled with the topological overlap dissimilarity measure based on
   high-throughput chip data or RNA-Seq data [101]^26^-[102]^28. In
   current study, WGCNA package (version 1.60) in R was used to merge
   highly correlated genes and identify key modules based on the
   expression levels of the most significant genes, identified by RRA
   methods above, from the data of [103]GSE77509 [104]^16. Here, the power
   of β = 18 (scale free R^2 = 0.8) was selected as the soft-thresholding
   to ensure a scale-free network. The cut Height = 0.7 and min Size = 10
   were used to identify key modules. The gene modules were labeled with
   different colors and the grey module showed the genes that cannot be
   merged.

PPI network construction and clusters analysis

   STRING ([105]https://string-db.org/) was employed to construct PPI
   networks for distance-related genes by using the genes in the biggest
   module identified by WGCNA [106]^29. The PPI networks were then
   exported from STRING and imported into Cytoscape. We then used the
   “Molecular Complex Detection (MCODE)” Cytoscape app to identify
   discrete clusters from the former PPI networks using default settings
   (K-Core: 2; degree cutoff: 2; max. depth: 100 and node score cutoff:
   0.2 were used as the cutoff criteria for the identification of network
   modules) [107]^30. The top clusters were screened under the conditions
   of minimum size = 6 and minimum score = 6.

Generation and validation of prognostic signature and statistical analysis

   The data of Roessler S's study (GEO DataSet: [108]GSE14520), which
   contained the data of both gene expression levels and clinical
   signatures of 212 HBV-associated HCC patients, were used for the
   generation and validation of prognostic signature using the package of
   survminer and survival of R (version 3.3.3). Firstly, the univariate
   Cox proportional hazards regression model was used to evaluate the
   associations between overall survival (OS) and the expression level of
   each hub gene that belong to the top cluster identified by PPI networks
   and MCODE Cytoscape app (P < 0.05 was considered significant). Hazard
   ratios (HRs) were applied to identify protective (HR < 1) and risk
   genes (HR > 1). Secondly, we divided the data of these 212 patients
   randomly into two groups: a discovery cohort (N = 106) and an internal
   testing cohort (N = 106). Thirdly, in the discovery cohort, protective
   or risk genes identified by univariate Cox analysis were selected to
   calculate the association with OS by multivariable Cox analysis, and
   the risk scores for predicting OS was calculated with the formula based
   on the protective or risk genes: Risk scores = exp[gene1]*β[1] +
   exp[gene2]*β[2] +…+exp[genen]*β[n]. The “β” value is the estimated
   regression coefficient of gene derived from the multivariate Cox
   analysis and “exp” indicates the expression profiles of gene [109]^31.
   Kaplan-Meier survival analysis with the log-rank test (P < 0.05) was
   used to examine the proportional assumptions for Cox proportional
   hazard model. Lastly, the prognostic signature of indicated genes
   generated in the discovery cohort was further confirmed in the internal
   testing cohort. The statistical significance was defined as P < 0.05.

Gene Set Enrichment Analysis (GSEA)

   GSEA was performed with indicated prognostic genes using phenotype
   labels “high-risk” vs “low-risk” by the GSEA software
   ([110]http://software.broadinstitute.org/gsea/index.jsp). Gene sets
   used in this work were c2.cp.kegg.v5.2.symbols.gmt downloaded from the
   Molecular Signatures Database (MSigDB,
   [111]http://software.broadinstitute.org/gsea/msigdb/index.jsp)

Statistical Analysis

   Statistical analyses were accomplished using R (version 3.3.3). The
   association between clinical characteristics and group was determined
   using chi-square test, wilcoxon rank sum test or unpaired t test
   according the data type of clinical characteristics. Univariate and
   multivariate Cox regression models were used to evaluate prognostic
   significance. Survival analysis was performed using Kaplan-Meier and
   log-rank test. The Student's t-test was used for comparisons of two
   independent groups. A P value < 0.05 was considered statistically
   significant.

Results

Identified significance genes by integrated analysis

   In order to identify significance genes in HBV-associated HCC, we setup
   a workflow showing in Figure [112]1. By using the method of RRA to
   integrate 7 datasets, 82 up-regulated genes and 577 down-regulated
   genes were identified as the most significant genes [Table [113]S1, P <
   0.01]. The top 50 significantly up-regulated and down-regulated genes
   were listed in Figure [114]2A. KEGG pathway enrichment analysis of the
   most significant up-regulated genes showed that the top 5 enriched
   pathway is cell cycle, cellular senescence, melanoma, p53 signaling
   pathway, and ECM-receptor interaction. All of these pathways were
   reported with important roles in cancer [115]^32^-[116]^35 (Figure
   [117]2B). Whereas the top 5 KEGG pathway enriched in the down-regulated
   genes, including valine, leucine and isoleucine degradation, retinol
   metabolism, PPAR signaling pathway, fatty acid degradation and
   complement and coagulation cascades, also have been studied to be
   important pathways involved in tumor growth and metastasis (Figure
   [118]2B and Table [119]S2) [120]^36^-[121]^39. Together with the data
   of GO analysis (Figure [122]2C, 2D, and Table [123]S3), these
   significant genes are shown to be mainly involved in tumor metabolism
   and growth.

Figure 1.

   [124]Figure 1
   [125]Open in a new tab

   Flowchart describing the schematic overview of the study design. After
   integrated analysis and different bioinformatics analysis of
   HBV-associated HCC genome expression datasets, we identified hub genes
   in Cluster 1. Hub genes were then analyzed individually for prognostic
   significance by univariate Cox proportional hazards and Kaplan-Meier
   survival analysis. 10 hub genes were significantly associated with the
   survival of patients with HBV-associated HCC (UVA Cox analysis, P <
   0.05, and log-rank test P < 0.05). The HBV-associated HCC cohort
   ([126]GSE14520, N = 212) were randomly divided in to discovery cohort
   (N = 106) and internal testing cohort (N = 106). Next, we used
   multivariable Cox proportional hazards stepwise regression analysis
   with forward selection to build a prognostic model that included 3
   genes: SPP2, ECHDC2, and CDC37L1. This model was used to calculate risk
   scores for discovery cohort (risk score = expSPP2* - 0.1941 +
   expCDC37L1* - 0.5466 + expECHDC2* - 0.4714), and the cut-off point was
   chosen. This risk score calculation and cut-off point were further
   validated in internal testing cohort. Lastly, GSEA analysis of the
   high-risk and low-risk group was used to further inquiry the 3 genes
   prognostic signature. UVA Cox: univariate Cox; MVA Cox: multivariate
   Cox.

Figure 2.

   [127]Figure 2
   [128]Open in a new tab

   Identified significance genes and enrichment analysis. (A) Heatmap
   showed the fold change of the top 100 significantly genes in different
   studies (50 up-regulated genes and 50 down-regulated genes) by Robust
   Rank Aggreg (RRA) methods from 7 different datasets. Each row
   represents the same mRNA and each column represents the same study. The
   fold change intensity of each mRNA in one study is represented in shade
   of red or blue. Red represents the fold change of up-regulated genes
   and blue represents the fold change of down-regulated genes,
   respectively, in comparison to non-tumor tissues. (B) Kyoto
   Encyclopedia of Genes and Genomes (KEGG) enrichment analysis of the
   significant genes. Up-regulated KEGG pathways are showed with
   horizontal axis > 0, and down-regulated genes are showed with
   horizontal axis < 0, respectively. The size of the horizontal axis
   shows the numbers of enriched genes in each KEGG pathway and the color
   shade of bar reflects P value. (C-D) Gene Ontology (GO) enrichment
   analysis of up-regulated (C) and down-regulated genes (D). The vertical
   and horizontal axes represent GO term and -log[10] (P value) of the
   corresponding GO term, respectively. The number in each bar reflects
   the enriched gene number of each GO term. Different colors reflect main
   categories of GO terms: BP, biological process; CC, cellular component;
   MF, molecular function.

WGCNA and PPI analysis

   Co-expression network by WGCNA analysis revealed that the most
   significant genes mentioned above were grouped into 3 modules, of which
   the turquoise module was the most significant one (Figure [129]3A and
   Table [130]S4). KEGG pathway enrichment analysis showed that the
   turquoise module is mainly involved in the PPAR signaling and different
   metabolism pathway, similar to the enriched KEGG pathways involving by
   the most significant genes [131]^40^-[132]^42 (Figure [133]3B). We
   further constructed the PPI networks of the turquoise module by STRING,
   of which the largest connected master network containing 396 nodes
   (Figure [134]4A). In this largest connected master network, three
   discrete sub-clusters were extracted using the MCODE app by Cytoscape
   software [135]^30 (Figure [136]4B). GO enrichment analysis of these
   three discrete sub-clusters revealed that Cluster 1, the largest
   cluster containing 32 genes was mainly involved in platelet
   degranulation, fatty acid metabolic process and exocytosis (Figure
   [137]4C and Table [138]S5). KEGG pathway analysis revealed that Cluster
   1 was mainly involved cellular metabolism: including fatty acid
   degradation, fatty acid metabolism and valine, leucine and isoleucine
   degradation (Figure [139]4C and Table [140]S5). This indicates that
   Cluster 1 may play a critical role in the pathogenesis of HCC, as
   platelet degranulation, fatty acid metabolism and degradation have been
   reported to have important roles in the development of HCC [141]^36^,
   [142]^43^, [143]^44.

Figure 3.

   [144]Figure 3
   [145]Open in a new tab

   WGCNA analysis of the significant genes and KEGG analysis of the key
   modules from WGCNA. (A) Gene clustering and module identification by
   WGCNA analysis based on the dataset of [146]GSE77509. Top: clustering
   dendrogram showed the result of hierarchical clustering, and each line
   represents one gene. Bottom: the colored row below the dendrogram
   indicates module membership identified by the static tree cutting
   method. Different color represents different co-expression network
   modules for the significantly genes. (B) KEGG analysis of the top three
   modules from WGCNA. The vertical and horizontal axes represent the KEGG
   pathways and different modules, respectively. The size and the color
   intensity of a circle represent gene number and -log[10] (P value),
   respectively.

Figure 4.

   [147]Figure 4
   [148]Open in a new tab

   Identified discrete clusters and enrichment analysis of the turquoise
   module in Figure [149]3. (A) PPI network of genes in the turquoise
   module. The color intensity in each node represents the fold change of
   the gene in comparison to non-tumor samples (up-regulation of a gene is
   shown in red and down-regulation of a gene is shown in blue). The size
   of the circle is proportional to the score of PPI based on the STRING
   database. (B) Main sub-clusters from the master PPI networks. The color
   intensity in each node was proportional to fold change of each gene
   expression in comparison to non-tumor samples (up-regulation in red and
   down-regulation in blue). (C) KEGG and GO enrichment analysis of the
   top three sub-clusters. The vertical and horizontal axes represent the
   GO biological process /KEGG pathways and different sub-clusters,
   respectively. The size and the color intensity of a circle represent
   gene number and -log10 (P value), respectively.

Identification of hub genes in Cluster 1 by univariate Cox proportional
hazards regression model

   The above findings that Cluster 1 genes were mainly involved in
   platelet degranulation, fatty acid metabolism and degradation prompted
   us to explore its prognostic value in HBV-associated HCC. To identify
   prognosis-related genes in this network, we first used an univariate
   Cox proportional hazards regression model to evaluate the associations
   between the expression level of the involved genes in Cluster 1 and OS
   by using the data reported by Roessler S et al (GEO DataSet:
   [150]GSE14520) [151]^21. The data revealed that 10 out of 32 genes,
   including acyl-CoA dehydrogenase, C-4 To C-12 straight chain (ACADM),
   acyl-coA synthetase medium-chain family member 3 (ACSM3), cell division
   cycle 37 like 1 (CDC37L1), crystallin lambda 1 (CRYL1), enoyl-CoA
   hydratase domain containing 2 (ECHDC2), coagulation factor VIII (F8),
   glutaryl-CoA dehydrogenase (GCDH), histidine rich glycoprotein (HRG),
   methylmalonyl-CoA mutase (MUT) and secreted phosphoprotein 2 (SPP2),
   were all negatively relevant to OS in patients of HBV-associated HCC (P
   < 0.05, and log-rank test P < 0.05) (Figure [152]5 and Table [153]S6).

Figure 5.

   [154]Figure 5
   [155]Open in a new tab

   Kaplan-Meier survival plots of the association between the expression
   levels of hub genes and overall survival probability in patients of
   HBV-associated HCC. P values were obtained from Log-rank test. Yellow
   and blue line represent the samples with the gene higher expressed and
   lower expressed, respectively. The table below the Kaplan-Meier
   survival plots showed the number of patients at the risk.
   Abbreviations: ACADM, acyl-CoA dehydrogenase, C-4 To C-12 straight
   chain; ACSM3, acyl-coA synthetase medium-chain family member 3;
   CDC37L1, cell division cycle 37 like 1; CRYL1, crystallin lambda 1;
   ECHDC2, enoyl-CoA hydratase domain containing 2; F8, coagulation factor
   VIII; GCDH, glutaryl-CoA dehydrogenase; HRG, histidine rich
   glycoprotein; MUT, methylmalonyl-CoA mutase; SPP2, secreted
   phosphoprotein 2.

Generation and validation of multiple-gene prognostic signature by
multivariable Cox proportional hazards regression analysis

   By randomly dividing the data of 212 HBV-associated HCC patients in
   Roessler et al's study (GEO DataSet: [156]GSE14520) [157]^21 into a
   discovery cohort (N = 106) and an internal testing cohort (N = 106)
   (Table [158]2, and Table [159]S7). The 10 genes that were significantly
   associated with OS of HBV-associated HCC were used for prognostic
   module building by using forward conditional stepwise regression with
   multivariable Cox analysis in the discovery cohort. This procedure
   selected a prognostic model containing three genes, including SPP2,
   CDC37L1 and ECHDC2. The risk scores of these three genes signature for
   each sample in the discovery cohort were calculated by using the
   following formula: Risk score = exp[SPP2]*-0.1941 + exp[CDC37L1]* -
   0.5466 + exp[ECHDC2]* - 0.4714 and ranked according to the values of
   Risk score. Using the median risk score (-9.158) as the cut-off point,
   patients were divided into a high-risk group (score > -9.158, N = 53)
   and a low-risk group (score ≤ -9.158, N = 53) (Figure [160]6A).
   Patients in high-risk group tended to express low levels of these three
   genes and shorter survival time (Figure [161]6B-C). Consistently,
   patients in high-risk group had lower OS rates than those in low-risk
   group, suggesting that patients in the high-risk group was more likely
   to have higher mortality than that in the low-risk group (Log-rank
   test, P < 0.05) (Figure [162]6D).

Table 2.

   Clinical characteristics of the patients
   Clinical Variable Discovery Cohort
   (n = 107) Internal Testing Cohort
   (n = 107) P value
   Sex
   Male 92 91 1^b
   Female 14 15
   Age, y
   Median (range) 51(71-27) 50 (77-21) 0.6054^a
   ALT
   Negative (≤50 U/L) 67 57 0.2097^b
   Positive (>50 U/L) 39 49
   Tumor size
   Large(>5cm) 31 43
   Small(≤5cm) 74 63 0.1245^b
   No data 1 0
   TNM stage
   I 47 42
   II 37 39 1^c
   III-IV 22 25
   AFP
   Low 56 59
   High 50 44 0.6117^b
   No data 0 3
   [163]Open in a new tab

   AFP, α-fetoprotein; ALT, alanine transferase; ^a Unpaired t test; ^b
   chi-squared test (χ2 test); ^c wilcoxon rank sum test.

Figure 6.

   [164]Figure 6
   [165]Open in a new tab

   Three-gene predictor-score analysis of HBV-associated HCC patients in
   both the discovery and internal cohort. (A) and (E) Three-gene risk
   score distribution in both the discovery (A) and internal testing
   cohort (E), respectively. Each point represents one patient; the
   vertical and horizontal axes represent the risk score calculated from
   the three-genes model and results sorted by the size of the risk score,
   respectively; red and blue represent patients with high and low risk
   scores identified by cut-off, respectively. The black dotted line
   represents the median mRNA risk score cut-off dividing patients into
   low-risk or high-risk groups. (B) and (F) Patients' survival status and
   time in both the discovery (B) and internal testing cohort (F),
   respectively. Each point corresponds to the same patient as above; the
   vertical and horizontal axes represent the survival time and results
   sorted by the size of the risk score, respectively; red or blue
   represent patient dead or live in the end, respectively. (C) and (G)
   Heatmap of gene expression profiles in both the discovery (C) and
   internal testing cohort (G), respectively. Each row represents the same
   gene and each column represents the same patient corresponded to the
   above point. The expression intensity of each gene in one patient is
   represented in shade of red or grey, indicating its expression level
   above or below the median expression intensity across all patients,
   respectively. (D) and (H) The Kaplan-Meier overall survival plots for
   HBV-associated HCC risk groups obtained from both the discovery (D) and
   internal testing cohort (H), respectively. Red and green line represent
   the patient with high or low risk, respectively.

   To test our findings, this three-gene risk score model was further
   evaluated using the internal cohort. With the same risk-score formula
   and cut-off point derived from the discovery cohort, the internal
   cohort was divided into high-risk group (N = 61) and low-risk group (N
   = 45). The data in internal cohort revealed a similar finding as that
   in the discovery cohort (Figure [166]6E-H).

GSEA analysis

   According to the above results, this three-gene model can distinguish
   survival difference from HBV-associated HCC patients. GSEA analyses of
   the data of the [167]GSE14520 revealed that the high-risk group showed
   gene-enrichment in bladder cancer and cell cycle pathway (Figure
   [168]7). And the oncogenes in bladder cancer pathway were overexpressed
   in the tumor tissues of HBV-associated HCC patients in the high-risk
   group, such as MYC that can trigger specific gene amplification to
   promote cell growth and proliferation [169]^45.

Figure 7.

   [170]Figure 7
   [171]Open in a new tab

   GSEA analysis of the three prognostic genes. GSEA analysis of the
   differentially expressed genes between the high-risk versus low-risk
   group. Only two significantly functional gene sets were enriched for
   high-risk group (marked in red), whereas only four most significantly
   enriched functional gene sets were listed for low-risk group (markered
   in blue) in HBV-associated HCC samples.

   Other overexpressed genes of bladder cancer pathway in the tumor tissue
   of patient in the high-risk group, such as vascular endothelial growth
   factor B (VEGFB), matrix metallopeptidase (MMP) 1, 2 and 9, are also
   reported to be involved in tumor invasion and metastasis
   [172]^46^-[173]^49. In the cell cycle pathway, genes associated with
   the proliferative capacity of tumor cells were also higher expressed in
   patients of high-risk group, such as minichromosome maintenance (MCM)
   complex [174]^50. While, the low-risk group showed gene-enrichment in
   drug metabolism-cytochrome P450, PPAR signaling, and fatty acid
   metabolism pathway (Figure [175]7). These pathways, associated with
   hepatic metabolism of cholesterol, fatty acid and drug, are usually
   down-regulated in HCC patients [176]^51^, [177]^52. This indicates that
   patients in the low-risk group preserve relative good capacity of
   hepatic function, which might also protect HCC patients from persistent
   drug toxicity in response to various medication. These results showed a
   highly similar manner with the significant changed KEGG pathways of the
   most significant genes in total tumor samples of HBV-associated HCC as
   compared to healthy tissue presented by integrational analysis (Figure
   [178]2B).

Discussion

   Although increasing chip and RNA-seq profiling studies were performed
   to find differentially expressed genes in HBV-associated HCC, these
   reports often presented inconsistent results across different studies.
   In our current work, we used an integrated analysis by using the RRA
   method to identify significant differentially expressed genes from 7
   independent datasets of HBV-associated HCC profiling experiments, which
   could overcome the rather noisy from different individual studies
   [179]^53.

   The expression levels of each gene were re-ranked and their
   significances were re-determined by giving a P-value for each gene.
   Finally, 82 up-regulated and 577 down-regulated integrated significant
   genes were identified. Many of them have been proven to be oncogenic,
   such as aldo-keto reductase family 1 member B10 (AKR1B10) [180]^54,
   maternal embryonic leucine zipper kinase (MELK) [181]^55, and aldehyde
   dehydrogenase 3 family member A1 (ALDH3A1) [182]^56). Some are
   considered as anti-oncogenes, such as hydroxyacid oxidase 2 (HAO2)
   [183]^57, C-type lectin domain family 1 member B (CLEC1B) [184]^58 and
   secreted phosphoprotein 2 (SPP2) [185]^59 in HCC. The result of
   enrichment analysis also showed that these significance genes were
   highly related to HCC, such as the cell cycle pathway and p53 signaling
   pathway [186]^32^, [187]^33. In addition, we also identified many
   significantly expressed genes, the role of which are still unclear and
   need further research in HCC, such as EH domain containing 3 (EHD3),
   aspartoacylase (ASPA) and melanocortin 2 receptor accessory protein 2
   (MRAP2). This method could also be used in other human diseases such as
   neurological disorders [188]^60^, [189]^61.

   In this study, we also used WGCNA combination with PPI network to
   identify hub genes that link with HBV-associated HCC. Three clusters
   identified by PPI network from the most significant module by WGCNA
   were mainly involved in metabolism correlation pathways, cell cycle
   pathways and cytokine-cytokine receptor interaction, which are closely
   related to HCC [190]^33^, [191]^62. In Cluster 1, the expression levels
   of 10 genes were significantly correlated with OS of HBV-associated HCC
   patients by univariate Cox proportional hazards regression model and
   Kaplan-Meier survival analysis, which had differences not only in the
   expression but also in survival. Therefore, seeking genes in the
   largest cluster, Cluster 1, for hub genes is more useful and precise
   for the generation and validation of prognostic signature by
   multivariable cox analysis.

   Using clinical information, pathological classification and the
   expression of protective or risk genes could provide information for
   the diagnosis and prognosis in different types of cancer
   [192]^63^-[193]^65. Recently, some prediction models for risk
   assessment in HCC has been reported, such as 4-aminobutyrate
   aminotransferase (ABAT), alanine-glyoxylate aminotransferase (AGXT),
   aldehyde dehydrogenase 6 family member A1 (ALDH6A1), cytochrome P450
   family 4 subfamily A polypeptide 11 (CYP4A11), D-amino-acid oxidase
   (DAO) and enoyl-CoA hydratase and 3-hydroxyacyl CoA dehydrogenase
   (EHHADH) [194]^66^-[195]^68. Several studies have identified prognostic
   signatures by combining multiple genes, which contained more than 6
   genes, for HCC [196]^66^, [197]^69^, [198]^70. Because many variables
   need to be controlled, these prognostic signatures are difficult to
   develop as diagnosis markers in the future. Li B et al recently
   identified a three-gene prognostic signature (UPB1, SOCS2 and RTN3) for
   HCC without considering the etiology of disease by using datasets from
   The Cancer Genome Atlas project (TCGA) [199]^71. RTN3 is not a
   significant gene in HBV-associated HCC, which may limit its use in
   HBV-associated HCC [Table [200]S1]. In the current study, we discovered
   a three-gene model (SPP2, CDC37L1, and ECHDC2) that contributes a
   specific prognostic signature for HBV-associated HCC, none of which
   were reported in previous prediction models. These three genes are
   specifically highly expressed in healthy liver tissue but lose their
   expression in HCC tissues (Data not shown). SPP2 is localized mainly in
   the extracellular region, which could inhibit the growth of HCC in vivo
   [201]^59. CDC37L1 is localized mainly in cytoplasm and extracellular
   region, which participates in unfolded protein and heat shock protein
   binding [202]^72, and has been considered as a promising marker in HCC.
   ECHDC2 is a mitochondrial-related gene, which takes part in the
   metabolic process and fatty acid biosynthesis [203]^73.

   These three hub genes may have protective roles in HBV-associated
   carcinogenesis mainly by involving drug metabolism-cytochrome P450,
   PPAR signaling pathway and many metabolic pathways, pathways of which
   are enriched in patients of the low-risk group by GSEA analysis.
   Previous studies revealed that abnormality of PPAR signaling and drug
   metabolism-cytochrome P450 pathway can be the characteristics of early
   stages of HCC [204]^51^, [205]^52^, [206]^74. Reprogramming of
   metabolic pathways, such as fatty acid metabolism, is also associated
   with angiogenesis, cell adhesion and invasion of tumor tissue
   [207]^75^, [208]^76. In patients in the low-risk group, higher
   expression levels of genes in these pathways, such as aldehyde
   dehydrogenase family, cytochrome P450 family, enoyl-CoA hydratase and
   3-hydroxyacyl CoA dehydrogenase (EHHADH), help maintain good metabolic
   function in liver [209]^77^-[210]^79, which in turn might protect HCC
   patients from persistent drug toxicity in response to various drug
   treatments. Therefore, this three-gene prognostic signature may be a
   promising prognostic marker for HBV-associated HCC, which warrants
   further functional and mechanistic studies. It may also be beneficial
   to develop a commercial detection kit for diagnosis and prognosis
   prediction based on this three-gene prognosis signature.

   Limitations exist in our study. Only the [211]GSE14520 dataset was
   selected for this work, resulting in limited samples for the 3-gene
   signature model of prognosis. Also, little was known on the functions
   and mechanisms of these three genes in HBV-associated HCC. As a result,
   further validation studies of our model with independent larger cohorts
   and further characterization of the functions of these 3 genes in
   HBV-associated HCC are required in the future.

   In conclusion, WGCNA analysis of differentially expressed genes
   acquired by RRA from 7 datasets following by multivariable Cox
   regression analysis identifies a three hub gene module (SPP2, CDC37L1,
   and ECHDC2) as a specific prognostic signature for HBV-associated HCC.
   The higher expression levels of these three genes may improve the
   prognosis of patients with HBV-associated HCC by influencing drug
   metabolism-cytochrome P450, PPAR signaling pathway and many other
   metabolic pathways.

Supplementary Material

   Supplementary tables.
   [212]Click here for additional data file.^ (323.7KB, pdf)

Acknowledgments