Abstract Integration of public genome-wide gene expression data together with Cox regression analysis is a powerful weapon to identify new prognostic gene signatures for cancer diagnosis and prognosis. Hepatitis B virus (HBV) is a major cause of hepatocellular carcinoma (HCC), however, it remains largely unknown about the specific gene prognostic signature of HBV-associated HCC. Using Robust Rank Aggreg (RRA) method to integrate seven whole genome expression datasets, we identified 82 up-regulated genes and 577 down-regulated genes in HBV-associated HCC patients. Combination of several enrichment analysis, univariate and multivariate Cox proportional hazards regression analysis, we revealed that a three-gene (SPP2, CDC37L1, and ECHDC2) prognostic signature could act as an independent prognostic indicator for HBV-associated HCC in both the discovery cohort and the internal testing cohort. Gene set enrichment analysis showed that the high-risk group with lower expression levels of the three genes was enriched in bladder cancer and cell cycle pathway, whereas the low-risk group with higher expression levels of the three genes was enriched in drug metabolism-cytochrome P450, PPAR signaling pathway, fatty acid and histidine metabolisms. This indicates that patients of HBV-associated HCC with higher expression of these three genes may preserve relatively good hepatic cellular metabolism and function, which may also protect HCC patients from persistent drug toxicity in response to various medication. Our findings suggest a three-gene prognostic model that serves as a specific prognostic signature for HBV-associated HCC. Keywords: Hepatitis B virus associated hepatocellular carcinoma, Robust Rank Aggreg analysis, Hub genes, Prognostic signature, Overall survival. Introduction In recent years, high-throughput expression profiling data from microarray chip and RNA sequencing (RNA-Seq) provide useful information regarding the mechanisms and diversity of various cancers, and are valuable for the diagnosis, therapeutic response prediction and prognosis evaluation in those diseases [43]^1. However, there are many obstacles in data preparation, analysis and interpretation because the results from one individual study can be rather noisy. Therefore, integration of the public genome-wide gene expression data is a powerful weapon for cancer diagnosis and prognosis [44]^2. Hepatocellular carcinoma (HCC) is one of the most prevalent cancers worldwide and the second cause of death among malignancy tumors [45]^3. Although more than 40 prognostic gene signatures have been described in HCC, none has become a commercial marker useful for clinical decision-making [46]^4^, [47]^5. Therefore, identifying a more specific gene prognostic signature for HCC with a specific etiology still represents a critical approach for the diagnosis and prognosis of HCC. Over 50% of HCC cases worldwide are attributed to hepatitis B virus (HBV) infection. In areas with high HBV prevalence, such as Southeast Asia and sub-Saharan Africa, more than 80% of patients with HCC are due to HBV infection [48]^6. Chronic HBV infection plays critical roles in the progression of HCC and patients with HBV-associated HCC usually have poor clinical recovery and high rate of recurrence after treatment [49]^7. Control of HBV infection is effective to reduce the mortality of HCC patients in East Asia and Southern Europe [50]^8. Because of the integration of HBV DNA into host cellular DNA, and also the expression of HBV proteins that have direct effect on cellular functions [51]^9, the pathogenesis of HBV-associated HCC has its own specific gene signatures as compared to other types of HCC [52]^10^-[53]^12. Recently, semi-supervised methods to predict patient survival from gene expression data have been widely accepted. These methods allow us to use the expression of some critical genes, also known as prognostic genes, to predict the survival of future patients and also the efficiency of therapeutic treatment [54]^13. As such, screening of HBV specific prognostic gene models is ultimately important for the diagnosis and prognosis of patients with HBV-associated HCC. However, no specific gene prognostic model has been developed for HBV-associated HCC until now. In this current study, the R package called Robust Rank Aggreg (RRA) was used to integrate several HBV-associated HCC experimental sources, which is helpful to identify evidence-supported findings and to increase the signal as well as to minimize rates of false positive findings [55]^14. By using the gene cluster of differential expressed genes acquired from RRA, we performed bioinformatics analysis of weighted gene co-expression network analysis (WGCNA) [56]^15 to identify key gene modules, following by construction of the protein-protein interaction (PPI) networks for the key gene modules in order to further identify hub genes. Lastly, we identified a three-gene prognostic signature in these hub genes by univariate and multivariate Cox regression analysis from the [57]GSE14520 dataset. Materials and Methods Collection of HBV-associated HCC gene expression datasets All HBV-associated HCC datasets were downloaded from GEO ([58]http://www.ncbi.nlm.nih.gov/geo/). Every dataset was estimated thoroughly through full text. The selection criteria used in this study are as follows: 1. gene expression profiling datasets that included gene microarray chip technology or RNA-Seq technique; 2. studies comparing gene expressions between HBV-associated cancer and non-cancerous liver tissue in human samples; 3. expression studies using cell lines or body fluid (such as serum, saliva, peripheral blood, etc.) were excluded; 4. sample size should be larger than three pairs. According to the above screening criteria, 7 datasets were finally included in this study: [59]GSE77509 [60]^16, [61]GSE62232 [62]^17, [63]GSE25097 [64]^18, [65]GSE54238 [66]^19, [67]GSE50579 [68]^20, [69]GSE14520 [70]^21, and [71]GSE55092 [72]^22 (Table [73]1). Table 1. Characteristics of the studies First author and reference Region Number of samples Year GEO accession Platforms Data Type Yang Y et al [74]^9 China 17 pair 2017 [75]GSE77509 [76]GPL16791 RNA-seq Schulze K et al [77]^10 France 10T,10N 2015 [78]GSE62232 [79]GPL570 Gene Chip Sung WK et al [80]^11 USA 223T,243N 2011 [81]GSE25097 [82]GPL10687 Gene Chip Yuan SX et al [83]^12 USA 26T,10N 2016 [84]GSE54238 [85]GPL16955 Gene Chip Neumann O et al [86]^13 Germany 8T,10N 2012 [87]GSE50579 [88]GPL14550 Gene Chip Roessler S et al [89]^14 USA 212 pair 2010 [90]GSE14520 [91]GPL571 [92]GPL3921 Gene Chip Melis M et al [93]^15 USA 39T,81N 2014 [94]GSE55092 [95]GPL570 Gene Chip [96]Open in a new tab Note: T, tumor samples; N, non-cancerous samples; pairs, tumor tissues and non-cancerous tissues from the same patient. Datasets processing and statistical analysis Series matrix files of each dataset were downloaded from GEO. Normalization of the microarray datasets was performed by the package limma of R (version 3.3.3), and normalization of the RNA-seq datasets was performed by the package edgeR of R (version 3.3.3) [97]^23. We used R package RRA to identify the most significant genes, and the new data frame results were constructed with the standard of P value < 0.01 [98]^14. Enrichment analysis To conduct enrichment analyses, the package cluster Profiler (version 3.2.14) of R (version 3.3.3) was used for analyzing the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) processes, as described previously [99]^24^, [100]^25. WGCNA WGCNA is a typical systemic biological method for describing correlation patterns among genes and identifying modules of highly correlated genes by using average linkage hierarchical clustering coupled with the topological overlap dissimilarity measure based on high-throughput chip data or RNA-Seq data [101]^26^-[102]^28. In current study, WGCNA package (version 1.60) in R was used to merge highly correlated genes and identify key modules based on the expression levels of the most significant genes, identified by RRA methods above, from the data of [103]GSE77509 [104]^16. Here, the power of β = 18 (scale free R^2 = 0.8) was selected as the soft-thresholding to ensure a scale-free network. The cut Height = 0.7 and min Size = 10 were used to identify key modules. The gene modules were labeled with different colors and the grey module showed the genes that cannot be merged. PPI network construction and clusters analysis STRING ([105]https://string-db.org/) was employed to construct PPI networks for distance-related genes by using the genes in the biggest module identified by WGCNA [106]^29. The PPI networks were then exported from STRING and imported into Cytoscape. We then used the “Molecular Complex Detection (MCODE)” Cytoscape app to identify discrete clusters from the former PPI networks using default settings (K-Core: 2; degree cutoff: 2; max. depth: 100 and node score cutoff: 0.2 were used as the cutoff criteria for the identification of network modules) [107]^30. The top clusters were screened under the conditions of minimum size = 6 and minimum score = 6. Generation and validation of prognostic signature and statistical analysis The data of Roessler S's study (GEO DataSet: [108]GSE14520), which contained the data of both gene expression levels and clinical signatures of 212 HBV-associated HCC patients, were used for the generation and validation of prognostic signature using the package of survminer and survival of R (version 3.3.3). Firstly, the univariate Cox proportional hazards regression model was used to evaluate the associations between overall survival (OS) and the expression level of each hub gene that belong to the top cluster identified by PPI networks and MCODE Cytoscape app (P < 0.05 was considered significant). Hazard ratios (HRs) were applied to identify protective (HR < 1) and risk genes (HR > 1). Secondly, we divided the data of these 212 patients randomly into two groups: a discovery cohort (N = 106) and an internal testing cohort (N = 106). Thirdly, in the discovery cohort, protective or risk genes identified by univariate Cox analysis were selected to calculate the association with OS by multivariable Cox analysis, and the risk scores for predicting OS was calculated with the formula based on the protective or risk genes: Risk scores = exp[gene1]*β[1] + exp[gene2]*β[2] +…+exp[genen]*β[n]. The “β” value is the estimated regression coefficient of gene derived from the multivariate Cox analysis and “exp” indicates the expression profiles of gene [109]^31. Kaplan-Meier survival analysis with the log-rank test (P < 0.05) was used to examine the proportional assumptions for Cox proportional hazard model. Lastly, the prognostic signature of indicated genes generated in the discovery cohort was further confirmed in the internal testing cohort. The statistical significance was defined as P < 0.05. Gene Set Enrichment Analysis (GSEA) GSEA was performed with indicated prognostic genes using phenotype labels “high-risk” vs “low-risk” by the GSEA software ([110]http://software.broadinstitute.org/gsea/index.jsp). Gene sets used in this work were c2.cp.kegg.v5.2.symbols.gmt downloaded from the Molecular Signatures Database (MSigDB, [111]http://software.broadinstitute.org/gsea/msigdb/index.jsp) Statistical Analysis Statistical analyses were accomplished using R (version 3.3.3). The association between clinical characteristics and group was determined using chi-square test, wilcoxon rank sum test or unpaired t test according the data type of clinical characteristics. Univariate and multivariate Cox regression models were used to evaluate prognostic significance. Survival analysis was performed using Kaplan-Meier and log-rank test. The Student's t-test was used for comparisons of two independent groups. A P value < 0.05 was considered statistically significant. Results Identified significance genes by integrated analysis In order to identify significance genes in HBV-associated HCC, we setup a workflow showing in Figure [112]1. By using the method of RRA to integrate 7 datasets, 82 up-regulated genes and 577 down-regulated genes were identified as the most significant genes [Table [113]S1, P < 0.01]. The top 50 significantly up-regulated and down-regulated genes were listed in Figure [114]2A. KEGG pathway enrichment analysis of the most significant up-regulated genes showed that the top 5 enriched pathway is cell cycle, cellular senescence, melanoma, p53 signaling pathway, and ECM-receptor interaction. All of these pathways were reported with important roles in cancer [115]^32^-[116]^35 (Figure [117]2B). Whereas the top 5 KEGG pathway enriched in the down-regulated genes, including valine, leucine and isoleucine degradation, retinol metabolism, PPAR signaling pathway, fatty acid degradation and complement and coagulation cascades, also have been studied to be important pathways involved in tumor growth and metastasis (Figure [118]2B and Table [119]S2) [120]^36^-[121]^39. Together with the data of GO analysis (Figure [122]2C, 2D, and Table [123]S3), these significant genes are shown to be mainly involved in tumor metabolism and growth. Figure 1. [124]Figure 1 [125]Open in a new tab Flowchart describing the schematic overview of the study design. After integrated analysis and different bioinformatics analysis of HBV-associated HCC genome expression datasets, we identified hub genes in Cluster 1. Hub genes were then analyzed individually for prognostic significance by univariate Cox proportional hazards and Kaplan-Meier survival analysis. 10 hub genes were significantly associated with the survival of patients with HBV-associated HCC (UVA Cox analysis, P < 0.05, and log-rank test P < 0.05). The HBV-associated HCC cohort ([126]GSE14520, N = 212) were randomly divided in to discovery cohort (N = 106) and internal testing cohort (N = 106). Next, we used multivariable Cox proportional hazards stepwise regression analysis with forward selection to build a prognostic model that included 3 genes: SPP2, ECHDC2, and CDC37L1. This model was used to calculate risk scores for discovery cohort (risk score = expSPP2* - 0.1941 + expCDC37L1* - 0.5466 + expECHDC2* - 0.4714), and the cut-off point was chosen. This risk score calculation and cut-off point were further validated in internal testing cohort. Lastly, GSEA analysis of the high-risk and low-risk group was used to further inquiry the 3 genes prognostic signature. UVA Cox: univariate Cox; MVA Cox: multivariate Cox. Figure 2. [127]Figure 2 [128]Open in a new tab Identified significance genes and enrichment analysis. (A) Heatmap showed the fold change of the top 100 significantly genes in different studies (50 up-regulated genes and 50 down-regulated genes) by Robust Rank Aggreg (RRA) methods from 7 different datasets. Each row represents the same mRNA and each column represents the same study. The fold change intensity of each mRNA in one study is represented in shade of red or blue. Red represents the fold change of up-regulated genes and blue represents the fold change of down-regulated genes, respectively, in comparison to non-tumor tissues. (B) Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis of the significant genes. Up-regulated KEGG pathways are showed with horizontal axis > 0, and down-regulated genes are showed with horizontal axis < 0, respectively. The size of the horizontal axis shows the numbers of enriched genes in each KEGG pathway and the color shade of bar reflects P value. (C-D) Gene Ontology (GO) enrichment analysis of up-regulated (C) and down-regulated genes (D). The vertical and horizontal axes represent GO term and -log[10] (P value) of the corresponding GO term, respectively. The number in each bar reflects the enriched gene number of each GO term. Different colors reflect main categories of GO terms: BP, biological process; CC, cellular component; MF, molecular function. WGCNA and PPI analysis Co-expression network by WGCNA analysis revealed that the most significant genes mentioned above were grouped into 3 modules, of which the turquoise module was the most significant one (Figure [129]3A and Table [130]S4). KEGG pathway enrichment analysis showed that the turquoise module is mainly involved in the PPAR signaling and different metabolism pathway, similar to the enriched KEGG pathways involving by the most significant genes [131]^40^-[132]^42 (Figure [133]3B). We further constructed the PPI networks of the turquoise module by STRING, of which the largest connected master network containing 396 nodes (Figure [134]4A). In this largest connected master network, three discrete sub-clusters were extracted using the MCODE app by Cytoscape software [135]^30 (Figure [136]4B). GO enrichment analysis of these three discrete sub-clusters revealed that Cluster 1, the largest cluster containing 32 genes was mainly involved in platelet degranulation, fatty acid metabolic process and exocytosis (Figure [137]4C and Table [138]S5). KEGG pathway analysis revealed that Cluster 1 was mainly involved cellular metabolism: including fatty acid degradation, fatty acid metabolism and valine, leucine and isoleucine degradation (Figure [139]4C and Table [140]S5). This indicates that Cluster 1 may play a critical role in the pathogenesis of HCC, as platelet degranulation, fatty acid metabolism and degradation have been reported to have important roles in the development of HCC [141]^36^, [142]^43^, [143]^44. Figure 3. [144]Figure 3 [145]Open in a new tab WGCNA analysis of the significant genes and KEGG analysis of the key modules from WGCNA. (A) Gene clustering and module identification by WGCNA analysis based on the dataset of [146]GSE77509. Top: clustering dendrogram showed the result of hierarchical clustering, and each line represents one gene. Bottom: the colored row below the dendrogram indicates module membership identified by the static tree cutting method. Different color represents different co-expression network modules for the significantly genes. (B) KEGG analysis of the top three modules from WGCNA. The vertical and horizontal axes represent the KEGG pathways and different modules, respectively. The size and the color intensity of a circle represent gene number and -log[10] (P value), respectively. Figure 4. [147]Figure 4 [148]Open in a new tab Identified discrete clusters and enrichment analysis of the turquoise module in Figure [149]3. (A) PPI network of genes in the turquoise module. The color intensity in each node represents the fold change of the gene in comparison to non-tumor samples (up-regulation of a gene is shown in red and down-regulation of a gene is shown in blue). The size of the circle is proportional to the score of PPI based on the STRING database. (B) Main sub-clusters from the master PPI networks. The color intensity in each node was proportional to fold change of each gene expression in comparison to non-tumor samples (up-regulation in red and down-regulation in blue). (C) KEGG and GO enrichment analysis of the top three sub-clusters. The vertical and horizontal axes represent the GO biological process /KEGG pathways and different sub-clusters, respectively. The size and the color intensity of a circle represent gene number and -log10 (P value), respectively. Identification of hub genes in Cluster 1 by univariate Cox proportional hazards regression model The above findings that Cluster 1 genes were mainly involved in platelet degranulation, fatty acid metabolism and degradation prompted us to explore its prognostic value in HBV-associated HCC. To identify prognosis-related genes in this network, we first used an univariate Cox proportional hazards regression model to evaluate the associations between the expression level of the involved genes in Cluster 1 and OS by using the data reported by Roessler S et al (GEO DataSet: [150]GSE14520) [151]^21. The data revealed that 10 out of 32 genes, including acyl-CoA dehydrogenase, C-4 To C-12 straight chain (ACADM), acyl-coA synthetase medium-chain family member 3 (ACSM3), cell division cycle 37 like 1 (CDC37L1), crystallin lambda 1 (CRYL1), enoyl-CoA hydratase domain containing 2 (ECHDC2), coagulation factor VIII (F8), glutaryl-CoA dehydrogenase (GCDH), histidine rich glycoprotein (HRG), methylmalonyl-CoA mutase (MUT) and secreted phosphoprotein 2 (SPP2), were all negatively relevant to OS in patients of HBV-associated HCC (P < 0.05, and log-rank test P < 0.05) (Figure [152]5 and Table [153]S6). Figure 5. [154]Figure 5 [155]Open in a new tab Kaplan-Meier survival plots of the association between the expression levels of hub genes and overall survival probability in patients of HBV-associated HCC. P values were obtained from Log-rank test. Yellow and blue line represent the samples with the gene higher expressed and lower expressed, respectively. The table below the Kaplan-Meier survival plots showed the number of patients at the risk. Abbreviations: ACADM, acyl-CoA dehydrogenase, C-4 To C-12 straight chain; ACSM3, acyl-coA synthetase medium-chain family member 3; CDC37L1, cell division cycle 37 like 1; CRYL1, crystallin lambda 1; ECHDC2, enoyl-CoA hydratase domain containing 2; F8, coagulation factor VIII; GCDH, glutaryl-CoA dehydrogenase; HRG, histidine rich glycoprotein; MUT, methylmalonyl-CoA mutase; SPP2, secreted phosphoprotein 2. Generation and validation of multiple-gene prognostic signature by multivariable Cox proportional hazards regression analysis By randomly dividing the data of 212 HBV-associated HCC patients in Roessler et al's study (GEO DataSet: [156]GSE14520) [157]^21 into a discovery cohort (N = 106) and an internal testing cohort (N = 106) (Table [158]2, and Table [159]S7). The 10 genes that were significantly associated with OS of HBV-associated HCC were used for prognostic module building by using forward conditional stepwise regression with multivariable Cox analysis in the discovery cohort. This procedure selected a prognostic model containing three genes, including SPP2, CDC37L1 and ECHDC2. The risk scores of these three genes signature for each sample in the discovery cohort were calculated by using the following formula: Risk score = exp[SPP2]*-0.1941 + exp[CDC37L1]* - 0.5466 + exp[ECHDC2]* - 0.4714 and ranked according to the values of Risk score. Using the median risk score (-9.158) as the cut-off point, patients were divided into a high-risk group (score > -9.158, N = 53) and a low-risk group (score ≤ -9.158, N = 53) (Figure [160]6A). Patients in high-risk group tended to express low levels of these three genes and shorter survival time (Figure [161]6B-C). Consistently, patients in high-risk group had lower OS rates than those in low-risk group, suggesting that patients in the high-risk group was more likely to have higher mortality than that in the low-risk group (Log-rank test, P < 0.05) (Figure [162]6D). Table 2. Clinical characteristics of the patients Clinical Variable Discovery Cohort (n = 107) Internal Testing Cohort (n = 107) P value Sex Male 92 91 1^b Female 14 15 Age, y Median (range) 51(71-27) 50 (77-21) 0.6054^a ALT Negative (≤50 U/L) 67 57 0.2097^b Positive (>50 U/L) 39 49 Tumor size Large(>5cm) 31 43 Small(≤5cm) 74 63 0.1245^b No data 1 0 TNM stage I 47 42 II 37 39 1^c III-IV 22 25 AFP Low 56 59 High 50 44 0.6117^b No data 0 3 [163]Open in a new tab AFP, α-fetoprotein; ALT, alanine transferase; ^a Unpaired t test; ^b chi-squared test (χ2 test); ^c wilcoxon rank sum test. Figure 6. [164]Figure 6 [165]Open in a new tab Three-gene predictor-score analysis of HBV-associated HCC patients in both the discovery and internal cohort. (A) and (E) Three-gene risk score distribution in both the discovery (A) and internal testing cohort (E), respectively. Each point represents one patient; the vertical and horizontal axes represent the risk score calculated from the three-genes model and results sorted by the size of the risk score, respectively; red and blue represent patients with high and low risk scores identified by cut-off, respectively. The black dotted line represents the median mRNA risk score cut-off dividing patients into low-risk or high-risk groups. (B) and (F) Patients' survival status and time in both the discovery (B) and internal testing cohort (F), respectively. Each point corresponds to the same patient as above; the vertical and horizontal axes represent the survival time and results sorted by the size of the risk score, respectively; red or blue represent patient dead or live in the end, respectively. (C) and (G) Heatmap of gene expression profiles in both the discovery (C) and internal testing cohort (G), respectively. Each row represents the same gene and each column represents the same patient corresponded to the above point. The expression intensity of each gene in one patient is represented in shade of red or grey, indicating its expression level above or below the median expression intensity across all patients, respectively. (D) and (H) The Kaplan-Meier overall survival plots for HBV-associated HCC risk groups obtained from both the discovery (D) and internal testing cohort (H), respectively. Red and green line represent the patient with high or low risk, respectively. To test our findings, this three-gene risk score model was further evaluated using the internal cohort. With the same risk-score formula and cut-off point derived from the discovery cohort, the internal cohort was divided into high-risk group (N = 61) and low-risk group (N = 45). The data in internal cohort revealed a similar finding as that in the discovery cohort (Figure [166]6E-H). GSEA analysis According to the above results, this three-gene model can distinguish survival difference from HBV-associated HCC patients. GSEA analyses of the data of the [167]GSE14520 revealed that the high-risk group showed gene-enrichment in bladder cancer and cell cycle pathway (Figure [168]7). And the oncogenes in bladder cancer pathway were overexpressed in the tumor tissues of HBV-associated HCC patients in the high-risk group, such as MYC that can trigger specific gene amplification to promote cell growth and proliferation [169]^45. Figure 7. [170]Figure 7 [171]Open in a new tab GSEA analysis of the three prognostic genes. GSEA analysis of the differentially expressed genes between the high-risk versus low-risk group. Only two significantly functional gene sets were enriched for high-risk group (marked in red), whereas only four most significantly enriched functional gene sets were listed for low-risk group (markered in blue) in HBV-associated HCC samples. Other overexpressed genes of bladder cancer pathway in the tumor tissue of patient in the high-risk group, such as vascular endothelial growth factor B (VEGFB), matrix metallopeptidase (MMP) 1, 2 and 9, are also reported to be involved in tumor invasion and metastasis [172]^46^-[173]^49. In the cell cycle pathway, genes associated with the proliferative capacity of tumor cells were also higher expressed in patients of high-risk group, such as minichromosome maintenance (MCM) complex [174]^50. While, the low-risk group showed gene-enrichment in drug metabolism-cytochrome P450, PPAR signaling, and fatty acid metabolism pathway (Figure [175]7). These pathways, associated with hepatic metabolism of cholesterol, fatty acid and drug, are usually down-regulated in HCC patients [176]^51^, [177]^52. This indicates that patients in the low-risk group preserve relative good capacity of hepatic function, which might also protect HCC patients from persistent drug toxicity in response to various medication. These results showed a highly similar manner with the significant changed KEGG pathways of the most significant genes in total tumor samples of HBV-associated HCC as compared to healthy tissue presented by integrational analysis (Figure [178]2B). Discussion Although increasing chip and RNA-seq profiling studies were performed to find differentially expressed genes in HBV-associated HCC, these reports often presented inconsistent results across different studies. In our current work, we used an integrated analysis by using the RRA method to identify significant differentially expressed genes from 7 independent datasets of HBV-associated HCC profiling experiments, which could overcome the rather noisy from different individual studies [179]^53. The expression levels of each gene were re-ranked and their significances were re-determined by giving a P-value for each gene. Finally, 82 up-regulated and 577 down-regulated integrated significant genes were identified. Many of them have been proven to be oncogenic, such as aldo-keto reductase family 1 member B10 (AKR1B10) [180]^54, maternal embryonic leucine zipper kinase (MELK) [181]^55, and aldehyde dehydrogenase 3 family member A1 (ALDH3A1) [182]^56). Some are considered as anti-oncogenes, such as hydroxyacid oxidase 2 (HAO2) [183]^57, C-type lectin domain family 1 member B (CLEC1B) [184]^58 and secreted phosphoprotein 2 (SPP2) [185]^59 in HCC. The result of enrichment analysis also showed that these significance genes were highly related to HCC, such as the cell cycle pathway and p53 signaling pathway [186]^32^, [187]^33. In addition, we also identified many significantly expressed genes, the role of which are still unclear and need further research in HCC, such as EH domain containing 3 (EHD3), aspartoacylase (ASPA) and melanocortin 2 receptor accessory protein 2 (MRAP2). This method could also be used in other human diseases such as neurological disorders [188]^60^, [189]^61. In this study, we also used WGCNA combination with PPI network to identify hub genes that link with HBV-associated HCC. Three clusters identified by PPI network from the most significant module by WGCNA were mainly involved in metabolism correlation pathways, cell cycle pathways and cytokine-cytokine receptor interaction, which are closely related to HCC [190]^33^, [191]^62. In Cluster 1, the expression levels of 10 genes were significantly correlated with OS of HBV-associated HCC patients by univariate Cox proportional hazards regression model and Kaplan-Meier survival analysis, which had differences not only in the expression but also in survival. Therefore, seeking genes in the largest cluster, Cluster 1, for hub genes is more useful and precise for the generation and validation of prognostic signature by multivariable cox analysis. Using clinical information, pathological classification and the expression of protective or risk genes could provide information for the diagnosis and prognosis in different types of cancer [192]^63^-[193]^65. Recently, some prediction models for risk assessment in HCC has been reported, such as 4-aminobutyrate aminotransferase (ABAT), alanine-glyoxylate aminotransferase (AGXT), aldehyde dehydrogenase 6 family member A1 (ALDH6A1), cytochrome P450 family 4 subfamily A polypeptide 11 (CYP4A11), D-amino-acid oxidase (DAO) and enoyl-CoA hydratase and 3-hydroxyacyl CoA dehydrogenase (EHHADH) [194]^66^-[195]^68. Several studies have identified prognostic signatures by combining multiple genes, which contained more than 6 genes, for HCC [196]^66^, [197]^69^, [198]^70. Because many variables need to be controlled, these prognostic signatures are difficult to develop as diagnosis markers in the future. Li B et al recently identified a three-gene prognostic signature (UPB1, SOCS2 and RTN3) for HCC without considering the etiology of disease by using datasets from The Cancer Genome Atlas project (TCGA) [199]^71. RTN3 is not a significant gene in HBV-associated HCC, which may limit its use in HBV-associated HCC [Table [200]S1]. In the current study, we discovered a three-gene model (SPP2, CDC37L1, and ECHDC2) that contributes a specific prognostic signature for HBV-associated HCC, none of which were reported in previous prediction models. These three genes are specifically highly expressed in healthy liver tissue but lose their expression in HCC tissues (Data not shown). SPP2 is localized mainly in the extracellular region, which could inhibit the growth of HCC in vivo [201]^59. CDC37L1 is localized mainly in cytoplasm and extracellular region, which participates in unfolded protein and heat shock protein binding [202]^72, and has been considered as a promising marker in HCC. ECHDC2 is a mitochondrial-related gene, which takes part in the metabolic process and fatty acid biosynthesis [203]^73. These three hub genes may have protective roles in HBV-associated carcinogenesis mainly by involving drug metabolism-cytochrome P450, PPAR signaling pathway and many metabolic pathways, pathways of which are enriched in patients of the low-risk group by GSEA analysis. Previous studies revealed that abnormality of PPAR signaling and drug metabolism-cytochrome P450 pathway can be the characteristics of early stages of HCC [204]^51^, [205]^52^, [206]^74. Reprogramming of metabolic pathways, such as fatty acid metabolism, is also associated with angiogenesis, cell adhesion and invasion of tumor tissue [207]^75^, [208]^76. In patients in the low-risk group, higher expression levels of genes in these pathways, such as aldehyde dehydrogenase family, cytochrome P450 family, enoyl-CoA hydratase and 3-hydroxyacyl CoA dehydrogenase (EHHADH), help maintain good metabolic function in liver [209]^77^-[210]^79, which in turn might protect HCC patients from persistent drug toxicity in response to various drug treatments. Therefore, this three-gene prognostic signature may be a promising prognostic marker for HBV-associated HCC, which warrants further functional and mechanistic studies. It may also be beneficial to develop a commercial detection kit for diagnosis and prognosis prediction based on this three-gene prognosis signature. Limitations exist in our study. Only the [211]GSE14520 dataset was selected for this work, resulting in limited samples for the 3-gene signature model of prognosis. Also, little was known on the functions and mechanisms of these three genes in HBV-associated HCC. As a result, further validation studies of our model with independent larger cohorts and further characterization of the functions of these 3 genes in HBV-associated HCC are required in the future. In conclusion, WGCNA analysis of differentially expressed genes acquired by RRA from 7 datasets following by multivariable Cox regression analysis identifies a three hub gene module (SPP2, CDC37L1, and ECHDC2) as a specific prognostic signature for HBV-associated HCC. The higher expression levels of these three genes may improve the prognosis of patients with HBV-associated HCC by influencing drug metabolism-cytochrome P450, PPAR signaling pathway and many other metabolic pathways. Supplementary Material Supplementary tables. [212]Click here for additional data file.^ (323.7KB, pdf) Acknowledgments