Abstract Background Traditionally top-down method was used to identify prognostic features in cancer research. That is to say, differentially expressed genes usually in cancer versus normal were identified to see if they possess survival prediction power. The problem is that prognostic features identified from one set of patient samples can rarely be transferred to other datasets. We apply bottom-up approach in this study: survival correlated or clinical stage correlated genes were selected first and prioritized by their network topology additionally, then a small set of features can be used as a prognostic signature. Methods Gene expression profiles of a cohort of 221 hepatocellular carcinoma (HCC) patients were used as a training set, ‘bottom-up’ approach was applied to discover gene-expression signatures associated with survival in both tumor and adjacent non-tumor tissues, and compared with ‘top-down’ approach. The results were validated in a second cohort of 82 patients which was used as a testing set. Results Two sets of gene signatures separately identified in tumor and adjacent non-tumor tissues by bottom-up approach were developed in the training cohort. These two signatures were associated with overall survival times of HCC patients and the robustness of each was validated in the testing set, and each predictive performance was better than gene expression signatures reported previously. Moreover, genes in these two prognosis signature gave some indications for drug-repositioning on HCC. Some approved drugs targeting these markers have the alternative indications on hepatocellular carcinoma. Conclusion Using the bottom-up approach, we have developed two prognostic gene signatures with a limited number of genes that associated with overall survival times of patients with HCC. Furthermore, prognostic markers in these two signatures have the potential to be therapeutic targets. Introduction Hepatocellular carcinoma (HCC) is the third leading cause of cancer-related death in the world, especially in Asia and Africa[[41]1]. Surgical resection is one of the most important curative treatments for HCC, while long-term survival of HCC remains poor because of high recurrence rate. Improvements in early diagnosis and accurate staging systems can help guide patients to take optimum treatment strategies that may suppress recurrence and prolong survival[[42]2]. Currently, beyond tumor-node-metastasis(TNM) staging system, several prognostic algorithms used to predict survival among patients with hepatocellular carcinoma have been established, Barcelona Clinic Liver Cancer (BCLC) and Cancer of the Liver Italian Program (CLIP) systems are among the most commonly used systems worldwide[[43]3–[44]7]. Nonetheless, all these staging systems only select optional clinical and serum biochemical indexes such as tumor size, vascular invasion, alpha fetoprotein (AFP), albumin(ALB), etc. Although these clinicopathologic staging systems have been proven useful, their predictive accuracy remains limited and they failed to provide molecular biological characteristics of HCC that might be genetic and heterogenic. With the recent advances in genome studies, gene expression profiling-based studies have improved our understanding of cancer biology and gene expression signatures have been successfully used as prognostic tools especially in breast cancer[[45]8–[46]10]. Recently, two main strategies have been used for prognostic gene signature identification, symbolized as “top-down” or “bottom-up”. In ‘top-down’ approach firstly genes with different expression patterns between case samples and control samples were sought, secondly gene(s) with expression level(s) significantly correlating to histological grades or biological phenotypes were selected as candidate genes, then a regression model with best predictors was built to construct the final gene signature. Many studies have applied this approach and identified quite a few gene signatures with prognostic values, such as MammaPrint[[47]11]. Unfortunately, prognostic signatures identified by this approach in HCC present minimal overlaps and few of them have been adopted in routine clinical practice[[48]12–[49]14]. The ‘bottom-up’ approach was firstly based on supervised analysis of genes which are directly associated with occurrence of the event studied (metastasis, survival), Secondly some genes were selected by machine learning algorithms or significant enrichment analysis of specific pathways or biological functions, and then the prognostic value of these gene sets could be calculated[[50]15,[51]16]. This ‘bottom-up’ approach was applied by some groups in several cancers but few of assays used this approach to identify HCC prognostic signatures. While the candidate gene set was determined by “top-down” or “bottom-up” approach, various machine learning algorithms including regression models can be used to identify the final gene signatures. However, overfitting and the low accuracy in independent cohort limited the clinical application of these algorithms[[52]17]. Meanwhile, analysis of networks and modular biological processes has shown the effective capacity to estimate key genes which may have impact on patient outcomes[[53]18]. In addition to identifying prognostic signatures in tumor tissues, many research groups showed that genes in adjacent non-tumor tissues appear to be implicated in tumour progression and aggressiveness, gene expression profiles in surrounding non-tumor tissues can be helpful to identify signatures associated with outcomes[[54]9]. This can be especially true with adjacent non-tumor tissues for HCC, where often pathological change such as cirrhosis is present because of long-term inflammation caused by hepatitis B or C virus (HBV or HCV) infection[[55]9,[56]19]. Over the past years, large investments have been input to discover novel markers for valuable biological insights to mechanisms of many common diseases. Nonetheless, the translation from genetic findings to clinical applications remains limited. Recently, some works explored one potential application of these genetic molecular markers: drug repositioning[[57]20,[58]21]. We suppose that prognostic gene signatures could not only be used for predicting patient outcome, but also can be used to gain important information for drug discovery, which can then be utilized to provide appropriate therapeutic advices to patients. In this study, we have mainly implicated the ‘bottom-up’ approach to develop HCC prognostic signatures both in tumor and adjacent non-tumor tissues. The workflow of our analysis is shown in [59]Fig. 1. In the training cohort, we first identified genes significantly associated with clinical staging systems or patient survival times, then selected some representative and important markers by network and pathway enrichment analyses, finally developed two signatures that can predict overall survival of HCC patients. Likewise, the ‘top-down’ approach have also been tried. Differentially expressed genes(DEGs) in tumor tissues compared to adjacent non-tumor tissues were firstly obtained, then we adopted the same bioinformatics methods to identify some core genes, based on which a DEGs-related signature was finally constructed. The three signatures were further tested in an independent cohort. The results showed that signatures developed by the ‘bottom-up’ approach were more efficient compared to signatures identified by the ‘top-down’ approach, and two signatures identified by ‘bottom-up’ approach were validated in the independent cohort. In addition, we have evaluated the potential application of our molecular markers as drug targets by drug repositioning analysis. Fig 1. The workflow of prognostic model construction. [60]Fig 1 [61]Open in a new tab Firstly, prognostic genes were selected for ProgGenes and ProgNGenes sets (bottom-up approach). Differentially expressed genes(DEGs) were selected by comparing the profiles of tumors and non-tumors synchronously (top-down approach). Next, the three gene clusters identified by both approaches went through both network and pathway analysis to obtain most important genes for prognostic signature assembling and model construction. Finally the signatures were validated in one independent clinical data set. Materials & Methods Datasets: Genomic profiles and patient information Publicly available datasets with whole-genome gene expression measures in 225 tumors and 220 adjacent non-tumor tissues from 225 primary HCC patients were downloaded from microarray databases Gene Expression Omnibus (GEO: [62]http://www.ncbi.nlm.nih.gov/projects/geo/), the accession number is [63]GSE14520[[64]14]. Pre-processed series of matrixes originally provided by the authors were used in our analysis. Accessory available clinical and follow-up data were also provided by the authors. Patient and tumor features are detailed in [65]Table 1. Among the 225 patients, four patients were excluded due to the lack of survival time. The validation set that included 80 tumors and 82 adjacent non-tumor liver tissues from 82 primary HCC patients was also retrieved from Gene Expression Omnibus, the accession number is [66]GSE10143[[67]9]. For each sample, the expression values of all probes for a given gene were reduced to a single value by taking the average expression value. Table 1. Clinical, histological, molecular data of HCC in training cohort. Variables Categories Total(n = 221) OS RFS HR p-value HR p-value Age < = 56 112(51%) 0.917 0.256 >56 109(49%) 1.02(0.67–1.56) 1.23(0.86–1.76) Gender female 30(14%) 0.153 0.019[68]^* male 191(86%) 1.7(0.82–3.52) 2.17(1.13–4.14) NA 3(1%) AFP < = 300ng/ml 118(53%) 0.017[69]^* 0.159 >300ng/ml 100(45%) 1.68(1.1–2.58) 1.25(0.82–1.91) ALT < = 50U/L 130(59%) 0.726 0.23 >50U/L 91(41%) 1.08(0.7–1.66) 1.25(0.87–1.78) NA 1(0%) Size < = 5cm 140(63%) 0.002[70]^* 0.067 >5cm 80(36%) 1.94(1.26–2.99) 1.41(0.98–2.04) Multinodular No 176(80%) 0.057 0.428 Yes 45(20%) 1.59(0.99–2.57) 1.19(0.77–1.84) cirrhosis No 18(8%) 0.032[71]^* 0.062 Yes 203(92%) 4.62(1.14–18.8) 2.18(0.96–4.97) NA 2(1%) TNM I 93(42%) <0.001[72]^** <0.001[73]^** II 77(35%) 2.08(1.21–3.58) 1.97(1.29–3.01) III 49(22%) 5.05(2.91–8.79) 3.14(1.97–5.01) NA 2(1%) BCLC 0-A 168(76%) <0.001[74]^** <0.001[75]^** B-C 51(23%) 3.63(2.33–5.65) 2.78(1.88–4.11) NA 2(1%) CLIP 0 97(44%) <0.001[76]^** 0.0017[77]^** 1 74(33%) 1.49(0.86–2.56) 2.21(1.42–3.43) 2–5 48(22%) 3.75(2.24–6.3) 1.25(0.82–1.91) [78]Open in a new tab In univariate analysis, the three staging systems, TNM, BCLC and CLIP, were most significantly associated with overall survival and recurrence-free survival. AFP, a-fetoprotein; ALT, alanine transferase; Prognostic staging system: TNM, Tumor Node Metastasis; BCLC, Barcelona Clinic Liver Cancer; CLIP, Cancer Liver Italian Program. OS, overall survival; RFS, recurrence-free survival; HR: hazard ratio. **P-value <0.01; *P-value <0.05. ‘Top-down’ approach: identification of differentially expressed genes(DEGs) As the first step of the ‘top-down’ approach, we identified the most varying genes between tumors and non-tumor tissues using Linear Models for Microarray Data (limma)[[79]22] analysis. Eventually, only the DEGs with a false discovery rate(FDR) < 0.05 and fold change ≥2 were selected. ‘Bottom-up’ approach: identification of outcome correlated genes in tumor and adjacent non-tumor tissues In this study, the ‘Bottom-up’ approach was mainly applied to develop HCC prognostic signatures. To obtain genes with potential prognostic value in tumor and adjacent non-tumor tissues, we evaluated all the genes in two aspects ([80]Fig. 1). Survival time related genes Survival-directed prognostic genes contains recurrence free survival(RFS) and overall survival(OS) related genes respectively. The correlation to RFS or OS of each gene is tested by both univariate Cox proportional hazards regression and supervised principal components analysis (SPCA) [[81]23,[82]24]. For univariate-Cox analysis, genes with a p-value < 0.05 was selected as Cox RFS or OS prognostic genes. In our work, the patient outcomes were used as response variables and the first principal component of SPCA were selected to determine most important score for each gene. Finally, top 2000 most important genes were selected as SPCA RFS or OS prognostic genes. At last, overlapping genes identified by both univariate-Cox and SPCA were considered as RFS or OS prognostic genes. Stage related genes HCC staging systems are commonly used to indicate outcomes. As noted in [83]Table 1, the three clinical staging systems TNM, BCLC, and CLIP are most significantly associated with RFS and OS. We may reason that genes significantly correlated to TNM, BCLC and CLIP also have prognostic values. Therefore, genes associated with the three clinical staging systems were selected. TNM and CLIP related genes were selected by logistic regression analysis and Kruskal-Wallis test. BCLC related genes were selected using logistic regression analysis and t-test. Genes with a p-value <0.05 were regarded as significant genes. According to the above analysis, five gene sets were identified in tumor tissues as well as adjacent non-tumor tissues, which are related to RFS, OS, TNM, BCLC or CLIP respectively. Finally genes identified in two of five types of prognostic gene sets were defined as prognostic genes([84]Fig. 1), which was termed as ProgGenes in tumors. Similarly, a set of prognostic genes termed ProgNGenes in adjacent non-tumors were also determined. Ranking of candidate gene sets by network prioritization Networks construction To identify the key genes involved in HCC from the aforementioned three larger sets of candidate genes separately, we constructed three protein-protein interaction (PPI) networks and performed topological analysis for each. Each gene set was firstly converted to be the seed proteins. Initial interactions among these proteins with a confidence score > 0.4 were obtained from STRING database (version 9.1) (Search Tool for the Retrieval of Interacting Genes/Proteins; [85]http://string.embl.de/). Then, to ensure these interactions truly exist in the gene expression profiles, only gene pairs with a p-value from Pearson Correlation Test less than 0.05 were retained. After that, an undirected network was finally constructed based on these gene pairs. Topological analysis of protein interaction networks In order to analyze these three networks and to search topologically important nodes, three fundamental measurements in network: degree, betweenness centrality(BC) and closeness centrality(CC) were calculated. Degree measures how many neighbors a node directly links to. A node with high degree centrality may have more influence over others. BC measures how often nodes occur on the shortest paths between other nodes[[86]25], and CC measures the average length from a node to all other nodes[[87]26]. In the PPI network, the nodes with high degree are defined as hub nodes, the nodes with high BC were defined as bottleneck nodes[[88]27], and the nodes with high CC are also important[[89]28], all the three types of nodes are key nodes in the network. In this study, the prioritization of candidate genes was based on the average ranking derived from the three parameters. For each gene in the network, a comprehensive rank score(RS) was calculated as follow: [MATH: RS=average( Rankdegree+RankBC+RankCC) :MATH] Where Rank[degree], Rank[BC] and Rank[CC] represent the rank of the gene in the network according to degree, BC and CC. We selected the top 5% nodes as the key nodes for later analysis. Network visualizations were generated by Cytoscape[[90]29] and network topological parameters were calculated using the Network Analyzer Cytoscape plugin[[91]30]. Pathway Enrichment Analysis By using the online software GSEA(gene set enrichment analysis, [92]http://www.broad.mit.edu/gsea/, v3.87), KEGG pathway enrichment analysis was carried out to find the main functional and metabolic pathways involved in HCC. The most significantly enriched genes (FDR<0.001) were mapped to the corresponding KEGG pathways. Integration of network prioritization and pathway enrichment In order to find vital genes not only playing key roles in networks but also possessing important biological functions in pathways, we integrated the network prioritization result and pathway enrichment result. P-value of hypergeometric probability distribution were calculated to examine whether genes in each significantly enriched pathway were also enriched in key genes selected from PPI network. Significant genes involved in pathways of the same main category were combined into a gene cluster. At last, three types of gene clusters were retrieved, which were representative for DEGs, ProgGenes and ProgNGenes. Construction of prognostic signatures Each gene cluster of DEGs, ProgGenes and ProgNGenes type was tested for its prognostic value separately. Both leave one out cross validation (LOOCV) combined with Cox proportional regression and unsupervised hierarchical clustering were used for patient classification. LOOCV involves using each sample in turn as the validation data for prediction, and the remaining samples as the training set to establish Cox proportional hazards model. The classification of the left-out sample was based on predicted value compared with the mean predicted value of the remaining samples. Finally all samples were divided into two classes, then we evaluate whether the two classes show significantly different overall survival[[93]31]. Unsupervised hierarchical clustering is a more popular classifier in many prognostic studies. Here Ward’s method was applied for clustering[[94]32]. Gene clusters that performed well in both LOOCV and hierarchical clustering method are retained to construct integrated prognostic signature for each type of DEGs, ProgGenes and ProgNGenes. Validation in the independent cohort For each type of DEGs, ProgGenes and ProgNGenes, a best integrated prognostic signature was constructed by combining gene clusters in the same category as a final prognostic signature. To evaluate the consistency and robustness of our signatures, we tested their prognostic power in an independent HCC microarray dataset([95]GSE10143) using LOOCV and hierarchical clustering. Further, we compared the performance of our signatures with other 21 signatures reported with prognostic value for either recurrence or survival of HCC from Molecular Signature Database (v4.0, [96]http://www.broadinstitute.org/gsea/msigdb). Survival Analysis We used log-rank test and Kaplan-Meier method to assess survival. Survival analyses were performed using the Cox proportional regression model. All statistical analyses were conducted using the open-source R software ([97]http://www.r-project.org). All P values were two sided. Drug repositioning analysis Genes identified by our methods play very important roles in HCC and approved drugs targeting these genes might also be effective for treating HCC. Information on drugs, drug targets and drug indications was obtained through a combination of Therapeutic Target Database(TTD, [98]http://xin.cz3.nus.edu.sg/group/ttd/ttd.asp) and DrugBank ([99]http://www.drugbank.ca/). Result Identification of top-down and bottom-up candidate prognostic gene sets in HCC To identify a robust molecular signature to predict outcome of HCC, we analyzed the genome-wide expression profiling of 225 HCC patients from three aspects (see flowchart in [100]Fig. 1). According to the top-down approach, we first compared gene expression profiles between tumor tissues and adjacent non-tumor tissues. A total of 1400 transcripts corresponding to 1079 genes were selected as DEGs. In bottom-up approach, 1312 genes were significantly related to outcome or prognostic staging systems within tumor tissues, which were termed as ProgGenes. Similarly, 872 prognostic genes obtained from adjacent non-tumor tissues were regarded as ProgNGenes. In the following analysis, each of the three candidate gene sets will be analyzed to identify a prognostic gene signature separately. Prioritization of candidate gene clusters by network topology and function analysis In order to excavate stable and important genes involved in the development of hepatocellular carcinoma, we combined network and pathway enrichment analysis. The three candidate gene sets were converted to seed proteins. Then functional linkages among these proteins were acquired from the STRING database. However, these associations were derived through wide diversity of experiments or prediction algorithms, and detected in various types of cell rather than in hepatic cells. Therefore, we next evaluated all the interactive pairs with our expression profiles. Only significantly related pairs were retained to construct subsequent molecular networks. At last, three PPI networks were constructed based on these linkages respectively, as shown in [101]S1 Fig. A-C. The interactions of DEGs network were most closely linked(density: 0.019, and average neighbors: 15.68). And the next is the ProgNGenes network, of which density and average neighbors is 0.009 and 4.96 separately. The density and average neighbors of ProgGenes network is 0.007, 6.35 respectively. To extract key genes from one big network effectively, we analyzed and processed the networks with measurements of degree, betweenness centrality(BC) and closeness centrality(CC). The average ranking of these three measurements for each node in the undirected networks was evaluated. Nodes at top 5% of the networks, which are more important than other less connected nodes, were selected as the core network gene sets ([102]S1 Table). KEGG pathway enrichment analyses were carried out for the three candidate gene sets separately. The DEGs were significantly enriched in 78 pathways, out of which 28 belong to metabolism related subcategories([103]S2 Table). Besides, the ProgGenes were significantly enriched in 86 pathways, and 39 out of which also belong to metabolism related subcategories([104]S3 Table). Relatively, ProgNGenes were enriched in 52 pathways, out of which only 4 related to metabolism, the other enriched pathways belong to 19 subcategories such as signal transduction and immune system([105]S4 Table). In order to find vital genes not only playing key roles in networks but also possessing important biological functions, we examined whether genes in each significantly enriched pathway were at the same time enriched in core genes derived from PPI network analysis. The significant genes are shown in [106]S5–[107]S7 Tables. It can be seen that many genes involved in different pathways belong to the same main category. To generate representative signatures for each type of DEGs, ProgGenes, and ProgNGenes, significant genes belonging to the same main category were combined into one functional gene cluster. Eventually, five gene clusters termed as D1-D5 which were most representative for DEGs, four gene clusters P1-P4 for ProgGenes and five clusters PN1-PN5 for ProgNGenes were obtained. In addition, because that immune microenviroment has been proven with value for predicting outcomes[[108]33] and many immune related pathways were identified in non-tumor tissues, we supposed that genes related to immune system to be important for HCC prognosis prediction. These genes were singled out as PN6. Moreover, as we all know, most cancers exhibit similar characteristics and share some common related genes such as EGFR, P53. We hypothesized that our final identified genes embedded in cancer related pathways may also be indicators for HCC outcome, and these genes in DEGs, ProgGenes and ProgNgenes were group as signature D6, P5 and PN7. Evaluation of prognosis performance for functional gene clusters and construction of final combined prognostic gene signatures For our ultimate goal of prognosis analysis for HCC patients, we first assessed the prognosis performance of all the functional gene clusters. We adopted supervised LOOCV and unsupervised hierarchical clustering methods to test whether each gene cluster could classify patients into two groups with significantly different overall survival(OS) rates (good and poor prognosis respectively). The corresponding p-values for each functional gene cluster are shown in [109]Table 2 and [110]S2 Fig. From the table we can see most of the gene clusters show significant prognostic relevance, clusters in non-tumor perform even better than clusters of DEGs and ProgGenes (from tumor). Among all the clusters identified in DEGs and ProgGenes, metabolism related cluster D1, replication and repair cluster D2, cell growth and death cluster D3, carbohydrate metabolism cluster P1, mTOR signaling cluster P2, cancer related cluster P5 were significant (p<0.05) or showed a tendency (p<0.1) for prognostic value for HCC, evaluated by both LOOCV and hierarchical clustering. In non-tumor tissues, one carbon pool by folate cluster PN1, signal transduction cluster PN2, Human Diseases cluster PN5, immune related cluster PN6 and cancer related cluster PN7 were tested to have prognosis value for HCC. These clusters with a small number of genes showed great prognostic potential for hepatocellular carcinoma. Table 2. Log rank p-values computed for all the gene clusters in training set. Signatures Size LOOCV Hclust D1(metabolism) 9 0.00356[111]^* 0.0308[112]^* D2(Replication and repair) 4 0.00444[113]^* 0.0357[114]^* D3(Cell growth and death) 5 0.0899[115]^# 0.0937[116]^# D4(Organismal Systems) 4 0.0232[117]^* 0.207 D5(Human Diseases) 6 0.119 0.662 D6(cancer related) 5 0.2602 0.5160 P1(Carbohydrate metabolism) 5 0.0181[118]^* 0.054[119]^# P2(Environmental Information Processing) 2 0.0468[120]^* 0.0854[121]^# P3(Cellular Processes) 4 0.00343[122]^* 0.938 P4(human deseases) 14 0.0221[123]^* 0.225 P5(cancer related) 7 0.0799[124]^# 0.0623[125]^# PN1(One carbon pool by folate) 1 0.00653[126]^* 0.00483[127]^* PN2(Signal transduction) 4 0.00424[128]^* 0.00593[129]^* PN3(Cell growth and death) 5 0.0433[130]^* 0.178 PN4(Organismal Systems) 9 0.0237[131]^* 0.108 PN5(Human Diseases) 8 0.00325[132]^* 0.00286[133]^* PN6(immune systems or disease) 8 0.00669[134]^* 0.0379[135]^* PN7(cancer related) 6 0.0201[136]^* 0.0459[137]^* [138]Open in a new tab *p < 0.05, represents the signature is significantly related to outcomes #0.05