Abstract
Background
Traditionally top-down method was used to identify prognostic features
in cancer research. That is to say, differentially expressed genes
usually in cancer versus normal were identified to see if they possess
survival prediction power. The problem is that prognostic features
identified from one set of patient samples can rarely be transferred to
other datasets. We apply bottom-up approach in this study: survival
correlated or clinical stage correlated genes were selected first and
prioritized by their network topology additionally, then a small set of
features can be used as a prognostic signature.
Methods
Gene expression profiles of a cohort of 221 hepatocellular carcinoma
(HCC) patients were used as a training set, ‘bottom-up’ approach was
applied to discover gene-expression signatures associated with survival
in both tumor and adjacent non-tumor tissues, and compared with
‘top-down’ approach. The results were validated in a second cohort of
82 patients which was used as a testing set.
Results
Two sets of gene signatures separately identified in tumor and adjacent
non-tumor tissues by bottom-up approach were developed in the training
cohort. These two signatures were associated with overall survival
times of HCC patients and the robustness of each was validated in the
testing set, and each predictive performance was better than gene
expression signatures reported previously. Moreover, genes in these two
prognosis signature gave some indications for drug-repositioning on
HCC. Some approved drugs targeting these markers have the alternative
indications on hepatocellular carcinoma.
Conclusion
Using the bottom-up approach, we have developed two prognostic gene
signatures with a limited number of genes that associated with overall
survival times of patients with HCC. Furthermore, prognostic markers in
these two signatures have the potential to be therapeutic targets.
Introduction
Hepatocellular carcinoma (HCC) is the third leading cause of
cancer-related death in the world, especially in Asia and
Africa[[41]1]. Surgical resection is one of the most important curative
treatments for HCC, while long-term survival of HCC remains poor
because of high recurrence rate. Improvements in early diagnosis and
accurate staging systems can help guide patients to take optimum
treatment strategies that may suppress recurrence and prolong
survival[[42]2].
Currently, beyond tumor-node-metastasis(TNM) staging system, several
prognostic algorithms used to predict survival among patients with
hepatocellular carcinoma have been established, Barcelona Clinic Liver
Cancer (BCLC) and Cancer of the Liver Italian Program (CLIP) systems
are among the most commonly used systems worldwide[[43]3–[44]7].
Nonetheless, all these staging systems only select optional clinical
and serum biochemical indexes such as tumor size, vascular invasion,
alpha fetoprotein (AFP), albumin(ALB), etc. Although these
clinicopathologic staging systems have been proven useful, their
predictive accuracy remains limited and they failed to provide
molecular biological characteristics of HCC that might be genetic and
heterogenic. With the recent advances in genome studies, gene
expression profiling-based studies have improved our understanding of
cancer biology and gene expression signatures have been successfully
used as prognostic tools especially in breast cancer[[45]8–[46]10].
Recently, two main strategies have been used for prognostic gene
signature identification, symbolized as “top-down” or “bottom-up”. In
‘top-down’ approach firstly genes with different expression patterns
between case samples and control samples were sought, secondly gene(s)
with expression level(s) significantly correlating to histological
grades or biological phenotypes were selected as candidate genes, then
a regression model with best predictors was built to construct the
final gene signature. Many studies have applied this approach and
identified quite a few gene signatures with prognostic values, such as
MammaPrint[[47]11]. Unfortunately, prognostic signatures identified by
this approach in HCC present minimal overlaps and few of them have been
adopted in routine clinical practice[[48]12–[49]14]. The ‘bottom-up’
approach was firstly based on supervised analysis of genes which are
directly associated with occurrence of the event studied (metastasis,
survival), Secondly some genes were selected by machine learning
algorithms or significant enrichment analysis of specific pathways or
biological functions, and then the prognostic value of these gene sets
could be calculated[[50]15,[51]16]. This ‘bottom-up’ approach was
applied by some groups in several cancers but few of assays used this
approach to identify HCC prognostic signatures. While the candidate
gene set was determined by “top-down” or “bottom-up” approach, various
machine learning algorithms including regression models can be used to
identify the final gene signatures. However, overfitting and the low
accuracy in independent cohort limited the clinical application of
these algorithms[[52]17]. Meanwhile, analysis of networks and modular
biological processes has shown the effective capacity to estimate key
genes which may have impact on patient outcomes[[53]18].
In addition to identifying prognostic signatures in tumor tissues, many
research groups showed that genes in adjacent non-tumor tissues appear
to be implicated in tumour progression and aggressiveness, gene
expression profiles in surrounding non-tumor tissues can be helpful to
identify signatures associated with outcomes[[54]9]. This can be
especially true with adjacent non-tumor tissues for HCC, where often
pathological change such as cirrhosis is present because of long-term
inflammation caused by hepatitis B or C virus (HBV or HCV)
infection[[55]9,[56]19].
Over the past years, large investments have been input to discover
novel markers for valuable biological insights to mechanisms of many
common diseases. Nonetheless, the translation from genetic findings to
clinical applications remains limited. Recently, some works explored
one potential application of these genetic molecular markers: drug
repositioning[[57]20,[58]21]. We suppose that prognostic gene
signatures could not only be used for predicting patient outcome, but
also can be used to gain important information for drug discovery,
which can then be utilized to provide appropriate therapeutic advices
to patients.
In this study, we have mainly implicated the ‘bottom-up’ approach to
develop HCC prognostic signatures both in tumor and adjacent non-tumor
tissues. The workflow of our analysis is shown in [59]Fig. 1. In the
training cohort, we first identified genes significantly associated
with clinical staging systems or patient survival times, then selected
some representative and important markers by network and pathway
enrichment analyses, finally developed two signatures that can predict
overall survival of HCC patients. Likewise, the ‘top-down’ approach
have also been tried. Differentially expressed genes(DEGs) in tumor
tissues compared to adjacent non-tumor tissues were firstly obtained,
then we adopted the same bioinformatics methods to identify some core
genes, based on which a DEGs-related signature was finally constructed.
The three signatures were further tested in an independent cohort. The
results showed that signatures developed by the ‘bottom-up’ approach
were more efficient compared to signatures identified by the ‘top-down’
approach, and two signatures identified by ‘bottom-up’ approach were
validated in the independent cohort. In addition, we have evaluated the
potential application of our molecular markers as drug targets by drug
repositioning analysis.
Fig 1. The workflow of prognostic model construction.
[60]Fig 1
[61]Open in a new tab
Firstly, prognostic genes were selected for ProgGenes and ProgNGenes
sets (bottom-up approach). Differentially expressed genes(DEGs) were
selected by comparing the profiles of tumors and non-tumors
synchronously (top-down approach). Next, the three gene clusters
identified by both approaches went through both network and pathway
analysis to obtain most important genes for prognostic signature
assembling and model construction. Finally the signatures were
validated in one independent clinical data set.
Materials & Methods
Datasets: Genomic profiles and patient information
Publicly available datasets with whole-genome gene expression measures
in 225 tumors and 220 adjacent non-tumor tissues from 225 primary HCC
patients were downloaded from microarray databases Gene Expression
Omnibus (GEO: [62]http://www.ncbi.nlm.nih.gov/projects/geo/), the
accession number is [63]GSE14520[[64]14]. Pre-processed series of
matrixes originally provided by the authors were used in our analysis.
Accessory available clinical and follow-up data were also provided by
the authors. Patient and tumor features are detailed in [65]Table 1.
Among the 225 patients, four patients were excluded due to the lack of
survival time. The validation set that included 80 tumors and 82
adjacent non-tumor liver tissues from 82 primary HCC patients was also
retrieved from Gene Expression Omnibus, the accession number is
[66]GSE10143[[67]9]. For each sample, the expression values of all
probes for a given gene were reduced to a single value by taking the
average expression value.
Table 1. Clinical, histological, molecular data of HCC in training cohort.
Variables Categories Total(n = 221) OS RFS
HR p-value HR p-value
Age < = 56 112(51%) 0.917 0.256
>56 109(49%) 1.02(0.67–1.56) 1.23(0.86–1.76)
Gender female 30(14%) 0.153 0.019[68]^*
male 191(86%) 1.7(0.82–3.52) 2.17(1.13–4.14)
NA 3(1%)
AFP < = 300ng/ml 118(53%) 0.017[69]^* 0.159
>300ng/ml 100(45%) 1.68(1.1–2.58) 1.25(0.82–1.91)
ALT < = 50U/L 130(59%) 0.726 0.23
>50U/L 91(41%) 1.08(0.7–1.66) 1.25(0.87–1.78)
NA 1(0%)
Size < = 5cm 140(63%) 0.002[70]^* 0.067
>5cm 80(36%) 1.94(1.26–2.99) 1.41(0.98–2.04)
Multinodular No 176(80%) 0.057 0.428
Yes 45(20%) 1.59(0.99–2.57) 1.19(0.77–1.84)
cirrhosis No 18(8%) 0.032[71]^* 0.062
Yes 203(92%) 4.62(1.14–18.8) 2.18(0.96–4.97)
NA 2(1%)
TNM I 93(42%) <0.001[72]^** <0.001[73]^**
II 77(35%) 2.08(1.21–3.58) 1.97(1.29–3.01)
III 49(22%) 5.05(2.91–8.79) 3.14(1.97–5.01)
NA 2(1%)
BCLC 0-A 168(76%) <0.001[74]^** <0.001[75]^**
B-C 51(23%) 3.63(2.33–5.65) 2.78(1.88–4.11)
NA 2(1%)
CLIP 0 97(44%) <0.001[76]^** 0.0017[77]^**
1 74(33%) 1.49(0.86–2.56) 2.21(1.42–3.43)
2–5 48(22%) 3.75(2.24–6.3) 1.25(0.82–1.91)
[78]Open in a new tab
In univariate analysis, the three staging systems, TNM, BCLC and CLIP,
were most significantly associated with overall survival and
recurrence-free survival. AFP, a-fetoprotein; ALT, alanine transferase;
Prognostic staging system: TNM, Tumor Node Metastasis; BCLC, Barcelona
Clinic Liver Cancer; CLIP, Cancer Liver Italian Program. OS, overall
survival; RFS, recurrence-free survival; HR: hazard ratio.
**P-value <0.01;
*P-value <0.05.
‘Top-down’ approach: identification of differentially expressed genes(DEGs)
As the first step of the ‘top-down’ approach, we identified the most
varying genes between tumors and non-tumor tissues using Linear Models
for Microarray Data (limma)[[79]22] analysis. Eventually, only the DEGs
with a false discovery rate(FDR) < 0.05 and fold change ≥2 were
selected.
‘Bottom-up’ approach: identification of outcome correlated genes in tumor and
adjacent non-tumor tissues
In this study, the ‘Bottom-up’ approach was mainly applied to develop
HCC prognostic signatures. To obtain genes with potential prognostic
value in tumor and adjacent non-tumor tissues, we evaluated all the
genes in two aspects ([80]Fig. 1).
Survival time related genes
Survival-directed prognostic genes contains recurrence free
survival(RFS) and overall survival(OS) related genes respectively. The
correlation to RFS or OS of each gene is tested by both univariate Cox
proportional hazards regression and supervised principal components
analysis (SPCA) [[81]23,[82]24]. For univariate-Cox analysis, genes
with a p-value < 0.05 was selected as Cox RFS or OS prognostic genes.
In our work, the patient outcomes were used as response variables and
the first principal component of SPCA were selected to determine most
important score for each gene. Finally, top 2000 most important genes
were selected as SPCA RFS or OS prognostic genes. At last, overlapping
genes identified by both univariate-Cox and SPCA were considered as RFS
or OS prognostic genes.
Stage related genes
HCC staging systems are commonly used to indicate outcomes. As noted in
[83]Table 1, the three clinical staging systems TNM, BCLC, and CLIP are
most significantly associated with RFS and OS. We may reason that genes
significantly correlated to TNM, BCLC and CLIP also have prognostic
values. Therefore, genes associated with the three clinical staging
systems were selected. TNM and CLIP related genes were selected by
logistic regression analysis and Kruskal-Wallis test. BCLC related
genes were selected using logistic regression analysis and t-test.
Genes with a p-value <0.05 were regarded as significant genes.
According to the above analysis, five gene sets were identified in
tumor tissues as well as adjacent non-tumor tissues, which are related
to RFS, OS, TNM, BCLC or CLIP respectively. Finally genes identified in
two of five types of prognostic gene sets were defined as prognostic
genes([84]Fig. 1), which was termed as ProgGenes in tumors. Similarly,
a set of prognostic genes termed ProgNGenes in adjacent non-tumors were
also determined.
Ranking of candidate gene sets by network prioritization
Networks construction
To identify the key genes involved in HCC from the aforementioned three
larger sets of candidate genes separately, we constructed three
protein-protein interaction (PPI) networks and performed topological
analysis for each. Each gene set was firstly converted to be the seed
proteins. Initial interactions among these proteins with a confidence
score > 0.4 were obtained from STRING database (version 9.1) (Search
Tool for the Retrieval of Interacting Genes/Proteins;
[85]http://string.embl.de/). Then, to ensure these interactions truly
exist in the gene expression profiles, only gene pairs with a p-value
from Pearson Correlation Test less than 0.05 were retained. After that,
an undirected network was finally constructed based on these gene
pairs.
Topological analysis of protein interaction networks
In order to analyze these three networks and to search topologically
important nodes, three fundamental measurements in network: degree,
betweenness centrality(BC) and closeness centrality(CC) were
calculated. Degree measures how many neighbors a node directly links
to. A node with high degree centrality may have more influence over
others. BC measures how often nodes occur on the shortest paths between
other nodes[[86]25], and CC measures the average length from a node to
all other nodes[[87]26]. In the PPI network, the nodes with high degree
are defined as hub nodes, the nodes with high BC were defined as
bottleneck nodes[[88]27], and the nodes with high CC are also
important[[89]28], all the three types of nodes are key nodes in the
network. In this study, the prioritization of candidate genes was based
on the average ranking derived from the three parameters. For each gene
in the network, a comprehensive rank score(RS) was calculated as
follow:
[MATH: