Abstract
Background
Liver hepatocellular carcinoma accounts for the overwhelming majority
of primary liver cancers and its belated diagnosis and poor prognosis
call for novel biomarkers to be discovered, which, in the era of big
data, innovative bioinformatics and computational techniques can prove
to be highly helpful in.
Methods
Big data aggregated from The Cancer Genome Atlas and Natural Language
Processing were integrated to generate differentially expressed genes.
Relevant signaling pathways of differentially expressed genes went
through Gene Ontology enrichment analysis, Kyoto Encyclopedia of Genes
and Genomes and Panther pathway enrichment analysis and protein-protein
interaction network. The pathway ranked high in the enrichment analysis
was further investigated, and selected genes with top priority were
evaluated and assessed in terms of their diagnostic and prognostic
values.
Results
A list of 389 genes was generated by overlapping genes from The Cancer
Genome Atlas and Natural Language Processing. Three pathways
demonstrated top priorities, and the one with specific associations
with cancers, ‘pathways in cancer,’ was analyzed with its four
highlighted genes, namely, BIRC5, E2F1, CCNE1, and CDKN2A, which were
validated using Oncomine. The detection pool composed of the four genes
presented satisfactory diagnostic power with an outstanding integrated
AUC of 0.990 (95% CI [0.982–0.998], P < 0.001, sensitivity: 96.0%,
specificity: 96.5%). BIRC5 (P = 0.021) and CCNE1 (P = 0.027) were
associated with poor prognosis, while CDKN2A (P = 0.066) and E2F1
(P = 0.088) demonstrated no statistically significant differences.
Discussion
The study illustrates liver hepatocellular carcinoma gene signatures,
related pathways and networks from the perspective of big data,
featuring the cancer-specific pathway with priority, ‘pathways in
cancer.’ The detection pool of the four highlighted genes, namely
BIRC5, E2F1, CCNE1 and CDKN2A, should be further investigated given its
high evidence level of diagnosis, whereas the prognostic powers of
BIRC5 and CCNE1 are equally attractive and worthy of attention.
Keywords: Hepatocellular carcinoma, TCGA, Pathways in cancer, Natural
language processing, Gene signatures, BIRC5, E2F1, CCNE1, CDKN2A,
Oncomine
Introduction
Liver cancer ranks the fourth amidst the commonest malignancies
globally with its highest incidence in East and South-East Asia
together with Northern and Western Africa, taking the second lead among
all the cancer-related deaths worldwide ([42]Torre et al., 2015).
According to the latest Global Cancer Statistics, a global estimation
of 782,500 new incidents of liver cancer and 745,500 deaths occurred
during the year 2012, with China alone taking up approximately half of
the number ([43]Torre et al., 2015). Liver hepatocellular carcinoma
(LIHC), commonly known as hepatocellular carcinoma (HCC), accounts for
70–85% of primary liver cancers. However, its prognosis remains
relatively poor with short overall survival due to elevated incidence,
belated diagnosis, constant drug resistance and frequent recurrence
([44]Bupathi, Kaseb & Janku, 2014; [45]Chun et al., 2011;
[46]Dhanasekaran, Limaye & Cabrera, 2012; [47]Li et al., 2015;
[48]Zhong et al., 2014; [49]Zhou et al., 2014). Thus, there is a
pressing need for novel diagnostic and prognostic biomarkers to be
discovered.
The scientific consensus has it that LIHC results from the long-term,
accumulative interactions amongst multiple environmental and genetic
factors, whose progression involves oncogene activation, tumor
suppressor gene inactivation, related gene mutation, and irreversible
cell damage ([50]Dai et al., 2015; [51]He et al., 2015; [52]Pan et al.,
2014; [53]Zhang et al., 2015a). Up to now, plentiful studies have
focused on the mutation and overexpression of abnormal genes which
might promote malignance, such as Cyclin D1(CCND1), epidermal growth
factor receptor (EGFR) and c-myc ([54]Wang et al., 2002; [55]Zender et
al., 2006), as well as the deletion or loss of tumor suppressor genes,
such as PTEN ([56]Buendia, 2000; [57]Wang et al., 2002).
Most recently, innovative bioinformatics and computational techniques
have been well applied in various oncological researches.
High-throughput technology, including microarray analysis and RNA
sequencing, steals the show, which has now enabled researchers to
obtain massive expression data sets, and proven itself to be
advantageous and serviceable for identifying novel tumor markers with
regard to cancer diagnosis and targeted treatment ([58]Cheng et al.,
2013; [59]Horie-Inoue & Inoue, 2013; [60]Xu et al., 2015; [61]Zhang et
al., 2015d). The Cancer Genome Atlas (TCGA), a public database begun in
2006, catalogues genetic information and covers 33 types of cancers,
which helps facilitate related studies on gene signatures and
tumorigenesis mechanisms. To date, only one single paper ([62]Ho, Kai &
Ng, 2015) was published with respect to using data from TCGA to
investigate gene signatures for LIHC ([63]Li et al., 2014;
[64]Lopez-Ayllon et al., 2015; [65]Lu et al., 2014; [66]Wang et al.,
2015a; [67]Wang et al., 2015b; [68]Zhang et al., 2015b; [69]Zhang et
al., 2015c). As presented in the previous study, [70]Ho, Kai & Ng
(2015) employed TCGA whole-transcriptome sequencing data to explore the
significantly dysregulated genes and signaling pathways in LIHC.
However, they only used data from 50 paired samples and no other
bioinformatics platforms were taken into consideration. Equally
attractive is the technology of natural language processing (NLP),
which concerns with the interactions between computers and natural
languages and has been considered auspicious in the field of
human–computer interaction. In the medical profession, researchers
regard NLP as one of the most potential and powerful tools to gather
sporadic laboratory and clinical data trapped in enormous electronic
records, thus accelerating the application of scientific advance to
clinical practice ([71]Yim et al., 2016).
In the study, at first, we combined data from TCGA with genes selected
through NLP to analyze the differentially expressed genes (DEGs) in
LIHC. Later on, relevant signaling pathways of DEGs were also
investigated using Gene Ontology (GO) enrichment analysis, Kyoto
encyclopedia of Genes and Genomes (KEGG) and Panther pathway enrichment
analysis and protein-protein interaction (PPI) network. Lastly, the
pathway ranked top in KEGG enrichment analysis was further
investigated, and selected genes with priority, which might serve as
potential biomarkers for the early diagnosis and prognosis prediction,
were evaluated and assessed in terms of their diagnostic and prognostic
values of several genes ([72]Fig. 1).
Figure 1. General flow chart.
Figure 1
[73]Open in a new tab
Differentially expressed genes (DEGs) were generated by integrating
data from The Cancer Genome Atlas (TCGA) and Natural Language
Processing (NLP). Relevant signaling pathways of DEGs went through
various analyses. The cancers-associated pathway ranked high in KEGG
enrichment analysis was further investigated, and selected genes with
top priority were evaluated and assessed in terms of their diagnostic
and prognostic values.
Materials and Methods
RNA-seq data from TCGA database and DEGs identification
The publicly available RNA-seq data of the mRNA level of Liver
hepatocellular carcinoma (LIHC) samples were downloaded directly from
the TCGA data portal
([74]https://tcga-data.nci.nih.gov/docs/publications/tcga/) via bulk
download mode (LIHC (cancer type), RNASeqV2 (data type), level 3 (data
level), All (preservation) and 1.12.0 (data version)) on January 2,
2017. The data were sequenced based on Illumina Genome Analyzer RNA
Sequencing platform. Gene expression data from RNA-Seq-V2 results were
quantified through RNA-Seq by Expectation-Maximization (RSEM)
([75]Cheng et al., 2015; [76]Guo et al., 2013) using the
“rsem.gene.normalized_results” file type. Extracted data were applied
with no further transformation, except by rounding off values to
integers. These downloaded data included a total of 369 LIHC samples
and 50 non-cancerous liver samples. The DEGs between the LIHC and the
non-cancerous liver control samples were identified by using DESeq
package ([77]Anders & Huber, 2010) in R. Gene expression comparison was
carried out by calculating the level of fold change (FC) in LIHC vs
non-cancerous liver tissue. The genes with a FC value > 2 or < 0.5 and
with P value < 0.05 analyzed with Student’s t-test were selected out as
DEGs in the current study.
Natural language processing
As previously reported ([78]Zhang et al., 2016a), the NLP procedure
consisted of searching through electronic records, managing data and
statistical analysis. At first, a comprehensive search was conducted in
PubMed in order to mine out all the related electronic records, with
publishing date ranging from January 1, 1980 to May 25, 2015. The
followings are the search strategy employed: (hepatocellular carcinoma)
AND (resistance OR prognosis OR metastasis OR recurrence OR survival OR
carcinogenesis OR sorafenib OR bevacizumab) and (“1980/01/01” (PDAT):
“2015/05/25” (PDAT)). Next, all the pertinent molecules, mostly
proteins and genes, were dug out and a list of them was generated
afterwards. Gene mention tagging was managed with A Biomedical Named
Entity Recognizer (ABNER)
([79]http://pages.cs.wisc.edu/ bsettles/abner/). ABNER also assisted us
in conjunction resolution ([80]Settles, 2005). Gene name normalization
conforms to the standard names in Entrez Database set forth by NCBI
([81]Morgan et al., 2008; [82]Smith et al., 2008). At last comes the
statistical analysis. Each gene’s frequency of occurrences was
calculated individually. The higher a certain gene’s frequency appeared
to be, the greater chance the gene harbored to be related to LIHC. N
was defined as the total number of records in PubMed. Meanwhile, m and
n represented the occurrence frequencies of genes and LIHC in the
PubMed respectively. K was determined as the frequency of spontaneous
co-occurrence of the specific gene and LIHC under actual circumstances.
With hypergeometric distribution, we were able to calculate and output
the probability of occurrence frequency of co-citation greater than k
under total randomness. The formulae used were as follows:
[MATH: p=1−∑i=0
k−1pi|n,m,<
mi>N :MATH]
[MATH: pi|n,m,<
mi>N=n!N−n!
m!N−m!n−i!i!n−m!N−n−m+
i!N!
mfrac> :MATH]
Protein–protein interactions network construction
Overlaps of DEGs identified from TCGA and genes obtained from NLP were
calculated using the COUNTIF function in Microsoft Excel 2013. STRING
([83]http://string-db.org/), Search Tool for the Retrieval of
Interacting Genes/Proteins, is a bioinformatics platform and web
resource of known and predicted PPI network. The latest version STRING
10.0 was employed to construct PPI network of proteins encoded by the
extracted DEGs with the cut-off criterion of combined score >0.7. The
PPI data were downloaded from STRING database and a map of the complete
PPI network was drawn. The interactions of proteins derived from four
sources, i.e., (1) literature-reported protein interactions, (2)
high-throughput experiments, (3) genome analysis and prediction and (4)
co-expression studies. Next, the hub genes, namely highly connected
genes in the network, were extracted from PPI data using the R Project
for Statistical Computing ([84]https://www.r-project.org/), an open
access software environment for statistical computation and graphics.
Moreover, STRING was in charge of both the screening of top hub genes
and the visualization of network. The protein product of a certain gene
acts as a node in the PPI network, and the connectivity degree evinces
the number of interplayed proteins. A node with high connectivity
degrees is considered as a hub node. Hub proteins were obtained by
means of analyzing the connectivity degrees of the nodes in PPI
networks.
Functional and signaling pathway analyses
A set of condition-specific genes from the overlaps of TCGA and NLP
further underwent the functional and signaling pathway analyses. The
functional and signaling pathway analyses of the selected DEGs was
performed on a public database platform, the Database for Annotation,
Visualization and Integrated Discovery (DAVID)
([85]https://david.ncifcrf.gov/), which could provide a functional
interpretation of massive gene lists deriving from genomic studies. The
analyses included gene ontology (GO) function analysis
([86]http://www.geneontology.org/), Kyoto Encyclopedia of Genes,
Genomes (KEGG) ([87]http://www.genome.jp/kegg/) analysis and panther
pathway analysis. GO function analysis categorized selected genes into
groups in accordance with three independent classification standards,
namely molecular function (MF), cellular component (CC), and biological
process (BP). A Benjamini P-value of <0.05 was used in the above
pathway enrichment analyses. The results were visualized into three
GO-maps via Cytoscape v3.4.0 ([88]http://cytoscape.org/). Subsequently,
the highly relevant pathway with the top priority was selected for
further evaluation, and the most significantly aberrantly expressed
genes were examined for their prospective diagnostic and prognostic
values. The aberrantly expressed genes selected were bioinformatically
validated using Oncomine Research Edition ([89]Rhodes et al., 2004)
([90]https://www.oncomine.org/).
Statistical analysis
SPSS 22.0 was used for statistical analysis. All the data were
presented as mean ± standard deviation of mean. The receiver operating
characteristic (ROC) curve was drawn to identify the diagnostic
significance of genes separately. The logistic regression contributed
to evaluating the integrated diagnostic value of the combined pool of
the top four genes in LIHC. The standards for assessing the area under
the curve (AUC) in ROC were as follows: 0.5–0.7 (poor evidence for
diagnosis); 0.7–0.9 (moderate evidence for diagnosis); 0.9–1.0 (high
evidence for diagnosis). Significance of difference between LIHC and
para-LIHC non-cancerous liver tissues was analyzed by Student’s t-test.
The scatter plot to exhibit the level of gene expression was
demonstrated by GraphPad Prism 5.0. Overall survival (OS) was estimated
by the Kaplan–Meier method, and the log-rank test was conducted to
compare survival curves. Kaplan–Meier survival curve was drawn to
evaluate the associations between DEGs expressions and survival rates
of 369 patients with LIHC. P value less than 0.05 indicated statistical
significance.
Results
Natural language processing
A series of 64,577 LIHC-related electronic records of titles and
abstracts altogether was retrieved from PubMed. Later on, statistical
analysis highlighted a panel of 1800 LIHC-related genes ([91]Zhang et
al., 2016a).
Overlaps between DEGs from TCGA and LIHC-related genes from NLP
A total of 3278 DEGs were achieved from TCGA dataset between LIHC
patients and non-LIHC liver controls in accordance with the criteria
described in the ‘Materials and Methods’ section. Meanwhile, 1800
LIHC-related genes from NLP procedure proceeded ([92]Zhang et al.,
2016a). The integration witnessed a set of 389 genes by overlapping
DEGs from TCGA and LIHC-related genes from NLP.
PPI network analysis
The PPI network was constructed with STRING in an attempt to
systemically analyze the functions of DEGs. The map of complete PPI
network has been made available in [93]Fig. S1. And a core sub-network
of 22 genes was selected for further analyses, whose connectivity
degrees were more than 20 as presented in [94]Fig. 2.
Figure 2. Interactions of the DEGs in hepatocellular carcinoma (HCC).
[95]Figure 2
[96]Open in a new tab
Interactions of 22 hub genes were illustrated by STRING online database
([97]http://string-db.org) with the cut-off criterion of combined score
>0.7. Network nodes represented proteins and edges stood for
protein-protein associations.
Functional and signaling pathway analyses
GO analysis classified 389 DEGs into three GO categories, namely 353
DEGs along with 613 pathways in BP, 304 DEGs together with 67 pathways
in CC and 333 DEGS with 93 pathways in MF. In BP, the top three
processes which DEGs actively participated in were response to organic
substance (GO: 0010033, P = 1.93 × 10^−22), regulation of
phosphorylation (GO: 0042325, P = 2.34 × 10^−19) and regulation of
phosphorus metabolic process (GO: 0051174, P = 1.36 × 10^−18).
Meanwhile, in CC, extracellular region part (GO: 0044421,
P = 1.81 × 10^−14), extracellular space (GO: 0005615,
P = 7.67 × 10^−13) and spindle (GO: 0005819, P = 4.67 × 10^−12) were
considered the top three. In MF, the top three functional items were
cadmium ion binding (GO: 0046870, P = 7.81 × 10^−5), platelet-derived
growth factor binding (GO: 0048407, P = 1.20 × 10^−4) and kinase
binding (GO: 0019900, P = 2.05 × 10^−4). In addition, the gene
expression profile graph from the Global Cancer Map was also processed
in Gene Set Enrichment Analysis (GSEA) and the Molecular Signatures
Database (MSigDB,
[98]http://software.broadinstitute.org/gsea/datasets.jsp). GSEA is a
computational solution that decides whether an a priori defined set of
genes demonstrates statistically significant, concordant differences
between two biological environments, while MSigDB is a collected
affiliation of annotated gene sets for GSEA software. Three visualized
GO-maps (BP, CC, MF) by Cytoscape v3.4.0 and GSEA were rendered in,
[99]Figs. 3–[100]5, and [101]Fig. S2. We identified the expression of
significantly altered genes among the DEGs, as well as their relative
tendencies of expression in the defined functional groups ([102]Fig.
6).
Figure 3. Directed acyclic graph (DAG) of pathways from the perspective of
biological process (BP) in gene ontology (GO) analysis.
[103]Figure 3
[104]Open in a new tab
The circles represented different terms of biological processes. The
relationships among terms were represented by arrows. A false discovery
rate (FDR) of 10^−13 was selected for the current DAG with 44 nodes and
55 edges included. The darker the color appeared, the greater
significance the term demonstrated.
Figure 5. DAG of pathways from the perspective of molecular function (MF) in
GO analysis.
[105]Figure 5
[106]Open in a new tab
The circles represented different terms of molecular functions. The
relationships among terms were represented by arrows. A false discovery
rate (FDR) of 10^−3 was selected for the current DAG, which held 32
nodes and 35 edges. The color depth indicated the significance of the
corresponding term.
Figure 6. Gene Set Enrichment Analysis (GSEA) for DEGs in HCC.
[107]Figure 6
[108]Open in a new tab
The 389 DEGs were displayed in form of a heat map achieved by GSEA
(Human tissue compendium, Novartis) and MSigDB.
Figure 4. DAG of pathways from the perspective of cellular component (CC) in
GO analysis.
[109]Figure 4
[110]Open in a new tab
The circles represented different terms of cellular components. The
relationships among terms were represented by arrows. A false discovery
rate (FDR) of 10^−6 was selected for the current DAG, which harbored 29
nodes and 45 edges. The color depth indicated the significance of the
corresponding term.
KEGG pathway enrichment analysis demonstrated the significant
enrichments of DEGs in 10 items ([111]Fig. 5). The three prominent
pathways with top significance were cell cycle (hsa04110,
P = 2.58 × 10^−9), pathways in cancer (hsa05200, P = 5.40 × 10^−6) and
Toll-like receptor signaling pathway (hsa 04620, P = 6.31 × 10^−5).
Panther pathway enrichment analysis identified three pathways with top
significance, i.e., p53 pathway ([112]P00059, P = 0.001), Toll receptor
signaling pathway ([113]P00054, P = 0.005) and blood coagulation
([114]P00011, P = 0.007).
Compared with other databases, KEGG is an ontology database
illustrating functional hierarchic domains of various biological
entities, including molecules, cells, organisms, drugs and diseases,
along with relationships among them. Thus, we decided to select the
cancers-specific pathway with prominent significance from KEGG
analysis, ‘pathways in cancer’ with 30 genes included, to further
analyze the diagnostic values of the target genes. The positions of 30
genes in “pathways in cancer” were shown in [115]Fig. 7.
Figure 7. Top 15 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways of
DEGs in HCC.
[116]Figure 7
[117]Open in a new tab
The KEGG enrichment analysis was performed with 389 DEGs by ggplot2
package of R. Pathways in cancer was the most significant pathway. The
color tints indicated the P-values. The size of the circle presented
the number of selected genes in the pathway. Thirty genes were found to
be enriched in pathways in cancer. Rich factor expressed the percentage
of the ratio of DEGs in current study vs total genes in the pathway.
Diagnostic values of highlighted DEGs
Thirty significant aberrant genes involved in the highlighted pathway,
pathways in cancer, which is considered the most cancer-related and
ranked top in KEGG analysis, were assessed for diagnostic values by
means of ROC curves. There were 12 genes with AUCs ranging from
0.9–1.0, 15 genes demonstrating AUCs between 0.7–0.9, and three genes
presenting AUCs from 0.5 to 0.7. Delightfully, 12 genes harbored AUCs
exceeding 0.9. The AUCs of the top four genes, i.e., BIRC5, E2F1, CCNE1
and CDKN2A, were 0.974 (95% CI [0.960–0.989], P < 0.001), 0.964 (95% CI
[0.942–0.986], P < 0.001), 0.945(95% CI [0.919–0.971], P < 0.001) and
0,939 (95% CI [0.915–0.963], P < 0.001), respectively. Four of them
have been all marked red in [118]Fig. 8. Genes with AUCs > 0.8 were
displayed in [119]Fig. 9.
Figure 8. Location map of 30 genes from the current study in Pathways in
Cancer.
[120]Figure 8
[121]Open in a new tab
The figure was modified according to Pathway mapping tools
([122]http://www.genome.jp/kegg/mapper.html). Thirty genes from the
current study were colored in green. Four significant genes (BIRC5,
E2F1, CCNE1 and CDKN2A) providing the highest Area Under Curves (AUCs)
were put in red, which located in different parts of the pathway.
Figure 9. ROC curves of DEGs in HCC.
[123]Figure 9
[124]Open in a new tab
There were 22 DEGs with AUCs greater than 0.8. (A–C): ROC curves of the
12 upregulated genes in HCC; (D) and (E): ROC curves of the 10
downregulated genes in HCC; (F): ROC curve of the detection pool of the
four top-listing genes in HCC.
For the purpose of validation, BIRC5, ECF1, CCNE1 and CDKN2A were
searched in Oncomine Research Edition ([125]Rhodes et al., 2004)
([126]https://www.oncomine.org) filtered with the Cancer Type of Liver
Cancer. Oncomine is an online platform gathering cancer expression
datasets and providing data mining services, which currently includes
715 datasets and 86,733 samples as of Jan 15, 2017 ([127]Rhodes et al.,
2004). All these four genes were included in different datasets. Highly
and reasonably consistent with the results from TCGA, all BIRC5
([128]Figs. S3A–[129]S3C), ECF1 ([130]Fig. S3D), CCNE1 ([131]Fig. S3E)
and CDKN2A ([132]Figs. S3F and [133]S3G) showed evidently
over-expressed pattern in LIHC tissues.
Gene expressions between LIHC and para-LIHC non-cancerous tissues were
also compared. The expressions of top four genes in LIHC tissues were
higher than those in the corresponding adjacent non-cancerous tissues,
and the results were as follows: 332.95 ± 530.89 vs 11.51 ± 11.07
(t = 11.612, P < 0.001), 616.43 ± 765.05 vs 41.45 ± 48.14
(t = 14.230, P < 0.001), 239.49 ± 1030.09 vs 10.09 ± 8.23
(t = 4.277, P < 0.001) and 427.99 ± 529.58 vs 21.66 ± 23.55
(t = 14.632, P < 0.001), respectively ([134]Fig. 10). Moreover, the
four genes with the highest AUCs were combined as a pool to distinguish
LIHC tissues from the non-tumor ones. The integrated AUC reached a
satisfying point of 0.990 (95% CI [0.982–0.998], P < 0.001), which
indicated its remarkable significance for LIHC diagnosis. The
sensitivity and specificity of the pool were 96.0% and 96.5%,
respectively.
Figure 10. Expression levels of the top four DEGs in HCC from TCGA data.
[135]Figure 10
[136]Open in a new tab
The figures illustrated the expression levels of the top four DEGs in
HCC tissues from TCGA data as compared to para-HCC non-cancerous
tissues. (A): BIRC5; (B): E2F1; (C): CCNE1 and (D): CDKN2A.
Prognostic values of highlighted DEGs
In order to evaluate the prognostic values of highlighted DEGs, we
further investigated the associations of the top four DEGs with overall
survival (OS) of patients by Kaplan–Meier and log-rank analysis. Among
all the 369 patients involved, 368 had complete followed-up data.
Determined by ROC analyses, the cut-off values for the highlighted
genes were 41.76 for BIRC5, 17.85 for CCNE1, 81.90 for CDKN2A, and
101.37 for E2F1, respectively. Patients with lower level of BIRC5
expression (n = 59) expected longer survival time (67.10 ± 6.30, 95% CI
[54.74–79.44]) than those with higher level of BIRC5 expression
(n = 309) (60.67 ± 4.04, 95% CI [52.75–68.58], P = 0.021). Patients
with lower CCNE1 expression (n = 47) were inclined to demonstrate
longer survival (73.34 ± 7.40, 95% CI [58.83–87.84]) than patients with
higher CCNE1 expression (n = 321) (59.91 ± 4.03, 95% CI [52.01–67.80],
P = 0.027). However, no statistically significant differences between
low and high expression groups of CDKN2A (P = 0.066) and E2F1
(P = 0.088) were observed in terms of survival ([137]Figs.
11A–[138]11D).
Figure 11. Associations of the top four DEGs with overall survival (OS) of
369 patients from TCGA data.
[139]Figure 11
[140]Open in a new tab
Kaplan–Meier survival analysis of OS based on expression status
provided the associations of the top four DEGs with overall survival
(OS) of 369 patients from TCGA data (A): BIRC5 (^∗P = 0.021, log-rank);
(B): CCNE1 (^∗P = 0.027); (C): CDKN2A (P = 0.066); (D): E2F1
(P = 0.088).
Discussion
Genes and proteins usually function via complex networks and have the
ability to affect the operations of biological systems collaboratively.
Furthermore, the impact of multi-gene interactions on cellular
functions through signaling pathways has been proved to be more
influential than that of a single gene. Up to now, the pathogenesis of
LIHC has been widely investigated and the consensus emerged that
multi-gene interactions contribute to the carcinogenesis and
progression of LIHC ([141]Li et al., 2014; [142]Lopez-Ayllon et al.,
2015; [143]Lu et al., 2014; [144]Wang et al., 2015a; [145]Wang et al.,
2015b; [146]Zhang et al., 2015b; [147]Zhang et al., 2015c). In the
present study, multi-gene analyses with various bioinformatics
techniques and platforms, such as NLP and TCGA, were performed to
identify the aberrantly expressed genes in LIHC. The condition-specific
genes were obtained from the overlapping of TCGA and NLP, which further
proceeded with functional and pathways enrichment analysis so as to
render prospective biomarkers for LIHC diagnosis and survival
prediction and to elucidate the potential mechanism of the DEGs in
LIHC.
Antecedently, articles published by [148]Jin et al. (2015a), [149]Ho,
Kai & Ng (2015) and [150]Shangguan, Tan & Zhang (2015) investigated the
deregulated genes and related signaling pathways in LIHC by
bioinformatics techniques. As compared with the previous studies, the
current one is the first to analyze the DEGs in LIHC by the winning
combination of TCGA database and NLP analysis. It should be highlighted
that the sample cohort in this present analysis consisted of 369 LIHC
and 50 para-LIHC non-cancerous liver tissues, outweighing any formerly
reported studies, that is, 117 tissue samples (54 LIHC cases and 63
controls) in the study of [151]Jin et al. (2015a), 50 pairs of human
LIHC tissues reported by [152]Ho, Kai & Ng (2015) and 286 LIHC and 39
non-cancerous liver tissues explored by [153]Shangguan, Tan & Zhang
(2015). More importantly, the current study evaluated the diagnostic
and prognostic values of the featured hub genes in LIHC by means of ROC
curves and survival analysis. Given the above, the study might provide
more profound insights than preceding researches which ignored the
application in diagnosis and prognosis but solely focused on the hub
genes, dysregulated pathways and relevant molecular mechanisms.
Novel signaling pathways analysis enables us to learn more about the
pathogenesis, biological processes and the key pathogenic genes of
cancers. For instance, [154]Ho, Kai & Ng (2015) employed TCGA
whole-transcriptome sequencing data and discovered the significantly
enriched KEGG pathways of cell cycle and p53 signaling, which matches
our results in the present one with DEGs from TCGA and supports the
credibility and reliability of the current research. Other studies
employed Gene Expression Omnibus (GEO) data to perform KEGG pathways
enrichment analysis but their results did not coincide with ours
([155]Jin et al., 2015a; [156]Li, Huang & Wei, 2016). Such findings
might be related to the methods and sample sizes adopted. We not only
used LIHC data from TCGA which harbors a large number of samples, but
also combined them with the genes generated by NLP, both of which make
our research stand out among all the others.
The KEGG Pathway Database is believed to be one of the most
comprehensive knowledge bases to understand organisms as molecular
systems in the development of tumor-genesis, proliferation, metastasis
and apoptosis. Among the signaling pathways presented, pathways in
cancer (P = 5.40E − 06, FDR = 0.06) ranked top among the list of KEGG
pathways released by DAVID and is considered specifically associated
with cancers, which implied that pathways in cancer might be actively
involved in the oncogenesis and development of LIHC and suggested that
its target genes could be potential markers for diagnosis and
prognosis.
Alpha-fetoprotein (AFP) is considered a vital and mature biomarker in
the diagnosis of liver cancer, especially in the case of LIHC. However,
the sensitivity of AFP is subject to the size and stage of tumor, which
makes combined detection of multiple biomarkers relatively superior and
more convincing. Thus, we decided to concentrate on the featured
pathway, i.e., pathways in cancer, as well as its related genes.
Of the 389 DEGs identified in the RNA-seq profiles of TCGA LIHC
database, 30 genes were found to be the most enriched in pathways in
cancer, which indicated an increased possibility of genes functioning
in the factual development and progression of LIHC. Their diagnostic
values were evaluated for by ROC curves, where 12 genes harbored high
diagnostic value by AUCs exceeding 0.9 and another 10 genes
demonstrated moderate diagnostic importance with AUCs ranging from 0.8
to 0.9 ([157]Fig. 9). Four special genes with the highest AUCs, namely
BIRC5, E2F1, CCNE1 and CDKN2A, were validated to be over-expressed in
LIHC tissues using Oncomine ([158]Rhodes et al., 2004). The Oncomine
validation not only boosts the feasibility of using the four-gene pool
as a potential diagnostic strategy but also prove the powerful validity
of the bioinformatics methods used. We consider that the bioinformatics
validation adds much to the significance and credibility of the study.
With the four genes validated, we became curious about the potential to
diagnose LIHC with the assistance of a pool formed by the four genes.
The four-gene pool excelled with an outstanding integrated AUC of 0.990
(95% CI [0.982–0.998], P < 0.001, sensitivity: 96.0%, specificity:
96.5%), denoting its appreciable diagnostic significance for LIHC.
Meanwhile, the results of ROC curves furnished us with more accurate
information to distinguish the survival differences between high and
low expression groups of certain genes. Overall, patients with low
BIRC5 expression (n = 59) showed longer survival time (67.10 ± 6.30,
95% CI [54.74–79.44]) than those with high BIRC5 expression
(n = 309)(60.67 ± 4.04, 95% CI [52.75–68.58], P = 0.021). Similar
results could be witnessed in the case of CCNE1, where patients with
low CCNE1 expression (n = 47) demonstrated longer survival
(73.34 ± 7.40, 95% CI [58.83–87.84]) than patients with high CCNE1
expression (n = 321) (59.91 ± 4.03, 95% CI [52.01–67.80], P = 0.027).
Nonetheless, CDKN2A (P = 0.066) or E2F1 (P = 0.088) displayed no
statistically significant differences between low and high expression
groups.
It is strikingly noteworthy that the four-gene detention pool has been
proved to be auspicious for LIHC diagnosis, which consisted of BIRC5,
E2F1, CCNE1 and CDKN2A. In the meanwhile, the roles that BIRC5 and
CCNE1 might play in the prognosis of LIHC should not be ignored.
Earlier studies regarding the four genes might help account for the
possible molecular mechanisms.
BIRC5 (Baculoviral IAP repeat containing 5), located in 17q25, belongs
to the inhibitor of apoptosis (IAP) gene family, which encodes
apoptosis-preventing proteins via negative regulation. Biologically,
BIRC5 plays a crucial role in malignancy through inhibiting cell
apoptosis, enhancing cellular proliferation and promoting angiogenesis
([159]Ryan, O’Donovan & Duffy, 2009). [160]Yang et al. (2011) found
that the expression of BIRC5 was increased in LIHC tissues as compared
to those adjacent non-tumor samples and non-cancerous liver tissues.
Several articles demonstrated that BIRC5 expression correlated with
unfavorable clinicopathological characteristics and predicted poor
survivals in patients with LIHC ([161]Augello et al., 2009; [162]Liu et
al., 2013; [163]Ye et al., 2007). Recently, the investigation of the
role of non-coding RNAs in cancer has attracted increasing attention.
[164]Chen et al. (2015) illustrated that miR-153 enhanced BIRC5
expression in LIHC cell lines and [165]Wei et al. (2013) verified by
Western blot assay that miR-203 could target survivin to promote the
progression of LIHC. Moreover, hepatitis B virus X protein combined
with BIRC5 could accelerate carcinogenesis of LIHC by modulating
miR-520b and hepatitis B X-interacting protein ([166]Baginski, 1969).
Nonetheless, the correlations between long noncoding RNAs (lncRNAs) and
BIRC5 were investigated only in lung cancer and prostatic cancer
([167]Bialkowska-Hobrzanska et al., 2006; [168]Xia et al., 2015). Thus,
it is necessary to clarify the relationships between lncRNAs and BIRC5
and their functions in the pathogenesis of LIHC.
E2F transcription factor 1, also known as E2F1, has been widely studied
in the generation and progression of several cancers. In liver cancer,
[169]Lu et al. (2016) examined the expression of E2F1 in 143 LIHC
samples using tissue arrays and observed that strong positive staining
of E2F1 took place in 84.62% of LIHC tissues through
immunohistochemical (IHC) staining. Moreover, [170]Wang et al. (2016)
reported that E2F1 is an important downstream gene of ISX in hepatoma
progression. Recently, studies of E2F1 have uncovered its critical
roles in the control of transcription, proliferation and apoptosis
([171]DeGregori & Johnson, 2006; [172]Muller et al., 2001;
[173]O’Donnell et al., 2005), and accumulating evidence showed that
E2F1, a fundamental biological regulator with the ability to activate
transcription in downstream genes, was also involved in the regulation
of other molecules, such as miRNAs and lncRNAs. For instance,
[174]O’Donnell et al. (2005) elaborated that E2F1 transcription was
simultaneously activated by c-Myc, yet negatively regulated by
miR-17-5p and miR-20a. [175]Petrocca et al. (2008) unveiled the
establishment of a negative feedback loop by the interaction of E2F1
and miR-106b-25 cluster, which might influence the development of TGF β
resistance in gastric cancer. [176]Zhang et al. (2016b) demonstrated
that E2F1-mediated overexpression of LINC00668 promotes or facilitates
cell proliferation and cell cycle in gastric cancer. The functions of
E2F1 as a target for other molecules have also been extensively
reported, such as miR-106b in bladder cancer ([177]Jin et al., 2015b),
miR-136 in glioma cells ([178]Chen et al., 2014), miR-331-3p in gastric
cancer ([179]Guo et al., 2010) and so forth.
Cyclin E1 (CCNE1), a cardinal regulator of the G1/S cell-cycle
transition and member of cyclins family, was upregulated in LIHC
([180]Peng, Chou & Hsu, 1998). The increased expression of CCNE1 might
shorten the tumor cell cycle phase, speech up cell proliferation, and
be closely involved in LIHC aggressiveness according to the published
literatures. For example, [181]Zhou, He & Liang (2003) established that
CCNE1 plays a cooperative role in LIHC tumorigenesis, differentiation,
invasiveness and metastasis. [182]Li et al. (2003) used RNA
interference to target CCNE1-overexpressing in LIHC, and found that
CCNE1, overexpressed in 70% of LIHCs, might serve as a novel
therapeutic target. CCNE1, as a spotlighted target gene of micro-RNA,
has already been found to be actively involved in LIHC ([183]Zhang et
al., 2014). All the above indicate that CCNE1 is a crucial gene
involved in the development and progression of LIHC and has the
potential to be an effective biomarker for diagnosis and prognosis of
LIHC.
CDKN2A (p16INK4a), located on chromosome 9p21, is an important tumor
suppressor and DNA repair gene and functions as a major negative
regulator of critical tumor pathways ([184]Song et al., 2013). CDKN2A
encodes p16 protein (p16) that competitively binds to cyclin-dependent
kinase 4 protein (Cdk4), and therefore inhibits the interaction of Cdk4
and cyclin D1 ([185]Sherr, 1996). [186]Liggett & Sidransky (1998)
believed that the p16/cyclin D/CDK/pRB pathway is prominent in many
epithelial malignancies. Inactivation of p16INK4a induced by aberrant
hypermethylation is implicated in the process of carcinogenesis of most
common human cancers and associated with poor prognosis
([187]Sharpless, 2005). It has been reported that the methylation of
CDKN2A gene is connected to the incidence of LIHC ([188]Biden et al.,
1997). Also, the present study indicated that CDKN2A expression in LIHC
tissues was significantly increased as compared to para-LIHC non-tumor
tissues. Furthermore, [189]Lan et al. (2011) discovered that hsalet-7g
might exercise its inhibitory ability via up-regulating p16INK4A and
play a considerably significant part in inhibiting LIHC tumorigenesis
and progression.
Despite the demonstration of multiple significantly aberrant genes in
LIHC, there still existed several limitations in the present study.
First all of, the results were based solely on the sequencing data in
TCGA database without further in vivo and in vitro confirmation.
Moreover, the solely used RNA-seq technology in TGCA confined the
legitimacy of the study, which requires further validation. Lastly, the
samples of LIHC were mainly collected in the western world, which might
fail to represent the worldwide picture of gene signatures in LIHC.
Further investigations of DEGs in LIHC concentrating on Asian and
African countries might help mend the issue.
In summary, the study investigated LIHC gene expression signatures,
related pathways and networks by overlapping data from TCGA and NLP.
One particular pathway, ‘pathways in cancer,’ stood out due to its
massive importance. Furthermore, four key genes highly involved in
‘pathways in cancer,’ BIRC5, E2F1, CCNE1 and CDKN2A, were
bioinformatically validated and selected out for further diagnostic and
prognostic tests. The detection pool formed by the four genes rendered
ideal values in term of diagnosing LIHC. Meanwhile, BIRC5 and CCNE1 can
also be considered promising in the field of LIHC prognosis. In the era
of big data, we now can gather worldwide gene expression data from
multiple databases. A case in point is TCGA, which helps the present
research to gain exclusive access to an unprecedentedly massive sample
size of LIHC patients along with the expression data. Based on that,
the diagnostic and prognostic tests are deemed even more creditable and
convincing. Forthcoming research can target the novel non-invasive
detection methods using serum to diagnose LIHC and predict its
prognosis on the selected genes, whose pool has been proved effective
in the present study. Still, relevant molecular mechanisms in relation
to the mentioned genes, pathways, and networks need further
investigation and validation.
Conclusion
The study has not only illustrated gene expression signatures of LIHC
and related regulatory pathways and networks from the perspective of
big data, featuring the cancers-associated pathway with top priority,
‘pathways in cancer,’ but also analyzed the diagnostic and prognostic
values of four highlighted genes, namely BIRC5, E2F1, CCNE1 and CDKN2A,
which were validated using Oncomine. It is advisable that a detection
pool of the four genes should be further investigated owing to its high
evidence level of diagnosis, whereas we believe that the prognostic
powers of BIRC5 and CCNE1 are equally attractive and thus worthy of
attention.
Supplemental Information
Figure S1. Protein-Protein Interaction (PPI) network analysis of all
DEGs.
Interactions of all the DEGs were shown by STRING online database
([190]http://string-db.org) with the cut-off criterion of combined
score >0.7. Network nodes represented proteins and edges represented
protein-protein associations.
[191]Click here for additional data file.^ (20.4MB, png)
DOI: 10.7717/peerj.3089/supp-1
Figure S2. GO analysis of all 389 genes.
The circles represented different terms from the perspective of
biological process (BP), cellular component (CC), and molecular
function (MF). The relationships among terms were represented by
arrows. A false discovery rate (FDR) of 0.05 was selected for the
current directed acyclic graph (DAG). Altogether, 1583 nodes and 2676
edges were presented. The color depth indicated the significance of the
corresponding term.
[192]Click here for additional data file.^ (6.1MB, pdf)
DOI: 10.7717/peerj.3089/supp-2
Figure S3. Validation of the expression of BIRC5, ECF1, CCNE1 and
CDKN2A in LIHC tissues by the data from Oncomine.
BIRC5, ECF1, CCNE1 and CDKN2A were searched in Oncomine Research
Edition ([193]https://www.oncomine.org) filtered with the Cancer Type
of Liver Cancer. All these four genes were included in different
datasets. Highly and reasonably consistent with the results from TCGA,
all BIRC5 (Fig. A, B and C), ECF1 (Fig. D), CCNE1 (Fig. E) and CDKN2A
(Fig. F and G) showed evidently over-expressed pattern in LIHC tissues.
(A) BIRC5 expression between normal liver and LIHC tissues from
Wurmbach Liver Statistics. Wurmbach Liver Statistics was based on the
platform of Human Genome U133 Plus 2.0 containing 19,574 measured
mRNAs. (B) BIRC5 expression between normal liver and LIHC tissues from
Roessler Liver 2 Statistics. Roessler Liver 2 Statistics was based on
the platform of Affymetrix Human Genome HT U133A Array containing
12,624 measured mRNAs. (C) BIRC5 expression between normal liver and
LIHC tissues from Roessler Liver Statistics. Roessler Liver Statistics
was based on the platform of Human Genome U133A 2.0 Array containing
12,603 measured mRNAs. (D) E2F1 expression between normal liver and
LIHC tissues from Chen Liver Statistics. Chen Liver Statistics was
based on the platform not pre-defined in Oncomine containing 10,802
measured mRNAs. (E) CCNE1 expression between normal liver and LIHC
tissues from Wurmbach Liver Statistics. Wurmbach Liver Statistics was
based on the platform of Human Genome U133 Plus 2.0 containing 19,574
measured mRNAs. (F) CDKN2A expression between normal liver and LIHC
tissues from Wurmbach Liver Statistics. Wurmbach Liver Statistics was
based on the platform of Human Genome U133 Plus 2.0 containing 19,574
measured mRNAs. (G) CDKN2A expression between normal liver and LIHC
tissues from Roessler Liver 2 Statistics. Roessler Liver 2 Statistics
was based on the platform of Affymetrix Human Genome HT U133A Array
containing 12,624 measured mRNAs.
[194]Click here for additional data file.^ (133.6KB, png)
DOI: 10.7717/peerj.3089/supp-3
Table S1. [Raw Data] A complete list of involved genes in GO analysis,
KEGG pathway analysis and Panther pathway analysis.
A complete list was provided for readers’ further reference and
research interests with regard to involved genes in GO analysis, KEGG
pathway analysis and Panther pathway analysis. Genes from GO analysis
were divided into three groups, i.e., Biological Process, Cellular
Component, and Molecular Function, accordingly.
[195]Click here for additional data file.^ (132.9KB, xlsx)
DOI: 10.7717/peerj.3089/supp-4
Funding Statement
This work was partly supported by the Fund of National Natural Science
Foundation of China [NSFC81260222 and NSFC81560386]. There was no
additional external funding received for this study. The funders had no
role in study design, data collection and analysis, decision to
publish, or preparation of the manuscript.
Contributor Information
Gang Chen, Email: chen_gang_triones@163.com.
Zhen-bo Feng, Email: guanghu1963@126.com.
Additional Information and Declarations
Competing Interests
The authors declare there are no competing interests.
Author Contributions
Hong Yang conceived and designed the experiments, performed the
experiments, analyzed the data, contributed reagents/materials/analysis
tools, wrote the paper.
Xin Zhang conceived and designed the experiments, performed the
experiments, analyzed the data, contributed reagents/materials/analysis
tools, wrote the paper, prepared figures and/or tables.
Xiao-yong Cai performed the experiments, analyzed the data, contributed
reagents/materials/analysis tools.
Dong-yue Wen, Zhi-hua Ye, Liang Liang, Lu Zhang and Han-lin Wang
performed the experiments, analyzed the data, contributed
reagents/materials/analysis tools, prepared figures and/or tables.
Gang Chen conceived and designed the experiments, performed the
experiments, analyzed the data, contributed reagents/materials/analysis
tools, wrote the paper, prepared figures and/or tables, reviewed drafts
of the paper.
Zhen-bo Feng conceived and designed the experiments, performed the
experiments, analyzed the data, contributed reagents/materials/analysis
tools, wrote the paper, reviewed drafts of the paper.
Data Availability
The following information was supplied regarding data availability:
The raw data has been supplied as a [196]Supplemental File.
References