Abstract

Background

   Liver cancer is a common malignant tumor in China, with high mortality.
   Its occurrence and development were thoroughly studied by
   high-throughput expression microarray, which produced abundant data on
   gene expression, mRNA quantification and the clinical data of liver
   cancer. However, the hub genes, which can be served as biomarkers for
   diagnosis and treatment of early liver cancer, are not well screened.

Results

   Here we present a new method for getting 6 key genes, aiming to
   diagnose and treat the early liver cancer. We firstly analyzed the
   different expression microarrays based on TCGA database, and a total of
   1564 differentially expressed genes were obtained, of which 1400 were
   up-regulated and 164 were down-regulated. Furthermore, these
   differentially expressed genes were studied by using GO and KEGG
   enrichment analysis, a PPI network was constructed based on the STRING
   database, and 15 hub genes were obtained. Finally, 15 hub genes were
   verified by applying the survival analysis method on Oncomine database,
   and 6 key genes were ultimately identified, including PLK1, CDC20,
   CCNB2, BUB1, MAD2L1 and CCNA2. The robustness analysis of four
   independent data sets verifies the accuracy of the key gene’s
   classification of the data set.

Conclusions

   Although there are complicated differences between cancer and normal
   cells in gene functions, cancer cells could be differentiated in case
   that a group of special genes expresses abnormally. Here we presented a
   new method to identify the 6 key genes for diagnosis and treatment of
   early liver cancer, and these key genes can help us understand the
   pathogenesis of liver cancer more deeply.

   Keywords: Bioinformatics analysis, Differential expressed genes, PPI,
   Oncomine, Survival curve analysis

Background

   As one of the common malignant tumors in China, liver cancer accounts
   for 70–80% of primary liver malignancies, and are the second leading
   cause of cancer-related mortality [[29]1–[30]3]. Current treatments for
   liver cancer mainly include surgery and intervention, etc. However, as
   the underlying mechanism of liver cancer is still unclear, liver cancer
   has a high incidence, strong invasiveness and poor prognosis. Data show
   that about 70% of liver cancers relapse within 5 years after resection
   or ablation [[31]4]. Early detection and treatment can effectively
   improve the prognosis of tumor patients and reduce the mortality rate,
   but there are usually no obvious clinical symptoms in the early stage
   of liver cancer. At present, the commonly used diagnostic methods of
   liver cancer, such as serological tumor markers and imaging techniques,
   are not satisfactory in clinical effect. Therefore, screening markers
   related to early diagnosis, invasion and metastasis of liver cancer is
   of great significance for improving the therapeutic effect and
   prognosis of liver cancer [[32]5].

   In this study, the original data was downloaded from the TCGA database,
   and the R software was used to screen the differentially expressed
   genes (DEGs) in the gene expression profile of liver cancer tissues and
   adjacent tissues. Using bioinformatics technology to analyze DEGs,
   combined with Oncomine database to screen potential molecular targets
   suitable for early diagnosis and immunotherapy of liver cancer [[33]6].
   Survival analysis and literature mining were used to verify our results
   and provided a theoretical basis for further research. Finally, we
   constructed a clinical prediction model to get more convincing results.

Results

   According to the mRNAs matrix data of normal tissues and liver cancer
   tissues in the TCGA database, using the R toolkit, letting
   [MATH: <mrow><mi>α</mi><mo>=</mo><mn>2</mn></mrow> :MATH]
   and
   [MATH: <mrow><mi>β</mi><mo>=</mo><mn>0.05</mn></mrow> :MATH]
   , a total of 1564 DEGs were obtained, of which 1400 were up-regulated
   and 164 were down-regulated.

   1564 DEGs were enriched by GO and KEGG with DAVID, where
   [MATH: <mrow><mi>β</mi><mo>=</mo><mn>0.01</mn></mrow> :MATH]
   was included as inclusion criteria in GO, and p value
   [MATH: <mrow><mo><</mo><mn>0.05</mn></mrow> :MATH]
   was included as inclusion criteria in KEGG. The results of GO
   enrichment analysis indicated that the DEGs were mainly enriched in
   BPs, including sister chromatid cohesion and chemical synaptic
   transmission. CC showed that the DEGs were significantly enriched in
   extracellular region, chromosome, extracellular space, plasma membrane
   and anchored component of membrane, and centromere region. MF showed
   that the DEGs were significantly enriched in hormone activity,
   sequence-specific DNA binding, growth factor activity and calcium ion
   binding. The results of KEGG pathway enrichment analysis showed that
   the DEGs were mainly enriched in Cell cycle, Neuroactive
   ligand-receptor interaction, Hypertrophic cardiomyopathy, Mineral
   absorption, Dilated cardiomyopathy, Adrenergic signaling in
   cardiomyocytes and Cardiac muscle contraction (Seen in Table [34]1).

Table 1.

   Significantly enriched GO terms and KEGG pathways of DEGs
   Category Term Description Count FDR/p value
   BP term GO:0007268 Chemical synaptic transmission 44
   [MATH: <mrow><mn>2.26</mn><mi>e</mi><mo>-</mo><mn>04</mn></mrow> :MATH]
   BP term GO:0007062 Sister chromatid cohesion 24
   [MATH: <mrow><mn>4.38</mn><mi>e</mi><mo>-</mo><mn>03</mn></mrow> :MATH]
   CC term GO:0005576 Extracellular region 198
   [MATH: <mrow><mn>2.66</mn><mi>e</mi><mo>-</mo><mn>08</mn></mrow> :MATH]
   CC term GO:0005615 Extracellular space 158
   [MATH: <mrow><mn>1.08</mn><mi>e</mi><mo>-</mo><mn>04</mn></mrow> :MATH]
   CC term GO:0031225 Anchored component of membrane 26
   [MATH: <mrow><mn>1.86</mn><mi>e</mi><mo>-</mo><mn>03</mn></mrow> :MATH]
   CC term GO:0005886 Plasma membrane 386
   [MATH: <mrow><mn>1.16</mn><mi>e</mi><mo>-</mo><mn>02</mn></mrow> :MATH]
   CC term GO:0000775 Chromosome, centromeric region 16
   [MATH: <mrow><mn>2.77</mn><mi>e</mi><mo>-</mo><mn>02</mn></mrow> :MATH]
   MF term GO:0043565 Sequence-specific DNA binding 79
   [MATH: <mrow><mn>2.65</mn><mi>e</mi><mo>-</mo><mn>06</mn></mrow> :MATH]
   MF term GO:0005179 Hormone activity 24
   [MATH: <mrow><mn>4.46</mn><mi>e</mi><mo>-</mo><mn>04</mn></mrow> :MATH]
   MF term GO:0005509 Calcium ion binding 88
   [MATH: <mrow><mn>6.24</mn><mi>e</mi><mo>-</mo><mn>03</mn></mrow> :MATH]
   MF term GO:0008083 Growth factor activity 29
   [MATH: <mrow><mn>3.83</mn><mi>e</mi><mo>-</mo><mn>02</mn></mrow> :MATH]
   KEGG pathway hsa04080 Neuroactive ligand–receptor interaction 46
   [MATH: <mrow><mn>2.33</mn><mi>e</mi><mo>-</mo><mn>07</mn></mrow> :MATH]
   KEGG pathway hsa04110 Cell cycle 23
   [MATH: <mrow><mn>8.06</mn><mi>e</mi><mo>-</mo><mn>05</mn></mrow> :MATH]
   KEGG pathway hsa04978 Mineral absorption 12
   [MATH: <mrow><mn>2.43</mn><mi>e</mi><mo>-</mo><mn>04</mn></mrow> :MATH]
   KEGG pathway hsa05410 Hypertrophic cardiomyopathy (HCM) 16
   [MATH: <mrow><mn>4.47</mn><mi>e</mi><mo>-</mo><mn>04</mn></mrow> :MATH]
   KEGG pathway hsa04260 Cardiac muscle contraction 15
   [MATH: <mrow><mn>9.36</mn><mi>e</mi><mo>-</mo><mn>04</mn></mrow> :MATH]
   KEGG pathway hsa04261 Adrenergic signaling in cardiomyocytes 22
   [MATH: <mrow><mn>1.00</mn><mi>e</mi><mo>-</mo><mn>03</mn></mrow> :MATH]
   KEGG pathway hsa05414 Dilated cardiomyopathy 16
   [MATH: <mrow><mn>1.01</mn><mi>e</mi><mo>-</mo><mn>03</mn></mrow> :MATH]
   [35]Open in a new tab

   The STRING database was used to construct a protein-regulated
   interaction network diagram and 1498 protein interaction network nodes
   were obtained. Analyze these nodes to further understand the regulatory
   relationship between DEGs and proteins in liver cancer tissues and
   normal tissues. Import the data downloaded in the STRING database into
   Cytoscape software for visualization, and get the PPI network diagram
   of DEGs. The degree of CytoHubba plug-in was used to determine the top
   15 hub genes for further analysis. They were GNGT1, GNG4, LPAR3,
   BDKRB1, PLK1, CDC20, CCNB2, ADCY8, BUB1, MAD2L1, AURKB, AGTR2, RLN3,
   CCNA2, and CENPE.

   The expression levels of these 15 hub genes were analyzed in the
   Oncomine database, and we obtained 6 key genes, PLK1, CDC20, CCNB2,
   BUB1, MAD2L1 and CCNA2, which have significant expression differences,
   as shown in Fig. [36]1. Furthermore, a meta-analysis of these 6 key
   genes was carried out. Taking CDC20 gene as an example, the results of
   differential expression analysis and meta-analysis are shown in
   Fig. [37]2. In the meta-analysis, three studies were screened by
   significant differences and differential expression folds [[38]7,
   [39]8]. Depending on the statistical analysis of Oncomine database, 6
   key genes were highly expressed in liver cancer compared with normal
   tissues, and the differences were statistically significant. Next, we
   used the 5-fold cross-validation method and three classifiers: decision
   tree(Dtree), random forest(RF), and support vector machine(SVM) to
   measure the classification performance of 6 key genes and 15 hub genes
   in 4 data sets. Taking into account the randomness of the classifier,
   the process was repeated 200 times and the average accuracy was
   calculated. The results are shown in Table [40]2.

Fig. 1.

   [41]Fig. 1
   [42]Open in a new tab

   Differential expression analysis of CDC20 gene in Oncomine database
   liver cancer. Where: 1. normal tissue; 2. liver cancer tissue; a, b, c
   respectively represents the results of three studies

Fig. 2.

   [43]Fig. 2
   [44]Open in a new tab

   Meta-analysis of Oncomine database of CDC20 gene in liver cancer

Table 2.

   The classification accuracy of 15 hub genes and 6 key genes using 3
   classifiers in 4 data sets
   Dataset       Number of genes Classifier and classification accuracy
                                 Dtree RF    SVM
   [45]GSE76427  15hub           0.850 0.919            0.832
                 6key            0.878 0.907            0.827
   [46]GSE57957  15hub           0.894 0.917            0.861
                 6key            0.893 0.908            0.856
   [47]GSE39791  15hub           0.916 0.911            0.887
                 6key            0.925 0.915            0.915
   [48]GSE102079 15hub           0.854 0.888            0.898
                 6key            0.843 0.859            0.875
   [49]Open in a new tab

   Finally, using the KM-Plotter database, the survival curves were
   obtained based on the 6 key genes. Conducting research on the
   prognostic value of these key genes for the overall survival of
   patients with liver cancer, the results shown that: PLK1 (
   [MATH:
   <mrow><mi>P</mi><mo>=</mo><mn>7</mn><mi>e</mi><mo>-</mo><mn>07</mn></mr
   ow> :MATH]
   ), CDC20 (
   [MATH:
   <mrow><mi>P</mi><mo>=</mo><mn>5.7</mn><mi>e</mi><mo>-</mo><mn>08</mn></
   mrow> :MATH]
   ), CCNB2 (
   [MATH:
   <mrow><mi>P</mi><mo>=</mo><mn>2.7</mn><mi>e</mi><mo>-</mo><mn>04</mn></
   mrow> :MATH]
   ), BUB1 (
   [MATH:
   <mrow><mi>P</mi><mo>=</mo><mn>4</mn><mi>e</mi><mo>-</mo><mn>05</mn></mr
   ow> :MATH]
   ), MAD2L1 (
   [MATH:
   <mrow><mi>P</mi><mo>=</mo><mn>5.3</mn><mi>e</mi><mo>-</mo><mn>06</mn></
   mrow> :MATH]
   ) and CCNA2 (
   [MATH:
   <mrow><mi>P</mi><mo>=</mo><mn>5.3</mn><mi>e</mi><mo>-</mo><mn>06</mn></
   mrow> :MATH]
   ). The analysis results of survival curve are shown in Fig. [50]3.
   These results show that these six key genes are closely related to the
   occurrence, development and prognosis of liver cancer, which were
   significant for the study of liver cancer.

Fig. 3.

   [51]Fig. 3
   [52]Open in a new tab

   The results of KM-Plotter database survival analysis on 6 key genes

   Based on the clinical prediction model, a univariate cox regression
   analysis was performed on the training group data set, and the p value
   of age and pathology was less than 0.05. Multivariate cox regression
   analysis was performed for these two factors, and a nomogram was drawn
   as shown in Fig. [53]4. The ROC curve of the training group (seen in
   Fig. [54]5) and the validation results of the test population show that
   the model has good prediction ability.

Fig. 4.

   [55]Fig. 4
   [56]Open in a new tab

   Liver cancer overall survival prediction nomogram

Fig. 5.

   Fig. 5
   [57]Open in a new tab

   ROC of the risk score

Discussions

   It is a dynamic biological process with multiple molecules,steps and
   factors during the occurrence and development of liver cancer, and
   people’s understanding of its mechanism is still very limited
   presently. Previous studies mainly focused on the impact of single
   genes on tumors, but the process of cell carcinogenesis often involves
   changes in multiple genes. These genes can play a role by transforming
   network interactions. Therefore, studying cancer gene expression
   profile at the level of multiple genes can help us better explore the
   pathogenesis. Gene chip can detect the expression levels of tens of
   thousands of genes at the same time, and is a powerful tool for
   studying the interaction between genomes and genes.

   In this study, to verify the performance of the 6 key genes in
   classifying cancer and normal samples, four datasets from different
   sources were applied. By using different classifiers for verification,
   higher classification accuracy has been obtained. By comparing the
   classification accuracy of 15 hub gene sets and 6 key gene sets, we can
   find that the prediction accuracy of the two gene sets is not much
   different, but because the number of genes was reduced, the speed of
   clinical diagnosis can be improved. Therefore, our method has strong
   robustness and effectiveness. Compared with normal samples all of the 6
   key genes are upregulated. Of the 6 key genes, PLK1 plays an important
   role in the regulation of cell mitosis and is closely related to
   survival and prognosis. In recent years, it has been found that PLK1 is
   highly expressed in cervical cancer [[58]9], neuroblastoma cells
   [[59]10], acute myeloid leukemia [[60]11], prostate cancer [[61]12] and
   many other malignant tumors. Studies have found that inhibiting the
   expression of PLK1 by interfering with multiple stages of mitosis can
   lead to tumor cell death, so PLK1 is expected to become a potential
   target for cancer treatment [[62]13].

   CDC20 is a key gene to ensure the normal progress of cell mitosis. Its
   abnormal expression can cause normal cells to produce non-integer
   chromosomes by disrupting cell mitosis, thereby promoting malignant
   transformation of cells [[63]14]. Studies have shown that the high
   expression of CDC20 is closely related to the clinical progress of
   human malignant tumors [[64]15, [65]16], such as lung cancer, liver
   cancer, malignant glioma, etc [[66]17–[67]19].

   The proteins encoded by CCNA2 and CCNB2 belong to the highly conserved
   cyclin family and play an important role in the process of cell cycle
   control. The research results of the literature [[68]20, [69]21] show
   that they are all highly expressed in HCC tissues, and the overall
   survival rate of patients is low.

   BUB1 encodes a serine/threonine-protein kinase that play an important
   role in mitosis [[70]22]. The protein partly encoded by this gene
   regulates the formation and activation of the late mitotic complex by
   phosphorylating some components of the mitotic checkpoint complex and
   activating the checkpoint of the spindle [[71]23, [72]24]. The abnormal
   expression of BUB1 is related to the occurrence and development of a
   variety of tumors, and it is also involved in the biological behavior
   of cancer stem cells such as breast cancer [[73]25, [74]26].

   MAD2L1 is an important part of the spindle assembly checkpoint in
   mitosis. Studies have shown that reducing the expression of MAD2L1 can
   inhibit tumor cell migration and invasion, thereby hindering tumor
   occurrence and development [[75]27].

   In the process of cell carcinogenesis, the interaction of multiple gene
   changes is of great significance to the pathogenesis of cancer, not
   only for liver cancer. Therefore, the analysis method in this study is
   also applicable to other cancers and is used to identify biomarkers of
   other complex human diseases, such as the non-coding RNA, microRNAs,
   etc [[76]28–[77]30].

Conclusions

   This study comprehensively explored the molecular functions and
   biological processes of the occurrence and development of liver cancer.
   Based on the analysis of the liver cancer data from the TCGA database,
   firstly compared the gene expression of normal tissues and liver cancer
   tissues, and screened out genes with different expressions. The GO
   enrichment analysis and KEGG pathway enrichment analysis of these
   differential genes shown that DEGs were mainly enriched in chemical
   synaptic transmission, extracellular area, extracellular space,
   neuroactive ligand-receptor Interaction, cell cycle, etc. Then, the PPI
   network of DEGs was constructed using STRING database and Cytoscape
   software, and 15 hub genes were screened out. In addition, in order to
   verify the expression levels of these key genes, differential
   expression analysis and meta-analysis of 15 hub genes were performed
   using Oncomine database, and 6 key genes were obtained, and the
   fivefold verification method was used to verify the effectiveness of
   these six key genes. Finally, Oncomine and survival analysis were used
   to confirm the six key genes that PLK1, CDC20, CCNB2, BUU1, MAD2L1 and
   CCNA2 are closely related to the occurrence, prognosis and mechanism of
   liver cancer, which will help us discover tumor markers and drug
   targets.

Methods

Datasets

   The mRNA data of liver cancer tissues and normal tissues were
   downloaded from the TCGA database. The data includes 371 tumor samples
   and 50 normal samples. The four gene expression profiles ([78]GSE76427
   [[79]31], [80]GSE57957 [[81]32], [82]GSE39791 [[83]33], [84]GSE102079
   [[85]34]) used for the verification test were obtained from the Gene
   Expression Comprehensive System (GEO) [[86]35] Download.

Methods

   This paper aims to find molecular markers for early diagnosis of liver
   cancer and potential molecular targets for liver cancer immunotherapy.
   The methods used in the study were as follows: Screening for the DEGs,
   GO and KEGG enrichment analysis, PPI network construction and hub genes
   identification, Oncomine validation and survival analysis of hub genes.
   These methods were described in detail as follows:

Screening for the DEGs

   Using the R package, we performed background correction,
   standardization and expression value calculation on the original data.
   Index
   [MATH:
   <mrow><mo>∣</mo><mi>l</mi><mi>o</mi><mi>g</mi><mi>F</mi><mi>C</mi><mo>∣
   </mo><mspace width="0.222222em"></mspace><mi>α</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>F</mi><mi>D</mi><mi>R</mi><mo><</mo><mi>β</mi></mrow>
   :MATH]
   were used as conditions to filter out DEGs, where the value of
   [MATH: <mi>α</mi> :MATH]
   was inversely proportional to the number of DEGs selected, and the
   value of
   [MATH: <mi>β</mi> :MATH]
   was proportional to the number of DEGs selected.

GO and KEGG enrichment analysis

   GO is a method for large-scale enrichment studies of gene functions
   including biological processes, molecular functions, and cellular
   components. Statistical analysis was used to calculate the p value and
   FDR in GO. KEGG pathway enrichment analysis calculates the
   hypergeometric distribution of DEGs and pathways, and returns the p
   value and FDR value of each pathway existing in DEGs to find the most
   likely related pathway. In this study, we used DAVID 6.8 database for
   GO and KEGG enrichment analysis. Then visually analyze GO and KEGG
   through bar graphs and ggplot2.

PPI network construction and hub genes identification

   The STRING database is an online analysis tool used to study
   protein-protein interaction (PPI) patterns and provide information on
   related pathways and functions [[87]36]. Import DEGs into the database
   and output network files of protein interactions. Then use the
   Cytoscape software to visualize the protein interaction network and
   obtain the hub genes [[88]37, [89]38].

Oncomine validation and survival analysis of hub genes

   The Oncomine database is a cancer gene chip database used to analyze
   cancer gene information, compare the differential expression between
   cancer and normal tissues, and verify hub genes. The KM-plotter
   database (http://kmplot.com/analysis/) contains information on 364
   liver cancer patients and is used to evaluate the prognostic value of
   hub genes in liver cancer patients.

In silico validation

   In order to verify the validity of the results, we downloaded 4
   independent gene expression profiles of GEO ([90]GSE76427,
   [91]GSE57957, [92]GSE39791, [93]GSE102079). The five-fold
   cross-classification verification method was adopted, and the four data
   sets were respectively verified using decision trees, random forests
   and support vector machines. Each experiment was repeated 200 times and
   the average classification accuracy was calculated.

Construction of clinical prediction model

   The clinical data of liver cancer samples were randomly divided into
   training group and testing group. Univariate cox regression analysis
   was performed using the training group to screen out statistically
   significant factors and draw the histograms. According to the
   calculation results, the training group was divided into high-risk and
   low-risk categories, and the ROC curve was constructed to evaluate the
   model. The testing group was used to verify the model.

   In this study, the DEGs were obtained by differential expression
   analysis of liver cancer gene expression profiles. These differential
   expression genes were studied by GO and KEGG enrichment analysis, and
   the significantly enriched functions and pathways of DEGs were
   obtained. Furthermore, protein interaction network was constructed
   using STRING database to screen key genes. Oncomine database and
   survival analysis were used to confirm the key genes, and the genes
   which suitable for liver cancer markers or potential targets were
   obtained. Finally, we constructed a clinical prediction model, so as to
   get more convincing results.

Acknowledgements