Abstract

   As liver hepatocellular carcinoma (LIHC) has high morbidity and
   mortality rates, improving the clinical diagnosis and treatment of LIHC
   is an important issue. The advent of the era of precision medicine
   provides us with new opportunities to cure cancers, including the
   accumulation of multi-omics data of cancers. Here, we proposed an
   integration method that involved the Fisher ratio, Spearman correlation
   coefficient, classified information index, and an ensemble of decision
   trees (DTs) for biomarker identification based on an unbalanced dataset
   of LIHC. Then, we obtained 34 differentially expressed genes (DEGs).
   The ability of the 34 DEGs to discriminate tumor samples from normal
   samples was evaluated by classification, and a high area under the
   curve (AUC) was achieved in our studied dataset and in two external
   validation datasets (AUC = 0.997, 0.973, and 0.949, respectively).
   Additionally, we also found three subtypes of LIHC, and revealed
   different biological mechanisms behind the three subtypes. Mutation
   enrichment analysis showed that subtype 3 had many enriched mutations,
   including tumor protein p53 (TP53) mutations. Overall, our study
   suggested that the 34 DEGs could serve as diagnostic biomarkers, and
   the three subtypes could help with precise treatment for LIHC.

   Keywords: ensemble of decision trees, diagnostic biomarkers, LIHC
   subtyping

1. Introduction

   Liver hepatocellular carcinoma (LIHC), the primary malignancy of liver,
   is derived from hepatocytes and accounts for over 80% of cases of liver
   cancer. LIHC is predicted to be the sixth most commonly diagnosed
   cancer and the fourth leading cause of cancer deaths worldwide [[32]1].
   LIHC is too often found at the advanced disease stage, and at this
   stage, there is virtually no effective treatment that can improve
   survival rates [[33]2]. Therefore, finding diagnostic biomarkers is
   essential for the early diagnosis and individualized treatments for
   LIHC. In addition, the correct discrimination of the subtypes is
   helpful to provide personalized therapy.

   Biomarkers have many potential applications in oncology, including
   screening, differential diagnosis, determination of prognosis, and
   monitoring the progression of disease [[34]3]. Diagnostic biomarkers
   are used to improve the diagnosis of a disease. Diagnostic biomarker
   discovery, i.e., identifying important features that can discriminate
   tumors from normal samples, is commonly solved by different feature
   selection methods [[35]4]. Many studies focused on identifying
   diagnostic biomarkers by finding differentially expressed genes (DEGs),
   as DEGs are the most informative genes among a large number of
   irrelevant ones. For example, Yin et al. used the integration of DEG
   screening and the method of weighted gene co-expression network
   analysis to identify biomarkers for hepatocellular carcinoma (HCC)
   [[36]5]. Li et al. identified 273 DEGs as candidate targets for the
   diagnosis of HCC using GEO2R
   ([37]http://www.ncbi.nlm.nih.gov/geo/geo2r) [[38]6]. Kaur et al.
   identified a three-gene HCC biomarker based on DEGs [[39]7].

   However, most of these studies did not consider the imbalance problems
   between tumors and normal samples, which are likely to produce
   unreliable biomarkers. Typically, approaches to specifically deal with
   the imbalance problem are proposed from the data and algorithmic
   levels. Sampling methods consist of balancing the original dataset,
   either by under-sampling for the majority class or over-sampling the
   minority class [[40]8,[41]9]. Algorithm level data balancing is widely
   used, such as cost-sensitive learning [[42]10,[43]11], integration
   methods, and single class learning. More notably, other methods were
   also proposed, such as two-stage feature selection [[44]12], which
   provided a new way of thinking about dealing with unbalanced data.

   Molecular subtyping refers to finding clusters of tumors that have
   shared characteristics, which is helpful to the treatment of specific
   cancers. Molecular subtyping could help researchers to identify both
   actionable targets for drug design as well as biomarkers for response
   prediction [[45]13]. Typically, cancers were classified using
   pathological criteria that rely heavily on the tissue site of origin
   [[46]14]. However, new data-driven approaches have been proposed in
   cancer subtyping based on gene expression profiling. For example,
   non-negative factorization clustering with a standard “Brunet” method
   was conducted, and two distinct molecular HCC subtypes were identified
   [[47]15]; statistical analysis was applied to HCC tumor examples to
   judge whether they can be divided into specific clusters with distinct
   features [[48]16]. Therefore, data-driving methods might be useful to
   deal with problems in biology.

   In this study, we first identified 34 DEGs of LIHC by an integration
   method which involved the Fisher ratio, Spearman correlation
   coefficient, classified information index, and the ensemble of decision
   trees (DTs). Then, we discussed two applications of the 34 DEGs: On the
   one hand, they were potential diagnostic biomarkers that showed good
   performance in discriminating tumors from normal samples; on the other
   hand, they were used to classify tumor samples into three subtypes with
   different survival rates and specific mutations. In summary, this study
   revealed 34 prognostic biomarkers and three subtypes of LIHC, which
   might play important roles in precision medicine regarding LIHC.

2. Materials and Methods

2.1. Datasets

   LIHC and HCC represent the same cancer: the most common type of primary
   liver cancer. The multiplatform genomics datasets including mRNA
   expression data, DNA-methylation data, and somatic mutation data, were
   downloaded from the Cancer Genome Atlas (TCGA,
   [49]http://cancergenome.nih.gov/) for LIHC. The clinical information
   was obtained through the TCGA Data Commons
   ([50]https://gdc.cancer.gov/). Two external validation datasets were
   downloaded from Gene Expression Omnibus (GEO,
   [51]https://www.ncbi.nlm.nih.gov/gds) with accession numbers
   [52]GSE39791 and [53]GSE3500. The flowchart of the study is shown in
   [54]Figure 1.

Figure 1.

   [55]Figure 1
   [56]Open in a new tab

   The flowchart of our main work. The process to obtain differentially
   expressed genes (DEGs) between tumors and normal samples is in the left
   panel, and the two main applications of our DEGs are in the right
   panel. Liver hepatocellular carcinoma (LIHC), the Cancer Genome Atlas
   (TCGA), and decision trees (DTs).

2.2. Data Preprocessing

   As a large number of genes do not actually work in discriminating tumor
   samples from normal samples, we first processed the gene expression
   profiles based on the following three strategies: (1) We deleted genes
   that were unexpressed in over 50% samples including normal and tumor
   samples; (2) we performed maximum and minimum normalization for each
   gene; and (3) we replaced zero expression with a random value from the
   range (0, min/10), where “min” is the minimum non-zero expression value
   for a certain gene.

2.3. Identification of Differentially Expressed Genes (DEGs)

   The following steps were conducted to reduce the dimension of the gene
   space and identify DEGs. First, we used the Fisher ratio method
   [[57]17] to delete irrelevant genes. Suppose that two investigating
   classes (i.e., tumor and normal) on gene
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   have means
   [MATH:
   <mrow><mrow><msub><mi>μ</mi><mrow><mi>i</mi><mn>1</mn></mrow></msub></m
   row></mrow> :MATH]
   ,
   [MATH:
   <mrow><mrow><msub><mi>μ</mi><mrow><mi>i</mi><mn>2</mn></mrow></msub></m
   row></mrow> :MATH]
   and variances
   [MATH:
   <mrow><mrow><msubsup><mi>σ</mi><mrow><mi>i</mi><mn>1</mn></mrow><mn>2</
   mn></msubsup></mrow></mrow> :MATH]
   ,
   [MATH:
   <mrow><mrow><msubsup><mi>σ</mi><mrow><mi>i</mi><mn>2</mn></mrow><mn>2</
   mn></msubsup></mrow></mrow> :MATH]
   . The Fisher ratio of gene
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   was defined as the ratio of the variance between the classes to the
   variance within the classes noted by
   [MATH:
   <mrow><mrow><mi>F</mi><mi>i</mi><mi>s</mi><mi>h</mi><mi>e</mi><mi>r</mi
   ><mtext> </mtext><mi>R</mi><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi><mro
   w><mo>(</mo><mi>i</mi><mo>)</mo></mrow><mrow><mtext> </mtext><mo>=</mo>
   </mrow><msup><mrow><mrow><mo>(</mo><mrow><msub><mi>μ</mi><mrow><mi>i</m
   i><mn>1</mn></mrow></msub><mo>−</mo><msub><mi>μ</mi><mrow><mi>i</mi><mn
   >2</mn></mrow></msub></mrow><mo>)</mo></mrow></mrow><mn>2</mn></msup><m
   o>/</mo><mrow><mo>(</mo><mrow><msubsup><mi>σ</mi><mrow><mi>i</mi><mn>1<
   /mn></mrow><mn>2</mn></msubsup><mo>+</mo><msubsup><mi>σ</mi><mrow><mi>i
   </mi><mn>2</mn></mrow><mn>2</mn></msubsup></mrow><mo>)</mo></mrow><mo>.
   </mo></mrow></mrow> :MATH]
   (1)

   This step aimed to reject the noise genes that can only provide few
   information to discriminate tumors from normal samples. The Fisher
   ratio were estimated for each gene, then the genes with the top scores
   were selected.

   Secondly, we calculated the spearman correlation coefficients to
   measure the correlation degree between genes. Genes with high
   correlation coefficients tended to have redundant information, which
   was further filtered based on the following defined classified
   information index [[58]18]. The classified information index of gene
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   was defined as
   [MATH: <mrow><mrow><mi>d</mi><mo stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo><mo>=</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mf
   rac><mrow><mo>|</mo><msub><mi>μ</mi><mrow><mi>i</mi><mn>1</mn></mrow></
   msub><mo>−</mo><msub><mi>μ</mi><mrow><mi>i</mi><mn>2</mn></mrow></msub>
   <mo>|</mo></mrow><mrow><msub><mi>σ</mi><mrow><mi>i</mi><mn>1</mn></mrow
   ></msub><mo>+</mo><msub><mi>σ</mi><mrow><mi>i</mi><mn>2</mn></mrow></ms
   ub></mrow></mfrac><mo>+</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><mi>ln</
   mi><mrow><mo>(</mo><mrow><mfrac><mrow><msubsup><mi>σ</mi><mrow><mi>i</m
   i><mn>1</mn></mrow><mn>2</mn></msubsup><mo>+</mo><msubsup><mi>σ</mi><mr
   ow><mi>i</mi><mn>2</mn></mrow><mn>2</mn></msubsup></mrow><mrow><mn>2</m
   n><msub><mi>σ</mi><mrow><mi>i</mi><mn>1</mn></mrow></msub><msub><mi>σ</
   mi><mrow><mi>i</mi><mn>2</mn></mrow></msub></mrow></mfrac></mrow><mo>)<
   /mo></mrow><mo>.</mo></mrow></mrow> :MATH]
   (2)

   Formula (2) fully considers the influence of the mean and variance
   between different types of samples. Unlike Formula (1), Formula (2)
   could maintain genes with large differences of variances even when they
   have the same means in tumors and normal samples. The purpose of this
   step was to apply a classified informative index to screen genes with
   high correlation coefficients, which helped us to find the most
   informative genes without redundancy.

   To simplify the data structure and obtain a better classification
   result, we applied the method of an ensemble of decision trees (DTs),
   which calculated feature importance using the classification
   performance. Due to the imbalanced dataset, this tended to select the
   features that had strong correlations with most classes. We used
   bootstrap under-sampling for LIHC to build multiple balanced datasets
   at the beginning. Based on each balanced dataset, we built a decision
   tree, and used perturbation for each gene and cross-validation to
   initially determine the importance score of each gene’s impact on the
   classification accuracy. The feature importance regarding gene
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   in DT
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   was defined by
   [MATH:
   <mrow><mrow><mi>F</mi><msub><mi>I</mi><mrow><mi>i</mi><mi>j</mi></mrow>
   </msub><mo>=</mo><mfrac><mrow><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><mi>C</mi><mi>V</mi></mrow></msubsup><mrow><mrow><mo>(
   </mo><mrow><mi>A</mi><mi>C</mi><msub><mi>C</mi><mrow><mi>i</mi><mi>k</m
   i></mrow></msub><mo>−</mo><mi>A</mi><mi>C</mi><mi>C</mi><msub><mi>F</mi
   ><mrow><mi>i</mi><mi>j</mi><mi>k</mi></mrow></msub></mrow><mo>)</mo></m
   row></mrow></mstyle></mrow><mrow><mi>C</mi><mi>V</mi></mrow></mfrac></m
   row></mrow> :MATH]
   (3)

   where CV represents the number of cross-validations, and the accuracy
   before and after perturbation for gene
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   about the kth cross-validation in DT
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   are
   [MATH:
   <mrow><mrow><mi>A</mi><mi>C</mi><msub><mi>C</mi><mrow><mi>i</mi><mi>k</
   mi></mrow></msub></mrow></mrow> :MATH]
   and
   [MATH:
   <mrow><mrow><mi>A</mi><mi>C</mi><mi>C</mi><msub><mi>F</mi><mrow><mi>i</
   mi><mi>j</mi><mi>k</mi></mrow></msub></mrow></mrow> :MATH]
   . The ensemble results of the sample classes were decided by all trees
   together. For example, if the results of most DTs indicated that sample
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   belonged to the tumor samples, then sample
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   was classified as a tumor sample, which was also named as a voting
   mechanism. The voting mechanism was also used for sample discrimination
   on each tree. As for the weight of the decision tree, the consistency
   between the prediction results of each decision tree and the integrated
   results of all decision trees was used as the weight of the decision
   tree. The weight of tree
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   was defined by
   [MATH:
   <mrow><mrow><mi>T</mi><mi>r</mi><mi>e</mi><mi>e</mi><mi>W</mi><mi>e</mi
   ><mi>i</mi><mi>g</mi><mi>h</mi><mi>t</mi><mi>e</mi><msub><mi>d</mi><mi>
   i</mi></msub><mo>=</mo><mfrac><mrow><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1<
   /mn></mrow><mi>s</mi></msubsup><mrow><mi>I</mi><mrow><mo>(</mo><mrow><m
   i>T</mi><mi>r</mi><mi>e</mi><msub><mi>e</mi><mrow><mi>i</mi><mi>j</mi><
   /mrow></msub><mo>=</mo><mi>E</mi><mi>n</mi><mi>s</mi><mi>e</mi><mi>m</m
   i><mi>b</mi><mi>l</mi><msub><mi>e</mi><mi>j</mi></msub></mrow><mo>)</mo
   ></mrow></mrow></mstyle></mrow><mi>S</mi></mfrac><mtext>×</mtext><mi>A<
   /mi><mi>c</mi><mi>c</mi><mi>E</mi><mi>n</mi><mi>s</mi><mi>e</mi><mi>m</
   mi><mi>b</mi><mi>l</mi><mi>e</mi></mrow></mrow> :MATH]
   (4)

   where I is the indicative function;
   [MATH:
   <mrow><mrow><mi>T</mi><mi>r</mi><mi>e</mi><msub><mi>e</mi><mrow><mi>i</
   mi><mi>j</mi></mrow></msub></mrow></mrow> :MATH]
   means the prediction result of sample
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   in tree
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   ;
   [MATH:
   <mrow><mrow><mi>E</mi><mi>n</mi><mi>s</mi><mi>e</mi><mi>m</mi><mi>b</mi
   ><mi>l</mi><msub><mi>e</mi><mi>j</mi></msub></mrow></mrow> :MATH]
   presents the ensemble prediction result for sample
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   ; S is the number of samples; and AccEnsemble is the accuracy of the
   ensemble predicted results.

   Finally, a weighted combination method regarding the feature importance
   and the weight of DTs in each tree was proposed to finally determine
   the feature importance score of each gene. The feature importance score
   of gene
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   was finally defined by
   [MATH:
   <mrow><mrow><mi>F</mi><msub><mi>I</mi><mi>j</mi></msub><mo>=</mo><mstyl
   e
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><mi>T</mi><mi>N</mi></mrow></msubsup><mrow><mi>F</mi><
   msub><mi>I</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mtext>×</mtext>
   <mi>T</mi><mi>r</mi><mi>e</mi><mi>e</mi><mi>W</mi><mi>e</mi><mi>i</mi><
   mi>g</mi><mi>h</mi><mi>t</mi><mi>e</mi><msub><mi>d</mi><mi>i</mi></msub
   ></mrow></mstyle></mrow></mrow> :MATH]
   (5)

   where TN represents the number of decision trees. Then, the most
   important DEGs were decided based on the feature importance score as
   defined by Formula (5). The method described above was called the
   ensemble of decision trees (DTs) method.

2.4. Classification between Tumors and Normal Samples by DEGs

   To verify whether the selected DEGs could be used as potential
   diagnostic biomarkers of LIHC, we first conducted hybrid sampling,
   which applied Synthetic Minority Oversampling TEchnique (SMOTE)
   over-sampling to normal samples and used k-means clustering to
   under-sampling the tumor samples to construct multiple balanced
   datasets.

   Then, for each balanced dataset, we adopted a combined classifier,
   which was a combination of naïve bayes (NB) and support vector machine
   (SVM), to determine the final class of each sample by voting
   ([59]Figure S1). Specifically, the TCGA dataset was divided into a
   training set and test set to verify our method. We also used the
   [60]GSE39791 and [61]GSE3500 datasets as independent external
   validation datasets. Receiver operating characteristic (ROC) curves
   [[62]19] were drawn to show the performance of the classification by
   the selected DEGs. Hierarchical clustering was performed using the
   expression of DEGs in both tumors and normal samples, and the results
   are shown as a heat map.

2.5. Subtypes of LIHC

   To determine the subtypes of LIHC, we performed a Bayesian clustering
   method with a spike-and-slab hierarchical model, which was suitable for
   clustering high-dimensional data using the function “bclust” in R
   package “e1071” [[63]20].

   The association of LIHC subtypes with the patients’ overall survival
   was assessed using the Kaplan–Meier survival curve and the log-rank
   test. To determine the representative genes of each subtype, we
   computed the two-side t-test for each gene by comparing each subtype
   with the other subtypes, and then selected the top 100 genes with the
   lowest p-value for each subtype. Principal component analysis (PCA) was
   conducted using the gene expressions of the representative genes to
   compare their expression profiles between the subtypes of LIHC. To
   further explore the biological mechanism behind subtypes, gene ontology
   (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway
   enrichment analyses were conducted for the representative genes. The
   gene ontology related to different subtypes in LIHC was found using the
   database for annotation, visualization and integrated discovery (DAVID)
   database ([64]https://david.ncifcrf.gov/) for the biological process
   (BP), cellular component (CC), and molecular function (MF) aspects. The
   KEGG orthology based annotation system (KOBAS) database
   ([65]http://kobas.cbi.pku.edu.cn/kobas3/genelist/) was used to detect
   the related KEGG pathways. Then, the enriched GO and KEGG pathways were
   compared between subtypes using the R package “cluster Profiler”
   [[66]21].

2.6. Mutational Enrichment Analysis

   To further investigate the mutations related to each subtype,
   mutational enrichment analysis was performed by pairwise and groupwise
   Fisher’s exact test using the function “clinical Enrichment” in R
   package “maftools” [[67]22].

3. Results

3.1. Summary of Datasets

   The TCGA mRNA expression dataset used in this study comprised 24,491
   genes in 371 tumor samples and 50 normal samples. However, 6297 genes
   were deleted as they were not expressed in over 50% of the samples,
   following the data preprocessing mentioned in [68]Section 2.2.
   Therefore, 18,694 genes were left for the study. Two external
   validation sets obtained from GEO were used in this study: [69]GSE39791
   comprised 72 tumor samples and 72 normal samples; and [70]GSE3500
   comprised 104 tumor samples and 76 normal samples.

   There were 370 patients who had clinical information in the clinical
   data, and, among them, 281 patients were alive and 89 patients were
   dead. The information of the 370 patients was used to perform survival
   analysis. Only 358 patients had three types of data, including
   expression data, clinical information, and somatic mutations, which
   were used in the mutational enrichment analysis.

3.2. Identification of Differencially Expressed Genes (DEGs) in LIHC

   The gene expression data were first normalized to identify the DEGs,
   after this step, as described in [71]Section 2.3, the Fisher ratio
   method was performed to delete irrelevant genes. As a result, we
   deleted genes whose score was below 0.5, and kept 5954 genes
   ([72]Figure S2).

   Then, spearman correlation coefficients were calculated to measure the
   correlation between genes. There existed redundant genes classified by
   having an absolute value of the correlation coefficient greater than
   0.7. To remove redundant genes, we adopted the method above, and
   deleted genes with a lower classified information index. After this
   step, we kept 1064 genes.

   Finally, 500 DTs were established using hybrid sampling. We calculated
   feature importance in each tree with five cross-validations and mean
   perturbations. Then, the importance of all genes were calculated using
   Formula (5), and 34 DEGs with non-zero feature importance were selected
   as the diagnostic biomarkers of LIHC ([73]Figure 2A). The details of
   the 34 DEGs are shown in [74]Table S1.

Figure 2.

   [75]Figure 2
   [76]Open in a new tab

   The detected 34 DEGs for LIHC. (A) The bar plot of the importance
   values of the 34 DEGs, in which the genes’ importance scores above 0.05
   were marked pink, and the rest were marked sky blue. (B) The heat map
   showing the expression level of the 34 DEGs in the tumors and normal
   samples. (C) Boxplots comparing the expression level of the 34 DEGs in
   the tumors and normal samples. Three stars (***) marks DEGs whose
   p-value < 0.001.

   Twenty-eight genes of the 34 DEGs were upregulated in LIHC, including
   γ-aminobutyric acid type A receptor subunit delta (GABRD), misato
   mitochondrial distribution and morphology regulator 1 (MSTO1), EBF
   transcription factor 2 (EBF2), and Glucosylceramidase β (GBA), while
   the downregulated genes in LIHC were angiopoietin like 6 (ANGPTL6),
   C-type lectin domain family 4 member M (CLEC4M), C-X-C motif chemokine
   ligand 14 (CXCL14), C-type lectin domain family 1 member B (CLEC1B),
   neurotrophin 3 (NTF3), and cytochrome P450 family 2 subfamily C member
   8 (CYP2C8).

   We created a heat map to show the differentially expressed patterns
   between the tumors and normal samples of the 34 DEGs ([77]Figure 2B).
   Boxplots were drawn to compare the expression of the 34 DEGs in the
   normal samples and LIHC samples ([78]Figure 2C). The expression
   differences between the tumors and normal samples of the 34 DEGs as
   exhibited in both heat map and boxplots suggested that they could be
   used as diagnostic biomarkers for LIHC.

3.3. Evaluating and Comparing the Performance of Our Feature Selection Method

   We randomly divided all samples into two main parts, with 80% of the
   samples for training and 20% of the samples for testing, and then we
   used K-means clustering and the SMOTE over-sampling method to
   constructed nine balanced datasets with a sample size of 100. Then,
   based on the expression profile of the 34 DEGs, we adopted a combined
   classifier model of NB and SVM to discriminate tumor samples from
   normal samples by a voting mechanism. Two external validation datasets
   were used to verify our method. Due to the technical differences
   between expression profiling by microarray (two external validation
   datasets) and expression profiling by high throughput sequencing (TCGA
   dataset), there were no expression values for some of the 34 genes in
   the two external validation datasets. Specifically, [79]GSE39791
   contained 29 of 34 genes, and [80]GSE3500 contained only 23 of 34
   genes.

   ROC curves were drawn to show the prediction accuracy of the DEGs to
   discriminate tumor samples from normal samples ([81]Figure 3). The
   results showed that the 34 DEGs achieved a high area under the curve
   (AUC) for the training dataset (AUC = 0.997), testing dataset (AUC =
   1), and complete dataset (AUC = 0.997) of the TCGA dataset ([82]Figure
   3A); twenty-nine genes in the first external validation dataset
   ([83]GSE39791) also obtained a high AUC for the training dataset (AUC =
   0.966), testing dataset (AUC = 1), and complete dataset (AUC = 0.972)
   of the [84]GSE39791 dataset ([85]Figure 3B); twenty-three genes in the
   second external validation dataset ([86]GSE3500) also had a high AUC
   for the training dataset (AUC = 0.955), testing dataset (AUC = 0.932),
   and complete dataset (AUC = 0.949) of the [87]GSE3500 dataset
   ([88]Figure 3C). The results showed that with the number of genes
   increased, we could obtain a higher area under the curve (AUC) for the
   training, testing, and complete datasets. The accuracies of the
   training, testing, and complete datasets regarding TCGA data were
   0.9941, 1, and 0.9952. In summary, these findings further indicated
   that the 34 DEGs could be potential diagnostic biomarkers of LIHC.

Figure 3.

   [89]Figure 3
   [90]Open in a new tab

   The receiver operating characteristic (ROC) curves and area under the
   curve (AUC) values for the TCGA dataset and two validation sets. (A)
   The results of the TCGA dataset, which contained 34 genes. (B) The
   results of validation set 1 ([91]GSE39791), which included 29 genes.
   (C) The results of validation set 2 ([92]GSE3500), which contained 23
   genes. The left panel, middle panel, and right panel in picture (A–C)
   were the results of the training set, testing set, and complete set,
   respectively.

   To show the effectiveness of our method, we compared our results with
   others’ ([93]Table 1). Notably, we obtained the highest accuracy
   (0.9952) with 34 DEGs compared with other methods. We adopted dataset
   ([94]GSE3500), which is the same dataset used in [[95]23] to verify our
   method. We obtained a lower classification accuracy (0.9553) than the
   reference (0.9944) in the [96]GSE3500 dataset, which may be caused by
   the fact that there were only 23 of 34 DEGs in the [97]GSE3500 dataset.
   Therefore, 34 DEGs achieved the best performance as a whole.

Table 1.

   Comparison of classification results obtained by different methods in
   the hepatocellular carcinoma (HCC) dataset.
   No. Methods Validation Methods No. of DEGs Accuracy Source
   1 Ensemble of DT based on unbalanced dataset Naïve Bayes (NB) and
   support vector machine (SVM) 34 0.9952 Our method
   2 Significance Analysis of Microarrays (SAM)-t test + Gene regulatory
   probability (GRP) SVM (RBF, Radial Basis Function Kernel) 10 0.9944
   [[98]23]
   3 Semi-supervised gene selection Spectral Biclustering 1 0.9870
   [[99]24]
   4 t-test + class separability Fuzzy neural network 2 0.9810 [[100]25]
   5 Univariate cox regression and Lasso penalized cox regression analysis
   Kaplan–Meier 12 0.9311 [[101]26]
   6 Resampling + SAM K nearest neighbor 10 0.93 [[102]27]
   7 Meta Threshold Gradient Descent Regularization Logistic regression 34
   0.8400 [[103]28]
   [104]Open in a new tab

3.4. Gene Ontology and KEGG Terms Enrichment Analysis for DEGs

   To explore the biological mechanism behind DEGs, gene enrichment
   analysis was performed for the 34 DEGs. Here, we adopted gene ontology
   (GO) and KEGG terms to explain the mechanism of the 34 DEGs. From
   [105]Figure 4A, the function annotation showed that 34 DEGs were
   enriched in viral genome replication, protein localization to the
   microtubule organizing center, the cellular response to glucocorticoid
   stimulus, and so on. The detailed information is shown in
   [106]Supplementary Table S2. On the other hand, enriched pathways were
   detected for the 34 DEGs ([107]Figure 4B). The top four enriched
   pathways for the 34 DEGs were the C-type lectin receptor signaling
   pathway, other glycan degradation, glycosylphosphatidylinositol
   (GPI)-anchor biosynthesis, and the linoleic acid metabolism. The genes
   involved in each specific pathway are shown in [108]Supplementary Table
   S3. According to the results of the function and pathway enrichment
   analysis, the 34 DEGs were involved in important biological processes
   and pathways, which are related to cancer.

Figure 4.

   [109]Figure 4
   [110]Open in a new tab

   The enrichment analysis for the 34 DEGs. (A) The bubble diagram of the
   enriched gene ontology. The vertical axis represents the enrichment
   functions, and the horizontal axis showed the corresponding p-value,
   which is below 0.01. (B) The bubble diagram of the enriched Kyoto
   Encyclopedia of Genes and Genomes (KEGG) pathways. The vertical axis
   represents the enrichment pathway, and the horizontal axis shows the
   corresponding p-value, which is below 0.1.

3.5. DNA Methylation Involved in Regulating the Expression of DEGs

   To investigate the underlying mechanisms in the regulation of the
   diagnostic biomarkers in LIHC, we further explored the correlation
   between DNA methylation and the gene expression of DEGs using the
   Spearman correlation coefficient. Among the 34 DEGs, only 27 genes had
   DNA methylation data, and 25 genes of 27 genes (92.65%) showed a
   negative correlation between the mRNA expression and DNA methylation
   ([111]Figure 5). There was significant negative correlation between the
   gene expression and DNA methylation for glycosylphosphatidylinositol
   anchor attachment 1 (GPAA1, s = −0.4544, p-value = 2.2 × 10^−16),
   keratinocyte associated protein 2 (KRTCAP2, s = −0.4005, p-value = 2.2
   × 10^−16), tubulin folding cofactor E (TBCE, s = −0.3408, p-value =
   2.049 × 10^−11), protoporphyrinogen oxidase (PPOX, s = −0.3234, p-value
   = 1.763 × 10^−10), C8orf33 (s = −0.4356, p-value = 2.2 × 10^−16),
   centrosomal protein 72 (CEP72, s = −0.4052, p-value = 4.283 × 10^−16),
   and family with sequence similarity 83 member H (FAM83H, s = −0.4722,
   p-value = 2.2 × 10^−16), where the s indicates the value of the
   Spearman correlation coefficient. The results indicated that DNA
   methylation might contribute to the dysregulation of biomarker
   expression in LIHC.

Figure 5.

   [112]Figure 5
   [113]Open in a new tab

   The association between the gene expression and DNA methylation of
   biomarkers in LIHC. The red label indicates a significant negative
   correlation by the Spearman test (s < −0.3), and the red line and blue
   line represent the regression line.

3.6. Blust Analysis Uncovers Major Subtypes of LIHC

   Applying the gene expression of the DEGs, we performed the R package
   “bclust” method based on the empirical Bayes method and obtained two to
   eight clusters. Then the k = 3 clustering solution was selected for
   further investigation. The k = 3 clustering solution formed three
   different subtypes, reference to here as “subtype 1” through “subtype
   3”. The three subtypes of LIHC included: Subtype 1 with 199 cases
   (comparing 53.64% of tumor samples), subtype 2 with 79 cases (21.29%),
   and subtype 3 with 93 cases (25.07%) of LIHC cases.

   We further explored the differences in the expression patterns of 34
   DEGs between the three subtypes of LIHC. The boxplots presented the
   expression level of 34 DEGs in different subtypes ([114]Figure 6A),
   which displayed that the expression values of 34 DEGs were truly
   fluctuating in the three subtypes of LIHC. For example, GABRD, EBF2,
   Surfactant Associated 1, LncRNA (SFTA1P), and GBA had a wide range of
   values in subtype 1, and their changes in subtype 2 were relatively
   stable, while in subtype 3 they had a large range and amplitude.

Figure 6.

   [115]Figure 6
   [116]Open in a new tab

   The three subtypes of LIHC. (A) Boxplots showing the expression level
   of the 34 DEGs in three subtypes. (B) Survival curves for three
   subtypes of LIHC in the TCGA dataset.

   Survival analysis was performed on 370 tumor samples with the clinical
   data ([117]Figure 6B), which suggested that the overall survival rates
   in the three subtypes of patients showed significant differences
   (p-value = 1.80 × 10^−3). The survival analysis implied that the three
   subtypes of LIHC could help guide clinical treatments.

3.7. Representative Genes of Subtypes in LIHC

   By using a two-sided t-test, DEGs for a given subtype versus the other
   two subtypes were obtained. The top 100 genes with the lowest p-value
   for each subtype were selected as the representative genes. A Venn
   diagram was drawn to show the distribution of the representative genes
   of the three subtypes ([118]Figure 7A). As shown in [119]Figure 7A,
   each subtype had many specific representative genes, even though
   subtype 2 and subtype 3 had 36 representative genes in common. We
   conducted principal component analysis (PCA) using the expression
   profiles of the 300 representative genes, and through the screen plot
   ([120]Figure S3), we finally chose two main components and drew a
   scatter diagram ([121]Figure 7B), which clearly showed that the 300
   representative genes could be used to classify the three subtypes.

Figure 7.

   [122]Figure 7
   [123]Open in a new tab

   Three subtypes of LIHC. (A) Venn plot for the 300 representative genes
   of three subtypes. (B) Scatter plot showing the different spatial
   distribution of subtypes. (C) Comparison of the enriched KEGG pathway
   of representative genes for the three subtypes.

   To reveal the pathological mechanism behind the subtypes, function
   enrichment analysis was performed for the representative genes of each
   subtype. For a detailed explanation, the enriched gene ontology
   ([124]Figure S4) and KEGG pathway ([125]Figure 7C) of the subtypes were
   compared, which demonstrated that the three subtypes were clearly
   enriched in different functions and pathways. For subtype 1, the main
   enriched functions included protein binding, nucleoplasm, cell
   division, and ATP bindings; the related KEGG pathway mainly included
   mismatch repair, DNA replication, cell cycle, and base excision repair.
   As for subtype 2, it related to the nucleus, protein binding,
   nucleoplasm, and so on, as well as related to the pathway of
   glycosaminoglycan biosynthesis-keratan sulfate. While for subtype 3,
   the main functions included extracellular exosomes, poly (A) RNA
   binding, and the nucleoplasm; it was associated with a large number of
   pathways, such as primary bile acid biosynthesis and phenylalanine
   metabolism. Therefore, different subtypes may be undergoing different
   tumor stages and should be treated with different methods.

3.8. Genetic Alteration in Subtypes

   Among the 358 patients with expression data, clinical information, and
   somatic mutations, there were 194, 77, and 87 patients for each
   subtype, respectively. There were 28 mutation genes enriched in
   different subtypes ([126]Figure 8A). Eighteen of the 28 mutation genes
   (64.3%) were enriched in subtype 3, which indicated that subtype 3
   tended to have more mutation genes. In this perspective, mutations
   could be the reason why patients in subtype 3 had the lowest survival
   rate. Specifically, tumor protein p53 (TP53) mutations were enriched in
   subtypes 3 ([127]Figure 8A); the top 10 frequently mutated genes in
   LIHC were shown in [128]Figure 8B, which also showed the detail of TP53
   mutations in different subtypes.

Figure 8.

   [129]Figure 8
   [130]Open in a new tab

   Mutation genes in different subtypes. (A) Mutation genes enriched in
   the different subtypes, ranked by the p-value of Fisher’s exact test.
   (B) Oncoplot showing the top 10 mutation genes in LIHC arranged by
   different subtypes.

4. Discussion

   With the large accumulation of multi-omics data and selective molecular
   targeted therapies, many studies focused on the identification of
   biomarkers through multi-omics data analysis and provided potential
   biomarkers that could play important roles in the clinical management
   of cancer patients [[131]29]. The discovery of biomarkers for LIHC
   could contribute to discriminating LIHC from normal samples. Therefore,
   we formed an integration method to extract the DEGs for LIHC and
   explored their two applications in the diagnosis and subtyping of LIHC.

   We found 34 DEGs from TCGA data mainly through the ensemble of DTs, and
   the information regarding them was summarized in [132]Table S1.
   Overall, ANGPTL6, CLEC4M, CXCL14, CLEC1B, NTF3, and CYP2C8 were
   downregulated in LIHC, while other genes were upregulated. Some of the
   34 DEGs were reported as biomarkers of HCC. For example, CLEC4M,
   CYP2C8, and SFTA1P might be valuable biomarkers for the prognosis of
   HCC [[133]30,[134]31,[135]32]. Calcyclin Binding Protein (CACYBP)
   expression was elevated in HCC and might serve as a promising
   therapeutic and prognostic biomarker [[136]33]. DDX11 Antisense RNA 1
   (DDX11-AS1) and Apelin (APLN) played oncogenic role in HCC and might
   serve as potential therapy target for HCC [[137]34,[138]35]. The
   methylation status of the Cadherin 13 (CDH13) promoter in peripheral
   blood mononuclear cells (PBMCs) was a potential noninvasive biomarker
   to predict the prognosis of HCC patients [[139]36]. Chromosome 8 Open
   Reading Frame 33 (C8orf33) was associated with the survival time of HCC
   patients, and could serve as a potential biomarker for distinguishing
   poorly differentiated from well-differentiated HCC [[140]37]. GABRD
   could serve as a biomarker for HCC stage-IV [[141]38]; SET And MYND
   Domain Containing 3 (SMYD3) promoted the tumorigenicity and
   intrahepatic metastasis of HCC cells, and could be a practical
   prognosis marker or therapeutic target against HCC [[142]39]. EBF2
   might be a candidate biomarker of HCC and potential therapeutic target
   of it [[143]40]. PPOX might act as a tumor suppressor and play a
   crucial role in the development of HCC [[144]41].

   Some of the 34 DEGs were related to liver function or other cancers,
   such as ANGPTL6, whose corresponding mRNA has been detected exclusively
   in the liver of humans [[145]42], and a study showed that normal liver
   tissues produce the highest amounts of ANGPTL6 [[146]43]. Biogenesis of
   Lysosomal Organelles Complex 1 Subunit 3 (BLOC1S3) could induce
   hepatocyte apoptosis [[147]44]. GBA could inhibit liver cancer, and
   reduce the ratio of natural killer T (NKT) lymphocytes in the liver
   [[148]45]. CDC28 Protein Kinase Regulatory Subunit 1B (CKS1B)
   represented a potential research target for therapeutics of
   retinoblastoma [[149]46]. CLEC1B and FAM83H were associated with a poor
   prognosis in LIHC [[150]47,[151]48]. The polymorphisms of gene CXCL14
   were linked with impaired liver function [[152]49]. Endothelial Cell
   Specific Molecule 1 (ESM1) could serve as a biomarker for diagnosing
   and monitoring renal cell carcinoma [[153]50]. GPAA1 could be a
   promising diagnostic biomarker and therapeutic target for gastric
   cancer [[154]51]. MicroRNA 3658 (MIR3658) was involved in the tumor
   progression of bladder cancer and had prognostic values [[155]52].
   Although CEP72, Chromosome 1 Open Reading Frame 35 (C1orf35), Collagen
   Type XV α 1 Chain (COL15A1), Cysteine Rich Protein 3 (CRIP3), HLA
   Complex Group 25 (HCG25), HIG1 Hypoxia Inducible Domain Family Member
   1B (HIGD1B), KRTCAP2, LOC105369632, MicroRNA 4793 (MIR4793), MSTO1,
   NTF3, and TBCE of the 34 DEGs have not been previously related to
   cancers, our study showed that they could be novel potential biomarkers
   of LIHC.

   The pathological mechanisms of the 34 genes were mainly detected by GO
   and KEGG pathway analysis, and the analysis indicated that the
   biomarkers were enriched in metabolic-related biological processes.
   Altered metabolic features were found quite generally across many types
   of cancer cells, and a reprogrammed metabolism is considered a hallmark
   of cancer [[156]53]. The cancer metabolism has been a target of cancer
   therapy since the appearance of chemotherapy [[157]54]. Pathways, such
   as C-type lectin receptors (CLRs), are powerful pattern-recognition
   receptors. Additionally, a study discovered that CLRs play key roles in
   autoimmunity, allergies, and in maintaining homeostasis [[158]55].
   Defects in the glycosylphosphatidylinositol (GPI) biosynthesis pathway
   could result in a group of congenital disorders of glycosylation known
   as the inherited GPI deficiencies (IGDs) [[159]56]. The pathways
   regarding nicotine addiction were involved in a large number of
   dysfunctional protein–protein interaction (PPI) pairs [[160]57].
   Neuroactive ligand–receptor interactions might play a critical role in
   the pathogenesis of pituitary gonadotroph adenomas [[161]58]. The
   modulation of the sphingolipid metabolism and the related signaling
   pathways might represent a potential therapeutic approach for
   devastating conditions [[162]59]. In addition, for many years,
   methylation was believed to play a crucial role in repressing gene
   expression; therefore, we further analyzed the DEG methylation in LIHC.
   The mRNA expression was negatively correlated with the DEG DNA
   methylation, which was consistent with the results found in previous
   studies: the presence of DNA methylation repressed gene expression in
   vivo [[163]60]. Our results suggested that the expression of DEGs might
   be regulated by DNA methylation in LIHC.

   We treated the 34 DEGs as diagnostic biomarkers of LIHC. For further
   analysis, we adopted two validation sets to explore the effectiveness
   of the 34 biomarkers. Though the two datasets contained only a subsect
   of the 34 DEGs, we discovered possible tendencies for the 34 DEGs: When
   the number of DEGs increased, we obtained higher AUC values in the
   training set, testing set, and complete set, which suggested that the
   34 DEGs could be potential biomarkers of LIHC.

   We applied the 34 DEGs to identify subtypes of LIHC, and discovered
   three subtypes of LIHC that were significantly associated with overall
   survival (p-value = 1.8 × 10^−3). Based on this step, we detected the
   representative genes of each subtype, and applied function and KEGG
   pathway enrichment analysis. For Subtype 1, pathways were mainly
   associated with the cell cycle, DNA replication, and mismatch repair,
   which implied that subtype 1 had disorder in the cell cycle processes.
   For subtype 2, pathways, such as endocytosis and the spliceosome, were
   enriched, and research reported that the endocytosis and spliceosome
   pathways could represent the first signs of embryonic activity. With
   the help of the spliceosome, endocytosis acted as one of main
   activators for finding a series of successive waves of maternal pioneer
   signal regulators [[164]61]. The misregulation of endocytosis could
   result in HCC [[165]62].

   As for subtype 3, the enriched pathways played important roles in liver
   cancer; for example, metabolic pathways that were reported as
   therapeutic targets in liver cancer [[166]63]; complement and
   coagulation cascades, which were crucially involved in the inflammatory
   response [[167]64]; and the retinol metabolism, which plays important
   roles in the development of the nervous system, notochord, and other
   embryonic structures, and in the maintenance of epithelial surfaces,
   immune competence, and reproduction [[168]65]. In summary, the above
   analysis suggested that the three subtypes of LIHC displayed different
   biological processes.

   We divided the 371 tumor samples into three subtypes according to the
   gene expression profiles of the identified 34 DEGs. Although these 371
   samples contained three subtypes of histology including HCC,
   fibrolamellar carcinoma, and mixed hepatocellular cholangiocarcinoma,
   almost 97.3% of them were HCC. Explaining the detailed relationship
   between our defined molecular subtypes and the histology subtypes is,
   thus, difficult. In addition, we drew a heat map for the molecular
   subtypes, pathologic stage, and the TNM classification of malignant
   tumors (TNM) staging in [169]Figure S5. Specifically, the percentages
   of pathologic stage I and II are 71.9% (143/199) in subtype 1, 69.6%
   (55/79) in subtype 2, and 61.3% (57/93) in subtype 3, respectively; the
   percentages of pathologic stage III and IV are 20.6% (41/199) in
   subtype 1, 22.8% (18/79) in subtype 2, and 33.3% (31/93) in subtype 3,
   respectively. Although, there was no obvious evidence to show
   inevitable correlation between molecular subtypes and pathologic
   stages, the data suggested that subtype 1 had a higher proportion of
   pathologic stage I and II, while subtype 3 had a higher proportion of
   pathologic stage III and IV, which could help us understanding the
   three molecular subtypes better. However, we just briefly analysed the
   relationship between molecular subtypes and pathologic stage at the
   data level, there might have some inner connection we didn’t find by
   our method. Therefore, further study about it is necessary. From future
   perspective, the molecular subtypes of LIHC will change the traditional
   classification methods and treatment strategies. With the development
   of data science and modern medicine, the molecular subtypes will be
   better defined and hopefully be applied to clinic.

   In summary, our research findings revealed the diagnostic biomarkers of
   LIHC, and determined the subtypes. Considering future research, we must
   admit the limitations of this paper: first, the thresholds in the
   feature selection process were set from experience; second, the
   external validation sets in this study contained only a part of the
   DEGs; however, no dataset containing all 34 DEGs was found.

5. Conclusions

   In this study, we proposed a new framework and identified 34 DEGs that
   could discriminate tumor samples from normal samples, and, through
   validation and enrichment analysis, we verified that some of the 34
   DEGs could serve as novel diagnostic biomarkers for LIHC. We further
   analyzed the 34 DEGs, and applied them to divide LIHC into three
   subtypes that might contribute to the accuracy judgement during the
   treatment of LIHC. The identified representative genes of each subtype
   could be potentially targetable markers for the different subtypes. In
   addition, mutations associated with the subtypes could be potential
   markers for drug development. Therefore, our results could aid in
   realizing future personalized medicine for LIHC.

Supplementary Materials

   The following are available online at
   [170]https://www.mdpi.com/2073-4425/11/9/1051/s1, Figure S1: The
   flowchart regarding the classification of the samples; Figure S2: The
   results of the Fisher ratio; Figure S3: Screen plot of 300
   representative genes; Figure S4: The bubble diagram of functional
   enrichment; Figure S5: The heatplot for the molecular subtypes,
   pathologic stage, and TNM staging; Table S1: The detailed information
   for the 34 DEGs; Table S2: Function enrichment analysis of the 34 DEGs;
   Table S3: Pathway enrichment analysis of the 34 DEGs.
   [171]Click here for additional data file.^ (1.1MB, pdf)

Author Contributions

   Conceptualization, X.O. and F.H.; methodology, X.O. and F.H.; software,
   X.O.; validation, X.O., F.H., and Y.S.; formal analysis, G.L.;
   investigation, Q.F.; resources, F.H.; data curation, X.O. and F.H.;
   writing—original draft preparation, X.O.; writing—review and editing,
   F.H.; visualization, X.O.; supervision, F.H.; project administration,
   F.H. and Q.F.; funding acquisition, F.H., Y.S., G.L., and Q.F. All
   authors have read and agreed to the published version of the
   manuscript.

Funding

   This work was supported by grants from the National Natural Science
   Foundation of China (61903285), the Natural Science Foundation of Hubei
   Province of China (2019CFB559), and the Fundamental Research Funds for
   the Central Universities (WUT: 2020IVB031, WUT: 2020IB003, WUT:
   2020IA005, WUT: 2019IB012, WUT: 2019IB012). The Grants-in-Aid supported
   this study financially only, and had no role in the design of the study
   and collection, analysis, and interpretation of data nor in writing the
   manuscript.

Conflicts of Interest

   The authors declare no conflict of interest.

References