Abstract

   Liver cancer is one of the most common malignancies and the second
   leading cause of cancer-related deaths worldwide, particularly in
   developing countries, where it poses a significant financial burden.
   Early detection and timely treatment remain challenging due to the
   complex mechanisms underlying the initiation and progression of liver
   cancer. This study aims to uncover key genomic features, analyze their
   functional roles, and propose potential therapeutic drugs identified
   through molecular docking, utilizing single-cell RNA sequencing
   (scRNA-seq) data from liver cancer studies. We applied two advanced
   hybrid methods known for their robust identification of differentially
   expressed genes (DEGs) regardless of sample size, along with four
   top-performing individual methods. These approaches were used to
   analyze four scRNA-seq datasets, leading to the identification of
   essential DEGs. Through a protein−protein-interaction (PPI) network, we
   identified 25 hub-of-hub genes (hHubGs) and 20 additional hHubGs from
   two naturally occurring gene clusters, ultimately validating a total of
   36 hHubGs. Functional, pathway, and survival analyses revealed that
   these hHubGs are strongly linked to liver cancer. Based on molecular
   docking and binding-affinity scores with 36 receptor proteins, we
   proposed 10 potential therapeutic drugs, which we selected from a pool
   of 300 cancer meta-drugs. The choice of these drugs was further
   validated using 14 top-ranked published receptor proteins from a set of
   42. The proposed candidates include Adozelesin, Tivozanib, NVP-BHG712,
   Nilotinib, Entrectinib, Irinotecan, Ponatinib, and YM201636. This study
   provides critical insights into the genomic landscape of liver cancer
   and identifies promising therapeutic candidates, serving as a valuable
   resource for advancing liver cancer research and treatment strategies.

1. Introduction

   Liver cancer originates in the liver as a malignant tumor and has one
   of the highest incidence rates among cancers globally [[28]1].
   According to the American Cancer Society, liver cancer accounts for
   over 700,000 deaths annually worldwide, making it the second leading
   cause of cancer-related mortality. The two primary types of liver
   cancer are hepatocellular carcinoma (HCC), the most prevalent form
   [[29]2], and cholangiocarcinoma. Liver cancer incidence and fatality
   rates are notably higher in developing nations [[30]3]. Major risk
   factors for liver cancer include infection with hepatitis B virus or
   hepatitis C virus, hepatic cirrhosis, and alcoholic liver disease
   [[31]4,[32]5,[33]6]. Evidence from various studies suggests that the
   molecular pathogenesis of liver cancer, including HCC, is closely
   linked to genetic predispositions and environmental influences, such as
   oncogene activation, gene mutations, and the inactivation of tumor
   suppressor genes [[34]7,[35]8]. Despite these findings, the mechanisms
   underlying the progression and development of liver cancer remain
   incompletely understood. Moreover, one of the leading causes of liver
   cancer mortality is its asymptomatic nature during the early stages,
   which makes early diagnosis challenging. Consequently, the disease is
   difficult to treat, has a poor prognosis, and is associated with a high
   recurrence rate [[36]2]. Therefore, identifying precise biomarkers and
   accurately predicting liver cancer progression holds significant
   potential to mitigate its severity and improve clinical outcomes.

   The identification of key therapeutic targets offers a viable approach
   to effectively addressing diseases. Numerous studies have attempted to
   identify critical genes associated with liver cancer. To identify the
   key genes and pathways involved in hepatocellular carcinoma, Wu et al.
   conducted a bioinformatics analysis using microarray data and
   conventional methods [[37]2]. Li et al. used the classical t-test and
   microarray data to identify the key genes involved in hepatocellular
   carcinoma [[38]9]. Some other studies have also conducted analyses
   based on conservative microarray data or bulk RNA-seq data to identify
   genes potentially involved in liver cancer [[39]10,[40]11,[41]12].
   However, the findings have often lacked consistency, and analyses based
   on single-cell RNA sequencing (scRNA-seq) data have not yet been
   conducted. However, single-cell RNA-seq analysis has some innovative
   features compared to microarray or bulk RNA-seq data like
   multimodality, excess zeros, and cellular heterogeneity, which are
   important for the accurate detection of differentially expressed genes
   [[42]13,[43]14,[44]15]. Notably, scRNA-seq data analysis offers a
   promising avenue for the study of liver cancer due to its unique and
   complex characteristics, such as excess zeros, multimodality, high
   dispersion, and cellular heterogeneity. Thus, a comprehensive study
   leveraging scRNA-seq data is necessary if we are to accurately identify
   the biomarkers underlying liver cancer and select the most effective
   candidate drugs based on these biomarkers. However, the development of
   novel drugs for liver cancer remains a challenging, time-intensive, and
   costly endeavor. To address this, it is essential to explore target
   proteins, known as receptors, that play a role in liver cancer
   pathogenesis and identify small molecules (drug agents) capable of
   mitigating the disease through interaction with these receptors.
   Consequently, identifying an appropriate drug from the pool of existing
   drugs for a specific disease could significantly reduce time and costs
   compared to developing entirely new drugs.

2. Methods and Materials

2.1. Data Description

   The datasets utilized in this study include [45]GSE146409 [[46]16],
   [47]GSE202069 [[48]17], [49]GSE189935, and [50]GSE98638 [[51]18]. The
   [52]GSE98638 dataset comprises CD8+ T cells from adjacent normal liver
   tissues (NTC) and CD8+ T cells from liver tumor tissues (TTC). To
   ensure data quality, we curated the dataset so that each gene contained
   at least 10% non-zero observations. Initially, this dataset included
   23,387 genes and 1189 samples (412 NTC vs. 777 TTC). After curation,
   the dataset retained 9288 genes. Similarly, the other datasets were
   curated to ensure that each gene contained at least 10% non-zero
   observations. The case vs. control groups in these datasets were as
   follows:

   [53]GSE146409: Non-malignant liver tumor (12) vs. malignant liver tumor
   (63).

   [54]GSE202069: Non-tumor (25) vs. tumor (41).

   [55]GSE189935: Adjacent normal tissues (987) vs. tumor tissues (2199).

   Detailed descriptions of these datasets are provided in [56]Table 1.

Table 1.

   Description of curated liver cancer datasets.
   Accession No. Compared Cell Subset Number of Samples per Group Number
   of
   Features Organism Data Source
   [57]GSE98638 CD8+ T cells from adjacent normal liver tissues (NTC) vs.
   CD8+ T cells from liver tumor (TTC) 412 vs. 777 9288 Human GEO
   [58]GSE146409 Non-malignant liver tumor vs. malignant liver tumor 12
   vs. 63 15061 Human GEO
   [59]GSE202069 Non-tumor vs. tumor patient 25 vs. 41 24492 Human GEO
   [60]GSE189935 Adjacent normal tissues vs. tumor tissues 987 vs. 2199
   15782 Human GEO
   [61]Open in a new tab

2.2. Hybrid Model

2.2.1. scDEA

   scDEA [[62]19] is an ensemble learning-based differential expression
   analysis method designed to produce more accurate and stable results
   than 12 individual differential expression (DE) analysis methods.
   Lancaster’s combination method integrates the p-values generated by
   these individual methods.

   The scDEA method involves three primary steps:

   1. Normalization: Normalize the scRNA-seq count data matrix according
   to the requirements of the 12 individual methods by using the
   count-per-million (CPM) method.

   2. p-value Generation: Compute the p-values for each gene using all 12
   methods.

   3. Integration: Combine the p-values using Lancaster’s combination
   method by adjusting weights based on rank correlation coefficients to
   produce a single, combined p-value for each gene.

   This method demonstrates robustness and stability in identifying DEGs,
   particularly in scenarios with limited sample sizes, outperforming
   individual methods regarding reliability and accuracy. We used a trim
   parameter of 0.2 in scDEA as a default.

2.2.2. scHD4E

   scHD4E [[63]15] is a recently developed hybrid model (2024) that
   outperforms the existing hybrid model (scDEA) and the top-ranked four
   individual methods (ROSeq, TPMM, Seurat, Limma (voom)), as well as the
   12 individual methods. The four top-ranked methods were selected
   through a rigorous evaluation process from a pool of 44 differential
   expression (DE) analysis methods for single-cell RNA-sequencing data
   [[64]15].

   Using Lancaster’s combination method, this model integrates the
   p-values generated exclusively by these four methods. A comprehensive
   experiment to assess its performance was conducted across five
   real-world scRNA-seq datasets. It evaluated the method from multiple
   perspectives, including sample-size effects, batch effects, control of
   type I error, runtime, and gene-enrichment analysis. scHD4E
   demonstrated exceptional performance in all these aspects, surpassing
   the other methods in accuracy, robustness, and efficiency. We applied
   scHD4E to the count data matrix using a trim parameter of 0.2.

2.3. Individual Methods of Differential Expression Analysis

2.3.1. TPMM

   A logistic regression model and linear regression model were used in a
   two-part mixed model (TPMM) for scRNA-seq data analysis, where logistic
   regression was used to model the zero proportions and log-transformed
   non-zero expression values were modeled by linear regression. To
   implement this mixed model, gene-expression values for single-cell
   samples in subjects should be normalized. Covariates such as biological
   condition or clustering effects were incorporated and analyzed using
   the model. We clustered the cells in our study using the k-means
   clustering technique, and the “optimcluster” function from the SwarnSeq
   R package was used to determine the number of clusters. The zlm
   function contains different methods, but we used “bayesglm”. The
   relevant R functions are available from the following github link:
   [65]https://github.com/shilab2017/two_part_mixed_model (accessed on 5
   November 2024). The parameters of the model were estimated using
   marginal maximum likelihood estimation, and differentially expressed
   genes were identified by testing the fixed effects across biological
   condition.

2.3.2. ROSeq

   The ROSeq model was developed based on discrete generalized beta
   distribution (DGBD) and models expression ranks as robust
   representatives if transcript abundance. By applying the DGBD
   rank-order distribution and ordering the ranks with bins based on
   bin-wise cellular frequencies, the authors investigated the advantages
   of discretizing bins of gene expression vectors. The model has two
   shape parameters, estimated by MLEs for each gene, and genes were
   filtered using
   [MATH: <mrow><mrow><msup><mrow><mi
   mathvariant="normal">R</mi></mrow><mrow><mn>2</mn></mrow></msup></mrow>
   </mrow> :MATH]
   as a qualification criterion. To identify the DE genes, the Wald-type
   test was applied. The ROSeq R package is available from Github at the
   following link: [66]https://github.com/krishan57gupta/ROSeq (accessed
   on 6 November 2024).

2.3.3. Limma (Voom)

   Limma (a linear model) was developed primarily for analyzing microarray
   data and conducting DE analysis. To analyze high-dimensional data like
   RNA-seq, it aggregates a set of statistical principles. Differential
   gene expression is analyzed by fitting a linear model to each row of
   the expression matrix, capturing various levels of variability between
   samples, accounting for complex experimental designs, and borrowing
   strength from gene-wise models. In our study, the following procedures
   were used: (i) to design a suitable biological condition for limma, the
   “model.matrix” R function was used; (ii) using “calcNormFactors” and
   “voom”, the count expression matrix was normalized by a factor derived
   from library size; (iii) the “lmFit” R function was used to fit the
   multiple linear regression by applying generalized least squares; (iv)
   to identify the DE genes for the multiple linear regression model, the
   empirical Bayes statistic was used by implementing the “ebayes”
   function. The Limma R packages included all the above functions.

2.3.4. Seurat

   Seurat was developed for quality control, exploration, and analysis of
   scRNA-seq data by multiple DE analysis methods. In our study, we used
   SeuratBimon, a model with two states. The first state represents cases
   where the gene-expression level is not zero and the gene is expressed
   in a single cell. The second state indicates that the gene-expression
   level is zero, meaning that the gene is not expressed. In this model,
   Bernoulli distributions are used to model the zero elements and
   log-normal distributions are used to model the nonzero elements. To
   detect the DE genes, a combined likelihood ratio test was performed.
   The Seurat R package is available from the following link:
   [67]https://github.com/satijalab/seurat (accessed on 10 November 2024).

2.4. Overview of the Analysis Process

   We utilized four scRNA-seq datasets from liver cancer samples with the
   following accession numbers: A = [68]GSE98638, B = [69]GSE146409, C =
   [70]GSE202069, and D = [71]GSE189935. Two of these
   datasets—[72]GSE146409 (case = 12 vs. control = 63) and [73]GSE202069
   (case = 25 vs. control = 41)—have smaller sample sizes. We applied both
   the hybrid models and the top-performing individual methods to all four
   datasets to detect differentially expressed genes (DEGs) with adjusted
   p-values ≤ 0.05. We then identified the intersecting genes between
   scHD4E and scDEA for all datasets (scDEA ∩ scHD4E). Additionally, we
   identified the common genes identified by six methods, including the
   hybrid and individual top four methods (scDEA ∩ scHD4E ∩ Seurat ∩ Limma
   ∩ TPMM ∩ ROSeq).

   Given that we used four datasets, it was possible to identify only a
   limited number of intersecting genes if we considered the intersection
   across all four datasets. To address this, we instead focused on the
   common genes across three datasets for each case, using four
   combinations of datasets, as follows:

   A ∩ B ∩ C

   A ∩ B ∩ D

   B ∩ D ∩ C

   A ∩ D ∩ C

   Similarly, for the six methods, we took the common genes from the same
   four combinations of datasets. For each gene set, we selected 15 hub
   genes using the STRING database and Cytoscape software (version
   3.10.1). This approach yielded 62 genes from the union of all hub gene
   sets. These genes were divided into two clusters using STRING and
   Cytoscape. The first cluster contained 35 genes, and the second cluster
   contained 27 genes. We then selected 25 and 20 hub-of-hub genes
   (hHubGs) using the cytoHubba plugin for Cytoscape. The entire procedure
   is illustrated in [74]Figure 1.

Figure 1.

   [75]Figure 1
   [76]Open in a new tab

   Flow diagram of the entire analysis procedure used to identify drugs
   for suggestion.

2.5. Construction of PPI Network for DEGs

   To identify hub genes (HubGs), constructing a
   protein−protein-interaction (PPI) network is essential. Understanding
   the interactions between proteins in larger protein groups is crucial
   for functional analysis [[77]20]. In a PPI network, vertices represent
   proteins and the interactions between proteins are depicted as edges.
   The PPI network was constructed using the online STRING database
   [[78]21]. Proteins interacting with many other proteins with a high
   degree of connectivity are considered to play a central regulatory role
   and are termed regulatory hubs [[79]22]. For the STRING PPI analysis,
   we set the confidence score threshold to ≥0.40.

   The app cytoHubba in Cytoscape was employed to identify the hub genes
   by ranking the nodes in the PPI network [[80]23,[81]24]. We selected
   the 15 top-ranked genes for each set using cytoHubba from the PPI
   network of intersecting common genes.

2.6. Gene Ontology and Pathway Enrichment Analysis

   To perform gene ontology (GO) and pathway-enrichment analysis of
   hHub-DEGs, we utilized four online databases: DAVID [[82]25], STRING
   [[83]18], Enrichr [[84]26], and WebGestalt [[85]27]. This analysis
   provided valuable insights into the functional roles of the hub-DEGs.
   The top hHub-DEGs identified in our study were used to search these
   databases to analyze the associated biochemistry GO terms, which
   included Biological Process, Cellular Component, and Molecular Function
   [[86]28], as well as pathways potentially linked to the progression and
   development of liver cancer.

   We selected the top-ranked biological processes, molecular functions,
   and cellular components commonly identified by at least two of the four
   online analysis databases. In most cases, the p-values were extremely
   small (p < 0.00001), indicating high statistical significance.
   Similarly, we conducted pathway enrichment analysis to explore pathways
   associated with the prognosis and development of liver cancer. We
   considered widely used pathway databases such as KEGG, WikiPathways,
   and Reactome, which are commonly employed to detect significantly
   enriched terms and pathways related to the hHubGs.

2.7. Validation of Differential Expression Analysis of hHub Genes

   The hub genes identified in our study exhibit differential expression
   based on the differential-expression analysis performed using methods
   such as scHD4E, scDEA, Seurat, and others across all of the liver
   cancer datasets. However, we aimed further to validate the
   differentiability of these genes for liver cancer using a web-based
   database, the Gene Expression Profiling Interactive Analysis (GEPIA2)
   online platform ([87]http://gepia2.cancer-pku.cn) (accessed on 20
   November 2024). We assessed the differentiability of the identified
   genes in liver cancer by constructing boxplots for each gene using the
   GEPIA2 database. This database relies on data from the GTEx project and
   TCGA, which includes thousands of samples from various cancer types.

2.8. Liver Cancer Stage-Wise Relation to Hub Genes

   To illustrate the effects of each hub gene on the development and
   prognosis across different stages of liver cancer, we constructed stage
   plots using the Gene Expression Profiling Interactive Analysis (GEPIA2)
   online platform ([88]http://gepia2.cancer-pku.cn) (accessed on 20
   November 2024) [[89]29]. The stage plots revealed variations in gene
   expression across the different stages of liver cancer. If the p-values
   for these plots are ≤0.05, the data suggest that the genes are
   significantly associated with the progression of liver cancer at the
   respective stage.

2.9. Survival Analysis of hHub Genes for Liver Cancer

   Survival analysis was conducted using three web-based databases, GEPIA2
   [[90]29], Kaplan-Meier (KM) Plotter [[91]30], and UALCAN [[92]31], to
   assess the risk associated with low and high expression of molecular
   signatures (genes) and evaluate the impact of hub genes on the
   prognosis and development of liver cancer. The use of these three
   databases allowed for more robust validation of the association between
   these genes and the survival of liver cancer patients. Overall and
   relapse-free survival analyses were performed using both GEPIA2 and KM
   Plotter. The survival plots displayed the hazard ratio (HR) with a 95%
   confidence interval and log-rank p-values for both GEPIA2 and KM
   Plotter. The UALCAN database focused on the effects of gene expression
   on patient survival, providing p-values. A p-value ≤ 0.05 was
   considered statistically significant.

2.10. Molecular Docking

   Molecular docking analysis was conducted to propose validated,
   efficient candidate drugs to treat liver cancer by targeting the
   identified receptor proteins. The target proteins for the drugs were
   selected based on the hub-of-hub (hHub) genes, and approximately 300
   drugs were considered as meta-drug agents. Additionally, 3D structures
   of both the meta-drug agents and receptor proteins were required for
   molecular docking analysis. We retrieved the 3D structures of the
   target proteins from the Protein Data Bank (PDB) [[93]32] and the 3D
   structures of the meta-drug agents from the PubChem database [[94]33].

   Water molecules and ligand heteroatoms were removed to prepare the
   receptor proteins for molecular docking, and polar hydrogens were added
   [[95]34]. A grid box was constructed to encompass the entire surface of
   the proteins. Using AutoDockTools, ligands were prepared by defining
   torsion trees and specifying the rotatable and non-rotatable bonds. The
   binding affinities between the drug agents and target proteins were
   computed using AutoDock Vina, with an exhaustiveness parameter set to
   10.

3. Results

3.1. Identification of Common Differentially Expressed Genes

   We applied all six methods to four real-world scRNA-seq liver cancer
   datasets and identified differentially expressed genes (DEGs) based on
   a significance cutoff of 0.05. Subsequently, we identified the common
   DEGs by taking the intersection of results from the two-hybrid methods,
   scDEA and scHD4E, and from all six methods (two hybrid and four
   individual). We also identified the common genes by considering all
   possible combinations of three datasets out of four for both the
   two-method and six-method analyses. The total number of combinations
   was eight.

   The procedure is described in the methodology and illustrated in
   [96]Figure 1. The numbers of common genes identified for each
   combination of datasets between scDEA and scHD4E were as follows: A ∩ B
   ∩ C = 165, A ∩ B ∩ D = 226, B ∩ D ∩ C = 1570, and A ∩ D ∩ C = 343.
   Similarly, for the intersection of all six methods, we identified 52,
   114, 534, and 180 common genes for the same dataset combinations. A
   significant number of genes were eliminated from the analysis when only
   the intersection of six methods was used because some of the datasets
   had a small sample size. The results are presented in [97]Figure 1.

3.2. Identification of Hub Genes Using STRING and Cytoscape

   We constructed the protein−protein-interaction (PPI) network for each
   of the eight sets of common genes identified in the previous subsection
   using web-based tools. The PPI networks were organized and visualized
   using cytoHubba in Cytoscape, which allowed us to rank the important
   genes. We selected the top 15 hub genes for each of the eight gene
   sets. The results are presented in [98]Table S4.

   From the table, it is evident that several common genes are shared
   among the sets of hub genes. To ensure that none of the important genes
   were excluded from the study, we took the union of the sets of hub
   genes. This approach resulted in the identification of 62 genes, which
   were the focus of our subsequent analysis.

3.3. Protein−Protein-Interaction (PPI) Network and Selection of the Most
Important Genes

   The PPI network was constructed using the STRING database tools and
   Cytoscape software.

   The resulting network graph indicates that the genes are divided into
   two distinct clusters: cluster one and cluster two. Cluster one
   contains 35 genes, while cluster two consists of 27 genes ([99]Figure
   2). We selected the 25 hub genes for subsequent analysis using
   cytoHubba. Additionally, we identified 20 essential hub genes from
   cluster two. The 25 + 20 top-ranked genes are listed in [100]Table S5.

Figure 2.

   [101]Figure 2
   [102]Open in a new tab

   Protein–protein-interaction (PPI) network for 62 important genes, which
   are divided into two clusters. A deeper orange color indicates the
   genes have more interactions with the other genes.

3.4. Checking Differentiability of hHub Genes Using GEPIA2

   We identified 25 hub genes from Cluster 1 and 20 hub genes from Cluster
   2. The differential expression of these hub genes was analyzed in the
   context of liver cancer to determine their potential roles in
   prognosis, diagnosis, and disease progression.

   Genes exhibiting significant differential expression are hypothesized
   to influence the prognosis, diagnostic accuracy, and development of
   liver cancer. To assess differential expression, boxplots for all 45
   hub genes were generated using the GEPIA online database ([103]Figure 3
   and [104]Figures S1–S3). The results demonstrated that all 25 genes
   from Cluster 1 were significantly differentially expressed between
   liver cancer and control samples, thereby meeting the differential
   expression criteria established through our statistical analyses. A
   similar conclusion was reached for the 20 genes in Cluster 2.

Figure 3.

   [105]Figure 3
   [106]Open in a new tab

   Boxplots of genes from the liver cancer case vs. control groups to show
   the differentiability of the following genes: BUB1, CDK1, CENPE, CENPF,
   CENPK, CENPM, EZH2, HELLS, KNTC1, MCM3, MCM5, MCM6. * indicates that a
   GENE is differentially expressed.

3.5. Impact of Hub Genes on the Stage of Liver Cancer

   Stage plots were generated for each hub gene in Clusters 1 and 2. All
   figures for Cluster 1 exhibited p-values ≤ 0.05, indicating that the
   expression of these hub genes significantly influences the progression
   of liver cancer across different stages ([107]Figure 4 and [108]Figures
   S4–S6). In Cluster 2, most genes were shown to have a significant
   impact, except for A2M (p = 0.305), which showed no notable association
   with liver cancer stage progression ([109]Figures S7–S9).

Figure 4.

   [110]Figure 4
   [111]Open in a new tab

   Violin plots of different stages of liver cancer representing the
   association of progression with the following genes: BUB1, CDK1, CENPE,
   CENPF, CENPK, CENPM.

3.6. GO and Pathway Analysis of Hub Genes

   For gene ontology (GO) and pathway analysis, we utilized four web-based
   tools: DAVID, STRING, Enrichr, and WebGestalt. The top-ranked GO terms
   were selected when at least two tools identified them as significant
   based on p-values. [112]Table 2 presents the gene ontology terms for
   Biological Processes (BPs) and KEGG pathways, along with the
   corresponding hub genes implicated in these processes. These biological
   processes and pathways are associated with liver cancer prognosis,
   diagnosis, and development, as supported by references listed in