Abstract Liver cancer is one of the most common malignancies and the second leading cause of cancer-related deaths worldwide, particularly in developing countries, where it poses a significant financial burden. Early detection and timely treatment remain challenging due to the complex mechanisms underlying the initiation and progression of liver cancer. This study aims to uncover key genomic features, analyze their functional roles, and propose potential therapeutic drugs identified through molecular docking, utilizing single-cell RNA sequencing (scRNA-seq) data from liver cancer studies. We applied two advanced hybrid methods known for their robust identification of differentially expressed genes (DEGs) regardless of sample size, along with four top-performing individual methods. These approaches were used to analyze four scRNA-seq datasets, leading to the identification of essential DEGs. Through a protein−protein-interaction (PPI) network, we identified 25 hub-of-hub genes (hHubGs) and 20 additional hHubGs from two naturally occurring gene clusters, ultimately validating a total of 36 hHubGs. Functional, pathway, and survival analyses revealed that these hHubGs are strongly linked to liver cancer. Based on molecular docking and binding-affinity scores with 36 receptor proteins, we proposed 10 potential therapeutic drugs, which we selected from a pool of 300 cancer meta-drugs. The choice of these drugs was further validated using 14 top-ranked published receptor proteins from a set of 42. The proposed candidates include Adozelesin, Tivozanib, NVP-BHG712, Nilotinib, Entrectinib, Irinotecan, Ponatinib, and YM201636. This study provides critical insights into the genomic landscape of liver cancer and identifies promising therapeutic candidates, serving as a valuable resource for advancing liver cancer research and treatment strategies. 1. Introduction Liver cancer originates in the liver as a malignant tumor and has one of the highest incidence rates among cancers globally [[28]1]. According to the American Cancer Society, liver cancer accounts for over 700,000 deaths annually worldwide, making it the second leading cause of cancer-related mortality. The two primary types of liver cancer are hepatocellular carcinoma (HCC), the most prevalent form [[29]2], and cholangiocarcinoma. Liver cancer incidence and fatality rates are notably higher in developing nations [[30]3]. Major risk factors for liver cancer include infection with hepatitis B virus or hepatitis C virus, hepatic cirrhosis, and alcoholic liver disease [[31]4,[32]5,[33]6]. Evidence from various studies suggests that the molecular pathogenesis of liver cancer, including HCC, is closely linked to genetic predispositions and environmental influences, such as oncogene activation, gene mutations, and the inactivation of tumor suppressor genes [[34]7,[35]8]. Despite these findings, the mechanisms underlying the progression and development of liver cancer remain incompletely understood. Moreover, one of the leading causes of liver cancer mortality is its asymptomatic nature during the early stages, which makes early diagnosis challenging. Consequently, the disease is difficult to treat, has a poor prognosis, and is associated with a high recurrence rate [[36]2]. Therefore, identifying precise biomarkers and accurately predicting liver cancer progression holds significant potential to mitigate its severity and improve clinical outcomes. The identification of key therapeutic targets offers a viable approach to effectively addressing diseases. Numerous studies have attempted to identify critical genes associated with liver cancer. To identify the key genes and pathways involved in hepatocellular carcinoma, Wu et al. conducted a bioinformatics analysis using microarray data and conventional methods [[37]2]. Li et al. used the classical t-test and microarray data to identify the key genes involved in hepatocellular carcinoma [[38]9]. Some other studies have also conducted analyses based on conservative microarray data or bulk RNA-seq data to identify genes potentially involved in liver cancer [[39]10,[40]11,[41]12]. However, the findings have often lacked consistency, and analyses based on single-cell RNA sequencing (scRNA-seq) data have not yet been conducted. However, single-cell RNA-seq analysis has some innovative features compared to microarray or bulk RNA-seq data like multimodality, excess zeros, and cellular heterogeneity, which are important for the accurate detection of differentially expressed genes [[42]13,[43]14,[44]15]. Notably, scRNA-seq data analysis offers a promising avenue for the study of liver cancer due to its unique and complex characteristics, such as excess zeros, multimodality, high dispersion, and cellular heterogeneity. Thus, a comprehensive study leveraging scRNA-seq data is necessary if we are to accurately identify the biomarkers underlying liver cancer and select the most effective candidate drugs based on these biomarkers. However, the development of novel drugs for liver cancer remains a challenging, time-intensive, and costly endeavor. To address this, it is essential to explore target proteins, known as receptors, that play a role in liver cancer pathogenesis and identify small molecules (drug agents) capable of mitigating the disease through interaction with these receptors. Consequently, identifying an appropriate drug from the pool of existing drugs for a specific disease could significantly reduce time and costs compared to developing entirely new drugs. 2. Methods and Materials 2.1. Data Description The datasets utilized in this study include [45]GSE146409 [[46]16], [47]GSE202069 [[48]17], [49]GSE189935, and [50]GSE98638 [[51]18]. The [52]GSE98638 dataset comprises CD8+ T cells from adjacent normal liver tissues (NTC) and CD8+ T cells from liver tumor tissues (TTC). To ensure data quality, we curated the dataset so that each gene contained at least 10% non-zero observations. Initially, this dataset included 23,387 genes and 1189 samples (412 NTC vs. 777 TTC). After curation, the dataset retained 9288 genes. Similarly, the other datasets were curated to ensure that each gene contained at least 10% non-zero observations. The case vs. control groups in these datasets were as follows: [53]GSE146409: Non-malignant liver tumor (12) vs. malignant liver tumor (63). [54]GSE202069: Non-tumor (25) vs. tumor (41). [55]GSE189935: Adjacent normal tissues (987) vs. tumor tissues (2199). Detailed descriptions of these datasets are provided in [56]Table 1. Table 1. Description of curated liver cancer datasets. Accession No. Compared Cell Subset Number of Samples per Group Number of Features Organism Data Source [57]GSE98638 CD8+ T cells from adjacent normal liver tissues (NTC) vs. CD8+ T cells from liver tumor (TTC) 412 vs. 777 9288 Human GEO [58]GSE146409 Non-malignant liver tumor vs. malignant liver tumor 12 vs. 63 15061 Human GEO [59]GSE202069 Non-tumor vs. tumor patient 25 vs. 41 24492 Human GEO [60]GSE189935 Adjacent normal tissues vs. tumor tissues 987 vs. 2199 15782 Human GEO [61]Open in a new tab 2.2. Hybrid Model 2.2.1. scDEA scDEA [[62]19] is an ensemble learning-based differential expression analysis method designed to produce more accurate and stable results than 12 individual differential expression (DE) analysis methods. Lancaster’s combination method integrates the p-values generated by these individual methods. The scDEA method involves three primary steps: 1. Normalization: Normalize the scRNA-seq count data matrix according to the requirements of the 12 individual methods by using the count-per-million (CPM) method. 2. p-value Generation: Compute the p-values for each gene using all 12 methods. 3. Integration: Combine the p-values using Lancaster’s combination method by adjusting weights based on rank correlation coefficients to produce a single, combined p-value for each gene. This method demonstrates robustness and stability in identifying DEGs, particularly in scenarios with limited sample sizes, outperforming individual methods regarding reliability and accuracy. We used a trim parameter of 0.2 in scDEA as a default. 2.2.2. scHD4E scHD4E [[63]15] is a recently developed hybrid model (2024) that outperforms the existing hybrid model (scDEA) and the top-ranked four individual methods (ROSeq, TPMM, Seurat, Limma (voom)), as well as the 12 individual methods. The four top-ranked methods were selected through a rigorous evaluation process from a pool of 44 differential expression (DE) analysis methods for single-cell RNA-sequencing data [[64]15]. Using Lancaster’s combination method, this model integrates the p-values generated exclusively by these four methods. A comprehensive experiment to assess its performance was conducted across five real-world scRNA-seq datasets. It evaluated the method from multiple perspectives, including sample-size effects, batch effects, control of type I error, runtime, and gene-enrichment analysis. scHD4E demonstrated exceptional performance in all these aspects, surpassing the other methods in accuracy, robustness, and efficiency. We applied scHD4E to the count data matrix using a trim parameter of 0.2. 2.3. Individual Methods of Differential Expression Analysis 2.3.1. TPMM A logistic regression model and linear regression model were used in a two-part mixed model (TPMM) for scRNA-seq data analysis, where logistic regression was used to model the zero proportions and log-transformed non-zero expression values were modeled by linear regression. To implement this mixed model, gene-expression values for single-cell samples in subjects should be normalized. Covariates such as biological condition or clustering effects were incorporated and analyzed using the model. We clustered the cells in our study using the k-means clustering technique, and the “optimcluster” function from the SwarnSeq R package was used to determine the number of clusters. The zlm function contains different methods, but we used “bayesglm”. The relevant R functions are available from the following github link: [65]https://github.com/shilab2017/two_part_mixed_model (accessed on 5 November 2024). The parameters of the model were estimated using marginal maximum likelihood estimation, and differentially expressed genes were identified by testing the fixed effects across biological condition. 2.3.2. ROSeq The ROSeq model was developed based on discrete generalized beta distribution (DGBD) and models expression ranks as robust representatives if transcript abundance. By applying the DGBD rank-order distribution and ordering the ranks with bins based on bin-wise cellular frequencies, the authors investigated the advantages of discretizing bins of gene expression vectors. The model has two shape parameters, estimated by MLEs for each gene, and genes were filtered using [MATH: R2 :MATH] as a qualification criterion. To identify the DE genes, the Wald-type test was applied. The ROSeq R package is available from Github at the following link: [66]https://github.com/krishan57gupta/ROSeq (accessed on 6 November 2024). 2.3.3. Limma (Voom) Limma (a linear model) was developed primarily for analyzing microarray data and conducting DE analysis. To analyze high-dimensional data like RNA-seq, it aggregates a set of statistical principles. Differential gene expression is analyzed by fitting a linear model to each row of the expression matrix, capturing various levels of variability between samples, accounting for complex experimental designs, and borrowing strength from gene-wise models. In our study, the following procedures were used: (i) to design a suitable biological condition for limma, the “model.matrix” R function was used; (ii) using “calcNormFactors” and “voom”, the count expression matrix was normalized by a factor derived from library size; (iii) the “lmFit” R function was used to fit the multiple linear regression by applying generalized least squares; (iv) to identify the DE genes for the multiple linear regression model, the empirical Bayes statistic was used by implementing the “ebayes” function. The Limma R packages included all the above functions. 2.3.4. Seurat Seurat was developed for quality control, exploration, and analysis of scRNA-seq data by multiple DE analysis methods. In our study, we used SeuratBimon, a model with two states. The first state represents cases where the gene-expression level is not zero and the gene is expressed in a single cell. The second state indicates that the gene-expression level is zero, meaning that the gene is not expressed. In this model, Bernoulli distributions are used to model the zero elements and log-normal distributions are used to model the nonzero elements. To detect the DE genes, a combined likelihood ratio test was performed. The Seurat R package is available from the following link: [67]https://github.com/satijalab/seurat (accessed on 10 November 2024). 2.4. Overview of the Analysis Process We utilized four scRNA-seq datasets from liver cancer samples with the following accession numbers: A = [68]GSE98638, B = [69]GSE146409, C = [70]GSE202069, and D = [71]GSE189935. Two of these datasets—[72]GSE146409 (case = 12 vs. control = 63) and [73]GSE202069 (case = 25 vs. control = 41)—have smaller sample sizes. We applied both the hybrid models and the top-performing individual methods to all four datasets to detect differentially expressed genes (DEGs) with adjusted p-values ≤ 0.05. We then identified the intersecting genes between scHD4E and scDEA for all datasets (scDEA ∩ scHD4E). Additionally, we identified the common genes identified by six methods, including the hybrid and individual top four methods (scDEA ∩ scHD4E ∩ Seurat ∩ Limma ∩ TPMM ∩ ROSeq). Given that we used four datasets, it was possible to identify only a limited number of intersecting genes if we considered the intersection across all four datasets. To address this, we instead focused on the common genes across three datasets for each case, using four combinations of datasets, as follows: A ∩ B ∩ C A ∩ B ∩ D B ∩ D ∩ C A ∩ D ∩ C Similarly, for the six methods, we took the common genes from the same four combinations of datasets. For each gene set, we selected 15 hub genes using the STRING database and Cytoscape software (version 3.10.1). This approach yielded 62 genes from the union of all hub gene sets. These genes were divided into two clusters using STRING and Cytoscape. The first cluster contained 35 genes, and the second cluster contained 27 genes. We then selected 25 and 20 hub-of-hub genes (hHubGs) using the cytoHubba plugin for Cytoscape. The entire procedure is illustrated in [74]Figure 1. Figure 1. [75]Figure 1 [76]Open in a new tab Flow diagram of the entire analysis procedure used to identify drugs for suggestion. 2.5. Construction of PPI Network for DEGs To identify hub genes (HubGs), constructing a protein−protein-interaction (PPI) network is essential. Understanding the interactions between proteins in larger protein groups is crucial for functional analysis [[77]20]. In a PPI network, vertices represent proteins and the interactions between proteins are depicted as edges. The PPI network was constructed using the online STRING database [[78]21]. Proteins interacting with many other proteins with a high degree of connectivity are considered to play a central regulatory role and are termed regulatory hubs [[79]22]. For the STRING PPI analysis, we set the confidence score threshold to ≥0.40. The app cytoHubba in Cytoscape was employed to identify the hub genes by ranking the nodes in the PPI network [[80]23,[81]24]. We selected the 15 top-ranked genes for each set using cytoHubba from the PPI network of intersecting common genes. 2.6. Gene Ontology and Pathway Enrichment Analysis To perform gene ontology (GO) and pathway-enrichment analysis of hHub-DEGs, we utilized four online databases: DAVID [[82]25], STRING [[83]18], Enrichr [[84]26], and WebGestalt [[85]27]. This analysis provided valuable insights into the functional roles of the hub-DEGs. The top hHub-DEGs identified in our study were used to search these databases to analyze the associated biochemistry GO terms, which included Biological Process, Cellular Component, and Molecular Function [[86]28], as well as pathways potentially linked to the progression and development of liver cancer. We selected the top-ranked biological processes, molecular functions, and cellular components commonly identified by at least two of the four online analysis databases. In most cases, the p-values were extremely small (p < 0.00001), indicating high statistical significance. Similarly, we conducted pathway enrichment analysis to explore pathways associated with the prognosis and development of liver cancer. We considered widely used pathway databases such as KEGG, WikiPathways, and Reactome, which are commonly employed to detect significantly enriched terms and pathways related to the hHubGs. 2.7. Validation of Differential Expression Analysis of hHub Genes The hub genes identified in our study exhibit differential expression based on the differential-expression analysis performed using methods such as scHD4E, scDEA, Seurat, and others across all of the liver cancer datasets. However, we aimed further to validate the differentiability of these genes for liver cancer using a web-based database, the Gene Expression Profiling Interactive Analysis (GEPIA2) online platform ([87]http://gepia2.cancer-pku.cn) (accessed on 20 November 2024). We assessed the differentiability of the identified genes in liver cancer by constructing boxplots for each gene using the GEPIA2 database. This database relies on data from the GTEx project and TCGA, which includes thousands of samples from various cancer types. 2.8. Liver Cancer Stage-Wise Relation to Hub Genes To illustrate the effects of each hub gene on the development and prognosis across different stages of liver cancer, we constructed stage plots using the Gene Expression Profiling Interactive Analysis (GEPIA2) online platform ([88]http://gepia2.cancer-pku.cn) (accessed on 20 November 2024) [[89]29]. The stage plots revealed variations in gene expression across the different stages of liver cancer. If the p-values for these plots are ≤0.05, the data suggest that the genes are significantly associated with the progression of liver cancer at the respective stage. 2.9. Survival Analysis of hHub Genes for Liver Cancer Survival analysis was conducted using three web-based databases, GEPIA2 [[90]29], Kaplan-Meier (KM) Plotter [[91]30], and UALCAN [[92]31], to assess the risk associated with low and high expression of molecular signatures (genes) and evaluate the impact of hub genes on the prognosis and development of liver cancer. The use of these three databases allowed for more robust validation of the association between these genes and the survival of liver cancer patients. Overall and relapse-free survival analyses were performed using both GEPIA2 and KM Plotter. The survival plots displayed the hazard ratio (HR) with a 95% confidence interval and log-rank p-values for both GEPIA2 and KM Plotter. The UALCAN database focused on the effects of gene expression on patient survival, providing p-values. A p-value ≤ 0.05 was considered statistically significant. 2.10. Molecular Docking Molecular docking analysis was conducted to propose validated, efficient candidate drugs to treat liver cancer by targeting the identified receptor proteins. The target proteins for the drugs were selected based on the hub-of-hub (hHub) genes, and approximately 300 drugs were considered as meta-drug agents. Additionally, 3D structures of both the meta-drug agents and receptor proteins were required for molecular docking analysis. We retrieved the 3D structures of the target proteins from the Protein Data Bank (PDB) [[93]32] and the 3D structures of the meta-drug agents from the PubChem database [[94]33]. Water molecules and ligand heteroatoms were removed to prepare the receptor proteins for molecular docking, and polar hydrogens were added [[95]34]. A grid box was constructed to encompass the entire surface of the proteins. Using AutoDockTools, ligands were prepared by defining torsion trees and specifying the rotatable and non-rotatable bonds. The binding affinities between the drug agents and target proteins were computed using AutoDock Vina, with an exhaustiveness parameter set to 10. 3. Results 3.1. Identification of Common Differentially Expressed Genes We applied all six methods to four real-world scRNA-seq liver cancer datasets and identified differentially expressed genes (DEGs) based on a significance cutoff of 0.05. Subsequently, we identified the common DEGs by taking the intersection of results from the two-hybrid methods, scDEA and scHD4E, and from all six methods (two hybrid and four individual). We also identified the common genes by considering all possible combinations of three datasets out of four for both the two-method and six-method analyses. The total number of combinations was eight. The procedure is described in the methodology and illustrated in [96]Figure 1. The numbers of common genes identified for each combination of datasets between scDEA and scHD4E were as follows: A ∩ B ∩ C = 165, A ∩ B ∩ D = 226, B ∩ D ∩ C = 1570, and A ∩ D ∩ C = 343. Similarly, for the intersection of all six methods, we identified 52, 114, 534, and 180 common genes for the same dataset combinations. A significant number of genes were eliminated from the analysis when only the intersection of six methods was used because some of the datasets had a small sample size. The results are presented in [97]Figure 1. 3.2. Identification of Hub Genes Using STRING and Cytoscape We constructed the protein−protein-interaction (PPI) network for each of the eight sets of common genes identified in the previous subsection using web-based tools. The PPI networks were organized and visualized using cytoHubba in Cytoscape, which allowed us to rank the important genes. We selected the top 15 hub genes for each of the eight gene sets. The results are presented in [98]Table S4. From the table, it is evident that several common genes are shared among the sets of hub genes. To ensure that none of the important genes were excluded from the study, we took the union of the sets of hub genes. This approach resulted in the identification of 62 genes, which were the focus of our subsequent analysis. 3.3. Protein−Protein-Interaction (PPI) Network and Selection of the Most Important Genes The PPI network was constructed using the STRING database tools and Cytoscape software. The resulting network graph indicates that the genes are divided into two distinct clusters: cluster one and cluster two. Cluster one contains 35 genes, while cluster two consists of 27 genes ([99]Figure 2). We selected the 25 hub genes for subsequent analysis using cytoHubba. Additionally, we identified 20 essential hub genes from cluster two. The 25 + 20 top-ranked genes are listed in [100]Table S5. Figure 2. [101]Figure 2 [102]Open in a new tab Protein–protein-interaction (PPI) network for 62 important genes, which are divided into two clusters. A deeper orange color indicates the genes have more interactions with the other genes. 3.4. Checking Differentiability of hHub Genes Using GEPIA2 We identified 25 hub genes from Cluster 1 and 20 hub genes from Cluster 2. The differential expression of these hub genes was analyzed in the context of liver cancer to determine their potential roles in prognosis, diagnosis, and disease progression. Genes exhibiting significant differential expression are hypothesized to influence the prognosis, diagnostic accuracy, and development of liver cancer. To assess differential expression, boxplots for all 45 hub genes were generated using the GEPIA online database ([103]Figure 3 and [104]Figures S1–S3). The results demonstrated that all 25 genes from Cluster 1 were significantly differentially expressed between liver cancer and control samples, thereby meeting the differential expression criteria established through our statistical analyses. A similar conclusion was reached for the 20 genes in Cluster 2. Figure 3. [105]Figure 3 [106]Open in a new tab Boxplots of genes from the liver cancer case vs. control groups to show the differentiability of the following genes: BUB1, CDK1, CENPE, CENPF, CENPK, CENPM, EZH2, HELLS, KNTC1, MCM3, MCM5, MCM6. * indicates that a GENE is differentially expressed. 3.5. Impact of Hub Genes on the Stage of Liver Cancer Stage plots were generated for each hub gene in Clusters 1 and 2. All figures for Cluster 1 exhibited p-values ≤ 0.05, indicating that the expression of these hub genes significantly influences the progression of liver cancer across different stages ([107]Figure 4 and [108]Figures S4–S6). In Cluster 2, most genes were shown to have a significant impact, except for A2M (p = 0.305), which showed no notable association with liver cancer stage progression ([109]Figures S7–S9). Figure 4. [110]Figure 4 [111]Open in a new tab Violin plots of different stages of liver cancer representing the association of progression with the following genes: BUB1, CDK1, CENPE, CENPF, CENPK, CENPM. 3.6. GO and Pathway Analysis of Hub Genes For gene ontology (GO) and pathway analysis, we utilized four web-based tools: DAVID, STRING, Enrichr, and WebGestalt. The top-ranked GO terms were selected when at least two tools identified them as significant based on p-values. [112]Table 2 presents the gene ontology terms for Biological Processes (BPs) and KEGG pathways, along with the corresponding hub genes implicated in these processes. These biological processes and pathways are associated with liver cancer prognosis, diagnosis, and development, as supported by references listed in