Abstract This study aimed to select the feature genes of hepatocellular carcinoma (HCC) with the Fisher score algorithm and to identify hub genes with the Maximal Clique Centrality (MCC) algorithm. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis was performed to examine the enrichment of terms. Gene set enrichment analysis (GSEA) was used to identify the classes of genes that are overrepresented. Following the construction of a protein-protein interaction network with the feature genes, hub genes were identified with the MCC algorithm. The Kaplan–Meier plotter was utilized to assess the prognosis of patients based on expression of the hub genes. The feature genes were closely associated with cancer and the cell cycle, as revealed by GO, KEGG and GSEA enrichment analyses. Survival analysis showed that the overexpression of the Fisher score–selected hub genes was associated with decreased survival time (P < 0.05). Weighted gene co-expression network analysis (WGCNA), Lasso, ReliefF and random forest were used for comparison with the Fisher score algorithm. The comparison among these approaches showed that the Fisher score algorithm is superior to the Lasso and ReliefF algorithms in terms of hub gene identification and has similar performance to the WGCNA and random forest algorithms. Our results demonstrated that the Fisher score followed by the application of the MCC algorithm can accurately identify hub genes in HCC. Subject terms: Microarray analysis, Cancer, Targeted therapies, Predictive markers, Applied mathematics Introduction Gene microarray technology, a prospective tool for the classification, diagnosis and aggressiveness prediction of cancer, provides valuable information in understanding the underlying mechanism of multiple cancers^[26]1–[27]4. The data obtained from microarray experiments, such as leukaemia datasets and breast cancer datasets^[28]5,[29]6, are often used for feature selection in machine learning. In comparison with a large number of genes, the training datasets usually have a very small sample size for classification. The limitations of training data constitute a great challenge to certain classification methodologies^[30]7. Gene expression datasets with a large number of variables and only a small number of samples are normally referred to as having the curse of dimensionality in feature selection^[31]8. The prediction performance of feature selection highly depends on the quality and the size of the gene dataset^[32]9. However, some of the gene expression datasets, such as the Wisconsin breast cancer database^[33]10, were constructed approximately thirty years ago and may have defects due to the limitation of instrument performance at that time. In addition, less information in these older datasets may lead to poor feature selection performance. Therefore, the establishment of updated datasets is necessary for the development of feature selection. A large number of microarray gene expression datasets are available in the Gene Expression Omnibus (GEO) database and are updated regularly. GEO microarray datasets are also normally characterized by a small number of samples with high dimensionality. The integration of independent GEO datasets published in recent years can significantly enlarge the sample size, which may be helpful to combat the curse of dimensionality. Feature selection, one of the vital and repeatedly used machine learning techniques in data mining, is the selection of a subset of the most pertinent features for use in the process of model construction^[34]11. In other words, feature selection is generally regarded as an optimization problem with the purpose of maximizing the classification accuracy with relatively fewer features^[35]12,[36]13. To achieve this goal, irrelevant and redundant features of raw datasets are usually eliminated with the application of feature selection^[37]14. Since the 1970s, feature selection techniques have been widely employed in a variety of fields, such as protein structural class prediction^[38]15, classification of traced neurons^[39]16, text classification^[40]17, acoustic event recognition^[41]18 and gene expression data classification^[42]19. Feature selection techniques are also used to select marker genes of cancer that affect the classification accuracy^[43]7. Despite important advances that have been achieved in the microarray-based molecular classification of tumours, it is far from application in clinical practice^[44]20,[45]21. To date, several feature selection algorithms, such as the Fisher score, Lasso, ReliefF and random forest algorithms, have been employed in the selection of feature genes^[46]22–[47]24. Previous studies have demonstrated that the Fisher score has good performance in feature gene selection^[48]21. In this study, we aimed to develop a hepatocellular carcinoma (HCC) hub gene identification method via the analysis of protein-protein interaction (PPI) networks. To build the PPI network, several individual genes that contribute to the classification of HCC are needed. Unlike some other feature selection algorithms, such as principal component analysis (PCA), in which the selected features are a combination of some raw features, the Fisher score algorithm selects each gene independently based on their scores under the Fisher criterion, which eventually leads to a subset of the most representative individual genes^[49]25,[50]26. Therefore, this algorithm may be an appropriate method for the feature selection of high dimensional gene expression profile data. To date, this algorithm has received less attention in the field of HCC feature gene selection. We constructed and integrated an HCC gene expression dataset from five independent HCC gene expression datasets and utilized the Fisher score algorithm to select feature genes for HCC. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis^[51]27 were performed to examine the molecular functions (MFs), cellular components (CCs), biological processes (BPs) and pathways of the selected feature genes. Gene set enrichment analysis (GSEA)^[52]28 was carried out to evaluate the feature selection performance of the Fisher score algorithm at the gene set level. To explore the association between the feature genes, the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database was applied to establish the PPI^[53]27 network, which was then analysed with the Maximal Clique Centrality (MCC)^[54]29 method to select the top ten hub genes of HCC. The Kaplan–Meier plotter was utilized to assess the role of the selected hub genes in liver cancer prognosis. To further evaluate the performance of the Fisher approach, weighted gene co-expression network analysis (WGCNA), one of the most widely used hub gene identification approaches, along with the Lasso, ReliefF and random forest algorithms, were used as comparison algorithms. Results Integrated dataset of liver microarray gene expression profiles A significant batch effect was identified in the raw data of 5 independent gene expression datasets (Supplementary Fig. [55]1(a)). After correcting the batch effect with the removeBatchEffect function of the limma package, no batch effect was observed based on the observation of PCA components (Supplementary Fig. [56]1(b)). The ultimate integrated dataset after the correction of the batch effect has 396 samples and 54613 variables (probe IDs) and takes up 214 MB of computer memory (Supplementary Dataset [57]1). Each sample is labelled with both its GSE ID and GSM ID. Feature gene selection with the fisher score algorithm Feature selection techniques can avoid the curse of dimensionality and thus enable the simplification of models, making interpreting experimental results easier for researchers. As one of the supervised feature selection methods, the Fisher score algorithm selects each feature independently in accordance with their scores. Here, Feature selection using the Fisher score algorithm results in a list of genes that are ranked by their importance. A previous study showed that approximately 1000 genes were biologically relevant to HCC^[58]30. Therefore, we also selected the top 1000 (Supplementary Dataset [59]2) feature genes with higher Fisher scores as the optimal feature subset for further analysis. GO and KEGG enrichment analysis A total of 37 significantly enriched BPs were observed in the current study, which involved many cancer-associated BPs, such as cell division, mitotic nuclear division, positive regulation of cell proliferation, cell proliferation, negative regulation of the apoptotic process, sister chromatid cohesion, DNA replication, regulation of the apoptotic process, the cell cycle, and the G2/M transition of the mitotic cell cycle (GO IDs: 0051301, 0007067, 0007062, 0008284, 0008283, 0043066, 0007062, 0006260, 0042981, and 0000086, respectively) (Fig. [60]1(a)). GO CCs were significantly enriched in the condensed chromosome kinetochore, chromosome, nucleoplasm, cytosol, nucleus, microtubule, kinetochore and so on (GO IDs: 0000777, 0000775, 0005654, 0005829, 0005634, 0005874 and 0000776, respectively) (Fig. [61]1(b)). The GO MFs were mainly enriched in protein binding, flavin adenine dinucleotide binding, microtubule motor activity, ATP binding, chromatin binding, enzyme binding, RNA polymerase II core promoter proximal region sequence-specific binding, and microtubule binding (GO IDs: 0008017, 0042802, 0003777, 0005524, 0003682, 0019899, 0001078, and 0008017, respectively) (Fig. [62]1(c)). Figure 1. [63]Figure 1 [64]Open in a new tab GO and KEGG enrichment analysis of feature genes in HCC. The numbers below each panel are reference P values (−log10). (a) GO biological process enrichment analysis of the feature genes in HCC. (b) GO cellular component enrichment analysis of the feature genes in HCC. (c) GO molecular function enrichment analysis of the feature genes in HCC. (d) KEGG pathway enrichment analysis of the feature genes in HCC. Figure [65]1(d) shows that the most significantly enriched KEGG pathway was the cell cycle (pathway ID: 04110), which is directly associated with cancer. In addition, the biosynthesis of amino acids, carbon metabolism, the p53 signalling pathway and metabolic pathways (pathway IDs: 01230, 01200, 04115, and 01100, respectively) were also closely linked to the progression of liver cancer. GSEA To evaluate the feature selection performance of the Fisher score algorithm at the gene set level, GSEA was carried out based on hallmark gene sets and GO gene sets. Most of the parameters used for GSEA were set as default. The number of permutations was 1000, and the permutation type was gene set. The min and max size of the selected gene sets were 10 and 500, respectively. When performing GSEA based on hallmark gene sets, 18/50 gene sets were upregulated in the cancer phenotype, 11 gene sets were significant at false discovery rate (FDR) < 25%, and 9 gene sets were significantly enriched at nominal p value < 5%. The top 3 upregulated gene sets were E2F targets, G2M checkpoint and mitotic spindle (Fig. [66]2(a–c), (i)). For GSEA based on GO gene sets, 1105/3322 gene sets were upregulated in the cancer phenotype, 471 gene sets were significant at FDR < 25%, and 400 gene sets were significantly enriched at nominal p value < 5%. The most significantly enriched genes included DNA replication, sister chromatid segregation, DNA-dependent DNA replication, nuclear chromosome segregation, and mitotic nuclear segregation (Fig. [67]2(d–h,i)). Figure 2. [68]Figure 2 [69]Open in a new tab GSEA. (a–c) Displays the top 3 most upregulated gene sets of GSEA based on the hallmark gene sets. (d–h) shows the top 5 most upregulated gene sets of GSEA based on the GO gene sets. The enrichment score (NES), false discovery rate (FDR) and the normalized enrichment score are shown for each gene set. The bars at the bottom of the panels are the corresponding genes of certain gene sets. (i) Displays the relative location of genes in the ranked list. PPI network establishment and hub gene identification A PPI network was constructed with all genes significantly enriched in BPs with the STRING database. A total of 365 nodes and 4326 edges were involved in the PPI network (Supplementary Fig. [70]2). After PPI network establishment, the PPI data were then imported into Cytoscape software. CytoHubba, an app of Cytoscape, is usually employed to predict important nodes or subnetworks in a given network based on several topological algorithms. Here, the top ten hub genes were selected based on the MCC algorithm in cytoHubba (Fig. [71]3). The results showed that the top ten genes contributing to HCC were ASPM, MELK, CCNB1, NDC80, BUB1B, NCAPG, CDK1, NUSAP1, CCNB2 and TPX2. Figure 3. Figure 3 [72]Open in a new tab The PPI network of the top ten hub genes in HCC. The nodes represent the selected feature genes, and the edges represent the interactions between two genes. Survival analysis The Kaplan–Meier plotter was utilized to assess the effect of the top ten hub genes on liver cancer prognosis. A total of 364 liver cancer cases were available for overall survival analysis. Our study showed that the overexpression of the hub genes selected with the Fisher score was correlated with a significant reduction in the overall survival time of liver cancer patients (P < 0.01, Fig. [73]4). Figure 4. [74]Figure 4 [75]Open in a new tab Kaplan–Meier survival analysis of the hub genes selected with the Fisher score. (a–j) shows a Kaplan–Meier plot of the top ten HCC hub genes. In comparison with that in normal subjects, the overexpression of these hub genes in HCC patients was associated with a significant reduction in overall survival time (P < 0.05). Comparison of the fisher score with other algorithms To assess the effect of the Fisher score algorithm, WGCNA, Lasso, ReliefF and random forest were also employed to select feature genes for HCC with the same integrated dataset followed by the identification of hub genes with the MCC algorithm. A Venn diagram (Fig. [76]5) showed that six hub genes selected with the Fisher algorithm were the same as those selected with the WGCNA or random forest algorithms and that no common genes were selected between the Fisher score algorithm and the Lasso or ReliefF algorithms. Figure 5. Figure 5 [77]Open in a new tab A Venn diagram showing the overlapping hub genes selected with the Fisher score, WGCNA, Lasso, ReliefF and random forest algorithms. The role of the selected hub genes with the abovementioned algorithms was also subject to survival analysis with the Kaplan–Meier plotter. Since 6 hub genes selected by WGCNA or random forest were the same as those selected by the Fisher algorithm (Fig. [78]4), we thus displayed only the unique genes selected by WGCNA or random forest. Survival analysis showed that the overexpression of the unique hub genes selected with WGCNA (Fig. [79]6(a–d), P < 0.05) or with random forest (Fig. [80]6(e–h), P < 0.05) were significantly correlated with a decrease in the survival time of HCC patients. In contrast, most hub genes selected with Lasso were either associated with an increased survival time (Fig. [81]7(a,d), P < 0.05) or had no relationship with the survival time (P > 0.05, Fig. [82]7(b,g,h,j)). Only four hub genes (Fig. [83]7(c,e,f,i), P < 0.05) were linked to the poor prognosis of HCC. Regarding the hub genes selected with the ReliefF algorithm, half were involved in increased survival time (P < 0.05, Fig. [84]8(b,c,g,i,j)), and the other half had no effect on the survival time (P > 0.05, Fig. [85]8(a,d,e,f,h)). Figure 6. [86]Figure 6 [87]Open in a new tab Kaplan–Meier survival analysis of hub genes selected with WGCNA and random forest. (a–d) shows that the unique hub genes selected with WGCNA, and (e–h) are the unique hub genes selected with random forest. The overexpression of all the unique hub genes was significantly correlated with a decrease in the survival time of HCC patients (P < 0.05). Figure 7. [88]Figure 7 [89]Open in a new tab Kaplan–Meier survival analysis of hub genes selected with Lasso. Most of the hub genes selected with Lasso were either associated with increased survival time ((a,d), P < 0.05) or had no relationship with survival time (P > 0.05, (b,g,h,j)). Only four hub genes ((c,e,f,i), P < 0.05) were linked to the poor prognosis of HCC. Figure 8. [90]Figure 8 [91]Open in a new tab Kaplan–Meier survival analysis of hub genes selected with ReliefF. One-half of the hub genes were associated with increased survival time in HCC patients (P < 0.05, (b,c,g,I,j)), and the other half had no effect on survival time (P > 0.05, (a,d,e,f,h)). Discussion Cancer encompasses many diseases that are characterized by the spread of abnormal cells and uncontrolled growth^[92]31. The overall occurrence of cancer is rapidly growing globally. An estimated 18.1 million new cancer cases and 9.6 million cancer deaths occurred in 2018^[93]32. Among all cancers, HCC is the fifth most frequently diagnosed cancer, ranking as the third leading cause of cancer-related death^[94]33. Currently, the main HCC treatment strategies include surgical resection, microwave ablation, radiofrequency ablation and transcatheter arterial chemoembolization (TACE)^[95]34,[96]35. Regarding the prospects of a cure, surgical resection is believed to have a definitive curative effect^[97]36. However, most HCC cases are detected in advanced stages with the invasion of major blood vessels, obvious extrahepatic metastases or poor liver function, making them unfit for surgical resection. A prospective study conducted from December 2009 to December 2010 indicated that recurrent HCC patients are ineligible for percutaneous ablation^[98]37. Conventional TACE is fit for advanced HCC treatment and involves the delivery of chemotherapeutic agents that target cancer cells, which may cause the release of cytotoxic agents^[99]38 and acute pancreatitis^[100]39, as those drugs do not target the expression of the hub genes of cancer. For this reason, discovering the hub genes of advanced HCC is necessary for the purpose of treatment with specific drugs. In this study, the feature genes that contribute to the occurrence of HCC were selected using the Fisher score algorithm. GO and KEGG enrichment analysis was performed to interpret the functions and pathways of the feature genes. The enriched BPs included cell division, mitotic nuclear division, positive regulation of cell proliferation, cell proliferation, negative regulation of the apoptotic process, sister chromatid cohesion, DNA replication, regulation of the apoptotic process, the cell cycle, and the G2/M transition of the mitotic cell cycle. These processes are typically representative features of HCC progression. The selected genes with the Fisher score algorithm were matched with the genes implicated in the abovementioned complex process of cancer development, indicating that the Fisher score algorithm is an effective method for selecting feature genes in HCC. The effectiveness of the Fisher score algorithm was further confirmed by GO CCs and GO MFs, which were related to cell proliferation and division. The top enriched KEGG pathway was the cell cycle-related signalling cascade, which contributes to the molecular mechanisms of hepatocarcinogenesis. Moreover, other enriched pathways, such as the biosynthesis of amino acids, carbon metabolism, p53 signalling and metabolism, are also associated with HCC proliferation and progression. Hence, the Fisher score algorithm is very efficient in feature gene selection. GSEA confirmed that the proliferation-related genes showed significant differences between the HCC and normal states. A PPI network was established with the STRING database. With the application of the Cytoscape app cytoHubba, the top ten hub genes contributing to HCC were predicted and are as follows: ASPM, MELK, CCNB1, NDC80, BUB1B, NCAPG, CDK1, NUSAP1, CCNB2 and TPX2. Traditionally, the identification of biomarkers is mainly based on the metabolism of a pharmaceutical agent or the biology of the tumour and surrounding environment as performed in biological experiments^[101]40. Evidence from previous studies supports the effectiveness of the Fisher score algorithm in gene identification. Reverse transcription-PCR assays showed that ASPM is a marker for early recurrence and vascular invasion and that ASPM overexpression is correlated with poor prognosis in hepatocellular carcinoma^[102]41. The role of ASPM, MELK, NUSAP1, CCNB2 and NCAPG in HCC was validated by predictions performed with other bioinformatic tools as well as by real-time quantitative PCR experiments^[103]42. Regarding CCNB2 and CDK1, a recent study with primary HCC tissue samples showed that the downregulation of CCNB2 and CDK1 led to the inhibition of cell proliferation and cell cycle shutdown in the G2/M phase, indicating that the overexpression of CCNB2/CDK1 may promote tumour cell proliferation^[104]14. The experimental results of western blotting and real-time PCR showed that NDC80 contributed to HCC progression by reducing apoptosis and overcoming cell cycle termination^[105]43. Both BUB1B and TPX2 are associated with the separation of sister chromatids, which is the most abnormal phase in the progression of HCC^[106]44. In support of the accuracy of the Fisher score algorithm in predicting prognosis, Kaplan–Meier survival analysis revealed that the overexpression of the selected hub genes was correlated with reduced survival time. WGCNA, a systems biology method for the analysis of correlation patterns among genes, has been heavily used in the field for hub gene identification. Based on a PubMed literature search, we found that more than 6000 WGCNA-related studies have been published so far. This finding demonstrated that WGCNA is a dominant method and is popular among researchers. In the current study, the Fisher score was demonstrated to be a method with similar performance to that of WGCNA, providing another viable methodological option in the field of hub gene identification. Random forest, with similar performance to that of WGCNA and the Fisher score, may also serve as a potential method for hub gene identification. In contrast, the Lasso and ReliefF algorithms do not seem to be good methods for hub gene identification, since the survival analysis showed that most of the hub genes identified by these methods were not relevant to the poor prognosis of HCC. The reason for the poor performance of Lasso and ReliefF may be that they randomly select one gene from correlated genes, which results in unstable performance in feature selection^[107]24. In summary, we established an HCC dataset of a relatively large sample size by integrating five independent HCC datasets and demonstrated that the Fisher score algorithm is a suitable and accurate method for feature selection, thus providing an excellent option for hub gene identification in HCC patients. Methods Selection of datasets A total of 31468 HCC expression arrays were available in GEO^[108]45. Only 5 (0.0159%) of the datasets were selected. The number of datasets was determined based on the following considerations. On the one hand, the integration of multiple datasets is helpful in fighting against the curse of dimensionality for feature selection in gene expression data. On the other hand, our study indicated that 5 datasets were sufficient for the identification of valid hub genes, and any further increase in datasets did not seem to be necessary. Although the selected datasets were obtained from the same microarray platform, the microarray platform was only one of the criteria for data selection. Since this study mainly aimed to develop an effective method for the identification of HCC hub genes, only liver tissue datasets of Homo sapiens were considered. In addition, the sample size and data acquisition time were also important criteria for data selection. The datasets utilized in this study were obtained within the last 6 years with sample sizes greater than 40. Based on the above criteria, five datasets were selected in the current study. Due to a lack of techniques for RNA-Seq data analysis in our lab, RNA-Seq data were not included in this study. The microarray gene expression profiles ([109]GSE41804, [110]GSE69715, [111]GSE90653, [112]GSE98383, and [113]GSE107170) were downloaded from the GEO repository ([114]http://www.ncbi.nlm.nih.gov/geo/). The platform information of these microarray data is as follows: [115]GPL570, Affymetrix Human Genome U133 Plus 2.0 Array (Affymetrix Inc., Santa Clara, CA, USA). Since these observations were obtained on the same platform, these series of gene expression data share the same probe ID. All the files were integrated based on their probe ID. Batch effect identification and correction The integrated microarray gene expression data originated from researchers of independent institutes. Therefore, there may be a batch effect that can cause a decrease in the repeatability and reproducibility of the experimental results. To detect the possible batch effects, PCA was performed to identify the batch effect. The batch effect was eliminated with the removeBatchEffect function of the limma package^[116]46. The visualization of the top two PCA components was assessed before and after the batch effect correction. Feature selection using the Fisher score algorithm Before the application of the Fisher score algorithm^[117]21, the Affymetrix probe set IDs were converted to official gene symbols. Affymetrix probe set IDs without official gene names or corresponding to multiple official gene names were omitted. If multiple gene IDs corresponded to one official gene name, the expression value of the official gene was taken from the mean expression value of multiple gene IDs. The Fisher score algorithm is a feature ranking algorithm applied to eliminate the irrelevant and redundant features from the gene expression profiles. The process of feature selection can be briefly described as follows. Assuming NG = (U, C, D, δ) is a neighbourhood decision system for gene expression data, the corresponding matrix is X ∈ R^m×n, where m represents the number of genes, and n represents the number of samples. Then, the Fisher score is computed by [MATH: f(Z)=tr(Ab)tr(Aw) :MATH] where tr () represents the trace of a matrix, A[w] is the scatter matrix within the same category, and A[b] is the scatter matrix between the HCC samples and their paired normal controls. To address the prolonged issue of traditional combination optimization methods, a heuristic strategy is normally utilized to calculate a score for each gene separately by using some criteria. Then, the Fisher score of the l-th gene is calculated by [MATH: f(k)= k=1Cnk(μk lμl)2k=1C nk(σk l)2 :MATH] where n[k] represents the sample number of the k-th category, [MATH: μkl :MATH] and [MATH: σkl :MATH] are the mean and standard deviation of the samples from the k-th category corresponding to the l-th gene, respectively, and μ^l represents the mean of the samples of the l-th gene. GO and KEGG analysis GO analysis, a regular method in the annotation of large-scale functional enrichment studies, is normally classified into MF, BP, and CC categories^[118]27. KEGG^[119]27 is a widely utilized database for diseases, drugs, genomes, chemical substances and biological pathways. The GO and KEGG enrichment analysis of the selected feature genes in this study was performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID) ([120]https://david.ncifcrf.gov/) online tools^[121]47,[122]48. P values less than 0.05 and gene counts more than 10 were considered statistically significant. GSEA GSEA is a computational method that functions to identify classes of genes that are overrepresented in a large set of genes that may have a connection with disease phenotypes^[123]28. In this study, to further evaluate the performance of the Fisher score algorithm in selecting feature genes, the raw data of the integrated HCC gene expression data were applied in GSEA based on two major collections of MSigDB gene sets: hallmark gene sets and GO gene sets^[124]49. Hallmark genes were used since the hallmark gene sets can reduce redundancy and generate more robust enrichment analysis results. Establishment of a PPI network and identification of hub genes The significantly enriched genes in the GO BPs were employed to establish the PPI network using the online PPI establishment tool STRING ([125]http://string-db.org). The PPI data were then exported to Cytoscape version 3.4.0 ([126]http://cytoscape.org)^[127]50. CytoHubba, a Java plugin for Cytoscape, provides a user-friendly interface that enables the topology analysis of complex networks^[128]29. CytoHubba provides 11 topological algorithms for identifying network hub genes. Among all the algorithms, MCC has a better performance in predicting PPI network hub genes^[129]29. Therefore, the MCC algorithm was employed to identify the HCC hub genes in this study. Survival analysis The Kaplan–Meier plotter (KMplot, [130]http://www.kmplot.com/analysis) can be employed to assess the effect of 54675 genes on survival with 10293 cancer samples^[131]51. The samples included in this database were obtained from 1648 ovarian, 5143 breast, 1065 gastric and 2437 lung cancer patients, with an average follow-up of 40, 69, 33, and 49 months, respectively^[132]52,[133]53. The primary goal of this tool is to perform meta-analysis-based biomarker assessments. The HCC hub genes in this study were imported into the KMplot database to explore their relationship with the 5-year survival rates of patients. Identification of the hub genes with control algorithms To further evaluate the performance of the Fisher score algorithm, a series of control feature selection algorithms were utilized to select feature genes from the current integrated HCC dataset. The algorithms for comparison included WGCNA^[134]54, Lasso^[135]55, ReliefF^[136]56 and random forest^[137]57. To select the feature genes of HCC with the WGCNA algorithm, a series of procedures were carried out as follows. After loading the gene expression dataset, missing values and outlier microarray samples were checked to ensure that the data were appropriate for further analysis. A total of 3600 genes with the highest expression were then screened out based on the average gene expression value. We then selected the soft threshold using the network topology analysis function pickSoftThreshold. The gene expression matrix was then converted to an adjacency matrix and a topological overlap matrix (TOM). Hierarchical clustering of gene expression data was then performed based on the TOM-based dissimilarity distance. A dynamic tree cut function was employed to identify the modules (minimum module size = 30). GO and KEGG enrichment analysis was applied to select the HCC-related modules. Finally, the genes from the selected module were then used to construct a PPI network with the STRING database. The PPI network was visualized with Cytoscape followed by the identification of hub genes with the MCC algorithm. For the identification of hub genes with Lasso, Relief and random forest, the procedure was identical to that of the Fisher score algorithm except that the feature selection algorithm was replaced with Lasso, Relief or random forest. We therefore only offered information related to feature gene selection according to these methods. Regularization helps to address bias and variance as well as stabilize the estimates in a model. Lasso regression is one form of regularized regression. With the use of l1 regularizations, the coefficient of a variable can be reduced to zero. In the current Lasso approach for comparison, we used previously published methods by Regina et al.^[138]55 to select HCC feature genes. The feature gene selection procedures we followed are listed as follows. First, by tuning the parameter α, a list of genes with nonzero coefficients were selected with the Lasso. Second, the genes were sorted according to the absolute value of the coefficients in descending order. Third, the top 1000 genes of the sorted genes were selected for further analysis. The built-in Lasso algorithm in Scikit-learn (one of the machine learning libraries for the Python programming language) was utilized to select the feature genes from the current HCC dataset. The ReliefF algorithm is a feature selection method proposed to handle multi-class classification problems that evaluates the importance of each feature by assessing the role of the features for classification between sample classes. By default, ReliefF assigns the same weight to each feature at the beginning. To score the weight for each feature, it randomly selects a sample T from a training set E and then finds the nearest neighbour sample B from the same class of sample T, called Near Hit; it then searches the nearest neighbour sample R from a class different from that of sample T, called Near Miss. Afterwards, it updates the weight of each feature according to the following rules. If the distance between T and Near Hit of a feature is less than the distance between T and Near Miss, this indicates that this feature is beneficial for distinguishing the nearest neighbours of the same class and different classes, so the weight of this feature will be increased. Conversely, if the distance between T and Near Hit is greater than the distance between T and Near Miss, this feature plays a negative role in distinguishing the nearest neighbours of the same class and different classes, and the weight of this feature will be reduced. The above process can be repeated m times, and finally, the weight of each feature is obtained. Here, the ReliefFAttributeEval function of Weka (version 3.83) was used to obtain the weight of all HCC genes in the current dataset, and the top 1000 weighted genes were screened out for further analysis. For feature gene selection with random forest, the feature importance of feature X in the random forest was calculated as follows. First, for each decision tree in the random forest, the corresponding out-of-bag (OOB) data were used to calculate the OOB error, which is denoted as errOOB1. Second, random noise interference was added to the OOB data of feature X and the OOB data error was calculated once more, which is denoted as errOOB2. Third, assuming that there were N trees in the random forest, then the importance of the feature X = Σ (errOOB2-errOOB1)/N. In this way, random forests were performed that generate a list of all the variables based on their feature importance. Finally, the unimportant variables in the ranking list were deleted, leaving only the top 1000 important features. The RandomForestClassifier built-in in scikit-learn was applied for the feature selection of the current supervised classification HCC dataset. The comparison among these methods is based on the prognostic value of the hub genes. A bioinformatics online tool ([139]http://bioinformatics.psb.ugent.be/webtools/Venn/) was employed to obtain the intersections of the hub genes produced with the various approaches and to draw a Venn diagram. The hub genes were then also subjected to survival analysis with Kaplan–Meier plotter. Supplementary information [140]Supplementary figures^ (949.8KB, pdf) [141]Dataset 1^ (3.6MB, xlsx) [142]Dataset 2^ (24.4KB, xlsx) Acknowledgements