Abstract Background Non-invasive diagnostic methods, including medical imaging techniques and blood biomarkers such as alpha-fetoprotein (AFP), have been crucial in detecting hepatocellular carcinoma (HCC). However, imaging techniques are only effective for tumor size larger than 2 cm. AFP measurement remains unsatisfactory due to high rate of misdiagnosis and underdiagnosis. Therefore, new reliable biomarkers and better non-invasive diagnostic approach are necessary for HCC identification. Methods The differentially expressed genes were identified using multiple public RNA-seq data of liver tissues from healthy individuals and HCC patients including peritumoral and tumor tissues. The hub genes for HCC diagnosis were identified combining pathway enrichment analysis and protein–protein interaction network analysis. The performance of hub genes for non-invasive HCC diagnosis was analyzed in plasma of healthy individuals, HBV infected patients, and HCC patients based on exosomal RNA-seq data. A multi-layer perceptron (MLP) model based on exosomal hub genes was developed for non-invasive HCC diagnosis. Results Through differential gene expression and pathway enrichment analysis on multiple public RNA-seq datasets, we first identified 30 dysregulated genes in HCC tissues. Protein-protein interaction analysis further narrowed down this list to 10 key genes: BRCA2, CDK1, MCM4, PLK1, DNA2, BLM, PCNA, POLD1, BRCA1 and FEN1. By further evaluation using additional public HCC tissue datasets, POLD1 and MCM4 were excluded from consideration as potential biomarkers due to their suboptimal performance. Notably, CDK1, FEN1, and PCNA gene were found to be significantly elevated in the plasma exosomes of HCC patients compared to non-HCC individuals, including those with HBV-infected hepatitis and healthy controls. The MLP model, based on three biomarkers, showed an area under the curve (AUC) of 0.85 and 0.84 in training and test dataset respectively, after adjusting for the covariates sex and age. Conclusion We identified three key genes, CDK1, FEN1, and PCNA, as exosomal biomarkers for non-invasive diagnosis of HCC. The MLP model utilizing three biomarkers showed good differentiation between non-HCC individuals and HCC patients, which exhibits promising potential as a non-invasive diagnostic tool for detecting HCC. Additional validation with a larger sample size is essential to thoroughly assess the reliability of the biomarkers and the model’s performance. Supplementary Information The online version contains supplementary material available at 10.1186/s12885-024-13332-0. Keywords: Hepatocellular carcinoma, RNA-seq, Bioinformatics analysis, Biomarkers, Non-invasive diagnosis Background The global burden of liver cancer is substantial. According to the estimation in 2020, liver cancer is the sixth most commonly diagnosed cancer and the third most common cause of cancer death. Also, liver cancer is the second leading cause of cancer-related premature mortality [[36]1, [37]2]. The number of new cases of liver cancer is predicted to increase by 55.0% between 2020 and 2040. A predicted 1.3 million people could die from liver cancer in 2040 (56.4% more than in 2020) [[38]3]. HCC can be diagnosed by several imaging techniques such as ultrasound, Computed Tomography (CT) and magnetic resonance imaging (MRI) [[39]4]. Ultrasound is low-cost and is well accepted by patients. However, its sensitivity for HCC detection is not high (58–70%), and HCC tumors less than 1 cm are undetectable [[40]5]. CT screening showed a sensitivity and specificity of about 62% and 74% respectively. MRI screening is superior than CT, which showed a sensitivity and specificity of around 79% and 78% respectively [[41]6]. The sensitivity of MRI is greater for tumor size > 2 cm (almost 100%) but drops to 60% for tumor size smaller than 2 cm, and it is even lower for tumor smaller than 1 cm [[42]7]. High cost of MRI also limits its application in HCC diagnosis. Alpha-fetoprotein (AFP) is extensively utilized as a biomarker for HCC screening. However, its diagnostic precision is constrained due to a significant occurrence of false-negative results. Most patients with AFP-negative HCC have small tumors in the early stages with atypical imaging features. Also, not all HCCs secrete AFP and nearly 30% of HCC patients were AFP-negative [[43]8]. As a result, AFP-negative HCC has a high rate of misdiagnosis and underdiagnosis [[44]9]. On the other hand, AFP may be elevated in cirrhosis or hepatitis cases and result in false positive [[45]10]. The sensitivity of AFP ranges from 25% for tumors < 3 cm to 50% for lesions > 3 cm in diameter [[46]11]. The AFP application for screening of HCC thus has been controversial. Therefore, the discovery of new convenient, economical and non-invasive serum biomarkers is necessary for HCC diagnosis [[47]12, [48]13]. Extracellular Vesicle (EV) are lipid bilayer membrane-encapsulated structures with a diameter of 40–160 nm, secreted by most cells and stably circulated in body fluid [[49]14]. EV plays key roles in physiological balance and in disease process [[50]15, [51]16]. Exosomes are a subset of tiny extracellular vesicles, which are produced actively in tumor cells. Tumor-derived exosomes contain a large number of cancer-related serological markers. Many studies have shown that exosomes can be a reliable source of non-invasive biomarkers for early detection and diagnosis of cancers including HCC [[52]17, [53]18]. Correct understanding of the molecular mechanism of HCC occurrence is key to seeking effective biomarkers. Here, we investigate the molecular mechanism that are associated with HCC development based on multiple public liver tissue RNA-seq data. Bioinformatics analysis identified some potential biomarkers for HCC, which were subsequently evaluated in plasma exosomes as non-invasive biomarkers for HCC diagnosis. Materials and methods Data collection Public gene expression data and clinical information of tissues and plasmas were acquired from data portal of Prof. Huang’s laboratory [[54]19, [55]20] ([56]http://www.gepliver.org/, [57]http://www.exoRBase.org). The public data included 226 normal liver tissues, 243 adjacent non-tumor tissues, 270 HCC tumor tissues, 118 normal plasmas and 81 HCC plasmas. The detail source of these samples was listed in the supplementary Table [58]1. Because these data were downloaded directly from public databases, there were no requirement for ethical approvals. In addition to public data, we also collected 20 plasmas of non-HCC (viral hepatitis, HBV-infected) patients from the First Affiliated Hospital of Soochow University. The detail clinical information of these viral hepatitis samples was described in supplementary Table [59]1 (Column “Phenotype” labelled with “HBV-infected”). Ethical approval to conduct this research was obtained from the First Affiliated Hospital of Soochow University ethical committee. Informed consent has been obtained from all participants who participated in the study. All experiments were conducted in compliance with the principles and regulations established by the ethics committee. Plasma exosomes RNA isolation and RNA-seq library preparation Total exosome RNAs from 1 to 2 ml plasma were isolated using the exoRNeasy Midi/Maxi Kit (QiAOEN, GER) in accordance with the manufacturer’s instructions. The detail exosomal RNA extraction protocol is provided in Supplementary File 1. All the isolated RNAs were subjected to RNA-seq library preparation or stored at -80 °C. Because stranded-specific RNA-seq provides a more accurate estimate of transcript expression compared with non-stranded RNA-seq [[60]21], we constructed strand-specific libraries for RNA-seq of plasma exosome. Due to low exosomal RNA level, SMARTer^® Stranded Total RNA-Seq Kit (Clontech, USA) was used to prepare library, which is suitable for picogram amounts of RNA. Briefly, DNase I (NEB, UK) was used to remove DNA in the RNAs. The cDNA was pre-amplified and R-probes with ZapR was used to deplete the ribosomal and mitochondrial cDNA. Next, purified dsDNA was subjected to 16 cycles of PCR amplification. Quality control of the libraries was conducted using Qubit (Thermo Fisher Scientific, USA) and Qsep100 (BiOptic Inc., China). Further, the libraries were sequenced by the Illumina sequencing platform on a 150 bp paired-end run. Differential expressed genes (DEGs) analysis The data used for DEGs analysis were all from public Gepliver database ([61]http://www.gepliver.org/) including 226 normal liver tissues, 167 pairs of adjacent-HCC non-tumor tissues and HCC tissues (see supplementary Table [62]1, Column “TYPE” labelled with “DEG”). The RNA-Seq data were analyzed using the Limma package of R language to identify DEGs between two groups: normal tissues VS HCC tissues, and adjacent-HCC non-tumor tissues VS HCC tissues ([63]http://www.bioconductorrg/packages/release/bioc/html/limma.html). DEGs were selected based on the following criteria: q-value < 0.05 and |log[2] Fold Change| (|log[2]FC|) > 1. Heatmaps were generated using pheatmap package and volcano plots were also conducted in R software. The overlap upregulation genes and downregulation genes between two groups are selected for further pathway analysis. Gene ontology (GO) and kyoto encyclopedia of genes and genomes (KEGG) pathway enrichment analyses To investigate the biological pathways that might be involved in the occurrence and development of HCC, the above overlap upregulation and downregulation DEGs were subjected to pathway enrichment analysis. Gene Ontology (GO) analysis, which involved three categories, namely molecular functions (MF), cellular components (CC), and biological processes (BP), and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were performed with the threshold of q-value < 0.05 using the cluster Profiler R package [[64]20], which facilitated biological terminology classification and gene cluster enrichment. The overlap genes enriched in GO and KEGG Top 10 pathways were selected as initial candidate genes for further protein-protein interaction network analysis. Protein–protein interaction (PPI) network analysis and candidate gene screening A protein–protein interaction (PPI) network of initial candidate genes was constructed using STRING [[65]22]. Sub nets of the vast protein interaction network were extracted by calculating the degree score of nodes. The degree score is an important network topological metric used to describe the number of interactions between a node (protein) and other nodes (proteins) in a network. In the analysis of protein-protein interaction networks, the protein with highest degree score presents key proteins as it is the central node connecting most peripheral proteins. In our study, highly connected nodes with a degree score (> 15) were screened as candidate genes. Evaluation of potential biomarkers in independent tissue samples To further verify the performance of our identified potential biomarkers in other datasets, total 179 tissue samples from GEO database were collected, which includes 76 adjacent non-tumor tissues and 103 HCC tumor tissues (see supplementary Table [66]1, Column “TYPE” labelled with “Independent Evaluation”). Boxplots and covariate-adjusted (for sex and age) receiver operating characteristic (ROC) curve prediction were performed. R package RISCA was used to adjust covariate parameters (sex and age) of ROC curves. Evaluation of potential biomarkers in plasma and construction of multi-layer perceptron model for HCC diagnosis The performance of verified candidate genes in the independent tissue RNA-seq data were checked in the plasma of normal, HBV infected and HCC individuals (see supplementary Table [67]1, Column “TYPE” labelled with “Plasma Evaluation”). The Wilcoxon Rank Sun test (Mann-Whitney U test) was utilized to check the difference expression of candidate genes in plasma between non-HCC (n = 138) and HCC (n = 81) group and the significantly different expressed genes in plasma (p < 0.05) were selected and combined for model construction. Traditional machine learning often require prior knowledge of the data for feature engineering, but multi-layer perceptron (MLP) model can learn data feature patterns without manual feature engineering. Further, MLP can handle complex datasets that are not linearly separable. Therefore, our study proposed a diagnostic model based on multi-layer perceptron. The train and test dataset are 70% and 30% of the data respectively. We used scikit-learn software [[68]23] to build the MLP classifiers for our data with default parameters. The detail methods are described in the website of official documentation ([69]https://scikit-learn.org/). The covariate-adjusted ROC curves (for sex and age parameters) using RISCA package were plotted to evaluate the performance of MLP model. ROC curves provide a measure of how well a biomarker can distinguish between tumor and non-tumor states. The area under curve (AUC) is a common metric where an AUC of 1 represents perfect diagnosis (100% accuracy), and an AUC of 0.5 indicates no discrimination ability, which is equivalent to random guessing. Results Differentially expressed genes (DEGs) between HCC and non-HCC tissues To identify DEGs in HCC samples, gene expression patterns in HCC tissues were compared with both peritumoral (adjacent non-tumor) tissues and normal tissues. When HCC tumor tissues were contrasted with normal liver tissues, a volcano plot revealed 8,876 significantly DEGs, comprising 8,224 upregulated and 652 downregulated genes (Fig. [70]1A). In the comparison between HCC tumor tissues and peritumoral tissues, the volcano plot identified 7,520 significantly DEGs, including 5,927 upregulated and 1,593 downregulated genes (Fig. [71]1B). A fold change (FC) of ≥ 1 and an adjusted p-value of < 0.05 were utilized as the criteria for determining DEGs. Venn diagrams indicated that 4,828 upregulated DEGs (Fig. [72]1C) and 429 downregulated DEGs (Fig. [73]1D) were overlapped in two comparison groups (normal tissues vs. HCC tissues, adjacent-HCC non-tumor tissues vs. HCC tissues). These overlapped genes were selected for further analysis. To visualize these overlapping DEGs, heatmaps were constructed for both comparison groups (Fig. [74]1E-F). Fig. 1. [75]Fig. 1 [76]Open in a new tab Differentially expressed genes between HCC (n = 167) and non-HCC tissues including normal (n = 226) and peritumoral tissues (n = 167). (A) Volcano plot of differentially expressed genes between normal and HCC tissues. The red dots represent 8224 significantly upregulated genes and blue dots represents 1593 significantly downregulated genes. Gray dots represent no statical significance. (B) Volcano plot of differentially expressed genes between ADJ_HCC and HCC tissues, which includes 5931 significantly upregulated genes and 652 significantly downregulated genes. (C) Venn diagram of overlap upregulated DEGs between two compared groups (group “ADJ_HCC VS HCC” and group “normal VS HCC”). Total 4824 overlapped genes are found. (D) Venn diagram of overlap downregulated DEGs between two compared groups (group “ADJ_HCC VS HCC” and group “normal VS HCC”). Total 429 overlapped genes are found. (E) Heatmap of significantly differentially expressed genes between normal and HCC tissues. (F) Heatmap of significantly differentially expressed genes between adjacent-HCC non-tumor and HCC tissues. The color in heatmap from blue to red represents the progression from low expression to high expression. Abbreviation: ADJ_HCC, adjacent-HCC non-tumor tissues; HCC, hepatocellular carcinoma tissue GO and KEGG pathway analysis of DEGs To elucidate the biological functions of the 4,848 upregulated and 429 downregulated DEGs, we employed GO and KEGG pathway analyses to explore the genes involved in HCC-related biological responses. In the Go pathway analysis for upregulated DEGs (Fig. [77]2A), the top 10 biological processes (BP) analysis were significantly enriched in pathways such as chromosome segregation, mitotic nuclear division, sister chromatid segregation, DNA replication, nuclear division, nuclear chromosome segregation, mitotic sister chromatid segregation, organelle fission, covalent chromatin modification, and regulation of cell cycle phase transition. The top 10 cell components (CC) analysis were enriched in structures like condensed chromosomal region, spindle, chromosome, centromeric region, condensed chromosome, kinetochore, condensed chromosome, centromeric region, spindle pole, condensed chromosome kinetochore, microtubule, and methyltransferase complex. In the molecular functions (MF) analysis, the top 10 pathways were associated with ATPase activity, DNA-dependent ATPase activity, helicase activity, DNA helicase activity, catalytic activity, acting on DNA, histone binding, tubulin binding, 3’-5’ DNA helicase activity, microtubule binding, and Ras GTPase binding. KEGG pathway analysis revealed that the top five enriched pathways for the upregulated DEGs included cell cycle, phosphatidylinositol signaling system, DNA replication, nucleocytoplasmic transport and ubiquitin mediated proteolysis. (Fig. [78]2C). The remaining signal pathways in GO and KEGG functional enrichment analysis are shown in Supplementary Table [79]S2. Fig. 2. [80]Fig. 2 [81]Open in a new tab Pathway and functional enrichment of overlapped upregulation and downregulation DEGs. (A) Gene ontology analysis of 4824 overlap upregulation genes; the figure shows biological process (BP), cellular component (CC) and molecular function (MF) pathways. (B) Gene ontology analysis of 429 overlap downregulation genes; the figure shows biological process, cellular component and molecular function pathways. (C) The most significant Kyoto Encyclopedia and Genomes (KEGG) pathways of 4824 overlapped upregulation genes. (D) The most significant Kyoto Encyclopedia and Genomes (KEGG) pathways of 429 overlapped downregulation genes In the GO pathway analysis of downregulated DEGs (Fig. [82]2B), the top 10 BP analysis pathways were organic acid catabolic process, carboxylic acid catabolic process, small molecule catabolic process, complement activation, carboxylic acid biosynthetic process, organic acid biosynthetic process, humoral immune response, alpha-amino acid catabolic process, alpha-amino acid metabolic process, cellular amino acid catabolic process. In CC analysis, the top 10 enrichment pathways were blood microparticle, collagen-containing extracellular matrix, plasma lipoprotein particle, lipoprotein particle, high-density lipoprotein particle, immunoglobulin complex, protein-lipid complex, endocytic vesicle lumen, collagen trimer, haptoglobin-hemoglobin complex. In MF analysis, the top 10 enrichment pathways were heme binding, monooxygenase activity, tetrapyrrole binding, arachidonic acid monooxygenase activity, oxidoreductase activity, iron ion binding, steroid hydroxylase activity, oxidoreductase activity, aromatase activity and vitamin binding. KEGG pathway analysis showed that the top 5 enrichment pathways of the downregulated DEGs are retinol metabolism, fatty acid degradation, chemical carcinogenesis-DNA adducts, complement and coagulation cascades and drug metabolism cytochrome P450 (Fig. [83]2D). The remaining pathways in GO and KEGG functional enrichment analysis were shown in Supplementary Table [84]S2. PPI network construction of DEGs and identification of candidate genes PPI network is used to analyze protein-protein interaction. Proteins with the most connections to other proteins are often considered as hub nodes in the network [[85]24], which plays a crucial role because they interact with more proteins. These hub proteins are often considered as key biomarkers as their centrality in biological processes, which may be closely related to the onset, progression of diseases. The 30 candidate genes were first identified by overlapping the upregulated genes of top 10 GO and KEGG pathways (Fig. [86]3A). Further, PPI network of 30 genes was constructed (Fig. [87]3B). The central 10 proteins (BLM, BRCA1, BRCA2, CDK1, DNA2, FEN1, MCM4, PCNA, PLK1, POLD1) showed highest degree score, which are most connected to the other 20 proteins. Therefore, the genes encoded these 10 proteins were selected as candidate genes for further evaluation in the independent tissue and plasma samples (Table [88]1). Fig. 3. [89]Fig. 3 [90]Open in a new tab PPI network analysis of overlapped 30 DEGs. (A) Venn plots of the upregulated genes in the top 10 GO and KEGG enrichment pathways. The overlapped 30 DEGs were used for PPI network analysis (B) PPI network analysis of 30 overlapped genes from figure A Table 1. List of top10 genes from PPI network with a degree score greater than 15 Gene symbol Gene description Degree score BLM BLM RecQ like helicase 15 BRCA1 BRCA1 DNA repair associated 20 BRCA2 BRCA2 DNA repair associated 17 CDK1 Cyclin dependent kinase 1 26 DNA2 DNA replication helicase/nuclease 2 21 FEN1 Flap structure-specific endonuclease 1 21 MCM4 Minichromosome maintenance complex component 4 15 PCNA Proliferating cell nuclear antigen 18 PLK1 Polo like kinase 1 15 POLD1 DNA polymerase delta 1 16 [91]Open in a new tab Evaluation of 10 potential biomarkers using other public HCC tissue samples To explore the performance of our selected 10 genes from PPI network across various datasets, we additionally collected 179 public samples for evaluation. The boxplots and covariate-adjusted ROC curves (for sex and age parameters) were used to evaluate their performance. The results of boxplots revealed that the expression levels of the 10 candidate genes (BLM, BRCA1, BRCA2, CDK1, DNA2, FEN1, MCM4, PCNA, PLK1, POLD1) were significantly upregulated in HCC tissues, which distinguish tissue of HCC patients from adjacent-HCC non-tumor tissues (Fig. [92]4A-J). As shown in Fig. [93]4K, the AUC values of the 8 candidate genes (BLM, BRCA1, BRCA2, CDK1, DNA2, FEN1, PCNA, PLK1) were greater than 0.8, indicating good discrimination ability (Fig. [94]4K). In contrast, the AUC value of POLD1 and MCM4 genes are 0.73 and 0.74 respectively, indicating lower discrimination ability. Thus, POLD1 and MCM4 gene were excluded from consideration as potential biomarkers fur further evaluation. Fig. 4. [95]Fig. 4 [96]Open in a new tab Performance of 10 candidate genes between adjacent-HCC and HCC tissues in independent tissue dataset. (A-J) 10 candidate genes were analyzed in boxplots to distinguish adjacent-HCC tissue from HCC tissue. ns p ≥ 0.05; ****p < 0.0001, data are shown as mean ± SD. (K) covariate-adjusted ROC curve of 10 candidate genes in distinguishing peritumoral tissues (ADJ-HCC) and HCC tissues Evaluation of potential biomarkers in plasma for diagnosis To check the potential values of above 8 verified genes as non-invasive biomarkers, we analyzed their expression difference in plasma exosomes of HCC patients (n = 81) and non-HCC individuals (healthy and HBV-infected hepatitis) (n = 138). The results of boxplots and Wilcoxon test showed that 3 genes (CDK1, FEN1, PCNA) were significantly upregulated in the RNA expression data of HCC plasma exosome when compared to non-HCC (normal, HBV-infected hepatitis) (Fig. [97]5B-D). The results of the covariate-adjusted ROC curve (for sex and age factors) showed that the AUC of 3 hub genes for HCC detection was between 0.5 and 0.7 (Fig. [98]5A), indicating the poor performance of the single gene in HCC diagnosis. Therefore, the joint analysis of combining three hub genes was carried out by developing a MLP model. The AUC of MLP model for HCC detection was 0.85 and 0.84 in training and test datasets respectively after adjusting sex and age covariate parameters (Fig. [99]5A). Fig. 5. [100]Fig. 5 [101]Open in a new tab Performance of 3 candidate genes in plasma. (A) The covariate-adjusted ROC curve for comparison between MLP model and single biomarkers in train dataset (n = 153) and test dataset (n = 66). The adjusted covariate parameters are sex and age. Combined_AUC is the AUC result of MLP model combining 3 genes. (B-D) Boxplots of three genes expression in plasma exosome between non-HCC (healthy and HBV infected, n = 138) and HCC (n = 81). *p < 0.05, **p < 0.01. The p value was calculated by Mann-Whitney U test Discussion As a predominant type of liver cancer, HCC has a very high cancer mortality rate, which is mainly due to tumor heterogeneity and lack of early diagnosis [[102]25]. It is important to find accurate and reliable biomarkers for the diagnosis of HCC. Although a lot of HCC-related biomarkers are identified by previous studies, most of them didn’t evaluate the performance of biomarkers in diverse populations [[103]26, [104]27]. Further, most studies didn’t check their potential values as non-invasive diagnostic biomarkers using clinic plasma samples [[105]28]. In our present study, we integrate multiple public datasets from different studies to identify HCC-related hub genes using various bioinformatic tools. The potential value of identified hub gene in HCC diagnosis was evaluated using plasma exosome RNA-seq data. By combining with machine learning algorithm, our identified hub genes showed potential values for non-invasive HCC detection. In our study, most of the top 10 enriched pathways in GO analysis of upregulated DEGs were related to cell cycle regulation. Similarly, the top 5 enriched pathways in KEGG analysis of upregulated DEGs were predominantly linked to cell cycle regulation. This means cell cycle regulation pathway is important for HCC development. In agreement, previous finding reports that the disruption of the cell cycle may lead to cell cycle arrest, dysfunction of transcription and uncontrolled cell growth, which can be the fundament of tumorigenesis and can affect the prognosis of cancer [[106]29–[107]31]. Zhili Zeng et al. reported that the HCC-related genes were also enriched in the pathway of cell cycle regulation using 35 HCC tissues and 35 normal tissues [[108]32]. Similar findings were also reported by Lanyi Zhang in 2022 [[109]33]. Three hub genes, CDK1, FEN1 and PCNA, are identified by combining GO, KEGG and PPI network analysis. These genes have been reported to associate with HCC. For instance, mRNA expression of CDK1 was up-regulated in HCC tissues [[110]34]. Several studies based on functional assays showed a correlation between CDK1 and the proliferation and migration of HCC cells [[111]35, [112]36]. Aberrant expression or mutation of FEN1 has been identified in numerous solid tumors, which has facilitated the migration and proliferation of tumor cells [[113]37, [114]38]. It was observed that FEN1 overexpression promoted cell proliferation, migration, and invasion of HCC cells, whereas FEN1 suppression inhibited these processes. This suggests that FEN1 is significantly involved in the pathogenesis of HCC [[115]39–[116]41]. In addition, research demonstrated a positive correlation between FEN1 expression and tumor size, distant metastasis, vascular invasion, and distant metastasis in HCC [[117]42]. PCNA has been suggested as a potential diagnostic biomarker for early hepatocellular carcinoma (HCC) [[118]43–[119]45]. Study showed that the expression intensity of PCNA is much higher in HCC tissues than that in paracarcinoma tissues and PCNA is also associated with AFP, albumin, tumor number, clinical grade, vascular invasion, and tumor-node-metastasis (TNM) stage (all p < 0.0001) [[120]46]. Li et al. found that PCNA expression levels increased with increasing histological stage and tumor size by analyzing RNA seq from HCC samples and adjacent non-tumor samples [[121]47]. In addition, PCNA has also been reported to regulate apoptosis and glycolysis, both of which are critical in HCC tumorigenesis [[122]48]. In our study, we found CDK1, FEN1, PCNA genes were significantly increased in the plasma exosome of HCC patients. The covariate-adjusted ROC curve prediction revealed that the diagnostic efficacy of a single gene in plasma was not high, with an approximate AUC between 0.5 and 0.7. The MLP model by combining the three genes significantly improve the accuracy of HCC diagnosis, suggesting its potential value for HCC detection. Prior to the clinical application of these biomarkers and the MLP model, it is essential to collect a larger number of samples to validate their efficacy and reliability. To integrate the exosome RNA detection and MLP model into the current clinical workflows as an aided diagnosis, it is necessary to standardize all the experimental processes including exosome RNA extraction, RNA quantification and set up automated prediction processes of MLP model. This is important for consistency in results. Numerous kits are available for extracting RNA from plasma exosomes, and the expression levels of CDK1, FEN1, and PCNA genes can be detected using real-time qPCR machines, which are commonly employed in hospitals for diagnostic purpose. The MLP model can be installed on a computer to predict the risk score for HCC. A risk score below the predetermined cutoff value is classified as low-risk for HCC, while a score above it is considered high-risk. High-risk individuals should undergo regular follow-up and monitoring to facilitate early detection and treatment, thereby enhancing therapeutic outcomes and survival rates. Implementing these workflows in hospitals presents challenges, primarily in determining the appropriate cutoff value for the risk score. The market offers a diverse range of blood collection tubes, exosome RNA extraction kits, and qPCR machines. Variations in these tools can impact the efficiency of exosome RNA extraction and the determination of risk score cutoff values. Consequently, the cutoff value must be fine-tuned based on the specific workflows, kits, and machines used. Also, large sample size is needed to determine the appropriate risk score. Despite these challenges, transforming the potential of exosomal RNA biomarkers to clinical diagnostics is significant. The limitation of the study is the small size of plasma samples. This means the plasma samples used in our study may not be fully representative of the entire HCC population. This partial representation could mean that the results are not generalizable to all HCC cases, potentially missing out on genetic or molecular variations present in other HCC cases not included in the study. Further, the study collected multiple tissues datasets from various studies, which could have been subject to different experimental conditions and sample processing methods among laboratories. These variations, known as batch effects, can introduce bias, complicating the interpretation of the data. Therefore, in our future study, we will collect more HCC and adjacent non-tumor tissue samples, as well as plasma samples. This will help to validate the performance of the identified biomarkers and the model in a larger and more diverse cohort, which can increase the robustness and generalizability of the model. This is crucial for the development of effective diagnostic tools for HCC. In addition, our future study also plans to conduct longitudinal studies. These longitudinal studies will collect the plasma from the same subjects over a longer period, allowing for a more in-depth evaluation of the association between biomarkers and the diagnosis of HCC progression. Conclusions In conclusion, our study integrates multiple public RNA-seq data to identify three significant upregulated biomarkers (CDK1, FEN1, and PCNA) of HCC. Notably, the expression level of three biomarkers were significantly increased in plasma exosome of HCC patients. We develop a non-invasive diagnostic model based on three hub genes, which has shown good differentiation between HCC patients and non-HCC individuals, including those who are healthy and those HBV-infected hepatitis individuals. However, before this model can be integrated into clinical practice, further validation with a larger sample size is necessary to assess the reliability of three biomarkers and the model’s performance. Electronic Supplementary Material Below is the link to the electronic supplementary material. [123]Supplementary Material 1^ (13.7KB, docx) [124]Supplementary Material 2^ (46.6KB, xlsx) [125]Supplementary Material 3^ (230.3KB, xlsx) Acknowledgements