Abstract Introduction Cirrhosis is one of the most important risk factors for development of hepatocellular carcinoma (HCC). Recent studies have shown that removal or well control of the underlying cause could reduce but not eliminate the risk of HCC. Therefore, it is important to elucidate the molecular mechanisms that drive the progression of cirrhosis to HCC. Materials and Methods Microarray datasets incorporating cirrhosis and HCC subjects were identified from the Gene Expression Omnibus (GEO) database. Differentially expressed genes (DEGs) were determined by GEO2R software. Functional enrichment analysis was performed by the clusterProfiler package in R. Liver carcinogenesis-related networks and modules were established using STRING database and MCODE plug-in, respectively, which were visualized with Cytoscape software. The ability of modular gene signatures to discriminate cirrhosis from HCC was assessed by hierarchical clustering, principal component analysis (PCA), and receiver operating characteristic (ROC) curve. Association of top modular genes and HCC grades or prognosis was analyzed with the UALCAN web-tool. Protein expression and distribution of top modular genes were analyzed using the Human Protein Atlas database. Results Four microarray datasets were retrieved from GEO database. Compared with cirrhotic livers, 125 upregulated and 252 downregulated genes in HCC tissues were found. These DEGs constituted a liver carcinogenesis-related network with 272 nodes and 2954 edges, with 65 nodes being highly connected and formed a liver carcinogenesis-related module. The modular genes were significantly involved in several KEGG pathways, such as “cell cycle,” “DNA replication,” “p53 signaling pathway,” “mismatch repair,” “base excision repair,” etc. These identified modular gene signatures could robustly discriminate cirrhosis from HCC in the validation dataset. In contrast, the expression pattern of the modular genes was consistent between cirrhotic and normal livers. The top modular genes TOP2A, CDC20, PRC1, CCNB2, and NUSAP1 were associated with HCC onset, progression, and prognosis, and exhibited higher expression in HCC compared with normal livers in the HPA database. Conclusion Our study revealed a highly connected module associated with liver carcinogenesis on a cirrhotic background, which may provide deeper understanding of the genetic alterations involved in the transition from cirrhosis to HCC, and offer valuable variables for screening and surveillance of HCC in high-risk patients with cirrhosis. Keywords: cirrhosis, hepatocellular carcinoma, transcriptome, module, prognosis Introduction Hepatocellular carcinoma accounts for 90% of all primary liver malignancies ([27]Mittal and El-Serag, 2013; [28]Galle et al., 2018), posing a serious threat to human health and quality of life. Worldwide, most patients with HCC have underlying cirrhosis of various etiologies ([29]Fattovich et al., 2004; [30]Beste et al., 2015; [31]Walker et al., 2016). Growing clinical evidence shows that removal or control of the injurious factors, such as hepatitis B or C virus, can reduce but not eliminate the risk of HCC ([32]Casado et al., 2013; [33]Marcellin et al., 2013; [34]Xu et al., 2015; [35]Sun et al., 2017). Therefore, it is important to understand the molecular mechanisms that drive the progression of cirrhosis to HCC. Hepatocellular carcinoma occurs as a consequence of the complex interplay between multiple genetic determinants ([36]Sanyal et al., 2010). Previous studies have found that aberrations in genetic molecules pertaining to oxidative stress, EMT, inflammatory response, cellular senescence, or telomere dysfunction may contribute to the progression of cirrhosis to HCC ([37]Ramakrishna et al., 2013). In addition, the Wnt/β-catenin, p53, pRb, MAPK, RAS, and JAK/STAT pathways are also reported to be canonical molecular pathways in HCC development ([38]Aravalli et al., 2008). However, different studies often yield diverse results and the global view on the landscape of genomic changes is still not very clear. With the aid of high-throughput detection techniques, all expressed genetic molecules in a given liver tissue sample can be simultaneously detected over a wide quantitative range ([39]Mas et al., 2009; [40]Villanueva et al., 2011; [41]Yildiz et al., 2013; [42]Wang et al., 2014; [43]Lee, 2015; [44]Schulze et al., 2015; [45]Villanueva et al., 2015; [46]Diaz et al., 2018; [47]Shen et al., 2018). High-throughput sequencing and microarray technologies allow investigators to simultaneously measure the changes of genome-wide genes under certain biological conditions. These approaches usually generate large “interesting” gene lists. By using biological knowledge accumulated in public databases (e.g., KEGG^[48]1), it is possible to systematically dissect large gene lists in an attempt to assemble a summary of the most enriched and pertinent biology. Therefore, integrated analyses of multiple datasets generated from different studies may help us to identify reliable and reproducible genetic alterations involved in the development of HCC on a cirrhotic background. Therefore, our present study used multiple bioinformatics tools to systematically integrate publicly available transcriptomic datasets and performed high-throughput gene expression comparisons between HCC and benign cirrhotic tissues. Materials and Methods Retrieval of Microarray Datasets on Cirrhosis and HCC From Public Database First, we searched and retrieved transcriptome profiles of cirrhotic and HCC tissues from GEO which is a public functional genomics data repository, allowing users to query, locate, review, and download studies and gene expression profiles of interest ([49]Barrett et al., 2013). The search terms we used included “cirrhosis” and “HCC.” Studies were considered eligible for analysis if: (1) studies contained both cirrhosis and HCC tissues; (2) species was limited to Homo sapiens; and (3) platform was limited to microarray. Then the retrieved datasets were further screened by manual retrieval. Our workflow for bioinformatics analysis of publicly available datasets is illustrated in [50]Figure 1. FIGURE 1. [51]FIGURE 1 [52]Open in a new tab Workflow for bioinformatics analysis. GEO, Gene Expression Omnibus; HCC, hepatocellular carcinoma; STRING, Search Tool for the Retrieval of Interacting Genes; MCODE, Molecular Complex Detection; HCL, hierarchical clustering; PCA, principal component analysis; KEGG, Kyoto Encyclopedia of Genes and Genomes; HPA, Human Protein Atlas. ROC, receiver operating characteristic. Identification of DEGs Related to Liver Carcinogenesis From the Retrieved Microarray Datasets Differentially expressed genes between cirrhosis and HCC tissues were defined as liver carcinogenesis-related genes that may have important implications in driving cirrhosis to HCC. Gene expression in all the datasets was normalized by the antilog-transformed RMA algorithm. GEO query and limma R packages contained in GEO2R, which allows gene expression analysis of published microarray datasets, was used to determine the DEGs between cirrhosis and HCC tissues ([53]Davis and Meltzer, 2007). FDR < 0.05 and FC > 1.5 were considered as the cutoff values for DEG screening. The overlapping DEGs in datasets were retained for further analyses. Functional Specification of the Identified DEGs Related to Liver Carcinogenesis To identify and visualize enriched KEGG pathways for the candidate gene sets, clusterProfiler, which is an R package for comparing biological themes among gene clusters, was employed ([54]Yu et al., 2012). Fisher’s exact test followed by the Benjamini correction was performed and an adjusted P-value of <0.05 was set as the cutoff criterion. Establishment of the LiverCarcinogenesis-Related Network andIts Modules The internal regulatory relationships between the identified liver carcinogenesis-related genes were predicted by the STRING database (confidence score > 0.4) ([55]Szklarczyk et al., 2017). Liver carcinogenesis-related network was established and visualized with Cytoscape software ([56]Shannon et al., 2003). We used the MCODE plug-in in the Cytoscape software ([57]Bader and Hogue, 2003) to screen the modules concealed in the liver carcinogenesis-related network with the following criteria: Max. depth = 100, K-Core = 2, mode score cutoff = 0.2, and degree cutoff = 2. Likewise, the functional specification of the identified module was determined with the clusterProfiler package as mentioned above. An adjusted P-value of <0.05 was considered statistically significant. Verification of the Identified Modules for Discriminating Cirrhosis From HCC We used three of datasets ([58]GSE89377, [59]GSE17548, and [60]GSE98383) to mine modules from the liver carcinogenesis-related network; and used the remaining dataset ([61]GSE56140) to validate the findings. To verify the ability of the identified modules to discriminate cirrhosis from HCC subjects, we performed hierarchical clustering analysis by using R with the complete linkage method and the expression of modular genes as a distance metric. To verify the results of hierarchical clustering, we applied the identified modular genes that were considered as observed variables to PCA plots. PCA was conducted with the ggbiplot package in R. The first two principal components (PCs) were then subjected to binary logistic regression analysis to calculate the predicted probability which was applied to the receiver operating characteristic (ROC) curve analysis implemented by SPSS 20 (IBM, United States). Area under curve (AUC) was calculated to determine the predictive performance of the identified gene module. In order to reduce sampling bias, the modules were screened from any three out of the four datasets and repeated evaluations of their discriminant ability were performed using the remaining dataset. Comparison of the Identified Modular Genes in Normal and Cirrhotic Samples We used GEO2R software to determine the expression differences between any two groups in dataset. An FDR of <0.05 and an FC of >1.5 were considered as the cutoff values for DEG screening. Modular gene expression in normal, cirrhotic, and HCC samples were visualized by using a heatmap drawn with MeV software^[62]2. Analyses of the Association Between the Top Modular Genes and HCC Histological Grade or Clinical Outcome Modular genes with FC > 3 between cirrhosis and HCC tissues in all GEO datasets were considered as the top modular genes. Association of the top modular genes and HCC grades or prognosis was analyzed by using UALCAN ([63]Chandrashekar et al., 2017), which is an interactive web-portal for exploring the association between tumor subgroup gene expression and survival in TCGA^[64]3. Expression differences of top modular genes between normal and different tumor grades were analyzed using the statistical method built-in the UALCAN web-software; a P-value of <0.05 was considered significant. According to the TPM expression values, the top modular genes were divided into a high expression group (with TPM values above upper quartile) and a low expression group (with TPM values below lower quartile). With information on the association between the gene expression and survival profiles documented in TCGA, Kaplan–Meier survival analyses were performed and overall survival plots were generated. The difference between high gene expression and low gene expression was compared by log-rank test; a P-value of < 0.05 was considered significant. In silico Analysis of the Top Modular Members in Normal and HCC Specimens Protein expression and distribution of the top modular genes in human liver tissue were searched in the HPA ^[65]4 database ([66]Uhlen et al., 2015). Results Retrieved Microarray Datasets Pertaining to Cirrhosis and HCC According to the retrieval criteria, four microarray datasets ([67]Table 1) containing a total of 95 benign (cirrhosis) and 98 malignant (HCC) subjects were found from the GEO database. [68]GSE98983 dataset was produced by Affymetrix Human Genome U133 Plus 2.0 Array ([69]GPL570); [70]GSE89377 by Illumina HumanHT-12 V3.0 expression beadchip ([71]GPL6947); [72]GSE17548 by Affymetrix Human Genome U133 Plus 2.0 Array ([73]GPL570); and.[74]GSE56140 by Illumina HumanHT-12 V3.0 expression beadchip. Table 1. Detailed information of the four microarray datasets used in the present study. ID Platform Sample Etiology References