Abstract Introduction The heterogeneity-specific nature of the available colorectal cancer (CRC) biomarkers is significantly contributing to the cancer-associated high mortality rate worldwide. Hence, this study was initiated to investigate a system of novel CRC biomarkers that could commonly be employed to the CRC patients and helpful to overcome the heterogenetic-specific barrier. Methods Initially, CRC-related hub genes were extracted through PubMed based literature mining. A protein-protein interaction (PPI) network of the extracted hub genes was constructed and analyzed to identify few more closely CRC-related hub genes (real hub genes). Later, a comprehensive bioinformatics approach was applied to uncover the diagnostic and prognostic role of the identified real hub genes in CRC patients of various clinicopathological features. Results Out of 210 collected hub genes, in total 6 genes (CXCL12, CXCL8, AGT, GNB1, GNG4, and CXCL1) were identified as the real hub genes. We further revealed that all the six real hub genes were significantly dysregulated in colon adenocarcinoma (COAD) patients of various clinicopathological features including different races, cancer stages, genders, age groups, and body weights. Additionally, the dysregulation of real hub genes has shown different abnormal correlations with many other parameters including promoter methylation, overall survival (OS), genetic alterations and copy number variations (CNVs), and CD8+T immune cells level. Finally, we identified a potential miRNA and various chemotherapeutic drugs via miRNA, and real hub genes drug interaction network that could be used in the treatment of CRC by regulating the expression of real hub genes. Conclusion In conclusion, we have identified six real hub genes as potential biomarkers of CRC patients that could help to overcome the heterogenetic-specific barrier across different clinicopathological features. 1. Introduction Colorectal cancer (CRC) is the most common cancer and is one of the leading causes of cancer-related deaths worldwide [[30]1]. Genetic mutations and altered expression levels of various tumor suppressor genes including Adenomatous Polyposis Coli (APC), Tumor Protein 53 (TP53) and BReast CAncer gene (BRCA1) have long been considered as an important cause of CRC, which affect the development, progression, and metastasis of CRC through a variety of regulatory pathways [[31]2]. Advances in biomarkers screening technologies have greatly helped to discover the reliable novel diagnostic and prognostic biomarkers for the early detection and treatments of CRC. However, due to the heterogeneity-specific nature of the available diagnostic and prognostic biomarkers the recurrence and metastasis of CRC in patients of different races, different cancer stages, genders, age groups, and body weight are still, not fully addressed and remain the major challenge to clinical treatment [[32]3,[33]4]. In the recent times, microarray technology has become very popular among the scientist as it enables them to screen thousands of differentially expressed messenger RNAs (mRNAs), microRNAs (miRNAs), and long noncoding RNAs (lncRNAs) together [[34]5–[35]7], which play an important role in the development and progression of a disease. Besides, this technology has also helped to perform the in-depth analyses of the key genes to explore potential molecular targets and diagnostic biomarkers [[36]8]. The Gene Expression Omnibus database (GEO, available at: [37]http://www.ncbi.nlm.nih.gov/geo/) is an open-access gene expression database, which is created, and maintained by the National Center for Biotechnology Information (NCBI) [[38]9,[39]10]. This database is used to store and freely distribute the microarrays, next generation sequencing, and various other forms of high-throughput functional genomic datasets to the researchers worldwide [[40]10]. GEO database is the most attractive platform for the researchers to re-evaluate and re-analyze the microarray datasets through different bioinformatics approaches for the identification of disease-specific novel potential biomarkers. Many researchers have previously utilized the CRC microarray expression datasets to identify the potential biomarkers as hub genes. However, by considering the fact that, biomarkers are highly race, cancer stage, genders, age, and body weight-specific biomolecules and knowing that CRC microarray expression datasets, available in the GEO database, contain information extracted from patients of different races, different cancer stages, genders, age groups, and body weights. Moreover, on daily basis new datasets are added to the GEO database and each dataset suggests different CRC related hub genes. Hence, it has become clinically impossible to globally employ the hub genes which have been reported by individual studies through extensive analysis of specific GEO’s-CRC related microarray expression datasets [[41]11] as potential diagnostic and prognostic biomarkers to the CRC patients of all the races, cancer stages, genders, age groups, and body weights. Therefore, in this study, we planned to re-analyze the already reported CRC-related hub genes through a multi-layered bioinformatics based strategy to detect few more closely CRC-linked hub genes (real hub genes) that could commonly be used as potential diagnostic and prognostic biomarkers for CRC patients of different clinicopathological features. For this purpose, the already reported hub genes will be extracted from those published studies that utilized the CRC-related GEO expression datasets. Then, all these hub genes will be added to a single pool to establish a consolidated set of most significantly dysregulated hub genes exhibiting the high degree of centrality among the analyzed genes. Followed by this step, the pool of hub genes will be subjected to pathway enrichment analysis, PPI network construction and identification of the most centralized genes (real hub genes) and their underlying pathways [[42]12]. Next, the differential expression and validation analysis of the identified real hub genes in normal and CRC patients of different clinicopathological features will be carried out via multiple authentic platforms such as GEPIA database [[43]13], GENT2 database, UALCAN database [[44]14] and cBioPortal database [[45]15] using numerous TCGA colon adenocarcinoma (COAD) datasets consisting of a large cohort of normal individuals and COAD patients. Following this, we will investigate the correlation of the real hub genes expressions with their promoter methylation level, genetic alterations, copy number variations (CNVs), overall survival (OS) and CD8+ T immune cells levels in CRC patients relative to normal controls. Additionally, in an attempt to understand the regulatory mechanisms, miRNAs interaction patterns will be studied to explore the role of miRNAs, if any, as mediators of the real hub genes’ expression behavior. Similarly the effect of various chemotherapeutic drugs on expression profile of the identified real hub genes will be deduced through hub genes-drug interaction network analysis. The information obtained could help to regulate the real hub genes expression during the treatment of CRC. Taken together, this detailed mega-scale study based upon information retrieved from the analysis of large number of datasets and reported hub genes is therefore expected to find some common CRC biomarkers which can be exploited for the diagnostic and prognostic purposes of worldwide CRC patients of different clinicopathological features and thus helpful to overcome the heterogenetic-specific barrier. The information retrieved can further help in predicting the treatment outcomes in CRC patients. 2. Material and methods 2.1 Literature search and hub genes extraction A PubMed based search was performed to search all the studies which analyzed the CRC expression microarray datasets available on GEO (available at: [46]https://www.ncbi.nlm.nih.gov/geo/) database and identified the hub genes. For this purpose, two keywords “Hub genes AND colon cancer” and “Hub genes AND colorectal cancer” were searched on PubMed separately with the “Original article” filter. In total 108 studies appeared which were further explored to filter out the studies having desired information. In total, 21 studies were selected which were found to collectively analyze more than 30 CRC expression based microarray datasets retrieved from the GEO database and identify numerous hub genes. We extracted all the hub genes from these studies and assembled them in the form of a single pool. 2.2 Pathway enrichment analysis The Database for Annotation, Visualization and Integrated Discovery (DAVID, available at: [47]https://david.ncifcrf.gov) integrates biological data as well as analytical tools to systematically and comprehensively annotate the biological functions [[48]16]. We used the DAVID database for pathway enrichment analysis of the pooled hub genes. A p-value < 0.05 indicated the statistically significant differences. 2.3 Protein–protein interaction (PPI) network construction and mining the real hub genes The protein–protein interaction (PPI) analysis is important to interpret the molecular mechanisms of the key pathways in carcinogenesis. In the present study, we utilized the Search Tool for the Retrieval of Interacting Genes (STRING) database (available at: [49]https://string-db.org/) [[50]17] to construct the PPI network of all the pooled hub genes. The six real hub genes present in the PPI network were then identified through Cytohubba application of the Cytoscape tool (version:3.7.1) [[51]18], which can explore important nodes and fragile motifs in a network by several topological algorithms including degree-edge percolated component and degree of centrality. 2.4 GEPIA dataset analysis GEPIA (available at: [52]http://gepia.cancer-pku.cn/) is an online platform of retrieved data from the UCSC Xena database (available at: [53]https://xena.ucsc.edu/), which in-houses the expression data of 9736 tumor samples and 8587 normal samples [[54]13]. In this study, the transcriptional expression levels of the real hub genes were analyzed in COAD patients relative to control. For this purpose the Colon adenocarcinoma (COAD) dataset was utilized which includes 275 tumor and 349 normal samples. A t-test is used for the statistics purpose in GEPIA. The expression level of real hub genes in GEPIA was normalized as transcript per million (TPM) reads, and a p-value < 0.05 was considered to be statistically significant. We also utilized this database for the correlational analysis between real hub genes expression and OS duration of the COAD patients. 2.5 GENT2 dataset analysis Gene Expression database of Normal and Tumor tissues 2 (GENT2, available at: [55]http://gent2.appex.kr) is an online platform that provides a user-friendly overview of the gene expression patterns across different normal and tumor tissues compiled from publically available GEO datasets. GENT2 contains the expression data of more than 68,000 samples and has several useful functions. For example, GENT2 provides gene expression analysis option across 72 different cancerous tissues. GENT2 also provides an option to study the differential expression and its prognostic significance based on tumor subtypes. Additionally, GENT2 provides a meta-analysis of survival information to provide users more reliable prognostic value of a gene of interest. A t-test is used for the statistics purpose in GENT2. In the present study, this platform was used for further validation of the GEPIA based results of real hub genes expression patterns examined in COAD patients relative to controls [[56]19]. The expression level of real hub genes in GENT2 was normalized as transcript per million (TPM) reads, and a p-value of < 0.05 was considered to be statistically significant. 2.6 UALCAN dataset analysis The UALCAN (available at; [57]http://ualcan.path.uab.edu/) is an online publicly available web-portal that offers in-depth analysis of data from TCGA. In the present study, this database was used for the genes promoters’ methylation analysis and validation of variations detected in the real hub genes’ mRNA and protein expression profiles of COAD patients of different clinicopathological features relative to normal controls. In UALCAN t-test was used for the statistics purpose. The mRNAs’ expression levels of real hub genes were normalized as transcript per million (TPM) reads. While corresponding proteins expression levels were normalized as z-value, the promoters’ methylation levels were normalized as beta (β) value, and a p-value of < 0.05 value was considered to be statistically significant. 2.7 cBioPortal analyses An open-source tool, cBioPortal (Available at: [58]http://www.cbioportal.org) developed by the Computation Biology Center located at Sloan Kettering, was utilized to summarize all the possible transcriptional changes, mutual expression tendencies, and overall survival through Kaplan-Meier analysis, by presenting the results as OncoPrint. In this study, the cBioPortal database was used to analyze genetic variations such as (amplifications, deep deletions, and mutations) in the real hub genes in COAD patients. 2.8 Real hub genes and infiltrating level of CD8+ T cells in COAD patients TIMER (available at: [59]https://cistrome.shinyapps.io/timer/) is a web resource for systematical evaluations of the clinical impact of different immune cells in diverse cancer types [[60]20]. In the present study, this database was used to find the Spearman correlation between the levels of real hub genes’ expression and CD8+ T immune cells. In TIMER a t-test was used for the statistics purpose and a p-value of < 0.05 was considered to be statistically significant. 2.9 The miRNA–real hub gene interaction network analysis The miRNAs, targeting the real hub genes, were predicted through miRNA target prediction databases “Regulatory Network Repository of Transcription Factor and microRNA Mediated Gene Regulations (RegNetwork) database” (available at: [61]http://regnetworkweb.org/). The RegNetwork database contains information of experimentally validated regulatory elements of gene expression including transcription factors (TFs) and miRNAs. This platform provides a user-friendly interface for the submission of query of interest and allows the finding of combinatorial and synergic regulatory relationships among TFs, miRNAs and genes [[62]21]. A co-expression network based on the correlation analysis of real hub genes and miRNAs associated with the cancer was then developed by Cytoscape software. In the network, interaction between the miRNAs and real hub gene was represented by an arrow. The numbers of arrows in the networks indicated the contribution of one miRNA in the expression regulation of the surrounding genes. 2.10 Real hub gene-drug interaction network analysis The Comparative Toxicogenomics Database (CTD, available at: [63]http://ctdbase.org/) has been employed to obtain the information of chemotherapeutic drugs that could reduce or enhance the mRNAs or proteins expression levels of the genes of interest [[64]22]. Briefly, all the real hub genes were searched in the CTD database, and hub gene-drug interaction networks were visualized using Cytoscape software. 3. Results 3.1 Literature search and hub genes extraction In total 21 studies were selected, some of them have identified the hub genes in individual microarray dataset of CRC [[65]23–[66]25] while others have used the combination of multiple microarray datasets of CRC [[67]26–[68]28]. We extracted all hub genes reported in literature and pooled them after normalizing the duplicated genes, hence, a pool of 210 hub genes from 31 microarray datasets, containing 3128 CRC and 877 normal samples, were selected for further analysis ([69]Table 1). Raw data without normalization can be seen in [70]S1 Table (Supporting Information). Table 1. Details of CRC microarray based expression datasets and the identified hub genes. Dataset No. samples C/N Source of origin Extracted hub genes References