Abstract Background Prostate cancer is one of the most common complex diseases with high leading cause of death in men. Identifications of prostate cancer associated genes and biomarkers are thus essential as they can gain insights into the mechanisms underlying disease progression and advancing for early diagnosis and developing effective therapies. Methods In this study, we presented an integrative analysis of gene expression profiling and protein interaction network at a systematic level to reveal candidate disease-associated genes and biomarkers for prostate cancer progression. At first, we reconstructed the human prostate cancer protein-protein interaction network (HPC-PPIN) and the network was then integrated with the prostate cancer gene expression data to identify modules related to different phases in prostate cancer. At last, the candidate module biomarkers were validated by its predictive ability of prostate cancer progression. Results Different phases-specific modules were identified for prostate cancer. Among these modules, transcription Androgen Receptor (AR) nuclear signaling and Epidermal Growth Factor Receptor (EGFR) signalling pathway were shown to be the pathway targets for prostate cancer progression. The identified candidate disease-associated genes showed better predictive ability of prostate cancer progression than those of published biomarkers. In context of functional enrichment analysis, interestingly candidate disease-associated genes were enriched in the nucleus and different functions were encoded for potential transcription factors, for examples key players as AR, Myc, ESR1 and hidden player as Sp1 which was considered as a potential novel biomarker for prostate cancer. Conclusions The successful results on prostate cancer samples demonstrated that the integrative analysis is powerful and useful approach to detect candidate disease-associate genes and modules which can be used as the potential biomarkers for prostate cancer progression. The data, tools and supplementary files for this integrative analysis are deposited at [31]http://www.ibio-cn.org/HPC-PPIN/. Keywords: Biomarker, Disease-associated Genes, Integrative analysis, Prostate cancer, Transcription factor Background Prostate cancer is the second leading cause of morbidity and mortality in men [[32]1,[33]2]. In recent years, the incidence rate of prostate cancer has dramatically increased [[34]3], and this is largely because of lack of diagnosis and treatment of the disease at the early stage [[35]4]. Thus, the successful clinical biomarkers for early diagnosis of the presence of prostate cancer become very urgent to reduce the death risk of the prostate cancer [[36]5,[37]6]. In the post-genomics era, there is an explosion of biological data and information generated from high-throughput technologies which have rapidly provided an unprecedented multi-level omics data [[38]7]. Such transcriptomics, referred to as gene expression profiling can now comprehensively survey the entire human genomics. Moreover, enormous efforts have been made to identify biomarkers for various cancers by the analysis of different transcriptomics data [[39]8-[40]12]. As an example reported by our previous study, integrative transcriptomics data could be used to identify putative novel prostate cancer associated pathways, such as Endothelin-1/EDNRA trans-activation of EGFR pathway which would provide essential information for development of network biomarkers and individualized therapy strategy for prostate cancer [[41]11-[42]13]. Looking at the other relevant studies for cancer transcriptomics, a large scale expression study presented by Wang et al. identified a set of gene markers for prediction of metastasis for breast cancer [[43]14] and followed by Chari et al. demonstrated an approach based on multiple concerted disruptions (MCD) analysis and identified genes and pathways in cancer [[44]15]. Furthermore, transcriptomics could be used to identify metabolic biomarkers through alterative metabolic pathways at different cancer phases [[45]16]. Concerning on the other levels of omics, proteomics in context of protein-protein interaction network could also be used to characterize and diagnose a pathological process [[46]17]. As clearly reported by Ideker and Sharan [[47]18], the indicating genes as biomarkers in complex diseases tend to cluster together on well-connected proteins interaction sub-networks. In following years, Chuang et al. also showed that it could be useful to extract co-expressed functional sub-networks for metastasis of breast cancer through integrating transcriptomics data with protein-protein interaction to obtain higher classification accuracy [[48]19]. Later, Taylor et al. studied the altered protein interaction modularity to predict breast cancer progression by examining the biochemical structure of the interactome [[49]20]. Besides, there were similar studies for analysis of sub-networks and/or hub proteins which had been helpful for the understanding of the metastasis of cancer at the molecular level [[50]18]. Focusing on prostate cancer, there were some reports on identifying disease-related gene modules, sub-networks or dysfunctional pathways focused on global characteristics of interactome together with gene expression data by different novel algorithms and methods development [[51]21-[52]23]. Nonetheless, there are still few studies on identification of prostate cancer biomarkers for early detection of the presence as well as disease progression [[53]20]. The relationships among the potential prostate cancer genes and associated functions as well as pathways are still poorly characterized, such as how they interacted and regulated with each other, also what they act within the network modules. These investigations are warranted for a comprehensive understanding of the molecular mechanisms underlying prostate cancer progression. Hence, it is a challenge to perform an integrative analysis of different data, which can be gene expression profiling, protein-protein interaction (PPI) data, pathway information, and clinical information, that can offer different perspectives on the biological problems in prostate cancer and further identification of potential biomarkers [[54]24,[55]25]. In this study, we therefore aim to reveal candidate disease- associated genes and biomarkers for prostate cancer progression by integrative gene expression profiling and network analysis at a systematic level. We first reconstructed human prostate cancer protein-protein interaction network and used this network as a scaffold for further integrative analysed with gene expression data of prostate cancer. Here, analysis of gene expression profiling of prostate cancer was performed at different disease phases. Through modular analysis, the different modules associated with disease phases were then identified. Last but not least, we could identify significant genes through these modules which were supposed to be the gene expression signatures with highly relevant to specific phases of prostate cancer. Once the common genes identified in each of different modules were overlapped, expectedly these common genes were beneficial for uncovering of novel prostate cancer-related pathways and transcription factors which could be candidate biomarkers for prostate cancer progression. Our study hereby demonstrated a practical workflow for integrative analysis of prostate cancer at the systematic level. For the genome-wide studies, this will be a basic effort for future development and evolution in aspects of the translational biomedical informatics, which ultimately intend to improve patient outcomes and diagnostics with omics dataset through integrative systems biology [[56]26]. Methods Human prostate cancer protein interaction network reconstruction and annotation The human prostate cancer protein-protein interaction network (HPC-PPIN) was initially reconstructed in order to be further used for integrative analysis as a diagram illustrated in Figure [57]1. To reconstruct the HPC-PPIN, we used two different types of datasets. The first dataset was the genes associated in prostate cancer derived from a collection of prostate cancer databases and other relevant resources (e.g. Dragon Database of Genes associated with Prostate Cancer (DDPC) [[58]27], GeneGo [[59]28], OMIM [[60]29], KEGG [[61]30], PGDB [[62]31], CCDB [[63]32], and Gene Ontology (GO) [[64]33]). Figure 1. Figure 1 [65]Open in a new tab The modular analysis pipeline. Diagram shows identification of candidate disease-associated genes as potential module biomarker based on integrative analysis of the reconstructed human prostate cancer protein-protein interaction network (HPC-PPIN) and the different phases of gene expression profiles of prostate cancer. The threshold for greedy algorithm via Cytoscape jActiveModules (jAM) plugin for the most significant core sub-networks analysis in each gene expression profile was set to three iterations and top ten ranks. For the second type of the dataset, it was the human protein-protein interactions data (Homo sapiens) which was downloaded from the BioGRID database [[66]34]. Concerning on annotation of the HPC-PPIN, we used the Database for Annotation, Visualization and Integrated Discovery (DAVID) system [[67]35,[68]36]. At the beginning, functional annotation clustering tool of DAVID system was applied to group annotated genes within HPC-PPIN across three GO processes underlying molecular function, biological process, and cellular component. Among three GO processes, this tool was then used to identify the enriched GO terms. In order to annotate detailed functions in context of pathways underlying metabolism, cellular process, environmental information process and genetics information process, KEGG database was used ([69]http://www.genome.jp/kegg/pathway.html). Prostate cancer gene expression data collection and analysis The gene expression profiles based different platform arrays from different stages of prostate cancer (i.e. disease stages I, II, II, IV) were collected from various laboratories. Table [70]1 lists available information of collected gene expression profiles (431 samples) of prostate cancer progression. Since only fewer samples are available in stage I than other disease stages, stages I and II were combined into one phase (Table [71]1). All expression datasets were analysed for gaining statistics values. The statistical processing methods were invoked through the limma (Linear Models for Microarray Data) package in R [[72]37,[73]38] and scripting under R version 2.9.0 (R Development Core Team). The limma package [[74]37] was applied to perform moderated Student's t-test between all possible pairwise disease phases comparison i.e., early-middle phases, middle-late phases, and early-late phases, to determine significantly differential gene expression. Empirical Bayesian statistical method was applied to moderate the standard errors within each gene and then the Benjamini-Hochberg's method was applied to adjust the multi-testing [[75]39], as well as to obtain the adjusted p-value. Table 1. Gene expression profiles of prostate cancer used for integrative analysis# No. Exp. Platform No. Probes Samples Series No. Samples of prostate cancer stages References