Abstract Urine has emerged as an attractive biofluid for the noninvasive detection of prostate cancer (PCa). There is a strong imperative to discover candidate urinary markers for the clinical diagnosis and prognosis of PCa. The rising flood of various omics profiles presents immense opportunities for the identification of prospective biomarkers. Here we present a simple and efficient strategy to derive candidate urine markers for prostate tumor by mining cancer genomic profiles from public databases. Prostate, bladder and kidney are three major tissues from which cellular matters could be released into urine. To identify urinary markers specific for PCa, upregulated entities that might be shed in exosomes of bladder cancer and kidney cancer are first excluded. Through the ontology-based filtering and further assessment, a reduced list of 19 entities encoding urinary proteins was derived as putative PCa markers. Among them, we have found 10 entities closely associated with the process of tumor cell growth and development by pathway enrichment analysis. Further, using the 10 entities as seeds, we have constructed a protein-protein interaction (PPI) subnetwork and suggested a few urine markers as preferred prognostic markers to monitor the invasion and progression of PCa. Our approach is amenable to discover and prioritize potential markers present in a variety of body fluids for a spectrum of human diseases. Introduction Prostate cancer (PCa) remains to be the most common malignancy and the second cause of cancer-related death for men worldwide [31][1]. Particularly in the western world, the number of men diagnosed with PCa has increased by 30% over the last 25 years and is expected to be doubled by the year of 2030 [32][2]. PCa is generally curable when the primary lesion is within its benign state but very difficult to cure or no longer curable once the tumor has spread to other distant sites. Therefore, the early detection is essential for the successful clinical treatment of PCa. Currently, the combination of DRE (digital rectal exam) and the PSA (prostate-specific antigen) blood test is commonly used in screening test to detect PCa in the absence of symptoms. Unfortunately, it is well recognized that the usefulness of PSA suffers from its low specificity and its low positive predictive value in early PCa detection. For example, it has been found that the upper cut-off of the PSA reference level at 4.0 ng/ml fails to detect a large number of PCa and many men with PSA values <4.0 ng/ml actually have PCa [33][3]. Moreover, it has been demonstrated that PSA can be secreted from other cancerous cells into the bloodstream as well [34][4]. Hence, there is a clear need to identify putative molecular signatures that can facilitate the accurate and non-invasive clinical PCa detection. Urine represents an amenable and appealing body fluid for the early detection of PCa [35][5]. First, urine can be used to detect the presence of PCa because secreted prostatic products or exfoliated cancerous cells are released directly into the genitourinary tract. Second, urine can be easily collected in large amounts noninvasively and repeatedly, rendering it as an attractive material for the analysis of prostate malignancy. To date, a number of urine biomarkers such as GSTP-1 (glutathione-S-transferase P1), DD3 (prostate cancer antigen 3, PCA3) and TB-15 (thymosin β15) etc. have been proposed as potential diagnostic agents for early PCa detection [36][6]. Moreover, with the recently developed sophisticated mass-spectrometry (MS) technology, it becomes possible to detect certain endogenous metabolites in urine for the early diagnosis of PCa. For instance, Sreekumar et al. [37][7] have identified Sarcosine (N-methylglycine) as a key metabolite in urine that could be potentially used as a marker for PCa malignancy. Although promising, there are still few studies assessing urine markers for PCa detection and there are only a few candidate urine markers are under consideration for future clinical development. Further, no single marker is adequate for the accurate detection of PCa owing to the complexity and heterogeneity of the disease. Hence, it is clear that a panel of urine markers is required for the successful diagnosis of PCa. The explosion of biological data and information generated from high-throughput ‘Omics’ technologies such as microarrays has provided unprecedented opportunities for researchers to uncover biomarkers and phenotypic pathways of clinical importance [38][8]. For instance, Kim et al. have reported the mining of public gene profiles from CGAP and GEO database to identify seven putative markers for lung cancer [39][9]. Analogously, we have successfully identified lists of blood-borne markers for six common human cancer types through a combined mining strategy in the Oncomine microarray database and a pathway knowledgebase. Using a filter-based approach and comparison analysis, we have retrieved disease-specific blood-based markers for each of the tumor types and common markers shared between different tumors. Notably, a large portion of the retrieved genomic-based markers have been literature-confirmed to be associated with the phenotypic pathways of tumor progression and invasiveness. Such findings would certainly be very useful to delineate potential targets with regards to the diagnosis, prognosis and pathogenesis of human solid tumors. Here we present an integrative mining approach to analyze public genomic profiles for the discovery of potential urine markers for PCa detection. Our strategy has been developed in the way that a vast body of cancer genomic profiles can be analyzed in the context of other biological data such as gene ontology, metabolic pathways and gene-gene/protein-protein interaction (PPI) networks (see [40]Figure 1 ). To identify disease-specific markers for PCa, we have retrieved upregulated genes in PCa, bladder cancer and kidney cancer from public cancer genomic databases. We were mining for upregulated genes as PCa markers here mainly because one of the prevailing hypotheses is that the most promising biomarkers for clinical use will be those upregulated genes or their protein products. However, we recognize that this might not be generally true and thereby we don't rule out the possibility that downregulated genes could be interesting candidate markers too. Other researchers could choose to mine downregulated genes for their specific purpose by applying the similar strategy as in this work. These upregulated genes were then filtered through a collection of ontology terms indicating the presence in urine and Ingenuity Knowledgebase. A comparison analysis was performed across prostate, bladder and kidney and only those entities unique to prostate were kept in the list as potential urinary markers for PCa. This is because entities present in bladder cancer and kidney cancer may interfere with the detection of PCa shed in human urinary system. Finally, the putative urine markers for PCa were analyzed and prioritized within metabolic pathways and protein-protein interaction networks. Our strategy highlights the significance of combining a variety of biological data to derive putative markers present in body fluids with disease specificity to detect common and lethal types of human cancers. Figure 1. Workflow of integrative mining from public cancer genomic profiles for discovery of putative urinary marker for the specific detection of PCa. [41]Figure 1 [42]Open in a new tab In the comparison pie graphs, “B” represents for bladder, “K” represents for kidney and “P” represents for prostate. Materials and Methods The focus of our analysis approach is to retrieve putative markers present in urine for the specific detection of PCa. Therefore, we need to retrieve and filter genes significantly upregulated in PCa, encoding urinary proteins, to a manageable gene list. The choice of microarray platform or database, statistical cut-off criteria, and controlled ontology terms (Gene Ontology terms) in the mining strategy is variable, depending on the particular interest and requirement of the user. Microarray data preparation and analysis In brief, for each of the three tumor types (PCa, bladder cancer and renal cancer), MeSH terms (prostate cancer, prostatic cancer; bladder cancer; kidney cancer, renal cancer) were used to search and obtain microarray experiments characterizing these disease conditions from two popular cancer genomic databases, Oncomine database [43][10] and ArrayExpress database [44][11]. Oncomine and ArrayExpress were chosen because they are two of the largest public cancer microarray repositories. Particularly, Oncomine has incorporated 534 independent microarray datasets, which span 35 cancer types. It unifies a large compendium of other published cancer microarray data as well including Gene Expression Omnibus (GEO) and Stanford Microarray Database (SMD). ArrayExpress stores well-annotated raw and normalized cancer microarray data from more than 300 studies. The advantage of using Oncomine and ArrayExpress is that medical researchers could easily perform differential expression analyses comparing most major types of cancer with their respective normal or benign tissues. Those microarray experiments comparing cancer vs. normal including malignant vs. benign conditions measured in equivalent tissues in same experiments were retained. We have chosen a relative stringent FDR (false discovery rate) value cut-off of 0.05 [45][12] in the analysis process, and only those overexpressed genes with FDR value less than 0.05 are kept in the final list. Overexpressed genes in Oncomine and ArrayExpress were collected by using the same FDR cut-off value. In addition, a customary fold change threshold 2.0 was also applied to retain those significantly overexpressed genes in the list. The redundant genes were resolved from the list. By comparison analysis across the upregulated genes of three tumor types using a C# program (see [46]File S1), only those genes specifically upregulated in PCa were retrieved for further analysis. Functional annotation enrichment and biomarker filtering Functional annotation (Gene Ontology assignment) for the retrieved overexpressed genes was conducted by using the DAVID system [47][13]. Next, a set of controlled GO terms implying the presence in urinary proteome were chosen according to the GO clustering analysis of 1273 urinary proteins (see [48]Figure 2 ) collected from MAPU urinary proteome database [49][14]. The GO clustering analysis was performed within DAVID system and could be used to measure the GO term appearing frequencies among the urinary proteins. Specifically, these controlled GO terms and their appearing frequencies are: Extracelluar region part: 34.8%; Response to stimulus: 25.5%; Cell adhesion: 13.2%; Calcium ion binding: 11.3%; Cell communication: 5.5%; Amine metabolic process: 1.9%. These controlled GO terms are enriched and overrepresented in the urinary proteome through the study of GO clustering. Further, the retrieved putative urine markers were consulted in Sys-BodyFluid database [50][15], MAPU proteome database and Ingenuity Knowledgebase [51][16] to confirm their presence in urine. Entities that are not present in urine were removed from the list. These databases represent the three most comprehensive public body fluid proteomes and contain over 10,000 proteins with detailed annotations. Researchers could easily download and analyze protein targets present in various body fluids from these on-line databases. Figure 2. Pie-chart of GO term appearing frequencies among the urinary proteins by clustering analysis of 1273 urinary proteins performed within DAVID system. [52]Figure 2 [53]Open in a new tab Pathway enrichment analysis The derived list of putative urine markers was then subjected to pathway enrichment analysis by importing them to a few PPI (protein-protein interaction) databases including Pfam [54][17], InterPro [55][18], Ingenuity Knowledgebase and the KEGG Knowledgebase [56][19]. These PPI databases were chosen because they are widely used as reference knowledgebase towards practical applications with network or pathway-based views of proteins, diseases and drugs. Moreover, the millions of pathway interactions storing in these knowledgebase were acquired by curation of scientific publications covering information on genes or proteins. The 19 entities were first imported as seeds to identify overrepresented biological functions and signaling pathways. Entities with direct physical interactions and co-expression evidenced by literatures were identified and used to construct PPI network. Particularly, those entities associated with the tumor cell growth, development and proliferation were used as seeds to construct a PPI subnetwork related to the invasion and metastasis of PCa. Subnetworks were constructed such that the genes (proteins) were nodes, with edges between genes indicating the direction and indirect biological interactions between entities. Literature Review of the Candidate Entities Further analysis and assessment of the resulting putative markers was performed retrospectively using GeneCards ([57]www.genecards.org), a curated database that finds links and cited articles to genes/proteins. The entities obtained were checked by carefully reading the associated literature references or original publications. The accuracy of the