Abstract

   Urine has emerged as an attractive biofluid for the noninvasive
   detection of prostate cancer (PCa). There is a strong imperative to
   discover candidate urinary markers for the clinical diagnosis and
   prognosis of PCa. The rising flood of various omics profiles presents
   immense opportunities for the identification of prospective biomarkers.
   Here we present a simple and efficient strategy to derive candidate
   urine markers for prostate tumor by mining cancer genomic profiles from
   public databases. Prostate, bladder and kidney are three major tissues
   from which cellular matters could be released into urine. To identify
   urinary markers specific for PCa, upregulated entities that might be
   shed in exosomes of bladder cancer and kidney cancer are first
   excluded. Through the ontology-based filtering and further assessment,
   a reduced list of 19 entities encoding urinary proteins was derived as
   putative PCa markers. Among them, we have found 10 entities closely
   associated with the process of tumor cell growth and development by
   pathway enrichment analysis. Further, using the 10 entities as seeds,
   we have constructed a protein-protein interaction (PPI) subnetwork and
   suggested a few urine markers as preferred prognostic markers to
   monitor the invasion and progression of PCa. Our approach is amenable
   to discover and prioritize potential markers present in a variety of
   body fluids for a spectrum of human diseases.

Introduction

   Prostate cancer (PCa) remains to be the most common malignancy and the
   second cause of cancer-related death for men worldwide [31][1].
   Particularly in the western world, the number of men diagnosed with PCa
   has increased by 30% over the last 25 years and is expected to be
   doubled by the year of 2030 [32][2]. PCa is generally curable when the
   primary lesion is within its benign state but very difficult to cure or
   no longer curable once the tumor has spread to other distant sites.
   Therefore, the early detection is essential for the successful clinical
   treatment of PCa. Currently, the combination of DRE (digital rectal
   exam) and the PSA (prostate-specific antigen) blood test is commonly
   used in screening test to detect PCa in the absence of symptoms.
   Unfortunately, it is well recognized that the usefulness of PSA suffers
   from its low specificity and its low positive predictive value in early
   PCa detection. For example, it has been found that the upper cut-off of
   the PSA reference level at 4.0 ng/ml fails to detect a large number of
   PCa and many men with PSA values <4.0 ng/ml actually have PCa [33][3].
   Moreover, it has been demonstrated that PSA can be secreted from other
   cancerous cells into the bloodstream as well [34][4]. Hence, there is a
   clear need to identify putative molecular signatures that can
   facilitate the accurate and non-invasive clinical PCa detection.

   Urine represents an amenable and appealing body fluid for the early
   detection of PCa [35][5]. First, urine can be used to detect the
   presence of PCa because secreted prostatic products or exfoliated
   cancerous cells are released directly into the genitourinary tract.
   Second, urine can be easily collected in large amounts noninvasively
   and repeatedly, rendering it as an attractive material for the analysis
   of prostate malignancy. To date, a number of urine biomarkers such as
   GSTP-1 (glutathione-S-transferase P1), DD3 (prostate cancer antigen 3,
   PCA3) and TB-15 (thymosin β15) etc. have been proposed as potential
   diagnostic agents for early PCa detection [36][6]. Moreover, with the
   recently developed sophisticated mass-spectrometry (MS) technology, it
   becomes possible to detect certain endogenous metabolites in urine for
   the early diagnosis of PCa. For instance, Sreekumar et al. [37][7] have
   identified Sarcosine (N-methylglycine) as a key metabolite in urine
   that could be potentially used as a marker for PCa malignancy. Although
   promising, there are still few studies assessing urine markers for PCa
   detection and there are only a few candidate urine markers are under
   consideration for future clinical development. Further, no single
   marker is adequate for the accurate detection of PCa owing to the
   complexity and heterogeneity of the disease. Hence, it is clear that a
   panel of urine markers is required for the successful diagnosis of PCa.

   The explosion of biological data and information generated from
   high-throughput ‘Omics’ technologies such as microarrays has provided
   unprecedented opportunities for researchers to uncover biomarkers and
   phenotypic pathways of clinical importance [38][8]. For instance, Kim
   et al. have reported the mining of public gene profiles from CGAP and
   GEO database to identify seven putative markers for lung cancer
   [39][9]. Analogously, we have successfully identified lists of
   blood-borne markers for six common human cancer types through a
   combined mining strategy in the Oncomine microarray database and a
   pathway knowledgebase. Using a filter-based approach and comparison
   analysis, we have retrieved disease-specific blood-based markers for
   each of the tumor types and common markers shared between different
   tumors. Notably, a large portion of the retrieved genomic-based markers
   have been literature-confirmed to be associated with the phenotypic
   pathways of tumor progression and invasiveness. Such findings would
   certainly be very useful to delineate potential targets with regards to
   the diagnosis, prognosis and pathogenesis of human solid tumors.

   Here we present an integrative mining approach to analyze public
   genomic profiles for the discovery of potential urine markers for PCa
   detection. Our strategy has been developed in the way that a vast body
   of cancer genomic profiles can be analyzed in the context of other
   biological data such as gene ontology, metabolic pathways and
   gene-gene/protein-protein interaction (PPI) networks (see [40]Figure 1
   ). To identify disease-specific markers for PCa, we have retrieved
   upregulated genes in PCa, bladder cancer and kidney cancer from public
   cancer genomic databases. We were mining for upregulated genes as PCa
   markers here mainly because one of the prevailing hypotheses is that
   the most promising biomarkers for clinical use will be those
   upregulated genes or their protein products. However, we recognize that
   this might not be generally true and thereby we don't rule out the
   possibility that downregulated genes could be interesting candidate
   markers too. Other researchers could choose to mine downregulated genes
   for their specific purpose by applying the similar strategy as in this
   work. These upregulated genes were then filtered through a collection
   of ontology terms indicating the presence in urine and Ingenuity
   Knowledgebase. A comparison analysis was performed across prostate,
   bladder and kidney and only those entities unique to prostate were kept
   in the list as potential urinary markers for PCa. This is because
   entities present in bladder cancer and kidney cancer may interfere with
   the detection of PCa shed in human urinary system. Finally, the
   putative urine markers for PCa were analyzed and prioritized within
   metabolic pathways and protein-protein interaction networks. Our
   strategy highlights the significance of combining a variety of
   biological data to derive putative markers present in body fluids with
   disease specificity to detect common and lethal types of human cancers.

Figure 1. Workflow of integrative mining from public cancer genomic profiles
for discovery of putative urinary marker for the specific detection of PCa.

   [41]Figure 1
   [42]Open in a new tab

   In the comparison pie graphs, “B” represents for bladder, “K”
   represents for kidney and “P” represents for prostate.

Materials and Methods

   The focus of our analysis approach is to retrieve putative markers
   present in urine for the specific detection of PCa. Therefore, we need
   to retrieve and filter genes significantly upregulated in PCa, encoding
   urinary proteins, to a manageable gene list. The choice of microarray
   platform or database, statistical cut-off criteria, and controlled
   ontology terms (Gene Ontology terms) in the mining strategy is
   variable, depending on the particular interest and requirement of the
   user.

Microarray data preparation and analysis

   In brief, for each of the three tumor types (PCa, bladder cancer and
   renal cancer), MeSH terms (prostate cancer, prostatic cancer; bladder
   cancer; kidney cancer, renal cancer) were used to search and obtain
   microarray experiments characterizing these disease conditions from two
   popular cancer genomic databases, Oncomine database [43][10] and
   ArrayExpress database [44][11]. Oncomine and ArrayExpress were chosen
   because they are two of the largest public cancer microarray
   repositories. Particularly, Oncomine has incorporated 534 independent
   microarray datasets, which span 35 cancer types. It unifies a large
   compendium of other published cancer microarray data as well including
   Gene Expression Omnibus (GEO) and Stanford Microarray Database (SMD).
   ArrayExpress stores well-annotated raw and normalized cancer microarray
   data from more than 300 studies. The advantage of using Oncomine and
   ArrayExpress is that medical researchers could easily perform
   differential expression analyses comparing most major types of cancer
   with their respective normal or benign tissues. Those microarray
   experiments comparing cancer vs. normal including malignant vs. benign
   conditions measured in equivalent tissues in same experiments were
   retained. We have chosen a relative stringent FDR (false discovery
   rate) value cut-off of 0.05 [45][12] in the analysis process, and only
   those overexpressed genes with FDR value less than 0.05 are kept in the
   final list. Overexpressed genes in Oncomine and ArrayExpress were
   collected by using the same FDR cut-off value. In addition, a customary
   fold change threshold 2.0 was also applied to retain those
   significantly overexpressed genes in the list. The redundant genes were
   resolved from the list. By comparison analysis across the upregulated
   genes of three tumor types using a C# program (see [46]File S1), only
   those genes specifically upregulated in PCa were retrieved for further
   analysis.

Functional annotation enrichment and biomarker filtering

   Functional annotation (Gene Ontology assignment) for the retrieved
   overexpressed genes was conducted by using the DAVID system [47][13].
   Next, a set of controlled GO terms implying the presence in urinary
   proteome were chosen according to the GO clustering analysis of 1273
   urinary proteins (see [48]Figure 2 ) collected from MAPU urinary
   proteome database [49][14]. The GO clustering analysis was performed
   within DAVID system and could be used to measure the GO term appearing
   frequencies among the urinary proteins. Specifically, these controlled
   GO terms and their appearing frequencies are: Extracelluar region part:
   34.8%; Response to stimulus: 25.5%; Cell adhesion: 13.2%; Calcium ion
   binding: 11.3%; Cell communication: 5.5%; Amine metabolic process:
   1.9%. These controlled GO terms are enriched and overrepresented in the
   urinary proteome through the study of GO clustering. Further, the
   retrieved putative urine markers were consulted in Sys-BodyFluid
   database [50][15], MAPU proteome database and Ingenuity Knowledgebase
   [51][16] to confirm their presence in urine. Entities that are not
   present in urine were removed from the list. These databases represent
   the three most comprehensive public body fluid proteomes and contain
   over 10,000 proteins with detailed annotations. Researchers could
   easily download and analyze protein targets present in various body
   fluids from these on-line databases.

Figure 2. Pie-chart of GO term appearing frequencies among the urinary
proteins by clustering analysis of 1273 urinary proteins performed within
DAVID system.

   [52]Figure 2
   [53]Open in a new tab

Pathway enrichment analysis

   The derived list of putative urine markers was then subjected to
   pathway enrichment analysis by importing them to a few PPI
   (protein-protein interaction) databases including Pfam [54][17],
   InterPro [55][18], Ingenuity Knowledgebase and the KEGG Knowledgebase
   [56][19]. These PPI databases were chosen because they are widely used
   as reference knowledgebase towards practical applications with network
   or pathway-based views of proteins, diseases and drugs. Moreover, the
   millions of pathway interactions storing in these knowledgebase were
   acquired by curation of scientific publications covering information on
   genes or proteins. The 19 entities were first imported as seeds to
   identify overrepresented biological functions and signaling pathways.
   Entities with direct physical interactions and co-expression evidenced
   by literatures were identified and used to construct PPI network.
   Particularly, those entities associated with the tumor cell growth,
   development and proliferation were used as seeds to construct a PPI
   subnetwork related to the invasion and metastasis of PCa. Subnetworks
   were constructed such that the genes (proteins) were nodes, with edges
   between genes indicating the direction and indirect biological
   interactions between entities.

Literature Review of the Candidate Entities

   Further analysis and assessment of the resulting putative markers was
   performed retrospectively using GeneCards ([57]www.genecards.org), a
   curated database that finds links and cited articles to genes/proteins.
   The entities obtained were checked by carefully reading the associated
   literature references or original publications. The accuracy of the