Abstract Muscle invasive bladder carcinoma is a complex, multifactorial disease caused by disruptions and alterations of several molecular pathways that result in heterogeneous phenotypes and variable disease outcome. Combining this disparate knowledge may offer insights for deciphering relevant molecular processes regarding targeted therapeutic approaches guided by molecular signatures allowing improved phenotype profiling. The aim of the study is to characterize muscle invasive bladder carcinoma on a molecular level by incorporating scientific literature screening and signatures from omics profiling. Public domain omics signatures together with molecular features associated with muscle invasive bladder cancer were derived from literature mining to provide 286 unique protein-coding genes. These were integrated in a protein-interaction network to obtain a molecular functional map of the phenotype. This feature map educated on three novel disease-associated pathways with plausible involvement in bladder cancer, namely Regulation of actin cytoskeleton, Neurotrophin signalling pathway and Endocytosis. Systematic integration approaches allow to study the molecular context of individual features reported as associated with a clinical phenotype and could potentially help to improve the molecular mechanistic description of the disorder. Introduction Bladder cancer (BC) presents with an estimate of 72,570 new cases diagnosed and 15,210 deaths across the United States [[45]1] in the year 2013, clearly demonstrating a need for improved diagnosis and therapy. Bladder cancer is the ninth most frequent malignancy with an approximate ratio of 5:1 with respect to non-muscle invasive versus muscle invasive phenotypes [[46]2]. Major confounders are smoking and other occupational exposures along with genetic predispositions, such as e.g. N-acetyltransferase 1 (NAT1), N-acetyltransferase 2 (NAT2) and glutathione S-transferase µ1 (GSTM1) polymorphisms [[47]3]. Though variable for bladder cancer patients, initial symptoms include haematuria and flank pain, commonly represented during advanced cancer stages caused by ureteric obstructions due to invasion of the bladder muscular wall or ureter, together with recurrent urinary tract infections [[48]4, [49]5]. Evidence suggests that malignant transformation of the bladder is multifactorial and a multitude of genes are involved in the development of muscle invasive or non-muscle invasive phenotype [[50]6, [51]7]. The major histological type is transitional cell carcinoma occurring in approximately 90% of diagnosed bladder tumours (with the rest being mainly squamous cell carcinomas and adenocarcinomas), with categories of non-invasive papillary (Ta) or flat (Tis), subepithelial invasive (T1), muscle invasive (T2–T4) and metastatic (N+, M+) diseases, all differing in biology, progression characteristics and hence clinical management. Majority of the cases are non-muscle invasive (Tis, Ta, T1) and 10–15% are muscle-invasive tumours (T2–T4), with the latter associated with fast recurrence and poorer prognosis based on progressing towards metastasis formation. Cystoscopy is the gold standard with a reported sensitivity and specificity in the range of 62–84% and 43–98%, respectively [[52]8]. Due to the invasive nature of the procedure, but also for adding accuracy in the detection, biomarkers assessed in blood or urine are considered as beneficial for supporting clinical assessment [[53]9]. This is also relevant for disease prognosis as biomarkers measured at the DNA, RNA and/or protein levels provide the potential to choose best surveillance measures and treatment regimens for specific patient populations regarding halting the development of muscle invasive disease [[54]10]. Treatment of papillary and non-muscle invasive high-grade carcinoma involves endoscopic transurethral resection of visible tumours followed by adjuvant treatment with intravesical instillation therapy (Mitomycin/Epirubicin or Bacillus Calmette-Guerin (BCG)) depending on the estimated risk for progression. Irrespective of aggressive treatment and vigorous follow-up, 70% of these tumours recur, and 25% of high-grade non-muscle invasive cancers progress into invasive phenotypes [[55]2, [56]11]. The comparison of the genetic characteristics of muscle-invasive and non-invasive tumours revealed that non-invasive tumours over-express HRAS and FGFR3 or produce highly activated forms of these proteins. As a result, the Ras/MAPK pathways are up-regulated in non-invasive tumours [[57]12]. Muscle-invasive BC is associated with alterations of p53, retinoblastoma protein (RB1) and tumour suppressors controlling cell cycle processes, in addition to elevated expressions in epidermal growth factor receptor (EGFR), human epidermal growth factor receptor 2 (HER2/ErbB2), matrix metallopeptidase 2 (MMP2) and MMP9 and deletions in p16Ink4a and P15Ink4b [[58]3]. High-throughput experimental platform technologies ranging from genomic sequencing to proteomic and metabolomic profiling are now being used for molecular characterization of clinical phenotypes [[59]13–[60]19]. A variety of datasets have become available e.g. in Array Express/Gene Expression Omnibus (GEO) for transcriptomics, Human Proteinpedia for proteomics, or in large data consolidation platforms such as GeneCards [[61]20]. In regard to disease specific omics data, valuable general sources in oncology include TCGA ([62]http://cancergenome.nih.gov/), Oncomine [[63]21], and OMIM [[64]22]. Though omics profiling has provided an abundance of data, technical boundaries involving incompleteness of the individual molecular catalogues together with the static representation of cellular activity limits the insights on molecular processes and their interaction dynamics [[65]23–[66]25]. Despite these challenges, omics-based profiling has significantly advanced bladder cancer research, providing the basis for an integrative analysis approach in delineating a more comprehensive overview of molecular processes and pathways that characterize variations of muscle-invasive urothelial carcinoma [[67]12]. On the effector level, proteins interact and co-operatively form specific molecular processes and pathways. Intermolecular interactions include various types being represented as networks (graphs) with molecular features denoted as nodes (vertices) together with their interactions (edges). A large number of biological pathway resources has become available, including KEGG [[68]26], PANTHER [[69]27], REACTOME [[70]28] and AmiGO [[71]29] described in PathGuide ([72]http://www.pathguide.org/), all displaying well-defined human molecular metabolic and signalling pathways together with disease-specific pathways (e.g. pathways in cancer). Molecular features being identified as associated with bladder cancer can be interpreted on the level of such pathways, adding to a functional interpretation of molecular feature sets characterizing the phenotype. To add to our understanding of muscle-invasive bladder carcinoma (MIBC), we derived a phenotype-specific network model (interactome) by integrating omics signatures characterizing MIBC, reported in scientific literature and databases. Our procedure incorporated scientific literature screening and signatures from omics profiling, resulting in 1,054 protein-coding genes being associated with MIBC, further consolidating to 286 genes on the interactome level. The results display deriving a systems-level model for molecular phenotyping of bladder cancer muscle invasion, presented as multiple affected pathways. Materials and Methods Data sources for characterizing bladder cancer pathophysiology For consolidating molecular features associated with muscle invasive bladder cancer, NCBI PubMed, Web of Science, Google Scholar and the omics repositories Gene Expression Omnibus (GEO) [[73]30] and ArrayExpress [[74]31] were queried. The keywords for the literature search included “bladder OR urothelial OR transitional cell” AND “neoplasm OR tumor OR carcinoma” AND “muscle” AND “invas* OR aggress* OR progress* OR inflammation” (Database version of April, 2014). By construction this search query focused specifically on muscle invasive bladder neoplasm. For extracting protein-coding genes associated with these publications gene-2-pubmed as provided by NCBI was used [[75]32]. The list of publications relevant to bladder cancer muscle invasion was isolated from the complete list of papers indexed in PubMed along with the associated gene IDs ([76]ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz). Profiling experiments were further screened for adequacy in sample size (at least 50 samples included in study design), magnitude of differential abundance (>2-fold change) and the specific phenotypic conditions; T1, T2[a/b], T3[a/b], T4[a/b] (Figs. [77]1 and [78]2). In addition, only papers mentioning the keywords “molecular” and “biomarker” were retained for deriving the literature mined MIBC molecules and pathways. Figure 1. Data assembly workflow. [79]Figure 1 [80]Open in a new tab PubMed, Google Scholar and Web of Science literature analysis and Omics data source screening with focus on transcriptomics. From the 4263 abstracts screened 3979 articles were excluded not specifically focusing on muscle-invasive bladder cancer phenotype (stages T2–T4). 188 studies out of 285 articles were discarded, as these did not meet required study designs and 2-fold change in magnitude of differential abundance of identified features. This restriction resulted in 1,279 protein-coding genes and was further used in the systems based analysis for MIBC. Figure 2. Feature set Overlap. [81]Figure 2 [82]Open in a new tab A. Redundant features were discarded from 1,279 protein coding genes resulting in 1,054 unique features.The overlap between individual omics studies and literature were calculated. B. The 1,054 protein coding genes were further reduced to 592 by discarding enzymes linked to metabolites as well as miRNA targeted gene symbols, further included for deriving the induced MIBC subgraph resting on BioGRID, IntAct and Reactome protein interaction information. Interaction data and induced subgraph Protein interaction information was obtained by querying IntAct [[83]33], BioGRID [[84]34], and Reactome [[85]28] leading to a total of 233,794 interactions covering 13,907 protein-coding genes within the human interactome (Databases in version of April, 2014). Mapping the MIBC associated molecular features on this consolidated interaction network [[86]13] provided an MIBC-specific induced subgraph. MIBC associated features not connected to at least another such feature were disregarded from further analysis. Functional analysis Cytoscape’s plug-ins ClueGO and CluePedia was used to identify pathways that are being over-represented in the set of features located in the induced subgraph [[87]35, [88]36]. KEGG pathway terms served as the clustering criterion using a two-sided hypergeometry test followed by Bonferroni correction (significance level of 0.05) for identifying significantly affected pathways. General disease pathways (such as pathways in cancer, miRNA’s in cancer, bladder cancer etc.) were discarded to obtain a set of generic pathway terms [[89]13]. Protein coding gene selection based on literature mining From the set of MIBC-associated protein-coding genes, each gene symbol was evaluated for being a member of the MIBC pathway set. The evidence of identified pathways and extracted genes involved in MIBC was assessed based on the level of annotation depth, defined as the number of individual studies identifying such protein-coding genes as involved in MIBC. Specifically, such evidence was derived from metadata available in PubMed. Gene-2-pubmed was used for linking the molecules contained in the induced subgraph to publications relevant to bladder cancer muscle invasion. The quality of publications obtained for each molecule was assessed based on manual reviewing. Only papers where a direct link of the molecule to bladder cancer muscle invasion was proven were retained. For the entire pathway set, the ratio between the number of molecules being linked to at least one urinary bladder neoplasm publication and the number of features in the pathway was computed and used for relevance ranking. For individual protein-coding genes identified in literature the number of linked urinary bladder neoplasm publications was used as relevance ranking criterion. Results Data Mining Mining of published articles and omics repositories led to a collection of 285 references after manual screening ([90]Fig. 1). This screening