Abstract Background The dreadful prognosis of nonmuscle invasive bladder cancer mainly results from the delay in recognition of individuals with a high risk of progression. Thus, the emphasis of this work lies in developing valuable biomarkers that is conducive to accurately predicting the progression of NMIBC. Methods Microarray data from [32]GSE32894 including 209 NMIBC samples were performed by weighted gene coexpression network analysis (WGCNA), which could find modules of highly correlated genes and relate modules to external sample traits. Besides, we constructed a protein–protein interaction to facilitate screening the hub gene. At last, we used RNA‐seq and microarray data and clinical information from ArrayExpress (E‐MTAB‐4321) and [33]GSE13507 to select and validate the candidate gene. Results In current paper, blue module of 13 gene coexpression clusters we identified was selected as the key modules. Seven genes namely: CDCA8, CENPF, MCM6, MELK, PRC1, STIL, and TPX2 have been identified as candidate genes. Notably, among them, only elevated CENPF in NIMBC tissue was closely associated with low progression‐free survival (PFS) and overall survival (OS) rate in three datasets and had a large area under receiver operating characteristic (ROC) curve. Finally, CENPF was identified as an effective biomarker in NMIBC. Conclusion Therefore, our findings submit a new progressive and prognostic molecular marker and therapeutic target for NMIBC. Moreover, these genes that deserve to be further researched may improve the comprehension about the occurrence and development of superficial bladder cancer. Keywords: biomarkers, CENPF, nonmuscle invasive bladder cancer (NMIBC), progression, weighted gene coexpression network analysis (WGCNA) 1. INTRODUCTION Bladder cancer is the ninth most common cancer in the world, with the 13th most common cause of cancer‐associated mortality (Ferlay et al., [34]2015). Depending on the guidelines of European Association of Urology (EAU) and American Urological Association (AUA), bladder cancer (BCa) can be divided into muscle invasive bladder cancer (MIBC) and nonmuscle invasive bladder cancer (NMIBC)—which are comprised of Ta (noninvasive papillary carcinoma), Tis (carcinoma in situ: “flat tumor”), and T1 (tumor invades subepithelial connective tissue) and represented the majority of primary BCa with approximately 85% (Babjuk et al., [35]2017; Chang et al., [36]2016). The prevailing management strategy for NMIBC is complete transurethral resection of the bladder tumor (TURBT) and adjuvant intravesical treatment (Babjuk et al., [37]2017; Chang et al., [38]2016). Despite this, it is reported that the progression rate of NIMBC can reach 10%–30%, especially those patients with T1G3 (Cambier et al., [39]2016). Van den Bosch and Alfred pointed out that two‐thirds of the secondary MIBC from progression died to BCa within 48 months even after radical treatment, and cancer‐specific survival is significantly dreadful (van den Bosch & Alfred, [40]2011). Fortunately, European Organization for Research and Treatment of Cancer (EORTC) risk tables were recommended for predicting the progression of NMIBC after TURBT (Babjuk et al., [41]2017; Chang et al., [42]2016). However, there are certain deficiencies in these risk tables. Numerous researches in recent years have established that remarkable the molecular heterogeneity has been shown in bladder cancer (Cancer Genome Atlas Research Network, [43]2014; Choi et al., [44]2014; Sanchez‐Carbayo, Socci, Lozano, Saint, & Cordon‐Cardo, [45]2006; Sjodahl et al., [46]2012; van Kessel et al., [47]2018). NMIBC is a heterogeneous set of tumors with thoroughly different oncologic outcomes. Furthermore, variability in staging and grading assessment is a recognized problem (Bol et al., [48]2003). Unfortunately, EORTC risk tables do not take these factors into account. Due to rapid technological advancements in the next‐generation sequencing, whole‐genome and RNA sequencing have been used to research pathological mechanisms and related biomarkers of BCa (Cancer Genome Atlas Research Network, [49]2014; Choi et al., [50]2014; Sanchez‐Carbayo et al., [51]2006; Sjodahl et al., [52]2012). According to the latest research, certain molecular markers can improve the accuracy of predictive progression of EAU risk stratification (van Kessel et al., [53]2018). Even though molecular profiles of NMIBC subtypes and considerable diagnostic indicators have been investigated as biomarkers of the risk of disease progression (Choi et al., [54]2014; Ding et al., [55]2014; Kim et al., [56]2010; Sanchez‐Carbayo et al., [57]2006; Sjodahl et al., [58]2012) and play a role in clinical management, none are able to accurately predict the behavior of NMIBC. Hence, identifying molecular biomarkers or therapeutic targets to address those problems is imminent. Unlike most other studies that focus on the screening of differentially expressed genes, this paper explores the affiliation between gene sets and clinical features by WGCNA—a systems biology method and enables the description of the correlation patterns between microarray samples and gene expression profiles (Langfelder & Horvath, [59]2008). Here, by this method, we find some candidate biomarkers associated with NMIBC progression and prognosis or therapeutic targets. 2. MATERIALS AND METHODS 2.1. Ethical compliance and study design We retrospectively analyzed the gene expression profiles of four public BCa cohorts, therefore ethical approval is not required. This study design was performed in a flow diagram (Figure [60]1). Figure 1. Figure 1 [61]Open in a new tab Flowchart of data collection, preparation, processing, analysis, and validation in this report 2.2. Data collection For consistency in the microarray platform, we searched in GEO database ([62]http://www.ncbi.nlm.nih.gov/geo/) for the BCa‐related gene expression datasets measured by the Illumina HumanHT‐12 V3.0 expression beadchip (Illumina, Inc.) and obtained two independent datasets. For one, non‐normalized microarray gene expressing profiles and corresponding clinical data were downloaded from [63]GSE32894 (n = 213) (Sjodahl et al., [64]2012) and used to screen probesets, construct of coexpression networks and identify hub genes in the present study. For another, the processed expression matrix of [65]GSE48075 (n = 54) (Choi et al., [66]2014) was obtained for module preservation analysis. Meanwhile, to verify the analysis results, expression data and related clinicopathologic information were acquired from [67]GSE13507 (n = 103) (Kim et al., [68]2010) and ArrayExpress ([69]https://www.ebi.ac.uk/arrayexpress/) (n = 460, E‐MTAB‐4321) (Hedegaard et al., [70]2016) which was performed on Illumina human‐6 v2.0 expression beadchip (48K) and paired‐end RNA‐Seq (101 + 7 + 101 bp) analysis on an Illumina HiSeq 2000 (Illumina, Inc.), respectively. In addition, the ONCOMINE database ([71]www.oncomine.org) (Rhodes et al., [72]2004; Sanchez‐Carbayo et al., [73]2006) were used for further analysis to validating the result of mRNA levels compared with normal tissue. All NMIBC (Ta/Tis/T1, 2009, 7th edition) patients were saved. 2.3. Data preprocessing and genes screening Non‐normalized data sets from [74]GSE32894 and ArrayExpress (E‐MTAB‐4321) were performed log2 transformation and standardized by quantile normalization used “preprocessCore” package (Bolstad, [75]2018) in R. Then, the reference genome (GRCh38.86) was used to convert Gene ID in the later dataset. Processed gene expression profiles from other datasets were utilized directly in the further analysis. Furthermore, according to the variance of the probe sets in all samples, the genes relevant to the probes of ranking in the top 20% were chosen for subsequent analysis. All the above operations are conducted on the programming software R and could be found in Appendix Code [76]S1. 2.4. Weighted coexpression network construction Following the instructions below, expression matrix of 9,761 probes was used to build coexpression network by the “WGCNA” packages in R (Langfelder & Horvath, [77]2008). Firstly, a gene coexpression resemblance measure (absolute value of the Pearson correlation) was utilized to estimate for all pairwise gene–gene relationship. Next, a weighted adjacency matrix was constructed using a “soft” power adjacency function a[ij] = |cor (x[i], x[j])| ^β. a[ij] indicates the weighted Pearson's correlation coefficient that measures the connection strength between gene i and gene j. The soft threshold power β, is the lowest integer where the resulting gene coexpression networks satisfy approximate scale‐free topology (Zhang & Horvath, [78]2005). The adjacency matrix was converted to topological overlap matrix (TOM) subsequently, which could evaluate the direct correlation of gene pairs and the degree of agreement in association with other genes in the data set. (Yip & Horvath, [79]2007). After that, average linkage hierarchical clustering was conducted in accordance with the TOM‐based dissimilarity measure. An appropriate minimum gene module size for the gene dendrogram was set to classify similar genes into one module (Ravasz, Somera, Mongru, Oltvai, & Barabasi, [80]2002). Moreover, module eigengene (ME) that can be regarded as a representative of the gene expression profiles of a module, is defined as the first principal component of an interesting module (Langfelder & Horvath, [81]2008). Similar modules would be merged by the calculation of ME. More detailed WGCNA method processing can be found in these references (Langfelder &