Abstract

   Bladder cancer (BC) is one of the most common malignancies in urinary
   system with a common malignancy in urinary system with a high mortality
   and recurrence rate, so we attempt to construct a gene signature to
   predict the prognosis of BCs. We initially established a co-expression
   network by performing WGCNA analysis and further identified magenta
   module as key module (P = 8e-05, R2 = 0.4). Subsequently, we screened
   12 genes associated with survival from the key module, which were
   selected to construct an eight-gene signature by establishing a LASSO
   Cox model. Moreover, we reckoned the risk score (RS) of each sample,
   through which we could divide samples into two groups (the high-risk
   and low-risk groups) and verify the signature, in the training set and
   3 validation sets (internal test set, GSE13507and E-MTAB-4321). This
   signature could distinguish between the high- and low- risk patients
   well (survival analysis: P = 0.015; AUC: 0.61 at 1 year, 0.61 at 3
   years and 0.61 at 5 years). In the validation sets, this signature also
   showed good performance, which was consistent with the training test.
   Furthermore, we plotted a nomogram to predict the possibility of the
   overall survival (OS) and three calibration curves to predict the
   effectiveness of the nomogram, which suggested good value and clinical
   utility of the nomogram. In conclusion, we established an eight-gene
   signature, which was probably effective in the prediction of prognosis
   of patients with BC.

   Keywords: bladder cancer, LASSO, WGCNA, grade, survival, gene
   signature, nomogram, overall survival, prognosis

Introduction

   BC is the tenth most common type of cancer worldwide, with
   approximately morbidity of 3.0% and mortality of 2.1% [35]^1. In
   addition, BC is a malignancy with a poor prognosis and a high
   recurrence rate (30-70%), the five-year survival rate of which is
   approximately 50-70% as reported [36]^2. Thus, many patients may lose
   the best opportunities to be diagnosed and cured because of the high
   morbidity and mortality as well as the poor prognosis, which means that
   some new methods to diagnose and predict the prognosis of BC are
   required.

   We have noticed that some studies identified novel prognostic
   biomarkers for BC by using bioinformatics methods [37]^3^, [38]^4,
   which means it is promising to utilize biomarkers to predict the
   prognosis of patients [39]^5. But almost all these studies only explore
   the predictive value of a single biomarker [40]^6. Actually, the effect
   of just a single biomarker for predicting prognosis is probably not
   sufficient in BC. Thus, by performing least absolute shrinkage and
   selection operator (LASSO) Cox regression model, the aim of this study
   was to develop and validate a more effective and useful multi-gene
   signature and further establish a prognostic nomogram based on several
   biomarkers obtained through the up-front bioinformatics analysis, which
   might be of great and predictive value for the prognosis of BC
   patients.

Materials and methods

Data collection and data preprocessing

   A flow chart of the study was shown in Figure [41]1. Data [42]GSE31684
   [43]^7^, [44]^8 including 93 BCs was downloaded from Gene Expression
   Omnibus (GEO) database ([45]https://www.ncbi.nlm.nih.gov/geo/) for
   weighted gene co-expression network analysis (WGCNA), which was
   performed on Affymetrix Human Genome U133 Plus 2.0 Array ([46]GPL570).
   We firstly annotated [47]GSE31684 by transforming probes into genes and
   then calculated variances of genes across all samples. Only the top
   5,000 most variant genes were picked out for constructing a
   co-expression network. Moreover, based on The Cancer Genome Atlas
   (TCGA) database (https:// gdc‐portal.nci.nih.gov/), the mRNA-seq
   expression profile and clinical information of BC patients were
   downloaded for LASSO Cox regression analysis. Totally 391 samples were
   included after getting rid of samples without complete survival
   information. 391 BCs were randomly assigned into two groups at the rate
   of 1:1. 195 samples formed the training set, which was used to
   construct a LASSO Cox model. The test set including 196 BCs was used to
   verify this model, which was regarded as an internal validation set.
   Furthermore, we also downloaded two other independent datasets
   [48]GSE13507 [49]^9^, [50]^10 including 165 BCs from GEO database and
   E-MTAB-4321 including 476 BCs from ArrayExpress database
   ([51]https://www.ebi.ac.uk/arrayexpress/) for external validation.
   “DEseq.2” [52]^11 in R software was used for normalization and log2
   transformation for TCGA-BLCA data displayed as count number. As for
   [53]GSE13507 and [54]GSE31684, we downloaded the Raw data and annotated
   the probes based on the corresponding annotation files. Then
   normalization and log2 transformation were performed by using R package
   “affy” [55]^12. Normalized expression matrix of E-MTAB-4321 was
   directly downloaded from ArrayExpress database for subsequent analysis.
   Some important clinical features including age, gender, grade, and
   stage were available in Table [56]1.

Figure 1.

   [57]Figure 1
   [58]Open in a new tab

   Flow chart indicating the process used to select target genes included
   in the analysis.

Table 1.

   Clinical characteristics of patients with BC in each study.
   Characteristics Training dataset Internal test dataset Entire TCGA
   dataset [59]GSE31684 [60]GSE13507 E-MTAB-4321
   Number of patients 195 196 391 93 165 476
   Age(median, IQR*) 68(61, 76) 70(61, 77) 69(61, 77) 69(62, 77) 65(59,74)
   69(62, 76)
   ≥ 65 120 136 256 67 96 323
   < 65 74 60 134 26 69 153
   NA 1 0 1 0 0
   Gender
   Male 142 145 287 68 135 367
   Female 53 51 104 25 30 109
   Stage
   I 1 1 2 15 104 457
   II 55 62 117 17 31 16
   III 70 69 139 42 19 0
   IV 67 64 131 19 11 0
   NA 2 0 2 0 0 3
   Grade
   Low 5 7 12 6 105 277
   High 187 189 376 87 60 192
   NA 3 0 3 0 0 7
   [61]Open in a new tab

   *IQR: interquartile range.

WGCNA analysis to screen out a key module

   We initially checked the expression data profile of top 2,500 genes
   whether they fit the co-expression network construction by using sample
   network method. A sample was considered to be array outlier when its
   Z.Ku < -2.5, which was weeded out from the profile. Then based on R
   package “WGCNA” [62]^13, we constructed a co-expression network. In
   order to classify genes into gene modules, three independent methods
   including manual (interactive) branch cutting approach, automatic
   single block analysis and 2 block analysis were used. In the process,
   we set a relatively large minimum module size (minClusterSize = 30) and
   a medium sensitivity (deepSplit = 2) for branch splitting. Furthermore,
   we identified key modules correlated with grade which interested us
   most through two methods. We firstly quantized the correlation between
   module eigengenes and traits, and further quantify the relationship by
   calculating Gene Significance (GS). Through the data processing, we got
   the average value of GS of all the genes in a module, called Module
   Significance (MS). The most positively relevant module was considered
   to be key module according to the results.

Functional enrichment and pathway enrichment analysis

   Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes
   and Genomes (KEGG) pathway analysis were performed for genes within the
   key module based on “clusterProfiler” [63]^14 in R software. Gene sets
   with P value less than 0.05 were referred to be significantly enriched.

Genes associated with survival identification

   Having completed aforesaid identification of gene module, we
   preliminarily picked out genes according to the standard:
   |cor.geneModuleMembership| > 0.5 and |cor.geneTraitSignificance| > 0.2.
   Furthermore, we utilized overall survival (OS) analysis to pick out
   genes related to survival from those preliminarily picked genes by an
   online tool called Kaplan-Meier Plotter
   ([64]https://kmplot.com/analysis/). Genes with significant P value (P <
   0.05) in OS analysis would be thought as survival associated genes.

Constructing and verifying a multi-gene signature

   For the construction of LASSO Cox model, we extracted the expression
   data of survival associated genes based on the training set. We
   afterwards performed LASSO Cox regression analysis by R package
   “glmnet” [65]^15, after which genes signature containing most helpful
   biomarkers for prognosis was obtained and then the risk score of each
   sample in all the datasets was calculated through the signature.
   Furthermore, we divided samples in all the datasets into high- and low-
   risk groups relying on their own median risk score. Then we performed
   survival analysis by using R package “survival” [66]^16 and
   time-dependent (1-year, 3-year, and 5-year) receiver operating
   characteristic (ROC) analysis by using R package “timeROC” [67]^17 in
   all the datasets (training set, internal test set, [68]GSE13507, and
   E-MTAB-4321) to verify the prognostic value of the multi-gene-based
   classifier.

Univariate and multivariate Cox regression analysis

   To complete the Cox proportion hazard regression analysis, the
   multi-gene signature (using this signature as feature and the risk
   score as feature value) and other significant clinical features
   including gender, age, pathologic stage, and histologic grade were
   included in cox univariate analysis of overall survival (OS) by using
   TCGA-BLCA data, and elements in aforesaid analysis whose P value was
   less than 0.1 were brought into cox multivariate analysis. R package
   “forestplot” [69]^18 was used to visualize the results.

Nomogram construction and validation

   Before the nomogram construction, cross-validation was performed to
   deal with over-fitting of the model. Package “rms” in R software was
   utilized to plot not only the nomogram but also calibrate curve which
   could test the nomogram. 45° line in the calibrate curve represented
   the best prediction. In addition, decision curve analysis (DCA) was
   conducted through R package “rmda” [70]^19 to ensure if the multi-gene
   signature was clinically helpful. Moreover, we calculated the C-index
   (Concordance index) and AUC (area under curve, based on R package
   “pROC” [71]^20) between the actual observation frequency and the
   predicted probability to estimate the accuracy of the nomogram.
   TCGA-BLCA was used for internal validation and [72]GSE13507 and
   E-MTAB-4321 were used for external validation.

Gene set enrichment analysis (GSEA)

   We firstly reckoned the median risk score on the basis of aforesaid
   dataset [73]GSE31684 (the dataset used for WGCNA analysis) for the
   reason of the understanding of underlying functions of the
   multiple-gene signature. Afterwards, 93 BC samples were split into two
   groups (high-risk group and low-risk group) through the median risk
   score. We set “c2.cp.kegg.v6.2.symbols.gmt”as the reference gene set,
   and further performed GSEA [74]^21 among the two groups. KEGG signaling
   pathways were filtered by a cut-off criteria of nominal P < 0.05, |ES|
   > 0.6, and FDR < 25%.

Results

Key module identification

   3 outlier samples were weeded out by using sample network methods,
   totally 90 samples were used for constructing a co-expression network
   ([75]Figure S1). As shown in [76]Figure S2, beta (β) = 4 (scale free
   R^2 = 0.85) was further set as the soft-thresholding for adjacencies
   calculation. Furthermore, 11 modules were identified in total by
   classifying genes into gene modules and merging modules ([77]Figure
   S3). Genes without being classified into any other module formed the
   grey module, which was abandoned for further analysis. Among these
   modules, the magenta module was the most correlated one with grade
   positively (P =8e-05, R^2 = 0.4) (Figure [78]2A), which was considered
   to be key module in this study. In addition, the MS of magenta module
   was higher than those of other modules suggested by Figure [79]2B-C. A
   network heatmap and a classical MDS plot were showed in [80]Figure S4.

Figure 2.

   [81]Figure 2
   [82]Open in a new tab

   Identification of modules associated with the clinical traits of BC and
   bioinformatics analysis of genes in the hub module. (A) Heatmap of the
   correlation between module eigengenes and clinical traits of BC. (B)
   Distribution of average gene significance and errors in the modules
   associated with grade of BC. (C) Scatter plot of module eigengenes
   related to grade in the magenta module. (D) Biological process of genes
   in the hub module. (E) KEGG pathway enrichment of genes in the hub
   module.

GO and KEGG pathway analysis

   Totally 137 biological processes (BPs) were shown in [83]Table S1,
   which were enriched by genes in the key module. The top 10 BPs were
   organelle fission, chromosome segregation, nuclear division, mitotic
   nuclear division, sister chromatid segregation, nuclear chromosome
   segregation, microtubule cytoskeleton organization involved in mitosis,
   spindle organization, mitotic spindle organization, and mitotic sister
   chromatid segregation as shown in Figure [84]2D. Moreover, six KEGG
   pathways including progesterone-mediated oocyte maturation, cell cycle,
   starch and sucrose metabolism, carbohydrate digestion and absorption,
   oocyte meiosis, and cellular senescence were significantly enriched by
   genes in the key module suggested by Figure [85]2E.

Survival associated gene identification

   At first, we preliminarily selected 45 genes from the key module, which
   contained 65 genes originally, through the
   criterion:|cor.geneModuleMembership| > 0.5, and
   |cor.geneTraitSignificance| > 0.2. Moreover, we used Kaplan-Meier
   Plotter to preform OS analysis for the 45 genes, and 12 genes
   associated with survival were eventually identified for subsequent
   analysis. The results of OS analysis of the 45 genes were showed in
   [86]Table S2.

Developing an eight-gene signature for predicting OS

   After identifying 12 biomarkers significantly correlated to survival of
   BCs, we further calculated the relative regression coefficient of the
   12 genes by performing LASSO analysis. Eight genes including ASPM,
   C4orf46, CCNB1, DIAPH3, MLF1, MTFR2, NSUN6, and OIP5 were screened out
   to establish a multi-gene signature on account of the coefficient
   suggested by the LASSO model (Figure [87]3A). In addition, we also
   calculated the risk score (RS) of each sample by combining the relative
   expression levels (represented by relative expression values) and
   relative regression coefficients of the genes in the classifier as
   follows: RS = -0.135 × ASPM expression value - 0.142 × C4orf46
   expression value + 0.632 × CCNB1 expression value + 0.105 × DIAPH3
   expression value + 0.803 × MLF1 expression value - 0.456 × MTFR2
   expression value - 0.287 × NSUN6 expression value + 0.05 × OIP5
   expression value. Moreover, among the eight biomarkers, CCNB1, DIAPH3,
   MLF1, and OIP5 had positive coefficients meanwhile ASPM, C4orf46,
   MTFR2, and NSUN6 had had negative coefficients. According to the
   median-RS (RS = 7.373) in training set, 195 samples were divided into
   high-risk group (n = 97) and low-risk group (n = 98). The result of
   survival analysis indicated that high-risk group had a poor OS of BC
   patients (P = 0.015, Figure [88]3B). In the training set, the AUC
   values of the eight-gene signature were 0.61 at 1 year, 0.61 at 3
   years, as well as 0.61 at 5 years suggested by Figure [89]3C.

Figure 3.

   [90]Figure 3
   [91]Open in a new tab

   (A) Plot of partial likelihood deviance for the 12 genes associated
   with survival in the training set. Survival analysis of the association
   between risk score and overall survival time of BC in (B) training set,
   (D) internal test set, (F) entire set. Time dependent ROC analyses at
   1,3, and 5 years in (C) training set, (E) internal test set, (G) entire
   set.

Prognostic value of the eight-gene signature

   Besides the training set, we also produced 3 validation sets (internal
   test set, [92]GSE13507 and E-MTAB-4321) to confirm the results we
   obtained from the training set. With the same methods we mentioned
   above, we calculated the risk score of each sample in these validation
   sets and the median-RS in each validation set was set as the cut-off
   criteria to split samples into high- and low- risk groups (median-RS:
   internal test set: 7.370; [93]GSE13507: 5.558; E-MTAB-4321: 8.193).
   According to the result of survival analysis, high-risk group was more
   correlated to a lower survival rate in internal test set (P = 0.018,
   Figure [94]3D), entire set (consisting of the internal test set and
   training set, P = 0.001, Figure [95]3F), [96]GSE13507(Cancer-specific
   survival (CSS): P = 0.002; OS: P = 0.034, Figure [97]4A-B) and
   E-MTAB-4321 (Progression-free survival (PFS): P < 0.001, Figure
   [98]4C). Moreover, the prognostic accuracy of the eight-gene signature
   in the internal test set was 0.61 at 1 year, 0.58 at 3 years, and 0.57
   at 5 years as shown in Figure [99]3E (entire set: 0.61 at 1 year, 0.59
   at 3 years, and 0.59 at 5 years (Figure [100]3G)). The AUC values of
   the multi-gene signature for CSS in [101]GSE13507 were 0.71 at 1 year,
   0.65 at 3 years, and 0.64 at 5 years meanwhile 0.71 at 1 year, 0.58 at
   3 years, and 0.54 at 5 years for OS (Figure [102]4D-E). In addition, as
   shown in Figure [103]4F, the prognostic accuracy of the 8-gene
   signature in E-MTAB-4321 was 0.67 at 1 year, 0.72 at 3 years and 0.78
   at 5 years. Furthermore, the eight-gene signature and significant
   clinical factors were included in the univariate Cox analysis. Figure
   [104]5A indicated that risk score (Hazard Ratio(HR) = 3.322, 95%CI:
   1.981-5.571, P < 0.001), age (HR = 1.744, 95%CI: 1.132-2.687, P =
   0.012) and pathologic stage (HR = 3.140, 95%CI: 1.788-5.514, P < 0.001)
   were influential features of OS among those important factors, which
   were used for multivariate Cox analysis. The result of multivariate Cox
   analysis suggested that these three variables were still associated
   with OS of BC patients even through the adjustment of other features
   (Figure [105]5B). In addition, the DCA analysis indicated that this
   eight-gene signature showed good potential for clinical application
   whatever the Threshold Probability (Pt) as shown in Figure [106]5C.
   Furthermore, although this signature also showed better performance
   than any other single biomarker (Figure [107]5D), we could not
   distinguish this signature from age and pathologic stage well when Pt
   was approximately less than 0.35 (Figure [108]5E).

Figure 4.

   [109]Figure 4
   [110]Open in a new tab

   (A) Survival analysis of the association between risk score and
   cancer-specific survival (CSS) time of BC in [111]GSE13507. (B)
   Survival analysis of the association between risk score and overall
   survival (OS) time of BC in [112]GSE13507. (C) Survival analysis of the
   association between risk score and progression-free survival time of BC
   in E-MTAB-4321. Time dependent ROC analyses at 1,3, and 5 years in (D)
   [113]GSE13507-CSS, (E) [114]GSE13507-OS, (F) E-MTAB-4321.

Figure 5.

   [115]Figure 5
   [116]Open in a new tab

   (A) Forest plot summary of analyses of OS univariate analysis of Risk
   score, gender, age, pathologic stage and histologic grade by using
   TCGA-BC data. (B) Forest plot summary of analyses of OS univariate
   analysis of Risk score, age and pathologic stage by using TCGA-BC data.
   DCA for assessment of the clinical utility of the 8-gene signature (C),
   single biomarkers (D), and clinical factors (E). The x‐axis represents
   the percentage of threshold probability, and the y‐axis represents the
   net benefit. DCA: decision curve analysis.

Nomogram construction and its clinical utility

   In the present study, the 8-gene signature, pathologic stage, and age
   were incorporated to construct a nomogram with the aim of creating a
   quantitative method for the possibility prediction of the OS at 1, 3,
   and 5 years for patients with BC (Figure [117]6A). In the calibration
   curve, diagonal line (ideal model) represented the best prediction.
   Nomogram-predicted probability of survival was plotted on the x-axis
   meanwhile the actual survival was plotted on the y-axis. The higher the
   coincidence degree of the fitting line (red line) and the diagonal line
   was the better performance the nomogram exhibited. According to the
   calibration curve, the nomogram owned fine prediction effectiveness
   compared with the ideal model especially for nomogram's 1-year or
   3-year OS estimates (Figure [118]6B-D). In the nomogram, lower total
   points indicated a worse outcome. The nomogram showed high accuracy as
   the C-index and AUC suggested. This nomogram could effectively predict
   OS of patients with BC by using TGCA-BLCA data (C-index: 0.671; AUC:
   0.707; Figure [119]7A) and [120]GSE13507 (C-index: 0.666; AUC: 0.632;
   Figure [121]7C). The C-index was 0.772 and AUC was 0.759 accurately
   (based on [122]GSE13507; Figure [123]7B), which determined that the
   nomogram also performed well to predict CSS as well as PFS
   (E-MTAB-4321: C-index: 0.708; AUC: 0.704; Figure [124]7D).

Figure 6.

   [125]Figure 6
   [126]Open in a new tab

   The nomogram for predicting proportion of patients with 1-, 3- or
   5-year OS (A). The calibration plots for predicting 1- (B), 3- (C) or
   5- (D) year OS.

Figure 7.

   [127]Figure 7
   [128]Open in a new tab

   Receiver operating characteristic (ROC) curves and area under the curve
   (AUC) statistics to evaluate the diagnostic efficiency of the nomogram
   in TCGA-BLCA data (A, overall survival (OS)), [129]GSE13507 (B,
   cancer-specific survival (CSS)), [130]GSE13507 (C, OS), and E-MTAB-4321
   (D, progression-free survival (PFS)).

Identification of eight-gene signature associated biological pathways

   After splitting samples into high- and low- risk groups, we further
   performed GSEA based on [131]GSE31684. Six risk score-related KEGG
   signaling pathways including basal transcription factors, DNA
   replication, glyoxylate and dicarboxylate metabolism, mismatch repair,
   nucleotide excision repair, and progesterone-mediated oocyte maturation
   were enriched as shown in Figure [132]8.

Figure 8.

   [133]Figure 8
   [134]Open in a new tab

   Gene set enrichment analysis. The six significant KEGG pathways
   including basal transcription factors (A), DNA replication (B),
   glyoxylate and dicarboxylate metabolism (C), mismatch repair (D),
   nucleotide excision repair (E), and progesterone-mediated oocyte
   maturation (F).

Discussion

   As the 10th most common form of cancer worldwide, BC carries a poor
   prognosis with an estimated 549,000 new cases and 200,000 deaths
   [135]^1. Until now, cystoscopy is still the golden standard for
   diagnosis of BC [136]^22. BC has caused a lot of troubles; the most
   major problem is that BC is of high rate of recurrence. 30-70% of BCs
   will relapse as reported [137]^23. Thus, we aimed to screen several
   effective biomarkers and further establish a multi-gene signature for
   the prognosis of patients with BC.

   With the development of high-throughput sequencing and bioinformatics,
   more and more bioinformatics methods have been used to develop novel
   biomarkers associated with tumor progression [138]^24, survival
   [139]^25 or prognosis [140]^26 of malignancies, which might greatly aid
   the early diagnosis and evaluation of prognosis in malignant tumors.
   Among these methods, WGCNA has been successfully applied in the
   identification of novel diagnostic and prognostic biomarkers for
   malignancies [141]^27^, [142]^28. We have made great efforts to apply
   this method in finding out prognosis biomarkers of patients with BC in
   our previous study [143]^29. Therefore, we started with identifying key
   modules in this study by constructing a co-expression network and 12
   genes associated with overall survival of BC were further selected by
   using Kaplan-Meier Plotter. Considering that a lot of studies have
   focused on establishing a multi-gene signature rather than using a
   single biomarker as a diagnostic or prognostic biomarker in present
   [144]^30^, [145]^31, we attempted to develop a multi-gene signature in
   this study. LASSO is a more popular approach compared with Cox
   proportional hazard regression analysis for constructing prognostic
   gene signature models, with an advantage of preventing overfitting
   [146]^32. Thus, based on the 12 biomarkers, we established an
   eight-gene signature which could predict prognosis of BC patients by
   constructed a LASSO Cox model. We further validated this signature by
   performing survival analysis and ROC analysis based on three other
   independent datasets (test set, [147]GSE13507, E-MTAB-4321). We firstly
   divided patients into high- and low- risk groups according to the risk
   score of each patient. In the training set, the eight-gene-based
   classifier had the ability to distinguish the high-risk group patients
   from those in the low-risk group effectively. The results of survival
   analysis and ROC curve in the validation datasets were consistent with
   our findings in the training set, which made our results convincing.

   Previous study had proved that tumor stage was significant associated
   with patient prognosis [148]^33. The incidence of BC increased with
   age, with high risk age from 50 to 70 [149]^34. Our study indicated
   that both of tumor stage and age were significant prognostic features
   by performing Cox regression analysis using the TCGA-BLCA data, which
   was consistent with the results from previous studies. Interestingly,
   we found that this signature was not only independent of tumor stage
   and age, but also was associated with OS even had been adjusted by
   other clinical features. These results demonstrated that the signature
   could distinguish high- and low- risk groups well, which might have
   great predictive value of prognosis for patients with BC.

   The biological role of genes in the multi-gene signature was explored
   by performing GSEA analysis. Six KEGG pathways were finally identified
   including basal transcription factors, DNA replication, glyoxylate and
   dicarboxylate metabolism, mismatch repair, nucleotide excision repair,
   and progesterone mediated oocyte maturation. We further explore genes
   in this signature by carrying out a literature review. Abnormal spindle
   microtubule assembly (ASPM), the human ortholog of the Drosophila
   melanogaster 'abnormal spindle' gene (asp), was the most common mutant
   gene associated with microcephaly primary type 5 [150]^35. ASPM was
   essential for normal mitotic spindle function, which could correctly
   guide the movement of spindles and maintain symmetric division of
   cytoplasm in mitosis [151]^36. ASPM was found to up-regulate in some
   malignancies, such as breast cancer and gastric cancer [152]^37. Some
   studies proved that the up-regulation of ASPM was associated with
   invasion, recurrence and poor prognosis of tumors [153]^38^, [154]^39.
   Chromosome 4 open reading frame 46 (C4orf46), also known as renal
   cancer differentiation gene 1 (RCDG1), encoded a small, conserved
   protein of unknown function which was expressed in some tissues
   [155]^40. RCDG1 was significantly down-regulated in renal cell
   carcinoma (RCC) suggested by a recent study [156]^40. As for cyclin B1
   (CCNB1), the protein encoded by cyclin B1 (CCNB1) was a regulatory
   protein involved in mitosis, which was necessary for proper control of
   the G2/M transition phase in cell cycle [157]^41. CCNB1 could combine
   with p34 (cdc2) and further form the maturation-promoting factor (MPF)
   [158]^42. As for diaphanous related formin 3 (DIAPH3), the protein
   encoded by this gene was a binding protein of actin, which participated
   in composing cytoskeleton proteins [159]^43. DIAPH3 played an important
   role in the infiltration and metastasis of cancerous cells, which might
   be a potential therapeutic target for cancer treatment [160]^44.
   Myeloid leukemia factor 1 (MLF1) encoded an oncoprotein which was
   associated with the phenotypic determination of hemopoietic cells
   [161]^45. Previous study had proved that the translocations between
   MLF1 and nucleophosmin were related to myelodyplastic syndrome and
   acute myeloid [162]^46. Mitochondrial fission regulator 2 (MTFR2) was
   poor studied in malignancies, a recent study identified this gene as
   one of the most correlated genes to dual specificity protein kinase TTK
   (TTK) in glioblastoma (GBM) [163]^47. Expression of TTK was
   up-regulated in GBM, which was thought to be associated with poor
   prognosis of patients with GBM [164]^47. NOP2/Sun RNA methyltransferase
   6 (NSUN6) was a human RNA methyltransferase that catalyzed formation of
   m5C72 in specific tRNAs [165]^48. NSUN6 localized to the cytoplasm,
   which was largely colocalized with marker proteins for the Golgi
   apparatus and pericentriolar matrix [166]^49. Opa interacting protein 5
   (OIP5) was a member of cancer-testis antigen (CTA) family, which might
   be a novel therapeutic target for cancer therapy because of the high
   expression of this gene in colorectal cancer [167]^49, lung cancer
   [168]^50, and esophageal cancer [169]^50. Protein encoded by OIP5 was
   essential for recruitment of centromere protein A (CENPA) by the
   mediator Holliday junction recognition protein in centromeres [170]^51.

   Although our conclusions have been well validated by multiple
   independent data sets, we have not further validated our conclusions
   through large-scale prospective clinical and molecular biology
   experiments, which we believe may be the largest limitation in this
   study. So, we will conduct large-scale prospective clinical trials and
   molecular biology experiments in subsequent studies to further confirm
   our conclusions and related molecular biological mechanisms.

Conclusions

   To sum up, we have presented a systematic and comprehensive analysis
   for data from TCGA, GEO, and ArrayExpress databases by using WGCNA and
   LASSO as the main methods. We identified 12 genes associated with
   overall survival of BC patients and further developed an eight-gene
   signature which was of great value to predict prognosis of BC. This
   eight-gene-based classifier might be of great value for making
   prognostic evaluations. However, the eight-gene signature must be
   validated by using clinical trials and more advanced methods in
   bioinformatics field.

Supplementary Material

   Supplementary figures and tables.
   [171]Click here for additional data file.^ (6.1MB, pdf)

Acknowledgments