Abstract

   Glioblastomas (GBMs), are the most common intrinsic brain tumors in
   adults and are almost universally fatal. Despite the progresses made in
   surgery, chemotherapy, and radiation over the past decades, the
   prognosis of patients with GBM remained poor and the average survival
   time of patients suffering from GBM was still short. Discovering robust
   gene signatures toward better understanding of the complex molecular
   mechanisms leading to GBM is an important prerequisite to the
   identification of novel and more effective therapeutic strategies.
   Herein, a comprehensive study of genome-scale mRNA expression data by
   combining GBM and normal tissue samples from 48 studies was performed.
   The 147 robust gene signatures were identified to be significantly
   differential expression between GBM and normal samples, among which 100
   (68%) genes were reported to be closely associated with GBM in previous
   publications. Moreover, function annotation analysis based on these 147
   robust DEGs showed certain deregulated gene expression programs (e.g.,
   cell cycle, immune response and p53 signaling pathway) were associated
   with GBM development, and PPI network analysis revealed three novel hub
   genes (RFC4, ZWINT and TYMS) play important role in GBM development.
   Furthermore, survival analysis based on the TCGA GBM data demonstrated
   38 robust DEGs significantly affect the prognosis of GBM in OS
   (p < 0.05). These findings provided new insights into molecular
   mechanisms underlying GBM and suggested the 38 robust DEGs could be
   potential targets for the diagnosis and treatment.

Introduction

   Glioblastomas (GBMs) are the most common and highly aggressive
   malignant brain tumors^[32]1,[33]2. Worldwide, in developed countries,
   an estimated 3~5 GBM cases per 100,000 inhabitants are diagnosed each
   year^[34]1,[35]3. 10,000 new cases of malignant GBM are diagnosed each
   year in the United States^[36]3. Despite enormous advances in knowledge
   and therapies over the decades, survival of patients diagnosed with GBM
   has not significantly improved, only around 5.1% of glioblastoma
   patients have a 5-year survival rate^[37]4,[38]5. Therefore,
   understanding the molecular mechanism of GBMs is an important
   prerequisite for discovering a novel and effective therapeutic
   strategy^[39]5–[40]8.

   High-throughput genomic technologies have been widely applied to
   facilitate to understand the mechanisms involved in the genesis of
   disease processes^[41]9. Among which, DNA microarray is recognized as
   very important and powerful tool for identifying the diversity of
   functional genes and identifying in-depth characterization of changes
   in gene expression because it can provide invaluable information on
   gene transcription by simultaneously measuring expression of thousands
   of genes within a particular biological sample^[42]10. For example, Li
   et al.^[43]11 identified that EZH2 could regulate neuroblastoma cell
   differentiation via NTRK1 promoter epigenetic modifications using DNA
   microarrays. And Dmitriy et al. discovered listeria species based on
   the iap gene sequence^[44]12.

   Numerous studies have examined gene expression profiles of individuals
   with GBM compared with healthy controls and demonstrated that highly
   proliferation^[45]13,[46]14, migration^[47]13, and invasion^[48]15
   nature of GBM cell are key factors hindering effective treatment of
   gliomas. However, owing highly complexity and intrinsically
   heterogeneity of GBMs at a molecular level, the specific molecular
   mechanisms underlying GBM are still poorly understood^[49]16,[50]17.
   Recent studies have shown that a robust signature comprising of genes
   can provide essential basis to study molecular mechanisms that underpin
   the process of disease^[51]18. It is reported that a robust signature
   critically depended on the sample sizes studied^[52]19, and even need
   thousands of samples. However, the number of normal control samples in
   public gene expression databases are disproportionally small compared
   to tumor samples in a variety of datasets^[53]20. In other words, the
   number of normal samples is inadequate for directly identifying the
   robust differential expression genes associated with GBM.

   Herein, the most comprehensive set of genome-scale mRNA expression data
   was constructed by combining GBM and normal samples from multiple
   studies. In total, thousands of samples were analyzed to compile the
   accurate and robust relevant genes towards insight into the molecular
   mechanisms. In particular, a list of genes with well robustness
   significantly different between GBM and normal tissue samples was
   firstly identified. Secondly, functional analysis base on these robust
   gene sets was performed and certain deregulated gene expression
   programs (e.g. cell cycle, immune response, p53 signaling pathway) are
   identified in glioblastoma process. Moreover, enrichment analysis of
   transcription factors and targeted miRNAs revealed three novel hub
   genes including RFC4, ZWINT, and TYMS and three transcriptional factors
   TATA, E2F4DP1 and HFH4, and two microRNA hsa-mir-519E and hsa-mir-527
   driving GBM tumorgenicity. Furthermore, survival analysis was applied
   for evaluating the prognostic performance of these robust differential
   expression genes using the clinical information of TCGA GBM data. In
   sum, the identified robust genes may facilitate the understanding of
   glioblastoma’s etiology and the discovery of novel hub genes,
   transcriptional factors and two microRNA driving GBM tumorgenicity
   would have therapeutic implications.

Materials and Methods

Data collection and pre-processing

   Genome-wide expression data sets were collected based on Human Genome
   U133 Plus 2.0 Array from Affymetrix GeneChip. In particularly, all raw
   CEL files of analyzed samples were directly downloaded from two
   well-known Gene Expression Omnibus (GEO) and ArrayExpress (AE)
   databases. Annotations information of each sample studied was carefully
   inspected, including GSM files from GEO database and sdrf files from AE
   database. All CEL files then were processed using single-channel array
   normalization (SCAN) method by SCAN.UPC package^[54]21 with default
   option in R software. In addition, version 17.0 of BrainArray was used
   for addressing expression of the same gene with several probes. For
   gene expression of duplicated samples and only one sample was retained.
   The final data matrix consisted of expression values for 22215 probes
   sets and 1588 samples. All detailed descriptions could be found in
   Lee’s pioneer study^[55]20.

Statistical modelling for robust gene signature (RGS) between GBMs and normal
samples

   Gene expression matrix after combining all studies was furtherly
   analyzed using bioinformatic methods. Firstly, computing test
   statistics for expression value of each gene using mt.teststat function
   by multtest package of the R statistical computing environment.
   Secondly, p-value of each gene set were computed using one-sided tests.
   Then, the resulting p-value for the up and downregulated genes were
   further adjusted for multiple testing using Benjamini and Hochberg
   method (BH). Moreover, to identify robust DEGs between GBM and normal
   controls, the samples size 722 (361 samples each groups)^[56]22 were
   randomly selected from a data set of 1,588 samples, and a gene set was
   prepared by selecting top 500 genes with the lowest p-value from t test
   analysis. The random sampling was performed 200 times. Secondly, an
   overlap between two gene-sets was computed for each pair of 200 gene
   sets. Overlap is the fraction of shared features that appear on both
   two lists of markers which determined the robustness of the identified
   markers by measure the similarity of two lists of identified
   markers^[57]23. All procedures aimed to identify robust gene signature
   (RGS) between GBM and normal control samples. Genes repeatedly selected
   during random sampling are defined as robust^[58]22,[59]24.

Hierarchical clustering based on the robust gene signature (RGS)

   To determine the specificity of RGS between GBM and normal samples,
   unsupervised hierarchical clustering analysis^[60]25 (HCA) was utilized
   for clustering distinct sample groups. The GBMs and normal samples were
   clustered by HCA based on the manhattan distance, and the ctc packages
   in R was furtherly used for converting hclust objects to newick format
   file. Then, the resulting output was used by the version 3 of
   Interactive Tree Of Life (iTOL) software to generate the associated
   heatmap and clustering dendrogram^[61]26.

Functional category enrichment analysis

   In order to explore biological functions of these differential
   expression genes, gene set enrichment analysis^[62]27 (GSEA) was
   performed based on the 1% most up- and downregulated genes between GBMs
   and normal samples. GSEA is a computational method that measures
   whether a known gene set shows significant differences between
   different biological conditions. Particularly, gene ontology term
   enrichment analysis was first conducted based the 160 DEGs, which
   including enrichment for GO ‘Biological Process’, ‘Molecular Function’
   and ‘Cellular Component’ terms. Secondly, KEGG pathways enrichment
   analysis based these genes were also implemented. To investigate the
   top enriched biological functions and pathways of up and down-regulated
   genes. One thousand random permutations were performed for each
   analysis and the threshold of false discovery rate (FDR) was set at
   0.05 to allow for investigative discovery.

Transcription factor and target miRNAs enrichment analysis

   GSEA based on DEGs was also carried out to elucidate the significant
   enriched transcription factor (TF) and miRNAs. One thousand random
   permutations were performed for each analysis. In addition,
   TFmiR^[63]28 was applied for performing integrated analysis of
   transcription factors (TFs), microRNAs (miRNAs) and genes.

Construction of gene/protein interaction network and analysis

   The Search Tool for the Retrieval of Interacting Genes (STRING)
   database^[64]29 has been widely used for exploring protein-protein
   interactions (PPI). Therefore, the PPI network of DEGs between GBMs and
   normal samples was constructed and visualized using the STRING online
   tool, which only included interactions with combined
   score ≥ 0.4^[65]29,[66]30. Secondly, the property of PPI network was
   analyzed using the NetworkAnalyzer module in Cytoscape v3.6.1
   software^[67]31, which could be useful in visualizing biological
   networks and integrating PPI data. The nodes of PPI network indicated
   genes and degree suggested the number of interactions of the gene with
   other genes. In PPI network established, these genes with large degrees
   (connectivity degree > 5) and high betweenness centrality^[68]32 were
   selected and regarded as the hub genes.

Results and Discussion

A large-scale GBMs and normal tissues samples is completely collected in this
work

   Thousands of samples were needed for generating robust differential
   expression genes associated with disease, which could contribute to
   understand molecular pathologies and mechanisms of disease^[69]33.
   Normal tissues samples can be widely used for cancer-associated studies
   by providing in invaluable clue for abnormal gene expression patterns
   in cancer compared to normal^[70]20. However, the number of normal
   sample were often small, while the number of cancer samples are
   relatively large. The unbalance of sample size of distinct groups may
   limit the study of the disease. Thus, we systematically searched two
   databases GEO and AE containing publicly-available microarray data sets
   to obtain the most complete datasets of GBMs and normal tissues
   samples. The overall numbers of GBMs and normal tissues samples
   integrating two well-known GEO and ArrayExpress databases was
   illustrated in Fig. [71]1. In sum, we collected a total of 48
   expression profiling studies, including 723 samples of GBMs and 865
   samples of normal controls. Among which, the majority of studies either
   contain GBMs or normal tissue samples separately, e.g. [72]GSE7307 only
   contained 174 GBMs samples, and [73]GSE68848 only consisted of 228
   normal samples.

Figure 1.

   [74]Figure 1
   [75]Open in a new tab

   Statistics of datasets studied in this work. Expression profiles of all
   analyzed samples were collected by Gene Expression Omnibus (GEO) and
   ArrayExpress (AE) databases. E-MTAB- indicates the AE source; GSE
   indicates the GEO source. Datasets were ascending ordered by their
   total number of samples.

Sample size consideration

   Statistical power analysis was performed for demonstrating the
   statistical power of this study. As known, sample size that is too
   small could reduce the power of a study and increases the probability
   of error, which can render the study meaningless^[76]34. Thus, in
   transcriptomic study, statistical power analysis was typically used in
   estimating sufficient sample sizes to achieve adequate
   power^[77]35,[78]36. For the studied dataset, it has over 90% power to
   detect differential expression genes at an overall significance level
   of 0.01 with Bonferroni’s adjustment. As reported in Mapstone’s pioneer
   study, a power of 0.9 is well suited for detecting differential genes
   in transcriptomic study^[79]35. Therefore, the statistical power
   analysis suggested that the studied dataset (sample size) is well
   suited for identifying the discriminating genes between GBM cases and
   normal controls.

Identify robust and reliable change in gene expression based on thousands of
samples

   Identification of robust and reliable differential expression genes
   could provide immense help for understanding molecular mechanisms
   underlying complex disease. To facilitate the identification of
   differential expression genes between GBMs and normal samples, p-values
   were estimated to identify gene sets that were differential expressed
   between GBMs and control samples. After multiple test correction, the
   160 DEGs were selected using the 1% most up and downregulated genes at
   a false discovery rate of 0.001. The top 10 most significantly up- or
   down-regulated DEGs were provided in Table [80]1. Based on the analysis
   of robust genes above description, a median overlap value was obtained
   (greater than 0.9), which suggested the identified DEGs are likely to
   be well robust. In tol, 678 robust genes were repeatedly selected
   during random sampling. Among of these 678 genes, 147 genes were common
   identified in the 160 gene-set (1% most up- and downregulated genes
   between GBMs and normal samples). Namely, 147 robust gene signatures
   were identified. The relationship between these 147 robust DEGs with
   GBM were investigated using manually searching PubMed database. We
   found that 68% (100 of the 147 unique genes have known connections to
   GBM (Supplementary Table [81]S1). Among 100 robust DEGs associated with
   GBM, 80 genes were identified to were differential expression in GBM
   samples, which included 60 were overexpressed or upregulated in GBM
   samples,12 were downregulated in GBM samples and 8 were differential
   expression in GBM without the upregulated or downregulated information.
   The relationship between the top 10 most significant genes and GBMs was
   listed in Table [82]2. These genes have been reported to be associated
   with the survival, growth, invasive and proliferation characteristics
   of GBMs cells, for example, suppressing of TMEM45A expression in glioma
   cells remarkably suppressed cell migration and cell invasion, and GJB6,
   also known as Cx30 has the potential to influence growth, proliferation
   and migration of glioma cells. Moreover, downregulated two neural
   subtype expressed markers were identified such as GABRA1 (Gamma-Amino
   Butyric Acid A Receptor, Alpha1) and SLC12A5 (solute carrier family 12
   member 5)^[83]37. In addition, some genes have not been reported in
   GBMs-associated studies such as the down-regulated gene TMEM130 with
   the lowest P-value (P = 2.51E-286). However, recent studies have shown
   that overexpression of transmembrane protein could increase migration
   capacity toward glioblastoma cells such as TMEM18^[84]38. Thus, gene
   TMEM130 could be associated with GBM and need further validation in the
   future.

Table 1.

   The top 10 most significantly up- or down-regulated DEGs between GBM
   and normal samples.
   Gene symbol Gene description Fold Change
   (1) Upregulated in GBMs
   COL3A1 collagen, type III, alpha 1 5.628
   TOP2A topoisomerase (DNA) II alpha 170 kDa 8.713
   CRISPLD1 cysteine-rich secretory protein LCCL domain containing 1 4.072
   RRM2 ribonucleotide reductase M2 9.195
   COL1A2 collagen, type I, alpha 2 3.750
   FCGBP Fc fragment of IgG binding protein 4.004
   CDCA7L cell division cycle associated 7-like 4.539
   SMC4 structural maintenance of chromosomes 4 5.165
   TMEM45A transmembrane protein 45 A 4.857
   PTX3 pentraxin 3, long 8.298
   (2) Downregulated in GBMs
   MAL2 mal, T-cell differentiation protein 2 (gene/pseudogene) 0.252
   GJB6 gap junction protein, beta 6, 30 kDa 0.285
   NEFM neurofilament, medium polypeptide 0.419
   SYNPR synaptoporin 0.407
   TMEM130 transmembrane protein 130 0.334
   GABRA1 gamma-aminobutyric acid (GABA) A receptor, alpha 1 0.335
   RBFOX1 RNA binding protein, fox-1 homolog (C. elegans) 1 0.399
   SLC12A5 solute carrier family 12 (potassium/chloride transporter),
   member 5 0.440
   NEFH neurofilament, heavy polypeptide 0.406
   AK5 adenylate kinase 5 0.398
   [85]Open in a new tab

   A final set of linear models were used to identify genes that were
   differential expressed between glioblastoma and control samples. After
   multiple test correction we identified 1% most up and downregulated
   genes at a false discovery rate of 0.001.

Table 2.

   The top 10 most significantly up- or down-regulated DEGs between GBMs
   and normal samples are associated with the GBMs.
   Gene symbol Descriptions of gene is associated with GBM UP/Down Ref.
   COL3A1 COL3A1 may be suitable biomarkers for diagnostic or therapeutic
   strategies for GBM DN ^[86]56
   TOP2A Over-expression of TOP2A as a prognostic biomarker in patients
   with GBM UP ^[87]57
   CRISPLD1 UN UN
   RRM2 BRCA1-mediated RRM2 expression protects GBM cells from endogenous
   replication stress UP ^[88]58
   COL1A2 COL1A2 is highly expressed genes in GBM spheroids as compared
   with normal brain UP ^[89]6
   FCGBP Primary glioblastomas exhibited higher expression of
   extracellular response-associated gene FCGBP UP ^[90]59
   CDCA7L It has been reported that CDCA7L is correlation to GBM patient
   survival time UP ^[91]60
   SMC4 Overexpression of SMC4 activates TGFβ/Smad signaling and promotes
   aggressive phenotype in GBM cells UN ^[92]61
   TMEM45A Suppressing of TMEM45A expression in glioma cells remarkably
   suppressed cell migration and cell invasion UN ^[93]62
   PTX3 Knockdown of PTX3 significantly decreases GBM8401 cell migration
   and invasion UN ^[94]63
   MAL2 UN UN
   GJB6 GJB6 (Cx30) has the potential to influence growth, proliferation
   and migration of GBM cells. UN ^[95]64
   NEFM KLF6 inhibits the malignant phenotype of GBM in vitro and
   upregulates neuronal marker NEFM. UP ^[96]65
   SYNPR SYNPR is downregulated differently expressed genes (DEGs) in GBM
   tissue samples. Down ^[97]66
   TMEM130 UN UN
   GABRA1 Upregulation of miR-155 in GBM could may downregulate GABRA1
   which renders tumor cells unresponsive to GABA signaling. Down ^[98]67
   RBFOX1 Downregulated RBFOX1 is identified in GBMs compared with normal
   brain. Down ^[99]68
   SLC12A5 UN UN
   NEFH miR-25 promotes GBMs cell proliferation and invasion by directly
   targeting NEFL. UN ^[100]69
   AK5 UN UN
   [101]Open in a new tab

   UP indicated that the gene was identified as up-regulated in GBMs; Down
   indicated that the gene was reported as down-regulated. UN suggested
   the gene has not been reported in current GBM-associated studies.

Hierarchical clustering analysis of differentially expressed GBMs signature
genes

   Unsupervised hierarchical clustering analysis is one of the most
   powerful methods for the exploratory analysis of gene expression data
   and was widely used to reflecting distinct gene expression patterns or
   modules of highly co-expressed genes. Therefore, in this work,
   hierarchical clustering with ward algorithm^[102]39 was applied to
   cluster the expression profile of differentially expressed genes in
   each sample group based on these 147 robust DEGs including upregulated
   and downregulated genes in GBMs. As shown in Fig. [103]2, two subtypes
   of all samples studied were identified by unsupervised hierarchical
   clustering. The heatmap demonstrated the most of GBMs and normal
   samples could be separated based these DEGs.

Figure 2.

   [104]Figure 2
   [105]Open in a new tab

   Heatmap of 723 glioblastoma and 865 normal samples based on identified
   147 robust differential expression (up and downregulated) genes. The
   highest expression values of DEGs are displayed in green and the lower
   gradually fading toward black color. The lowest expression values of
   DEG are shown in red, higher ones gradually fading toward black color.
   Glioblastoma samples were highlighted with red; Normal control samples
   were highlighted with blue.

Functional analysis of differentially expressed GBMs signature genes

   Functional analysis is secondary analysis of differential expression
   genes identified and can collectively provide biological function
   underlying these genes. Understanding dysregulated biological process
   and pathway in cancer cells are essential for the development of
   complex diseases^[106]40, and can provided immense assistance in the
   understanding the pathology^[107]41. Therefore, GSEA was performed to
   investigate biological function and pathways of genes associated with
   GBM. As shown in Fig. [108]3, the BP terms of GO (Fig. [109]3A) showed
   that the up-regulated genes were enriched over 50 terms and the top 10
   terms were associated with cell cycle and immune response. And the
   down-regulated genes showed 13 terms enrichment for cell signaling,
   anion transport, neurotransmitter transport and so on. The CC terms of
   GO (Fig. [110]3B) showed that the up-regulated genes were significantly
   enriched in 32 terms and the top 10 terms were associated with
   extracellular space, extracellular matrix, complex of collagen trimer,
   cell surface, golgi apparatus, collagen trimer and banded collagen
   fibril. And the down-regulated genes showed 31 terms enrichment for
   neuron projection, cell projection, intermediate filament and synapse,
   and so on. The MF terms of GO (Fig. [111]3C) showed that downregulated
   genes could be associated with transporter activity, anion channel
   activity, receptor activity and structural molecule activity, whereas
   the most upregulated genes were associated with protein complex,
   receptor, growth factor, enzyme binding. In addition, the KEGG pathways
   analysis based on these DEGs suggested that downregulated genes were
   significantly enriched in ecm receptor interaction, complement and
   coagulation cascades, p53 signaling pathway, focal adhesion, immune
   network and so on. And the down-regulated DEGs showed 2 pathways
   enrichment for neuroactive ligand receptor interaction and amyotrophic
   lateral sclerosis als.

Figure 3.

   [112]Figure 3
   [113]Open in a new tab

   Functional enrichment analysis of gene ontology terms and kegg
   biological pathway enrichment analysis of DEGs. Gene Ontology covers
   three domains: cellular component, molecular function and biological
   process. A-C GO analysis according to biological process, cellular
   component and molecular function, respectively. (A) Enrichment for GO
   ‘Biological Process’ terms of genes detected. The y-axis displays the
   fraction relative to all GO Biological Process terms. (B) Enrichment
   for GO ‘Molecular Function’ main terms of genes detected. The y-axis
   displays the fraction relative to all GO Cellular Component terms. (C)
   Enrichment for GO ‘Molecular Function’ main terms of genes detected.
   The y-axis displays the fraction relative to all GO Molecular Function
   terms. The figure shows terms on the x-axis that are significantly
   enriched (FDR < 0.05). (D) Enrichment for kegg ‘Biological Pathway’
   terms of genes detected.

Transcription factor and target miRNAs analysis

   Transcription factor (TF) and microRNA (miRNA) are essential for
   regulating the expression of gene^[114]42. Differentially expressed TFs
   in GBM, and their downstream gene targets, may be potential therapeutic
   biomarkers of GBM. Therefore, we perform the transcription factors
   enrichment analysis based on these DEGs. As the demonstrated
   Supplementary Table [115]S2, top 10 TFs based unregulated DEGs and top
   10 TFs based on downregulated DEGs are enrichment and listed by GSEA
   software at FDR < 0.05. The most of TFs reported that the directly
   associated GBMs. For example, the identified transcription factors
   FOXD3 could inhibit proliferation, migration, and invasion of GBM
   cells^[116]42. Moreover, numerous studies showed that miRNAs played
   important roles in development of cancer and could be potential targets
   for therapy^[117]5. Therefore, to investigate the regulatory
   mechanisms, we also performed miRNAs enrichment analysis based on these
   DEGs. As the demonstrated Supplementary Table [118]S3, 16 miRNAs sets
   are enriched and list by GSEA software at FDR < 0.05. Similarity, the
   most of miRNAs has reported that the directly associated GBMs. For
   example, miR-196b was upregulated in GBM compared with normal control
   samples and associated with cell proliferation^[119]42.

   In addition, combination analysis of TFs and miRNAs play important
   roles in understanding the pathogenic mechanisms associated with GBM
   tumorigenesis^[120]28. Therefore, we constructed and analyzed
   co-regulatory network based on enriched TFs and miRNAs using well known
   web server TFmiR^[121]28. As the illustrated Fig. [122]4, a total of 55
   regulatory interactions were identified which included 29 nodes (miRNAs
   and TFs/genes). Among which, 28 interactions were experimentally
   validated. The over representation analysis of the full interaction
   network showed 9 targeted miRNAs including hsa-mir-106a, hsa-mir-130a,
   hsa-mir-196b, hsa-mir-20a, hsa-mir-20b, hsa-mir-29a, hsa-mir-29c,
   hsa-mir-381 and hsa-mir-19a involvement in cancerogenesis of GBMs.

Figure 4.

   [123]Figure 4
   [124]Open in a new tab

   Glioblastoma-specific miRNA/transcription factor co-regulatory
   networks. The miRNAs are from the enrichment result based on DEGs (top
   1% upregulated) at a false discovery rate of 0.05. Green hexagon
   indicates the transcript factor, the yellow circle represents miRNA,
   the orange quadrilateral suggests target gene.

PPI network construction

   Deciphering the structure of complex network of protein-protein
   interactions (PPI) can facilitate to understand of molecular mechanism
   behind the disease^[125]43. The hub genes of whole PPI network
   identified play a vital role in this signal transduction network.
   Therefore, in this work, we preformed PPI network analysis by choosing
   a well-known high-throughput STRING dataset^[126]44, which can further
   promote to select reliable edges of network. Particularly, the PPI
   network was constructed and visualized based on 1% up and down
   regulated DEGs, which included 158 nodes and 378 edges. As shown in
   Supplementary Fig. [127]S1, nodes with high betweenness centrality and
   large degree (connectivity degree > 5) are selected as hub genes and
   were displayed in Table [128]3. Table [129]3 showed the top 10 crucial
   hub genes involved in the development of GBMs, which included TOP2A,
   RFC4, ZWINT, NUF2, UBE2C, TYMS, MYC, PBK, MELK and MCM2. Overall
   expression values of these hub genes were visualized by boxplot for the
   E-MTAB-3892 dataset. The obvious gene expression difference between GBM
   and normal samples could be seen in Fig. [130]5. Among which, the most
   of hub genes have been experimentally validated in GBM-associated
   studies, which reflected the hub genes identified is well
   reproducibility with previous findings. To be more specific, three hub
   genes including PBK (role in cell cycle), MELK (stem cell marker), and
   TOP2A (proliferation marker) have been validated in previous
   GBM-associated studies^[131]45. PBK was candidate can be a promising
   molecular target for GBM treatment^[132]46. MELK was identified for
   encoding other ABC transporters as well as Akt3 kinase in developing
   resistance of GBM to TMZ^[133]47. TOP2A was the hub protein of whole
   network, which have been demonstrated to its expression was correlated
   with aggressive and highly proliferating cancers, which were accordance
   with Horvath et a’ work. In addition, recent studies have showed that
   overexpression of MCM2 gene could be highly associated with survival of
   GBM^[134]48. Upregulated UBE2C gene was associated with the aggressive
   progression of GBM^[135]49. And siRNA-mediated knockdown against NUF2
   may be a potential therapeutic method for treatment of GBM^[136]50.

Table 3.

   The top 10 hub genes with a connectivity degree >5 were selected and
   listed.
   Hub gene Gene description Degree Betweenness centrality
   TOP2A topoisomerase (DNA) II alpha 170 kDa 30 0.2268
   RFC4 replication factor C (activator 1) 4, 37 kDa 27 0.0491
   ZWINT ZW10 interactor 22 0.0069
   NUF2 NUF2, NDC80 kinetochore complex component, homolog (S. cerevisiae)
   22 0.0263
   UBE2C ubiquitin-conjugating enzyme E2C 22 0.0559
   TYMS thymidylate synthetase 21 0.0508
   MYC v-myc myelocytomatosis viral oncogene homolog (avian) 21 0.4453
   PBK PDZ binding kinase 21 0.0415
   MELK maternal embryonic leucine zipper kinase 20 0.0016
   MCM2 minichromosome maintenance complex component 2 19 0.0010
   [137]Open in a new tab

   Given that the majority of the networks were scale-free, hub genes with
   a connectivity degree >5 were selected, as described previously. The
   connectivity degree represents the number of lines linked to a given
   node, and nodes with a high connectivity degree (≥5) are defined as hub
   genes that possess important biological functions. All the properties
   were computed based on these 1% most up and downregulated genes by
   NetworkAnalyzer module in Cytoscape software.

Figure 5.

   Figure 5
   [138]Open in a new tab

   Box plot of intensities after Scan normalization based on top 10 hub
   genes. Box plot showing median, interquartile range, minimum and
   maximum intensities with GBMs (blue boxes) compared to those with
   normal tissue sample (yellow boxes). Corresponding intensities values
   are displayed as dots. The p-value indicated significant differences
   between the distinct groups, which is calculated using t-test based on
   stat_compare_mean function in R ggpubr library.

   However, the association between three hub genes including RFC4, ZWINT
   and TYMS expression and GBMs has not been reported. A recent study by
   Jiang et al.^[139]51 showed that miR-127-3p and its targeted gene SKI
   could be promising targets for GBM therapy. The present study revealed
   that hub gene ZWINT also is a target of miR-127-3p, which has
   functional annotation related to cell cycle, cell division and nuclear
   division. Therefore, the gene may be a key regulator in GBM
   development.

Survival analysis

   Investigating the clinical significance (e.g. prognosis) of gene
   expression in GBM is crucial important for diagnosis and molecular
   target therapy of GBM^[140]52. As known, survival analysis was widely
   applied method to evaluate the prognostic performance of new biomarkers
   using the clinical data of oncological patients^[141]53. The Cancer
   Genome Atlas (TCGA) project is one of the largest available resources
   that accumulates genomic, transcriptomic and methylomic data for
   several types of cancer^[142]54. The TCGA provide a useful source of
   information for identification of prognostic markers^[143]55.
   Therefore, to investigate the oncogenic role of the robust differential
   expression genes in GBM progression, survival analysis by Kaplan-Meier
   estimates stratified by their expression was made based on the data of
   521 GBM cases provided by TCGA. The expression values of these genes
   were classified as either high (expression value ≥ median) or low
   (expression value < median). As the demonstrated Supplementary
   Fig. [144]S2, we found 38 robust DEGs were significantly related to the
   prognosis of GBM (OS, P < 0.05) based on Kaplan-Meier estimates
   (log-rank test). Figure [145]6 only listed top six significant genes.
   Thus, these genes are possible candidate genes for diagnosis and
   molecular target therapy of GBM. Moreover, these 38 robust DEGs
   significantly associated with overall survival (p < 0.05) in TCGA were
   retained for further analysis. Cox multivariate model was carried out
   with function “coxph” in the R package “survival” to develop the risk
   score model. As demonstrated Table [146]4, 20 robust DEGs were
   identified to be with positive coefficients, which could indicate their
   high expression positively correlated the risk score value, thus, these
   genes might be tumor genes. While 18 robust DEGs were identified to be
   with negative coefficients, which could indicate their high expression
   negatively correlated the risk score value, thus, these genes might be
   tumor suppressor genes. The performance of the risk score was evaluated
   by dividing the GBM samples in the TCGA into two subgroups, high-risk
   and low-risk, using the median risk score as a cutoff (2.992). As
   illustrated Supplementary Fig. [147]S3, the survival time of the low
   risk score group is significantly longer than the high-risk score
   group.

Figure 6.

   [148]Figure 6
   [149]Open in a new tab

   Univariate survival analysis in GBM stratified by robust differential
   expression gene expression based on the TCGA data as determined by
   Kaplan-Meier estimates. 521 GBM cases with full data of both clinical
   and gene expression were collected from the TCGA database. The
   expression values of these genes were classified as either high
   (expression value ≥ median) or low (expression value < median).
   Kaplan-Meier estimates (log-rank test) were made and found 38 genes
   expression were significantly affect the prognosis of GBM in OS
   (p < 0.05) (only listed top six genes). More relevant genes were shown
   in Supplementary Fig. [150]S2.

Table 4.

   Parameters of gene symbol, Hazard ratio, p values, coefficients and 95%
   confidence interval of 38 genes according to Cox multivariate
   regression.
   Gene symbol Hazard ratio p-value Coefficients 95% confidence interval
   ABCA1       1.064        0.571   0.062        0.858~1.319
   AEBP1       1.144        0.025   0.134        1.017~1.287
   ALOX5AP     1.023        0.775   0.023        0.875~1.195
   CD14        1.41         0.006   0.343        1.101~1.805
   CD163       1.027        0.747   0.027        0.872~1.21
   CD44        1.13         0.19    0.122        0.941~1.356
   CFI         0.989        0.864   −0.011       0.872~1.122
   CHI3L2      0.986        0.775   −0.014       0.896~1.086
   CLIC1       0.972        0.836   −0.029       0.742~1.274
   COL1A1      0.991        0.914   −0.009       0.842~1.166
   COL1A2      0.846        0.024   −0.167       0.732~0.979
   CXCR4       1.119        0.122   0.112        0.97~1.29
   ECM2        0.982        0.769   −0.018       0.872~1.106
   FCER1G      0.908        0.573   −0.096       0.65~1.268
   FNDC3B      1.07         0.579   0.068        0.843~1.358
   GPNMB       0.991        0.85    −0.009       0.901~1.09
   HLA.DMA     0.705        0.002   −0.349       0.568~0.876
   HMOX1       0.891        0.081   −0.115       0.783~1.014
   IFI44       1.052        0.514   0.05         0.904~1.224
   IGFBP2      1.1          0.073   0.095        0.991~1.22
   IGFBP3      1.044        0.368   0.043        0.951~1.147
   LY96        1.234        0.007   0.21         1.06~1.437
   MMP2        0.988        0.863   −0.012       0.859~1.136
   MTHFD2      0.892        0.245   −0.114       0.736~1.081
   MYD88       1.108        0.417   0.103        0.865~1.42
   NMI         1.122        0.305   0.115        0.9~1.398
   PLSCR1      1.058        0.617   0.056        0.848~1.32
   PTX3        0.98         0.705   −0.02        0.883~1.088
   PXDN        0.981        0.775   −0.02        0.858~1.121
   PYGL        0.874        0.101   −0.134       0.745~1.026
   RBBP8       0.912        0.468   −0.093       0.71~1.171
   SERPINE1    0.958        0.506   −0.043       0.845~1.087
   SOD2        0.836        0.042   −0.179       0.704~0.994
   SRPX        0.946        0.218   −0.056       0.865~1.034
   TENT5A      1.214        0.015   0.194        1.039~1.418
   TGFBI       0.98         0.791   −0.02        0.847~1.135
   TIMP1       1.079        0.454   0.076        0.884~1.318
   VSIG4       1.004        0.977   0.004        0.77~1.309
   [151]Open in a new tab

   All gene symbols were ordered alphabetically.

Conclusions

   The most comprehensive set of genome-scale mRNA expression data was
   constructed by combining GBM and normal control samples from 48
   studies, resulting thousands of samples for generating robust genes
   signature. Based on large-scale gene expression data of GBMs, we have
   identified 147 robust differential expression genes, which showed the
   underlying gene expression level differences between NC and GBMs
   samples. Moreover, the most of identified robust DEGs (67%) were
   reported that closed to associated with GBM, which suggested high
   reproducibility with published papers. Furthermore, the GO term and
   KEGG pathway enrichment results based these robust DEGs may contribute
   to better understand the molecular mechanisms of GBM. More importantly,
   based on these robustness DEGs, three new hub genes including RFC4,
   ZWINT, and TYMS and three top transcriptional factors TATA, E2F4DP1 and
   HFH4, and two miRNA hsa-mir-519E and hsa-mir-527 were identified in the
   present study. Furthermore, survival analysis based on the TCGA GBM
   data revealed 38 genes expression significantly affect the prognosis of
   GBM in OS (p < 0.05). In sum, these hub genes, transcriptional factors
   and microRNAs may be potential molecular targets for therapies of GBMs.

Electronic supplementary material

   [152]Supplementary Information^ (2.3MB, pdf)

Acknowledgements