Abstract

Background

   Using gene co-expression analysis, researchers were able to predict
   clusters of genes with consistent functions that are relevant to cancer
   development and prognosis. We applied a weighted gene co-expression
   network (WGCN) analysis algorithm on glioblastoma multiforme (GBM) data
   obtained from the TCGA project and predicted a set of gene
   co-expression networks which are related to GBM prognosis.

Methods

   We modified the Quasi-Clique Merger algorithm (QCM algorithm) into
   edge-covering Quasi-Clique Merger algorithm (eQCM) for mining weighted
   sub-network in WGCN. Each sub-network is considered a set of features
   to separate patients into two groups using K-means algorithm. Survival
   times of the two groups are compared using log-rank test and
   Kaplan-Meier curves. Simulations using random sets of genes are carried
   out to determine the thresholds for log-rank test p-values for network
   selection. Sub-networks with p-values less than their corresponding
   thresholds were further merged into clusters based on overlap ratios
   (>50%). The functions for each cluster are analyzed using gene ontology
   enrichment analysis.

Results

   Using the eQCM algorithm, we identified 8,124 sub-networks in the WGCN,
   out of which 170 sub-networks show p-values less than their
   corresponding thresholds. They were then merged into 16 clusters.

Conclusions

   We identified 16 gene clusters associated with GBM prognosis using the
   eQCM algorithm. Our results not only confirmed previous findings
   including the importance of cell cycle and immune response in GBM, but
   also suggested important epigenetic events in GBM development and
   prognosis.

Background

   The rapid development of high throughput gene expression profiling
   technology such as microarray and high throughput sequencing has
   enabled the development of many new bioinformatics data analysis
   methods for identifying disease related genes, characterizing disease
   subtypes and discovering gene signatures for disease prognosis and
   treatment prediction. For instance, in breast cancer research, a
   supervised approach was adopted to select 70 genes as biomarkers for
   breast cancer prognosis [[28]1,[29]2] and was successfully tested in
   clinical settings [[30]3]. However, a major drawback of such approach
   is that the selected gene features are usually not functionally related
   and hence cannot reveal key biological mechanisms and processes behind
   the difference of the two patient groups.

   In order to overcome this issue and identify functionally related genes
   associated with disease development and prognosis, several approaches
   have been adopted. One of such approaches is to use gene co-expression
   analysis. For instance, in [[31]4] and [[32]5] , we carried out gene
   co-expression network analysis for biomarker discovery in different
   types of cancers.

   The goal of gene co-expression network (GCN) analysis is to identify
   group of genes which are highly correlated in expression levels across
   multiple samples. The genes in the same co-expression sub-network are
   often enriched with similar functions. The metric to measure the
   correlation is usually the correlation coefficient (e.g., Pearson
   correlation coefficient or PCC) between expression profiles of two
   genes [[33]6-[34]8]. Then for each dataset, a weighted graph can be
   derived with the vertices being the genes and the weights of the edges
   being the PCC values between the two gene expression profiles. However,
   many network mining algorithms take only binary edges by imposing a
   threshold on the PCC values (i.e., two genes are connected by an edge
   only if the PCC value between them is higher than a pre-defined
   threshold) and transforming the network into a sparse unweighted gene
   co-expression network (UWGCN). For instance, in [[35]6] , an algorithm
   called CODENSE was developed to identify frequent UWGCNs from multiple
   datasets and this method has been applied to cancer biomarker
   discovery. Issues with the UWGCN approach include how to determine the
   threshold of PCC values and the threshold may be too rigid to include
   edges with weights around that threshold. Thus weighted GCN (WGCN)
   methods have been developed.

   For WGCN, Stephen Horvath's group has developed a series of methods for
   identifying gene clusters which are highly correlated using
   hierarchical clustering based approach [[36]7,[37]9,[38]10]. This
   method was applied to identify disease associated genes such as the
   ASPM gene in glioblastoma [[39]9]. However, there are several drawbacks
   of using the hierarchical clustering approach. First, it does not allow
   direct control over the intracluster connectivity such that the
   vertices within a cluster have high correlations on average. Second,
   the clustering approach does not allow shared genes between two
   sub-networks even though in biology, many genes have multiple functions
   and can be shared by multiple functional groups and dense sub-networks.
   Finally, clusters identified using this approach are often large (e.g.,
   more than 100 genes), thus smaller gene networks which contain subtle
   functional information may not be detected.

   In this paper, we take advantage of the dense sub-network finding
   method in the graph mining community and apply it to mine functional
   networks using the WGCN approach to identify dense co-expression
   sub-networks in glioblastoma. Specifically, using The Cancer Genome
   Atlas (TCGA) data sets, we identified co-expressed sub-networks
   (sub-networks for short in the following) for genes then we tested if
   these sub-networks can be used as features to separate patients into
   groups with different survival times. Using this approach, we
   identified 16 gene networks associated with GBM prognosis. Our results
   not only confirmed previous findings in GBM, but also suggested
   important epigenetic events (histone acetylation) in GBM development
   and prognosis.

Methods

Gene expression dataset for GBM

   We downloaded gene expression data from the Cancer Genome Atlas (TCGA)
   project webpage ([40]http://cancergenome.nih.gov, downloaded on
   11/24/2010) for all GBM patients with gene expression data generated
   using Affymetrix HU133 Genechip. We also downloaded all available
   public clinical data including survival information. In total, we
   selected 361 patients with complete data (i.e., each has one set of
   gene expressions, one set of microRNA expressions, and public clinical
   information). Among them, 345 have a valid vital status (i.e., either
   LIVING or DECEASED) and they are good for survival tests. The gene
   expression data were normalized using RMA normalization as described in
   the TCGA NCI Wiki.

Building WGCN for genes

   After normalization a total of 12,042 unique genes were available. PCC
   were computed between every pair of genes. We then set the genes to be
   the vertices of the WGCN with the absolute values of PCC (|PCC|) being
   weights of the edges.

Identify quasi-cliques in the WGCNs

   We first define the density of a weighted network with N vertices with
   w[ij ]being the weight, normalized between 0 and 1, between vertices
   v[i ]and v[j ](i = 1, 2,..., N, j = 1, 2,..., N, and i≠j ) as
   [MATH: <mi>d</mi><mo
   class="MathClass-rel">=</mo><mfrac><mrow><msubsup><mrow><mo
   class="MathClass-op">∑</mo></mrow><mrow><mi>i</mi><mo
   class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow><mi>N</mi><mo
   class="MathClass-bin">-</mo><mn>1</mn></mrow></msubsup><msubsup><mrow><
   mo class="MathClass-op"> ∑</mo></mrow><mrow><mi>j</mi><mo
   class="MathClass-rel">=</mo><mi>i</mi><mo
   class="MathClass-bin">+</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></m
   subsup><msub><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></
   msub></mrow><mrow><mi>N</mi><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>N</mi><mo
   class="MathClass-bin">-</mo><mn>1</mn></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-bin">/</mo><mn>2</mn></mrow></mfrac> :MATH]

   For mining densely connected networks in the WGCN, our approach is
   based on an existing algorithm previously developed for mining weighted
   networks [[41]11]. Different from many graph mining approaches (e.g.,
   [[42]12]) that focus on unweighted graphs, the algorithm of [[43]11]
   targets primarily at identifying dense components (or sub-networks) in
   a weighted graph (i.e., each edge has a weight), although it is called
   Quasi-Clique Merger (QCM). To mine dense-sub-networks in a
   gene-coexpression network, we slightly revise the original QCM
   algorithm by removing the hierarchical construction which does not
   contribute to our dense-sub-network finding, and changing the new
   search start condition from checking vertex coverage to checking edge
   coverage to ensure that each edge with its weight no less than the
   weight threshold (γ times the maximum edge weight) will be covered by
   at least one dense-sub-network. The revised algorithm is sketched
   below:

   Algorithm 1 eQCM (edge-covering Quasi-Clique Merger, a revised version
   of QCM [[44]11]. Input G=(V, E), γ, λ, t, β, Output: C )

   1: Sort edges in E in descending order of their weights;

   2: for i = 1:μ {e[μ ]is the last edge in the above sorted list with
   weight ≥γ·e[1]}

   3: if e[i ]is an edge in any sub-network in C

   4: continue;

   5: endif

   6: C = V(e[i]); U = V \ V(e[i]);

   7: while max[{v∈U}](contribute(v,C)) ≥ threshold

   8: C = C ∪ {v};

   9: U = U \ {v};

   10: endwhile

   11: C = C ∪ {C};

   12: endfor

   13: Merging highly overlapped sub-networks in C with respect to β;

   14: Output C ;

   To be consistent with the original QCM algorithm [[45]11], contribute
   (v, C) is defined as the ratio of the edge weight increase of G(C) on
   adding the vertex v, over the size of C, and threshold is
   [MATH: <mfenced close=")" open="("><mrow><mn>1</mn><mo
   class="MathClass-bin">-</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</
   mn><mi>λ</mi><mfenced close=")" open="("><mrow><mo
   class="MathClass-rel">|</mo><mi>C</mi><mo
   class="MathClass-rel">|</mo><mo
   class="MathClass-bin">+</mo><mi>t</mi></mrow></mfenced></mrow></mfrac><
   /mrow></mfenced> :MATH]
   *density (G(C)), which is determined by the input parameter λ, t, the
   size of C, and the density of the sub-network induced by C. Readers may
   refer to [[46]11] for additional details. The last second step
   (merging) is the step 4 in the original QCM algorithm. Since we are
   interested in identifying gene sub-networks with potential consistent
   functions, we select only the sub-networks with at least 10 genes to
   facilitate gene function enrichment analysis.

Survival test for identified networks

   For each sub-network, we test if the genes in it can be used as
   potential prognostic markers for predicting GBM survival. For a network
   with k genes, we extract the expression values for them for all
   patients and use them as the feature vectors for patients. The patients
   are then divided into two groups using the unsupervised K-means
   clustering algorithm (K = 2, 100 time replicates, correlation distance
   measure).

   The survival times for the two patient groups are plotted in
   Kaplan-Meier curves and the difference between the two groups is tested
   using log-rank test (code at
   [47]http://www.mathworks.com/matlabcentral/fileexchange/20388).
   P-values for the log-rank tests for all the identified networks are
   recorded.

Select representative sub-networks with significant p-values

   Since many of the identified networks have large overlaps, we cannot
   directly apply multiple test compensation method such as the Bonferroni
   correction as the tests are not independent and such correction would
   be too conservative. Instead, we design a randomized test to determine
   the false discovery ratio (FDR) for selecting significant sub-networks.

   For an N-gene sub-network, we randomly selected a list of genes from
   the entire gene list in the dataset such that the expected length of
   the selected gene list is N. Then we repeat the survival test process
   as described above. Such random test is repeated 1000 times. The lower
   5^th percentile of the 1000 p-values is used as the threshold for
   p-value for selecting sub-networks with significant prognostic power.
   Since we have a large number of sub-networks and cannot carry out 1000
   random tests for every possible N, we do such random tests for N =
   1*10^1 , 2*10^1 ,...9*10^1 , 1*10^2 , 2*10^2 ,... and the p-value
   thresholds are p[10], p[20],..., p[100], p[200],... Our results show
   that the p-value thresholds are close when N are close. Thus for a
   sub-network with size N', its p-value for survival test is compared to
   p[i ]where
   [MATH: <mrow><mi>i</mi><mo class="MathClass-rel">=</mo><mrow><mfenced
   close="⌋"
   open="⌊"><mfrac><mrow><msup><mrow><mi>N</mi></mrow><mrow><mi>′</mi></mr
   ow></msup></mrow><mrow><msup><mrow><mn>10</mn></mrow><mrow><mfenced
   close="⌋"
   open="⌊"><mrow><mtext>lg</mtext><msup><mrow><mi>N</mi></mrow><mrow><mi>
   ′</mi></mrow></msup></mrow></mfenced></mrow></msup></mrow></mfrac></mfe
   nced></mrow><mo
   class="MathClass-bin">*</mo><msup><mrow><mn>10</mn></mrow><mrow><mfence
   d close="⌋"
   open="⌊"><mrow><mtext>lg</mtext><mi>N</mi></mrow><mrow><mi>′</mi></mrow
   ></mfenced></mrow></msup></mrow> :MATH]
   to determine if it is significant. For example, a gene list with 28
   genes compares its p-value to p[20], and a gene list with 250 genes
   compare its p-value to p[200].

   We also noticed that many of the selected significant sub-networks have
   substantial overlaps and they form exclusive clusters. To identify such
   clusters, we iteratively merge networks with substantial overlaps (i.e,
   the overlap ratio r between two networks is larger than 50%) into
   clusters. The overlap ratio between two sub-networks G[1]=(V[1], E[1])
   and G[2]=(V[2], E[2]) is defined as
   [MATH: <mfrac><mrow><mo
   class="MathClass-rel">|</mo><msub><mrow><mi>V</mi></mrow><mrow><mn>1</m
   n></mrow></msub><mo
   class="MathClass-bin">∩</mo><msub><mrow><mi>V</mi></mrow><mrow><mn>2</m
   n></mrow></msub><mo
   class="MathClass-rel">|</mo></mrow><mrow><mi>m</mi><mi>i</mi><mi>n</mi>
   <mrow><mo class="MathClass-open">(</mo><mrow><mfenced close="|"
   open="|"><mrow><msub><mrow><mi>V</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub></mrow></mfenced><mo class="MathClass-punc">,</mo><mo
   class="MathClass-rel">|</mo><msub><mrow><mi>V</mi></mrow><mrow><mn>2</m
   n></mrow></msub><mo class="MathClass-rel">|</mo></mrow><mo
   class="MathClass-close">)</mo></mrow></mrow></mfrac> :MATH]
   Then we pick the sub-network with the lowest p-value in each cluster as
   the representative sub-network for further analysis.

   For the representative sub-networks, we used TOPPGene
   ([48]http://toppgene.cchmc.org) for gene ontology and pathway
   enrichment analysis without Bonferroni correction.

Results

   Using the eQCM algorithm (γ = 0.7, λ = 1, t = 1, β = 0.99999), we
   identified 8,124 sub-networks with at least ten vertices in the WGCN.
   The survival tests were then carried out on them and 866 show p-values
   less than 0.05. In addition, random tests were performed to obtain
   p[10], p[20],..., p[90], p[100],..., p[500 ]and all of them are smaller
   than 0.01. 170 sub-networks with significant p-values were selected and
   their densities range from 0.612 to 0.862. Then sub-networks with
   substantial overlaps (overlap ratio > 50%) were iteratively merged into
   sixteen clusters. The representative sub-networks for every cluster and
   their p-values and enriched GO functions are shown in Table [49]1. For
   cluster 1, the representative sub-network is highly enriched with genes
   involved in extracellular matrix organization (p = 8.22 × 10^-7) which
   also engage in many important biological processes such as cell-cell
   signaling and immune responses. Indeed, the entire set of genes in
   cluster 1 are highly enriched with immune system process genes (p =
   1.01 × 10^-46). Figure [50]1 shows examples of the Kaplan-Meier curves
   for some of the representative sub-networks in separating the patients
   using the unsupervised K-means algorithm, and heatmaps for these
   sub-networks.

Table 1.

   List of representative networks with log-rank test p-values.
   Cluster # # of unique genes Representative network size Log-rank test
   p-value Top enriched GO terms Member genes of the representative
   network
   1 284 11 5.7 × 10^-5 BP: extracellular matrix organization (8.22 ×
   10^-7)
   BP (entire cluster): immune system process (1.01 × 10^-46)
   MF: enzyme inhibitor activity (2.28 × 10^-7)
   CC: proteinacious extracellular matrix (7.32 × 10^-7) CLIC1, ILK,
   LGALS1, LGALS3, ANXA2, TIMP1, ANXA2P2, IQGAP1, EMP3, CAST, HEXB
     __________________________________________________________________

   2 43 22 1.31 × 10^-4 BP: chromatin organization (1.91 × 10^-4)
   MF: deoxycytidyltransferase activity (2.28 × 10^-7)
   CC: nucleoplasm (7.75 × 10^-5) C10orf18, TAF5, SIRT1, FMR1, FBXO11,
   TCERG1, CXorf45, CASP8AP2, ARID4B, JMJD1C, TAF2, ELF2, CENPC1, ZNF131,
   NUP153, SUZ12, SR140, ATAD2B, HISPPD1, REV1, PMS1, ZCCHC11
     __________________________________________________________________

   3 34 16 2.21 × 10^-4 BP: RNA processing (6.60 × 10^-5)
   MF: poly(A)-specific ribonuclease activity (1.50 × 10^-3)
   CC: nuclear speck (9.97 × 10^-5) USP52, ZCCHC11, FNBP4, CROP, NKTR,
   SFRS18, RBM6, RBM5, CCNL2, C21orf66, DMTF1, WSB1, CDK5RAP3, ZNF692,
   LOC440350, LOC339047
     __________________________________________________________________

   4 27 25 4.52 × 10^-4 BP: translation (2.06 × 10^-5), ncRNA metabolic
   process (4.01 × 10^-4)
   MF: structural constituent of ribosome (5.92 × 10^-5)
   CC: ribonucleoprotein complex (2.03 × 10^-7) RPP30, UCK2, BUB3, SMNDC1,
   SAR1A, MRPS16, GLRX3, TIMM23, UTP11L, HCCS, POLR3C, EIF2B3, MRPL9,
   SNRPD1, TFB2M, SUMO1, FASTKD3, HSPA14, DUSP11, ATPBD1C, MRPS15, MED28,
   GTF2B, MRPL22, POLE3
     __________________________________________________________________

   5 15 15 6.19 × 10^-4 BP: RNA processing (2.18 × 10^-10), RNA splicing
   (7.93 × 10^-8)
   MF: RNA binding (2.00 × 10^-12)
   CC: heterogeneous nuclear ribonucleoprotein complex (1.22 × 10^-12)
   JARID1B, RBM12, ADNP, CPSF6, HNRPA3, ILF3, CTCF, HNRPD, HNRNPA0, SART3,
   HNRPDL, SFPQ, HNRNPR, TARDBP, TLK2
     __________________________________________________________________

   6 35 27 0.0016 BP: chromatin modification (1.64 × 10^-9), histone
   acetylation (5.73 × 10^-6)
   MF: transcription activator activity (1.18 × 10^-5)
   CC: nucleolus (1.39 × 10^-8) BAHCC1, CHD7, PHF2, TOP2B, TCF4, MYST3,
   SETD5, POGZ, BRD3, MED13, BPTF, GPATCH8, TARDBP, ILF3, HNRNPR, NASP,
   MDC1, ARID1A, TRIM33, CTCF, HNRPA3, RBM10, YLPM1, SMARCA4, SART3,
   SFRS8, EP400
     __________________________________________________________________

   7 12 12 0.002747 BP: pentose-phosphate shunt , oxidative branch (1.94 ×
   10^-3)
   MF: 6-phosphoglucono-lactonase activity (1.29 × 10^-3)
   CC: ribosome (7.42 × 10^-3) PGLS, TMED1, CD320, MRPL4, RFXANK,
   TMEM161A, CLPP, STX10, TMEM147, EIF3G, C19orf56, UBA52
     __________________________________________________________________

   8 39 32 0.002782 BP: translation (8.30 × 10^-7)
   MF: structural constituent of ribosome (8.92 × 10^-9)
   CC: ribosome (1.59 × 10^-9), mitochondrion (2.47 × 10^-6) COMMD3,
   HSBP1, ZNF32, SUPT4H1, NFU1, LYRM4, RPS3A, RPS7, SNRPG, HAX1, MED28,
   UXT, MRPL22, FAM96B, UQCRQ, HBXIP, UBL5, MRPS15, NDUFA2, GTF2B, DUSP11,
   PSMA5, GTF2A2, PSMB4, ATP5F1, MRPL13, ATPBD1C, MRPL46, MRPL11, MRPS7,
   WDR61, BNIP1
     __________________________________________________________________

   9 25 25 0.003697 BP: RNA processing (1.06 × 10^-3), mitotic cell cycle
   (1.66 × 10^-3)
   MF: eukaryotic initiation factor 4G binding (1.29 × 10^-3), RNA binding
   (4.08 × 10^-3)
   CC: chromosomal part (2.64 × 10^-3) DDX52, PRPSAP2, YWHAQ, ORC4L,
   MOBKL3, MYNN, CENPQ, C11orf73, MIS12, HMGN4, C14orf104, FASTKD3,
   SNRPD1, C4orf27, SFRS3, SUMO1, GIN1, FLJ13611, THAP1, ATPBD1C, DUSP11,
   EIF4E, PIGF, RY1, NIF3L1
     __________________________________________________________________

   10 14 14 0.004707 BP: nuclear-transcribed mRNA catabolic process (1.82
   × 10^-3)
   MF: RNA binding (3.02 × 10^-3)
   CC: BRISC complex (2.94 × 10^-3) DDX50, DIP2C, KIAA0157, KIAA1128,
   KIAA1279, LARP5, PAPD1, RAB11FIP2, SHOC2, TNKS2, UPF2, WAC, WDR37,
   ZMYND11
     __________________________________________________________________

   11 22 16 0.005489 BP: type I interferon-mediated signalling (4.49 ×
   10^-14) pathway, immune system process (1.79 × 10^-11)
   MF: MHC class I receptor activity (4.94 × 10^-10)
   CC: MHC class I protein complex (1.80 × 10^-9) CASP1, CASP4, PLSCR1,
   NMI, SP100, SP110, TRIM22, TRIM6-TRIM34, TRIM21, IFI35, PSMB9, PSMB8,
   HLA-F, HLA-B, HLA-C, HLA-E
     __________________________________________________________________

   12 12 12 0.005663 BP: -
   MF: sequence-specific DNA binding transcription factor activity (2.43 ×
   10^-2)
   CC: - ZNF134, ZNF180, ZNF211, ZNF222, ZNF223, ZNF228, ZNF230, ZNF304,
   ZNF419, ZNF45, ZNF606, ZNF8
     __________________________________________________________________

   13 21 21 0.006774 BP: mitotic cell cycle (6.43 × 10^-5)
   MF: RNA trimethylguanosine synthase activity (1.13 × 10^-3)
   CC: nucleoplasm (4.08 × 10^-4) EED, POLD3, ELF2, CENPC1, HISPPD1,
   ZNF131, RBM12, CEP57, NOL11, COIL, NUP160, CEP76, ZNF140, ZNF143, TDG,
   TAF11, FASTKD3, TGS1, EXOSC9, YTHDF2, SAE2
     __________________________________________________________________

   14 18 18 0.006858 BP: arginine biosynthetic process via orthithine
   (9.69 × 10^-4)
   MF: argininosuccinate lyase activity 9.67 × 10^-4)
   CC: organelle envelope (4.11 × 10^-3) ASL, ZMYM6, RAB32, CD58, MOBKL1B,
   TRAM1, CD164, RER1, CCDC109B, CLIC1, CASP4, SQRDL, SERPINB1, MR1,
   CASP1, CAPG, MGAT4A, ANXA4
     __________________________________________________________________

   15 22 22 0.007708 BP: multicellular organismal movement (4.97 × 10^-4)
   MF: ATP-dependent helicase activity (2.22 × 10^-4)
   CC: endopeptidase Clp complex (1.10 × 10^-3) SEC24A, TMF1, KIAA0372,
   CLCC1, DHX29, SLC30A5, VPS54, CHD1, RPS6KB1, HISPPD1, ETAA1, CENPC1,
   CLPX, C1orf9, ZNF131, KLHL20, REV1, ZC3H7A, DDX46, NUP153, SMCHD1,
   PPWD1
     __________________________________________________________________

   16 15 15 0.008207 BP: ribulose-phosphate 3-epimerase activity (8.069 ×
   10^-4)
   MF: mRNA 3'-end processing (1.429 × 10^-3)
   CC: SPOTS complex (1.574 × 10^-3) C15orf15, SELT, COMMD10, UBE2A,
   TMED2, CNOT8, NMD3, MRPL42, BZW1, NUDT21, SPTLC1, DCTN4, YIPF5, RPE,
   C20orf30
   [51]Open in a new tab

   The p-values associated with the GO terms are based on Fisher's exact
   tests without Bonferroni correction
   ([52]http://toppgene.cchmc.or[53]g).

Figure 1.

   [54]Figure 1
   [55]Open in a new tab

   Kaplan-Meier curves and heatmaps for two groups of patients with
   significantly different survival times identified using three networks.
   Top: Left - The Kaplan-Meier curve for the patients separated using the
   immune response network #1. Right - The heatmap of the gene expression
   values for the genes in the representative network #1. The vertical
   white line indicates the separation between short survival (left to the
   white line) and long survival (right to the white line) groups. The
   density of the representative network is 0.6928 and the p-value is 5.7
   × 10^-5. Middle: The Kaplan-Meier curve and heatmap for the chromatin
   modification network #6. The density of the representative network is
   0.6951 and the p-value is 0.0016. Bottom: The Kaplan-Meier curve and
   heatmap for the mitotic cycle network #13. The density of the
   representative network is 0.6764 and the p-value is 0.006774.

Discussion

   In this paper, we carried out a co-expression analysis on GBM gene
   expression data to screen for biological processes involved in patient
   prognosis. In previous studies, using co-expression analysis based on
   clustering algorithm, ASPM has been identified as an important target
   gene in GBM [[56]9]. ASPM is involved in cell cycle and mitosis
   functions and many networks with ASPM were identified in our study. We
   also identified a mitosis related sub-network with a significant
   p-value in our study (sub-network #13 in Table [57]1). Besides cell
   cycle networks, immune response networks also prove to be critical in
   GBM development as shown in sub-networks #1 and #11, which is
   consistent with the previous report on the importance of immune and
   inflammation genes in GBM [[58]13]. As shown in Figure [59]1, genes in
   sub-network #1 show higher expression levels in the short survival
   group. Since a characteristic of GBM is its high metastasis occurrence
   and extracellular and immune genes play important roles in metastasis,
   the genes in this group may be potential targets for treatment for
   reducing metastasis. An interesting observation is that two
   sub-networks (#2 and #6) related to chromatin modification are
   identified. Particularly in sub-network #6, histone acetylation genes
   are highly enriched including well known chromatin modification genes
   such as CTCF [[60]14] and EP400 [[61]15]. The expression levels of
   these genes show down-regulation in the short survival group which
   indicates a possibly reduced histone acetylation activity. Histone
   acetylation is an important epigenetic event [[62]16] and our findings
   suggest that epigenetics may play an important role in GBM development
   and prognosis and ChIP-seq experiments targeting histone acetylation
   changes associated with GBM development may be necessary. These
   findings are subject to further cross-validation and experimental
   investigation. Besides genes, our approach can be applied to identify
   microRNA modules which show strong association with patient survival
   and the results can also shed light on microRNA transcription
   regulation.

Conclusions

   In this paper, we introduced eQCM algorithm for mining dense network
   clusters in weighted graphs and used this approach to identify 16 gene
   networks associated with GBM prognosis on weighted gene co-expression
   network. Our results not only confirmed previous findings including the
   importance of cell cycle and immune response networks in GBM, but also
   suggested important epigenetic events in GBM development and prognosis.

Competing interests

   The authors declare that they have no competing interests.

Authors' contributions

   YX carried out the development and implementation of eQCM and survival
   tests. CQZ originally proposed and designed the QCM algorithm. KH led
   the project including development of the idea, design of all
   experiments and writing of the manuscript. All authors edited the
   manuscript.

Contributor Information

   Yang Xiang, Email: yxiang@bmi.osu.edu.

   Cun-Quan Zhang, Email: cqzhang@math.wvu.edu.

   Kun Huang, Email: kun.huang@osumc.edu.

Acknowledgements