Abstract

   The rapid accumulation of cancer-related data owing to high-throughput
   technologies has provided unprecedented choices to understand the
   progression of cancer and discover functional networks in multiple
   cancers. Establishment of co-expression networks will help us to
   discover the systemic properties of carcinogenesis features and
   regulatory mechanisms of multiple cancers. Here, we proposed a
   computational workflow to identify differentially co-expressed gene
   modules across 8 cancer types by using combined gene differential
   expression analysis methods and a higher-order generalized singular
   value decomposition. Four co-expression modules were identified; and
   oncogenes and tumor suppressors were significantly enriched in these
   modules. Functional enrichment analysis demonstrated the significantly
   enriched pathways in these modules, including ECM-receptor interaction,
   focal adhesion and PI3K-Akt signaling pathway. The top-ranked miRNAs
   (mir-199, mir-29, mir-200) and transcription factors (FOXO4, E2A, NFAT,
   and MAZ) were identified, which play an important role in deregulating
   cellular energetics; and regulating angiogenesis and cancer immune
   system. The clinical significance of the co-expressed gene clusters was
   assessed by evaluating their predictability of cancer patients’
   survival. The predictive power of different clusters and subclusters
   was demonstrated. Our results will be valuable in cancer-related gene
   function annotation and for the evaluation of cancer patients’
   prognosis.

   Keywords: co-expression network, prognosis, HO-GSVD, gene module,
   cancer

INTRODUCTION

   The rapid accumulation of cancer-related data owing to high-throughput
   technologies has provided unprecedented choices to understand the
   progression of cancer and discover functional networks in multiple
   cancers. The vast majority of cancer-related studies have focused on a
   single cancer, but always ignored the common traits across different
   cancer types. Different cancers usually share common hallmarks, such as
   evading growth suppressors, resisting cell death and inducing
   angiogenesis. Moreover, the methods based on biological networks
   including gene co-expression networks, metabolic networks,
   protein-protein interaction networks and genetic regulatory networks
   can infer regulatory mechanisms related to biological processes. The
   network-based methods to search biological processes related to cancer
   hallmarks will help us in identifying the characterizations of tumor
   biology. Few studies focus on genome-scale networks across different
   cancer types. Thus, the network analysis may help us to unveil common
   traits involved in multiple cancers.

   The majority of the network-based methods are performed to identify
   distinct patterns within one cancer [[34]1, [35]2]. Compared with only
   analyzing one cancer, several methods have been used to identify common
   patterns shared by two or more cancers. Zhang et al. used a network
   mining algorithm to build tightly connected gene co-expression networks
   from the microarray datasets spanning 33 cancer types. Their results
   indicate that the commonly recognized characteristics of cancers are
   supported by highly coordinated transcriptomic activities [[36]3]. Yang
   et al. did the weighted correlation network analysis (WGCNA) to
   highlight common properties of prognostic genes in four cancer types
   [[37]4]. Li et al. described a network method to analyze the driver
   mutation by integrating both cancer genomes and transcriptomes and
   identified a significant correlation of genotype to phenotype in six
   solid tumors [[38]5].

   Several computationally efficient methods have been developed for
   construction of the networks, such as Generalized Singular Value
   Decomposition (GSVD) [[39]6] and higher-order GSVD (HO-GSVD) [[40]7].
   GSVD is used for identifying common structures across two conditions
   [[41]6]. The analysis based on HO-GSVD can extract common gene modules
   in two or more conditions [[42]7–[43]9]. Ponnapalli et al. first
   proposed HO-GSVD to analyze common structures shared by multiple
   datasets [[44]7]. Xiao et al. developed a new HO-GSVD method for
   analyzing common and tissue-specific modules from seven rat tissues and
   four human brain regions [[45]8]. Wang et al. applied a simple
   mathematical framework of HO-GSVD for analysis of multiple tissues
   [[46]9]. These studies indicate that HO-GSVD is a valuable method in
   common gene pattern discovery among different tissues. Motivated by
   this approach, we applied HO-GSVD to pan-cancer analysis in this study.
   To our best of knowledge, this approach has not been used for
   pan-cancer research.

   Differential gene expression analysis has been widely used for
   identifying differentially expressed genes between conditions. The
   commonly used methods include edgeR [[47]10], limma [[48]11], SAMseq
   [[49]12] and DESeq [[50]13]. Most studies only use one method to
   analyze gene expression patterns. Soneson et al. compared the commonly
   used methods for differential expression analysis and found that no
   single method is optimal under all circumstances and the method of
   choice in a particular situation depends on the experimental conditions
   [[51]14].

   In this study, we applied the above four methods for differential
   expression analysis to get common genes in eight types of cancers
   through a three-step procedure. First, the differentially expressed
   gene set shared by four caner types was selected using one method.
   Second, the commonly expressed gene set selected by four methods was
   set as a candidate gene set. We then converted candidate gene IDs to
   the corresponding DAVID gene IDs and removed the non-mapped genes. The
   gene expression matrices of multiple cancers were decomposed by the
   HO-GSVD method for identifying the common modules across different
   cancers. We chose the vectors with top eigenvalues in the right basis
   matrix as candidates of co-expression genes. The co-expression genes
   were selected based on the assumption that a small proportion of genes
   in candidate vectors is highly similar. We used the DAVID tools to
   validate the functional significance of the modules. The gene modules
   involved in the significant pathways were retained for further
   analysis. Multiple types of enrichment analysis were performed,
   including gene ontology terms, KEGG pathways, cell type enrichment,
   disease association and miRNA and transcription factor enrichment
   analysis [[52]15–[53]17]. The functional interaction networks were
   constructed for the identified modules. The survival analysis was then
   applied for the prognostic properties of modules across cancers. By
   following the procedure shows in the Figure [54]1, we found that the
   co-expression gene modules are enriched with oncogenes and tumor
   suppressors, which play an important regulatory role in multiple
   cancers. The genes in the main- and sub-modules are closely associated
   with the prognosis of multiple cancers.

Figure 1. Overview of the workflow.

   [55]Figure 1
   [56]Open in a new tab

   There are three main steps including gene differential expression
   analysis, identification of co-expression modules and significant
   enrichment analysis.

RESULTS

Identifying co-expression gene modules

   We analyzed differentially expressed genes using the raw count data and
   constructed co-expression networks using the FPKM count data. 5229
   differentially expressed genes were detected in breast invasive
   carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), prostate
   adenocarcinoma (PRAD), thyroid carcinoma (THCA), lung adenocarcinoma
   (LUAD), bladder urothelial carcinoma (BLCA), colon adenocarcinoma
   (COAD) and stomach adenocarcinoma (STAD) using limma, edgeR, DESeq, and
   SAMseq methods (Figure [57]2A). 4973 genes were used for construction
   of co-expression networks and pathway enrichment analysis. Five
   clusters with significant pathways were identified ([58]Supplementary
   Table 1). The significant pathways enriched in the smallest cluster 10
   were almost identical to the pathways in the cluster 5. Then five
   clusters except cluster 10 were kept for further analysis. The highly
   specific genes were included in each cluster. But there were a few
   genes shared by these clusters (Figure [59]2B). The results showed that
   these co-expression modules with specific genes may share similar
   biological information. To comprehensively investigate the
   characterization of the identified co-expression modules, we applied
   multiple types of enrichment analysis.

Figure 2. Identification of differentially expressed genes and co-expression
modules.

   [60]Figure 2
   [61]Open in a new tab

   (A) Venn diagram showing the overlap between differentially expressed
   genes selected by the four methods. (B) Venn diagram showing the
   overlap between the genes in the four co-expression modules.

   We applied functional enrichment analysis to identify the KEGG pathways
   (Figure [62]3A–3D) and Gene Ontology terms ([63]Supplementary Table 2)
   in the four clusters. For cluster 5, there are only six significant
   KEGG pathways as shown in the Figure [64]3B. ECM-receptor interaction,
   focal adhesion and PI3K-Akt signaling pathways were presented in the
   cluster 8 and cluster 5, consistent with the previous results obtained
   by pan-cancer analysis [[65]18]. Moreover, these pathways are related
   to the deregulation of cellular energetics [[66]19]. Abnormal ECM
   affects cancer progression by directly promoting cellular
   transformation and metastasis [[67]20]. Focal adhesion kinase, a
   protein tyrosine kinase, regulates cellular adhesion, motility,
   proliferation and survival in cancer cells, thereby promoting cancer
   progression and metastasis [[68]21]. PI3K-Akt signaling pathway has
   been reported as one of the most important pathways in cancer
   metabolism and growth [[69]22]. Figure [70]3C and Figure [71]3D show
   the enriched pathways in the cluster 7 and 8, respectively. Only focal
   adhesion was shared by the cluster 2, 5 and 8 (Figure [72]3A, 3B, 3D).
   These results indicated that the enriched pathways in the clusters are
   associated with tumorigenesis. To further evaluate whether the
   functional features in these clusters are also important in other
   cancers, we did analysis for another four cancers, including uterine
   corpus endometrial carcinoma (UCEC), head and neck squamous cell
   carcinoma (HNSC), rectum adenocarcinoma (READ) and liver hepatocellular
   carcinoma (LIHC) ([73]Supplementary Figure 1). 8619 genes were
   identified. We converted the candidate gene IDs to the corresponding
   DAVID gene IDs. If two IDs were corresponding to the same gene, we
   removed the redundant one. Therefore, 490 redundant genes were removed
   and 8129 genes used for further analysis. We got two clusters with
   significant pathways ([74]Supplementary Table 3) using HO-GSVD
   approach. We found that cluster 6 in the four cancers was highly
   enriched in focal adhesion
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>6</mn></msub><mo>=</mo><mn>5
   </mn><mo>.</mo><mn>7</mn><mtext>e</mtext><mo>−</mo><mn>11</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   , ECM-receptor interaction
   [MATH: <mrow><mtext> </mtext><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>6</mn></msub><mo>=</mo><mn>2
   .6</mn><mtext>e</mtext><mo>−</mo><mn>10</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   and PI3K-Akt signaling
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>6</mn></msub><mo>=</mo><mn>2
   .7</mn><mtext>e</mtext><mo>−</mo><mn>5</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   pathways. For the eight cancers, these clusters partially shared some
   KEGG pathways. We observed some similar patterns in the following
   results of the GO biological process (BP) enrichment analysis. The
   top-ranked five enriched BPs in the cluster 5 included collagen fibril
   organization
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>3
   .5</mn><mtext>e</mtext><mo>−</mo><mn>6</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   , extracellular matrix organization
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>4
   .5</mn><mtext>e</mtext><mo>−</mo><mn>6</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   , collagen catabolic process
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>4
   </mn><mo>.</mo><mn>9</mn><mtext>e</mtext><mo>−</mo><mn>5</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   , skeletal system development
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>3
   </mn><mo>.</mo><mn>2</mn><mtext>e</mtext><mo>−</mo><mn>3</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   and cell adhesion
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>1
   </mn><mo>.</mo><mn>3</mn><mtext>e</mtext><mo>−</mo><mn>2</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   . The top three enriched BPs in the cluster 8 were the same as those in
   the cluster 5. It was reported that the collagen fibrils reorganization
   within an extracellular matrix facilitated tumorigenesis and invasion
   [[75]23]. We found that the BPs in the cluster 7 are mainly involved in
   T cell and immune response [[76]24]. The highly enriched BPs in the
   cluster 2 are related to angiogenesis. In addition to cluster 7, these
   commonly enriched BPs are associated with extracellular matrix
   organization and cell adhesion. Similarity to the KEGG pathways, the
   corresponding enriched BPs in these clusters also play an important
   role in cancers. As expected, all the above results indicated that the
   identified clusters with biological functions related to cellular
   energetics, angiogenesis and anti-cancer immunity are important to the
   cancer development.

Figure 3. Enrichment of co-expression modules.

   [77]Figure 3
   [78]Open in a new tab

   The four bar charts display the pathway enrichment results of cluster 2
   (A), cluster 5 (B), cluster 7 (C) and cluster 8 (D). The pathway shared
   by at least two cluster are colored with light green. The cell-type
   enrichment analysis of cluster 8 (E) and other clusters
   ([79]Supplementary Figure 2) is shown as –log10 (Benjamini-Hochberg
   corrected p-value). The overlap of the members in the top 5-ranked
   miRNA families is shown as the Venn diagram (F).

   To further explore the cell specificity and disease association of the
   differentially expressed genes, we analyzed cell type enrichment (Cten)
   and disease association for the genes in four clusters
   ([80]Supplementary Tables 4, [81]5). Enriched cell types in these
   clusters were closely associated with all eight cancer types, including
   BRCA, KIRC, PRAD, THCA, LUAD, BLCA, COAD, STAD. The top three cell
   types for the cluster 8 included cardiac myocytes (score as –log10
   (Benjamini-Hochberg corrected p-value), score = 57), smooth muscle
   (score = 55) and adipocyte (score = 38) (Figure [82]3E). Adipocytes can
   support tumorigenesis by secreting adipokines and producing energy
   [[83]25]. The immune cells in the cluster 7 and cluster 8 included
   CD14+ Monocytes, CD33+ Myeloid and BDCA4+ Dentritic Cells. CD14+
   Monocytes (score = 4.88 in the cluster 7 and score = 7.81 in cluster 8)
   has been reported to affect dendritic cell differentiation and T-cell
   function in cancer patients [[84]26]. These results indicated that the
   enriched cell types in the clusters were associated with tumor
   progression. Moreover, disease association analysis suggested that the
   genes in the four clusters were mainly enriched in cancer-related
   diseases with high significance ([85]Supplementary Table 5,
   [MATH:
   <mrow><mi>p</mi><mo>≤</mo><mn>8</mn><mo>.</mo><mn>39</mn><mtext>e</mtex
   t><mo>−</mo><mn>8</mn></mrow> :MATH]
   ). All the above results further suggested that the four clusters which
   play important roles in biological processes related to tumor
   progression and immunity are associated with multiple cancers.

   To further demonstrate the potential regulatory mechanisms, we also
   analyzed miRNAs and transcription factors using the WebGestalt tool
   ([86]Supplementary Tables 6, [87]7). Three of the top ranked miRNA
   families were identical in the cluster 2, 5 and 8, including mir-199,
   mir-29 and mir-200 (Figure [88]3F). Mir-199a was only presented in the
   cluster 2 and cluster 5
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>2</mn></msub><mo>=</mo><mn>0
   </mn><mo>.</mo><mn>004</mn><mo>,</mo><mtext>  </mtext><msub><mi>p</mi><
   mn>5</mn></msub><mo>=</mo><mn>0</mn><mo>.</mo><mn>0035</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   . Mir-29a, mir-29b and mir-29c
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>0
   </mn><mo>.</mo><mn>0009</mn><mo>,</mo><mtext>  </mtext><msub><mi>p</mi>
   <mn>8</mn></msub><mo>=</mo><mn>5</mn><mo>.</mo><mn>88</mn><mtext>e</mte
   xt><mo>−</mo><mn>5</mn><mo stretchy="false">)</mo></mrow> :MATH]
   in the cluster 5 and 8 are the mature members of the mir-29 family.
   Three of the main mir-200 family members
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>2</mn></msub><mo>=</mo><mn>0
   </mn><mo>.</mo><mn>0004</mn><mo>,</mo><mtext>  </mtext><msub><mi>p</mi>
   <mn>8</mn></msub><mo>=</mo><mn>5</mn><mo>.</mo><mn>88</mn><mtext>e</mte
   xt><mo>−</mo><mn>5</mn><mo stretchy="false">)</mo></mrow> :MATH]
   in the cluster 2 and 8 have been reported to associate with tumor
   progression [[89]27, [90]28]. The let-7 family of miRNAs
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>0
   </mn><mo>.</mo><mn>0009</mn><mo stretchy="false">)</mo></mrow> :MATH]
   in the cluster 5 is involved in tumorigenesis [[91]29]. We also found
   that some clusters were high significantly enriched with transcription
   factors. FOXO4 gene was presented in the cluster 2, 5 and 7
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>2</mn></msub><mo>=</mo><mn>2
   </mn><mo>.</mo><mn>82</mn><mtext>e</mtext><mo>−</mo><mn>9</mn><mo>,</mo
   ><mtext>  </mtext><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>4</mn>
   <mo>.</mo><mn>29</mn><mtext>e</mtext><mo>−</mo><mn>5</mn><mo>,</mo><mte
   xt>  </mtext><msub><mi>p</mi><mn>7</mn></msub><mo>=</mo><mn>0</mn><mo>.
   </mo><mn>0027</mn><mo stretchy="false">)</mo></mrow> :MATH]
   [[92]30]. The clusters 2, 5 and 8 shared three transcription factors,
   including E2A
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>2</mn></msub><mo>=</mo><mn>3
   </mn><mo>.</mo><mn>66</mn><mtext>e</mtext><mo>−</mo><mn>8</mn><mo>,</mo
   ><mtext>  </mtext><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>1</mn>
   <mo>.</mo><mn>06</mn><mtext>e</mtext><mo>−</mo><mn>9</mn><mo>,</mo><mte
   xt>  </mtext><msub><mi>p</mi><mn>8</mn></msub><mo>=</mo><mn>7</mn><mo>.
   </mo><mn>38</mn><mtext>e</mtext><mo>−</mo><mn>15</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   , NFAT
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>2</mn></msub><mo>=</mo><mn>8
   </mn><mo>.</mo><mn>79</mn><mtext>e</mtext><mo>−</mo><mn>11</mn><mo>,</m
   o><mtext>  </mtext><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>2</mn
   ><mo>.</mo><mn>38</mn><mtext>e</mtext><mo>−</mo><mn>5</mn><mo>,</mo><mt
   ext>  </mtext><msub><mi>p</mi><mn>8</mn></msub><mo>=</mo><mn>3</mn><mo>
   .</mo><mn>51</mn><mtext>e</mtext><mo>−</mo><mn>17</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   , and MAZ
   [MATH: <mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mn>2</mn></msub><mo>=</mo><mn>2
   </mn><mo>.</mo><mn>47</mn><mtext>e</mtext><mo>−</mo><mn>8</mn><mo>,</mo
   ><mtext>  </mtext><msub><mi>p</mi><mn>5</mn></msub><mo>=</mo><mn>2</mn>
   <mo>.</mo><mn>40</mn><mtext>e</mtext><mo>−</mo><mn>6</mn><mo>,</mo><mte
   xt>  </mtext><msub><mi>p</mi><mn>8</mn></msub><mo>=</mo><mn>2</mn><mo>.
   </mo><mn>01</mn><mtext>e</mtext><mo>−</mo><mn>15</mn><mo
   stretchy="false">)</mo></mrow> :MATH]
   . E2A, FOXO4, NFAT and MAZ play important roles in tumor cell growth
   and metastasis in multiple cancers, such as colorectal cancer, breast
   cancer, prostate cancer and lung cancer [[93]31–[94]33]. Taken
   together, these results revealed that common regulators shared by the
   four clusters may cooperate with each other to regulate the biological
   processes of multiple cancers.

Functional network analysis of individual co-expression modules

   To further analyze the characterization of single identified modules,
   we constructed the functional interaction networks. These networks were
   built based on human PPIs, fly PPIs, worm PPIs, yeast PPIs, domain
   interaction, Lee's Gene Expression, Prieto's Gene Expression, GO BP
   sharing and PPIs from GeneWays [[95]34]. We found that the
   corresponding proportions of the genes linked to the functional
   interaction (FI) networks in the four modules were 80.63%, 83.78%,
   76.60% and 87.20%, respectively ([96]Supplementary Table 1). This
   result indicated that the four clusters were highly conserved at the
   functional protein level. According to the results from the above
   subsections, all clusters were closely associated with the eight
   cancers. Among of them, most of the enriched pathways in the cluster 5
   were consistent with the previous pan-cancer outcomes [[97]18],
   therefore, we did further functional analysis of the cluster 5-based FI
   network as shown in the Figure [98]4.

Figure 4. Network visualization of the cluster 5.

   [99]Figure 4
   [100]Open in a new tab

   The functional interaction network consists of eight sub-modules marked
   with different colors. The genes and link genes in the modules are
   represented as circles and diamonds, respectively.

   The FI network was clustered into small modules and eight modules were
   annotated as M1~M8 ([101]Supplementary Table 8). We performed the
   enrichment analysis for GO BP and found that some high significantly
   BPs (FDR < 0.001) were consistent with the above results. Extracellular
   matrix disassembly, collagen catabolic process, extracellular matrix
   organization, collagen fibril organization, skeletal system development
   and cell adhesion were enriched in the M2. The BPs in the M7 are
   associated with the negative regulation of extracellular matrix
   disassembly and endothelial cell migration. Endothelial cell migration
   has been reported to contribute the entry of cancer cells into the
   circulatory system [[102]35]. There were two BPs in the M5, including
   O-glycan processing and protein O-linked glycosylation. It has been
   reported that alterations in glycosylation impacted cell cycle and may
   support neoplastic progression [[103]36]. Interestingly, M1 had only
   one enriched pathway, that is, the notch signaling pathway. Notch
   signaling pathway plays an important in regulating stem cell
   self-renewal and the pathogenesis of breast cancer [[104]37]. The most
   enriched pathway in the M2 was ECM-receptor interaction. Phenylalanine
   metabolism was the top enriched pathway in the M3. Nicotinic
   acetylcholine receptor signaling pathway in M4 and amino acid
   metabolism pathway in M6 are associated with cancer growth
   [[105]38–[106]40]. These results indicated that distinct BPs enriched
   in different sub-modules contribute to the development of cancers.

   We then defined an individual gene in the modules with at least ten
   neighbors as one hub gene, and obtained 20 hub genes, including 15
   linker genes and 5 module genes. Over half of the hub genes were
   enriched in the M2, including FN1, COL1A1, COL1A2, COL3A1 and COL6A3.
   FN1 is a FDA-approved drug target gene against cancer [[107]41,
   [108]42]. COL1A1, COL1A2, COL3A1 and COL6A3 are members of the collagen
   family. These five genes were enriched in pathways such as ECM-receptor
   interaction, protein digestion and absorption, integrin signaling
   pathway and PI3K-Akt signaling pathway. DES, THBS2, PLAU and POSTN in
   the M2 have been associated with multiple cancers [[109]43–[110]50].
   DES encodes a muscle-specific class III intermediate filament and is
   related with colorectal cancer, breast cancer, prostate cancer, kidney
   cancer and lung cancer [[111]43–[112]46]. As the member of THBS family,
   THBS2 plays an important role in cancer progression [[113]47]. PLAU is
   involved in cancer cell migration [[114]48]. The protein encoded by
   POSTN has been reported to function in cancer stem cell [[115]49,
   [116]50].

Survival analysis of gene co-expression modules

   To further test the clinical significance of the co-expressed gene
   clusters, we did survival analysis for the eight types of cancers. The
   patients were clustered into different groups using non-negative matrix
   factorization (NMF). The cluster 2 and cluster 5 were able to predict
   patient survival of various cancers, including KIRC, THCA, LUAD, BLCA,
   COAD and STAD (Figure [117]5A, [118]Supplementary Figure 3). The
   cluster 7 and cluster 8 could predict survival of LUAD, KIRC and BLCA
   cancer patients ([119]Supplementary Figures 4, [120]5). The cox models
   were used to identify prognostic genes in various cancers. The distinct
   proportions of prognostic genes in the eight cancers were observed in
   the four clusters (Figure [121]5B). Among all of the clusters, KIRC
   contains the largest proportion of prognostic genes. We also carried
   out the survival analysis on the eight sub-modules in the cluster 5, as
   well as the module 9 containing the genes that weren't connected to the
   networks in the cluster 5 ([122]Supplementary Tables 8, [123]9). The
   patients were grouped using the k-means method when the number of genes
   in the sub-module was one. In the survival analysis of individual
   sub-modules, all sub-modules showed statistically significant
   differences in survival probabilities (Figure [124]5C). M1, M2, M4 and
   M9 were significantly associated with the patients’ survival in at
   least three cancers. But only one sub-module showed a significant
   difference in the survival analysis of BRCA and none of the sub-modules
   was significant in PRAD. Analysis of STAD showed that individual
   sub-modules couldn't predict these patients’ survival but the
   corresponding cluster 5 had the predictive power. The results of the
   sub-modules in the survival analysis were mostly consistent with the
   cluster 5. All these results revealed that the sub-modules associated
   with known biological processes cooperate with each other to contribute
   to the prognosis in multiple cancers.

Figure 5. Survival analysis of gene co-expression modules for 8 types of
cancers.

   [125]Figure 5
   [126]Open in a new tab

   (A) Survival analysis of the cluster 5 using Kaplan-Meier curves. The
   calculation of the log-rank p-values is based on the split of patient
   groups using non-negative matrix factorization. The number of patients
   in each group is also labeled in each panel. (B) The distributions of
   prognostic genes for the eight cancer types in each cluster. The y-axis
   represents the gene proportion of each cancer in the corresponding
   cluster. The eight cancers are marked with different colors. (C) The
   differential significance distributions of nine sub-modules for each
   cancer. The y-axis represents the log-rank p-value and the x-axis is
   annotated with the sub-modules grouped by the network clustering. The
   solid line represents –log[2](0.05).

   In order to further validate the prognosis of cluster 5 in multiple
   cancers, we performed survival analysis on HNSC, READ, LIHC and UCEC.
   The results showed that the cluster 5 was able to predict patient
   survival in HNSC, LIHC and UCEC (Figure [127]6A), but not in READ. The
   corresponding sub-modules obtained from the functional network analysis
   had statistically significant differences in survival analysis of the
   four cancers (Figure [128]6B). M2 had predictive power for patient
   survival in three cancers. We also applied survival analysis on other
   eight cancers, including glioblastoma multiforme (GBM), brain lower
   grade glioma (LGG), ovarian serous cystadenocarcinoma (OV), skin
   cutaneous melanoma (SKCM), adrenocortical carcinoma (ACC), cervical
   squamous cell carcinoma and endocervical adenocarcinoma (CESC), kidney
   renal papillary cell carcinoma (KIRP) and lung squamous cell carcinoma
   (LUSC). As shown in the [129]Supplementary Figure 6, the cluster 5 can
   predict patients’ survival of LGG, SKCM, ACC and KIRP cancers. These
   results confirmed the prognostic ability of cluster 5 in 7 cancer
   types. We also applied survival analysis on the above twelve cancer
   types for the cluster 2, 7 and 8. All of the three clusters can predict
   patients’ survival of LIHC, UCEC, LGG, SKCM, ACC and KIRP cancers.
   Additionally, cluster 2 can predict the survival of HNSC and CESC
   cancer patients.

Figure 6. Survival analysis of gene co-expression modules for HNSC, LIHC,
READ and UCEC.

   [130]Figure 6
   [131]Open in a new tab

   (A) Survival analysis of the cluster 5 using Kaplan-Meier curves for
   HNSC, LIHC, READ and UCEC. (B) The differential significance
   distributions of nine sub-modules for each cancer.

DISCUSSION

   In this study, we applied four methods to analyze the data and obtained
   4973 differentially expressed genes shared by the eight cancers. We
   constructed the co-expression network using HO-GSVD and identified
   modules and associated pathways. Some of these pathways have been
   reported in the pan-cancer analysis [[132]18], such as focal adhesion,
   PI3K-Akt signaling pathway, ECM-receptor interaction and Systemic lupus
   erythematosus. We constructed the functional interaction networks for
   further analyzing individual co-expression modules. These sub-modules
   were enriched with different pathways and cooperated with each other
   across cancers. We found that these modules are associated with
   patients’ survival in multiple cancers. In addition, the individual
   sub-modules in various cancers had different prognostic capability.

   Gene differential expression analysis is a commonly used method for
   identifying differentially expressed genes without considering the
   relationship between genes. There is no consensus on the best method
   for differential expression analysis [[133]14, [134]51]. Limma, edgeR,
   DESeq, and SAMseq have been commonly used for gene differential
   expression analysis. Each of them has advantages and disadvantages. The
   limma method, based on linear models and the voom transformation, is
   developed for analyzing RNA-seq data [[135]52]. The negative binomial
   model and empirical Bayes methods in edgeR are used to detect
   differential gene expression, and then the gene-wise dispersions is
   estimated by conditional maximum likelihood [[136]10]. DESeq models the
   count data using the negative binomial distribution and estimates the
   mean-variance relationship of each gene. In contrast to edgeR, it
   allows a widely data-driven parameter in the statistical test
   [[137]13]. SAMseq, a non-parametric method, uses the Wilcoxon rank
   statistic and resampling procedure to identify differential expressed
   genes [[138]12]. edgeR becomes liberal for small sample sizes with
   default settings to a certain extent and keeps a better balance between
   speed and accuracy than DESeq [[139]14, [140]51]. Limma and DESeq are
   described as the safest choices in some cases in terms of the
   consistency of differentially expressed genes when analyzing the
   complete data. SAMseq can identify a large number of genes that are
   usually not detected with the other methods [[141]53]. The combination
   of these methods can compensate for the shortcomings of a single
   method. The results of differential expression analysis obtained using
   these four methods are robust. We chose the co-expression networks to
   analysis cancer-related genes on the system level. Then we applied the
   simple framework of HO-GSVD to identify co-expression modules, which
   has been proved to be a simple, parameter-free and reproducible method.
   The results in the functional enrichment analysis showed the all the
   identified modules may play regulatory roles in multiple cancers. We
   found that the enriched BPs and pathways in the three clusters are
   associated with the cancer hallmarks including deregulating cellular
   energetics and inducing angiogenesis [[142]54]. The enriched pathways
   in the cluster 5 and 8 are related to deregulating cellular energetics,
   including focal adhesion, PI3K-Akt signaling pathway and ECM-receptor
   interaction. Focal Adhesion Kinase through ECM promotes the activation
   of PI3K-AKT signaling [[143]55]. Then P13K-AKT signaling pathway
   increases glycolysis in metabolic processes [[144]19]. The enriched BPs
   in the cluster 2 are related to inducing angiogenesis, such as platelet
   degranulation, positive regulation of angiogenesis, leukocyte migration
   and angiogenesis. Angiogenesis plays an important role during
   macroscopic and microscopic neoplastic progression [[145]54]. The
   microenvironment-related BPs in the cluster 7 are closely associated
   with anti-cancer immunity. Not surprisingly, we found that the genes in
   the modules were implicated in diseases, such as neoplastic processes,
   immune system diseases, cancer or viral infections and
   neovascularization. Some miRNA and transcription factors were also
   enriched in the modules. The mir-200 family has been identified as a
   biomarker in cancer [[146]56]. Mir-199a has been associated with
   various cancers, including kidney, breast, bladder, bronchial squamous
   and stomach cancers [[147]57–[148]61]. Studies indicate that mir-29
   family members may cooperatively or separately contribute to the
   development of breast and colon cancer [[149]62]. FOXO4 is reported to
   suppress tumor proliferation and metastasis in stomach carcinoma, and
   its clinical significance is observed in multiple cancers
   [[150]63–[151]65]. The NFAT-related roles have been studied in the
   tumor microenvironment [[152]33]. MAZ promotes the tumor progression in
   glioblastoma, breast cancer, prostate cancer and hepatocellular cancer
   [[153]31, [154]32]. The regulatory and clinical significance of E2A are
   studied in colorectal cancer [[155]66]. Our study indicates that the
   genes in the identified modules play a cooperate role in multiple
   cancers through miRNA and transcription factors.

   The sub-modules in the cluster 5 were analyzed for further
   understanding the regulatory mechanism in multiple cancers. We found
   that some genes in sub-modules play vital roles in multiple cancers,
   such as tumor suppressors and oncogenes. There was only one pathway in
   the largest sub-module M1 but the sub-module was highly enriched in
   cancer-related genes. Hub genes enriched in M2 have been identified to
   play a prognostic role in multiple cancers. Genes in M3 are involved in
   metastasis and prognosis. These sub-modules may help in discover new
   genes related with multiple cancers.

   The identified modules showed different predictive power in prognosis
   of cancers. Some modules were able to predict survival in six cancers,
   such as cluster 2 and cluster 5. These modules were not significantly
   related to survival in BRCA and PRAD. Sample size and the number of
   deaths are two important factors in survival analysis. There exists the
   unbalanced number of genes involved in multiple cancers. These factors
   may result in the poor performance of prognostic capability. Some
   sub-modules in the cluster 5 showed the statistical significance of
   survival analysis in at least three cancers, such as M1, M2, M4 and M9.
   Other sub-modules also revealed the prognostic capability in different
   cancers. The prognosis of cluster 5 was validated in other cancers.
   Importantly, the cluster 5 could be prognostic in 12 cancer types in
   total.

   In summary, we identified the functional modules and co-expression
   networks for the systematic analysis of the carcinogenic properties and
   regulatory mechanisms of multiple cancer. Further analysis indicates
   that these co-expression modules have a strong ability in predicting
   the survival of cancer patients. The results will be helpful in
   identifying new targets associated with cancer treatment. Our results
   will be valuable in cancer-related gene function annotation, and for
   the evaluation of cancer patients’ prognosis.

MATERIALS AND METHODS

Differential gene expression analysis

   We downloaded TCGA gene expression data from the Gene Expression
   Omnibus under accession number [156]GSE62944. Eight types of cancers
   were used to construct the co-expression network, including BRCA, KIRC,
   PRAD, THCA, LUAD, BLCA, COAD and STAD. After the construction of the
   co-expression network, we also analyzed UCEC, HNSC, READ and LIHC for
   further assessing the functional significance of the modules
   ([157]Supplementary Table 9).

   Four methods were used for differential gene expression analysis,
   including limma, edgeR, DESeq, and SAMseq. The sets of differentially
   expressed genes from each cancer were pooled together to increase the
   statistical power.

   The genes with multiple test corrected p-value < 0.05 were considered
   to be significantly differentially expressed. For SAMseq, the number of
   permutations used to estimate false discovery rates was set to 200 and
   the number of resamples used to construct test statistic was set to
   100. For limma, we used the TMM method of the edgeR package and the
   voom transformation.

   We proposed a three-step procedure for selecting the common genes
   shared by the eight cancer types (BRCA, KIRC, PRAD, THCA, LUAD, BLCA,
   COAD, STAD). First, genes which were significantly differentially
   expressed in at least four types of cancers analyzed by one method were
   clustered into one gene set. Second, genes shared by all four gene sets
   obtained by four methods were selected as a candidate gene set. At
   last, the genes that were not mapped to the corresponding DAVID genes
   were removed. We detected genes shared by UCEC, HNSC, READ, and LIHC
   cancers by following the rules similar to the above procedure, but with
   modification. Three methods (edgeR, SAMseq, and Limma) were used and
   significant genes shared by at least two cancer types with one method
   were clustered in one gene set.

Co-expression network construction

   The gene expression data expressed as fragments per kilobase of exon
   per million reads mapped (FPKM) and normalized on log2 scale were
   represented by matrices. We then used HO-GSVD to extract common modules
   through matrix decomposition. The basic idea behind this approach is
   using spectral decomposition to identify common structures
   (subnetworks) in multiple datasets.

   For the tth cancer
   [MATH:
   <mrow><mrow><mo>(</mo><mrow><mi>t</mi><mo>=</mo><mn>1</mn><mo>,</mo><mt
   ext> </mtext><mn>2</mn><mo>,</mo><mtext> </mtext><mo>…</mo><mo>,</mo><m
   text> </mtext><mi>T</mi></mrow><mo>)</mo></mrow></mrow> :MATH]
   , the input data
   [MATH:
   <mrow><msub><mi>D</mi><mi>t</mi></msub><mo>∈</mo><msup><mi>R</mi><mrow>
   <msub><mi>n</mi><mi>t</mi></msub><mo>×</mo><mi>p</mi></mrow></msup><mte
   xt> </mtext></mrow> :MATH]
   is the real matrix represented by the rows denoted by the samples
   [MATH: <mrow><msub><mi>n</mi><mi>t</mi></msub></mrow> :MATH]
   and the columns denoted by the genes
   [MATH: <mrow><mi>p</mi><mo>.</mo></mrow> :MATH]
   The mathematical form of the HO-GSVD framework of
   [MATH: <mi>T</mi> :MATH]
   real matrices is given by
   [MATH:
   <mrow><msub><mi>D</mi><mn>1</mn></msub><mo>=</mo><msub><mi>U</mi><mn>1<
   /mn></msub><msub><mi>Σ</mi><mn>1</mn></msub><msup><mi>V</mi><mi>T</mi><
   /msup><mo>,</mo><mtext> </mtext><msub><mi>D</mi><mn>2</mn></msub><mo>=<
   /mo><msub><mi>U</mi><mn>2</mn></msub><msub><mi>Σ</mi><mn>2</mn></msub><
   msup><mi>V</mi><mi>T</mi></msup><mo>,</mo><mtext> </mtext><mo>…</mo><mo
   >,</mo><msub><mi>D</mi><mi>T</mi></msub><mo>=</mo><msub><mi>U</mi><mi>T
   </mi></msub><msub><mi>Σ</mi><mi>T</mi></msub><msup><mi>V</mi><mi>T</mi>
   </msup><mtext> </mtext></mrow> :MATH]
   (1)

   where
   [MATH:
   <mrow><msub><mi>U</mi><mi>t</mi></msub><mo>∈</mo><msup><mi>R</mi><mrow>
   <msub><mi>n</mi><mi>t</mi></msub><mo>×</mo><mi>p</mi></mrow></msup></mr
   ow> :MATH]
   is composed of normalized left basis vectors,
   [MATH:
   <mrow><msub><mi>Σ</mi><mi>t</mi></msub><mo>∈</mo><msup><mi>R</mi><mrow>
   <mi>p</mi><mo>×</mo><mi>p</mi></mrow></msup><mtext> </mtext></mrow>
   :MATH]
   is a non-negative diagonal matrix consisted of the higher-order
   generalized singular values and
   [MATH:
   <mrow><mi>V</mi><mo>∈</mo><msup><mi>R</mi><mrow><mi>p</mi><mo>×</mo><mi
   >p</mi></mrow></msup></mrow> :MATH]
   is composed of normalized right basis vectors. As the previous method
   [[158]8], the right basis vectors
   [MATH: <mi>V</mi> :MATH]
   were defined as the solution of the eigensystem of the matrix
   [MATH: <mi>S</mi> :MATH]
   :
   [MATH: <mrow><mi>S</mi><mo>=</mo><mfrac><mn>1</mn><mrow><mi>T</mi><mo
   stretchy="false">(</mo><mi>T</mi><mo>−</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></mfrac><munderover><mstyle
   mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mrow><mi>t</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><mi>T</mi><mo>−</mo><mn>1</mn></mrow></munderover><mun
   derover><mstyle mathsize="140%"
   displaystyle="true"><mo>∑</mo></mstyle><mrow><mi>r</mi><mo>=</mo><mi>t<
   /mi><mo>+</mo><mn>1</mn></mrow><mi>T</mi></munderover><mo
   stretchy="false">(</mo><msub><mi>E</mi><mi>t</mi></msub><msubsup><mi>E<
   /mi><mi>r</mi><mrow><mo>−</mo><mn>1</mn></mrow></msubsup><mo>+</mo><msu
   b><mi>E</mi><mi>r</mi></msub><msubsup><mi>E</mi><mi>t</mi><mrow><mo>−</
   mo><mn>1</mn></mrow></msubsup><mo
   stretchy="false">)</mo><mtext> </mtext></mrow> :MATH]
   (2)

   where the covariance matrix
   [MATH:
   <mrow><msub><mi>E</mi><mi>t</mi></msub><mo>=</mo><msubsup><mi>D</mi><mi
   >t</mi><mi>T</mi></msubsup><msub><mi>D</mi><mi>t</mi></msub></mrow>
   :MATH]
   can be treated as the co-expression matrix. Importantly, the common
   HO-GSVD subspace is spanned by the right basis vectors
   [MATH: <mi>V</mi> :MATH]
   . Then we used the right basis vectors to select common structures
   shared by all cancer types. The advantage of this approach for
   identification of co-expressed structures across cancer types is able
   to reproduce accurately common structures without relying on any
   predefined parameters.

   Based on the eigenvalues of the eigen-decomposition of
   [MATH: <mi>S</mi> :MATH]
   , we chose the top ten eigenvectors to identify co-expression gene
   modules. A small part of the genes is supposed to have significantly
   similarity with each other and these genes can be regarded as the
   co-expression genes. Similarly to the strategy used in [[159]67], the
   selected eigenvectors were decomposed into two components and modeled
   using Gaussian Mixture Model (GMM). In addition, the small weight
   component of bimodal distribution can identify the small proportional
   genes with high similarity. We calculated the tail area-based false
   discover rate (q-value) using the R package fdrtool to identify
   co-expression genes. Then we clustered the genes into the co-expression
   gene module through the cut-off of q-value (0.001).

Enrichment analysis

   Functional enrichment analysis of the co-expressions gene modules was
   performed using DAVID [[160]15]. Benjamini-Hochberg method was used for
   the multiple test correction of p-values. KEGG pathways and Gene
   Ontology terms with Benjamini-Hochberg corrected p-value less than 0.05
   were considered as significantly enriched. To validate the functional
   significance of modules, only gene modules involved in significant
   pathways were retained for further analysis. The cell type enrichment
   was analyzed using Cten [[161]17]. Disease association analysis was
   conducted using the gene set analysis toolkit WebGestalt. In addition,
   the enrichment of miRNAs and transcription factors was performed using
   WebGestalt [[162]16].

Functional network analysis

   We used module genes to construct the functional interaction network
   within the Cytoscape FI plugin [[163]34]. Linker genes were used to
   maximize the number of connected genes in the module. Then we clustered
   the FI network into small modules. The functions of small modules were
   analyzed for GO terms and pathway enrichment.

Survival analysis

   We downloaded overall survival data from the Firehose. The patients
   with multiple repeated samples were not included in the survival
   analysis. The R package ‘survival’ was used for the Cox proportional
   hazards (PH) model and Kaplan-Meier survival analysis. To detect
   prognostic genes, we calculated P-values based on the raw Wald test for
   the Cox PH model. Based on the expression of module genes, we
   classified tumor samples into clusters through NMF [[164]68]. We then
   estimated the significance of differences between different clusters
   with patients’ survival using log-rank test of the Kaplan-Meier method.
   We also performed survival analysis on small modules generated from the
   FI network.

SUPPLEMENTARY MATERIALS FIGURES AND TABLES

   [165]oncotarget-08-112928-s001.pdf^ (4.1MB, pdf)
   [166]oncotarget-08-112928-s002.xlsx^ (12.2KB, xlsx)
   [167]oncotarget-08-112928-s003.xlsx^ (20.4KB, xlsx)
   [168]oncotarget-08-112928-s004.xlsx^ (11.5KB, xlsx)
   [169]oncotarget-08-112928-s005.xlsx^ (14.7KB, xlsx)

Abbreviations

   DAVID
          the Database for Annotation, Visualization, and Integrated
          Discovery

   TCGA
          The Cancer Genome Atlas

   KEGG
          Kyoto Encyclopedia of Genes and Genomes

Footnotes

   Author contributions

   WY collected and analyzed the data, and wrote the manuscript; XZ and WY
   conceived the idea. BZ participated in this work and analyzed the data.
   SZ participated in this work, supervised the whole work and revised the
   manuscript. YW, WZ and XZ revised the manuscript extensively. All
   authors read and approved the final manuscript.

   CONFLICTS OF INTEREST

   The authors declare no potential conflicts of interest.

   FUNDING

   This work was supported in part by the National Basic Research Program
   of China (973 Program) under Grant 2014CB340404, in part by the
   National Science Foundation of China under Grant 61471267, Grant
   61373105 and Grant 61672422, in part by the Fundamental Research Funds
   for the Central Universities, and in part by National Institutes of
   Health [1U01CA166886, 1U01AR069395 and 1R01GM123037].

REFERENCES