Abstract
The rapid accumulation of cancer-related data owing to high-throughput
technologies has provided unprecedented choices to understand the
progression of cancer and discover functional networks in multiple
cancers. Establishment of co-expression networks will help us to
discover the systemic properties of carcinogenesis features and
regulatory mechanisms of multiple cancers. Here, we proposed a
computational workflow to identify differentially co-expressed gene
modules across 8 cancer types by using combined gene differential
expression analysis methods and a higher-order generalized singular
value decomposition. Four co-expression modules were identified; and
oncogenes and tumor suppressors were significantly enriched in these
modules. Functional enrichment analysis demonstrated the significantly
enriched pathways in these modules, including ECM-receptor interaction,
focal adhesion and PI3K-Akt signaling pathway. The top-ranked miRNAs
(mir-199, mir-29, mir-200) and transcription factors (FOXO4, E2A, NFAT,
and MAZ) were identified, which play an important role in deregulating
cellular energetics; and regulating angiogenesis and cancer immune
system. The clinical significance of the co-expressed gene clusters was
assessed by evaluating their predictability of cancer patients’
survival. The predictive power of different clusters and subclusters
was demonstrated. Our results will be valuable in cancer-related gene
function annotation and for the evaluation of cancer patients’
prognosis.
Keywords: co-expression network, prognosis, HO-GSVD, gene module,
cancer
INTRODUCTION
The rapid accumulation of cancer-related data owing to high-throughput
technologies has provided unprecedented choices to understand the
progression of cancer and discover functional networks in multiple
cancers. The vast majority of cancer-related studies have focused on a
single cancer, but always ignored the common traits across different
cancer types. Different cancers usually share common hallmarks, such as
evading growth suppressors, resisting cell death and inducing
angiogenesis. Moreover, the methods based on biological networks
including gene co-expression networks, metabolic networks,
protein-protein interaction networks and genetic regulatory networks
can infer regulatory mechanisms related to biological processes. The
network-based methods to search biological processes related to cancer
hallmarks will help us in identifying the characterizations of tumor
biology. Few studies focus on genome-scale networks across different
cancer types. Thus, the network analysis may help us to unveil common
traits involved in multiple cancers.
The majority of the network-based methods are performed to identify
distinct patterns within one cancer [[34]1, [35]2]. Compared with only
analyzing one cancer, several methods have been used to identify common
patterns shared by two or more cancers. Zhang et al. used a network
mining algorithm to build tightly connected gene co-expression networks
from the microarray datasets spanning 33 cancer types. Their results
indicate that the commonly recognized characteristics of cancers are
supported by highly coordinated transcriptomic activities [[36]3]. Yang
et al. did the weighted correlation network analysis (WGCNA) to
highlight common properties of prognostic genes in four cancer types
[[37]4]. Li et al. described a network method to analyze the driver
mutation by integrating both cancer genomes and transcriptomes and
identified a significant correlation of genotype to phenotype in six
solid tumors [[38]5].
Several computationally efficient methods have been developed for
construction of the networks, such as Generalized Singular Value
Decomposition (GSVD) [[39]6] and higher-order GSVD (HO-GSVD) [[40]7].
GSVD is used for identifying common structures across two conditions
[[41]6]. The analysis based on HO-GSVD can extract common gene modules
in two or more conditions [[42]7–[43]9]. Ponnapalli et al. first
proposed HO-GSVD to analyze common structures shared by multiple
datasets [[44]7]. Xiao et al. developed a new HO-GSVD method for
analyzing common and tissue-specific modules from seven rat tissues and
four human brain regions [[45]8]. Wang et al. applied a simple
mathematical framework of HO-GSVD for analysis of multiple tissues
[[46]9]. These studies indicate that HO-GSVD is a valuable method in
common gene pattern discovery among different tissues. Motivated by
this approach, we applied HO-GSVD to pan-cancer analysis in this study.
To our best of knowledge, this approach has not been used for
pan-cancer research.
Differential gene expression analysis has been widely used for
identifying differentially expressed genes between conditions. The
commonly used methods include edgeR [[47]10], limma [[48]11], SAMseq
[[49]12] and DESeq [[50]13]. Most studies only use one method to
analyze gene expression patterns. Soneson et al. compared the commonly
used methods for differential expression analysis and found that no
single method is optimal under all circumstances and the method of
choice in a particular situation depends on the experimental conditions
[[51]14].
In this study, we applied the above four methods for differential
expression analysis to get common genes in eight types of cancers
through a three-step procedure. First, the differentially expressed
gene set shared by four caner types was selected using one method.
Second, the commonly expressed gene set selected by four methods was
set as a candidate gene set. We then converted candidate gene IDs to
the corresponding DAVID gene IDs and removed the non-mapped genes. The
gene expression matrices of multiple cancers were decomposed by the
HO-GSVD method for identifying the common modules across different
cancers. We chose the vectors with top eigenvalues in the right basis
matrix as candidates of co-expression genes. The co-expression genes
were selected based on the assumption that a small proportion of genes
in candidate vectors is highly similar. We used the DAVID tools to
validate the functional significance of the modules. The gene modules
involved in the significant pathways were retained for further
analysis. Multiple types of enrichment analysis were performed,
including gene ontology terms, KEGG pathways, cell type enrichment,
disease association and miRNA and transcription factor enrichment
analysis [[52]15–[53]17]. The functional interaction networks were
constructed for the identified modules. The survival analysis was then
applied for the prognostic properties of modules across cancers. By
following the procedure shows in the Figure [54]1, we found that the
co-expression gene modules are enriched with oncogenes and tumor
suppressors, which play an important regulatory role in multiple
cancers. The genes in the main- and sub-modules are closely associated
with the prognosis of multiple cancers.
Figure 1. Overview of the workflow.
[55]Figure 1
[56]Open in a new tab
There are three main steps including gene differential expression
analysis, identification of co-expression modules and significant
enrichment analysis.
RESULTS
Identifying co-expression gene modules
We analyzed differentially expressed genes using the raw count data and
constructed co-expression networks using the FPKM count data. 5229
differentially expressed genes were detected in breast invasive
carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), prostate
adenocarcinoma (PRAD), thyroid carcinoma (THCA), lung adenocarcinoma
(LUAD), bladder urothelial carcinoma (BLCA), colon adenocarcinoma
(COAD) and stomach adenocarcinoma (STAD) using limma, edgeR, DESeq, and
SAMseq methods (Figure [57]2A). 4973 genes were used for construction
of co-expression networks and pathway enrichment analysis. Five
clusters with significant pathways were identified ([58]Supplementary
Table 1). The significant pathways enriched in the smallest cluster 10
were almost identical to the pathways in the cluster 5. Then five
clusters except cluster 10 were kept for further analysis. The highly
specific genes were included in each cluster. But there were a few
genes shared by these clusters (Figure [59]2B). The results showed that
these co-expression modules with specific genes may share similar
biological information. To comprehensively investigate the
characterization of the identified co-expression modules, we applied
multiple types of enrichment analysis.
Figure 2. Identification of differentially expressed genes and co-expression
modules.
[60]Figure 2
[61]Open in a new tab
(A) Venn diagram showing the overlap between differentially expressed
genes selected by the four methods. (B) Venn diagram showing the
overlap between the genes in the four co-expression modules.
We applied functional enrichment analysis to identify the KEGG pathways
(Figure [62]3A–3D) and Gene Ontology terms ([63]Supplementary Table 2)
in the four clusters. For cluster 5, there are only six significant
KEGG pathways as shown in the Figure [64]3B. ECM-receptor interaction,
focal adhesion and PI3K-Akt signaling pathways were presented in the
cluster 8 and cluster 5, consistent with the previous results obtained
by pan-cancer analysis [[65]18]. Moreover, these pathways are related
to the deregulation of cellular energetics [[66]19]. Abnormal ECM
affects cancer progression by directly promoting cellular
transformation and metastasis [[67]20]. Focal adhesion kinase, a
protein tyrosine kinase, regulates cellular adhesion, motility,
proliferation and survival in cancer cells, thereby promoting cancer
progression and metastasis [[68]21]. PI3K-Akt signaling pathway has
been reported as one of the most important pathways in cancer
metabolism and growth [[69]22]. Figure [70]3C and Figure [71]3D show
the enriched pathways in the cluster 7 and 8, respectively. Only focal
adhesion was shared by the cluster 2, 5 and 8 (Figure [72]3A, 3B, 3D).
These results indicated that the enriched pathways in the clusters are
associated with tumorigenesis. To further evaluate whether the
functional features in these clusters are also important in other
cancers, we did analysis for another four cancers, including uterine
corpus endometrial carcinoma (UCEC), head and neck squamous cell
carcinoma (HNSC), rectum adenocarcinoma (READ) and liver hepatocellular
carcinoma (LIHC) ([73]Supplementary Figure 1). 8619 genes were
identified. We converted the candidate gene IDs to the corresponding
DAVID gene IDs. If two IDs were corresponding to the same gene, we
removed the redundant one. Therefore, 490 redundant genes were removed
and 8129 genes used for further analysis. We got two clusters with
significant pathways ([74]Supplementary Table 3) using HO-GSVD
approach. We found that cluster 6 in the four cancers was highly
enriched in focal adhesion
[MATH: (p6=5
.7e−11) :MATH]
, ECM-receptor interaction
[MATH: (p6=2
.6e−10) :MATH]
and PI3K-Akt signaling
[MATH: (p6=2
.7e−5) :MATH]
pathways. For the eight cancers, these clusters partially shared some
KEGG pathways. We observed some similar patterns in the following
results of the GO biological process (BP) enrichment analysis. The
top-ranked five enriched BPs in the cluster 5 included collagen fibril
organization
[MATH: (p5=3
.5e−6) :MATH]
, extracellular matrix organization
[MATH: (p5=4
.5e−6) :MATH]
, collagen catabolic process
[MATH: (p5=4
.9e−5) :MATH]
, skeletal system development
[MATH: (p5=3
.2e−3) :MATH]
and cell adhesion
[MATH: (p5=1
.3e−2) :MATH]
. The top three enriched BPs in the cluster 8 were the same as those in
the cluster 5. It was reported that the collagen fibrils reorganization
within an extracellular matrix facilitated tumorigenesis and invasion
[[75]23]. We found that the BPs in the cluster 7 are mainly involved in
T cell and immune response [[76]24]. The highly enriched BPs in the
cluster 2 are related to angiogenesis. In addition to cluster 7, these
commonly enriched BPs are associated with extracellular matrix
organization and cell adhesion. Similarity to the KEGG pathways, the
corresponding enriched BPs in these clusters also play an important
role in cancers. As expected, all the above results indicated that the
identified clusters with biological functions related to cellular
energetics, angiogenesis and anti-cancer immunity are important to the
cancer development.
Figure 3. Enrichment of co-expression modules.
[77]Figure 3
[78]Open in a new tab
The four bar charts display the pathway enrichment results of cluster 2
(A), cluster 5 (B), cluster 7 (C) and cluster 8 (D). The pathway shared
by at least two cluster are colored with light green. The cell-type
enrichment analysis of cluster 8 (E) and other clusters
([79]Supplementary Figure 2) is shown as –log10 (Benjamini-Hochberg
corrected p-value). The overlap of the members in the top 5-ranked
miRNA families is shown as the Venn diagram (F).
To further explore the cell specificity and disease association of the
differentially expressed genes, we analyzed cell type enrichment (Cten)
and disease association for the genes in four clusters
([80]Supplementary Tables 4, [81]5). Enriched cell types in these
clusters were closely associated with all eight cancer types, including
BRCA, KIRC, PRAD, THCA, LUAD, BLCA, COAD, STAD. The top three cell
types for the cluster 8 included cardiac myocytes (score as –log10
(Benjamini-Hochberg corrected p-value), score = 57), smooth muscle
(score = 55) and adipocyte (score = 38) (Figure [82]3E). Adipocytes can
support tumorigenesis by secreting adipokines and producing energy
[[83]25]. The immune cells in the cluster 7 and cluster 8 included
CD14+ Monocytes, CD33+ Myeloid and BDCA4+ Dentritic Cells. CD14+
Monocytes (score = 4.88 in the cluster 7 and score = 7.81 in cluster 8)
has been reported to affect dendritic cell differentiation and T-cell
function in cancer patients [[84]26]. These results indicated that the
enriched cell types in the clusters were associated with tumor
progression. Moreover, disease association analysis suggested that the
genes in the four clusters were mainly enriched in cancer-related
diseases with high significance ([85]Supplementary Table 5,
[MATH:
p≤8.39e−8 :MATH]
). All the above results further suggested that the four clusters which
play important roles in biological processes related to tumor
progression and immunity are associated with multiple cancers.
To further demonstrate the potential regulatory mechanisms, we also
analyzed miRNAs and transcription factors using the WebGestalt tool
([86]Supplementary Tables 6, [87]7). Three of the top ranked miRNA
families were identical in the cluster 2, 5 and 8, including mir-199,
mir-29 and mir-200 (Figure [88]3F). Mir-199a was only presented in the
cluster 2 and cluster 5
[MATH: (p2=0
.004, p<
mn>5=0.0035) :MATH]
. Mir-29a, mir-29b and mir-29c
[MATH: (p5=0
.0009, p
8=5.88e−5) :MATH]
in the cluster 5 and 8 are the mature members of the mir-29 family.
Three of the main mir-200 family members
[MATH: (p2=0
.0004, p
8=5.88e−5) :MATH]
in the cluster 2 and 8 have been reported to associate with tumor
progression [[89]27, [90]28]. The let-7 family of miRNAs
[MATH: (p5=0
.0009) :MATH]
in the cluster 5 is involved in tumorigenesis [[91]29]. We also found
that some clusters were high significantly enriched with transcription
factors. FOXO4 gene was presented in the cluster 2, 5 and 7
[MATH: (p2=2
.82e−9, p5=4
.29e−5, p7=0.
0027) :MATH]
[[92]30]. The clusters 2, 5 and 8 shared three transcription factors,
including E2A
[MATH: (p2=3
.66e−8, p5=1
.06e−9, p8=7.
38e−15) :MATH]
, NFAT
[MATH: (p2=8
.79e−11, p5=2.38e−5, p8=3
.51e−17) :MATH]
, and MAZ
[MATH: (p2=2
.47e−8, p5=2
.40e−6, p8=2.
01e−15) :MATH]
. E2A, FOXO4, NFAT and MAZ play important roles in tumor cell growth
and metastasis in multiple cancers, such as colorectal cancer, breast
cancer, prostate cancer and lung cancer [[93]31–[94]33]. Taken
together, these results revealed that common regulators shared by the
four clusters may cooperate with each other to regulate the biological
processes of multiple cancers.
Functional network analysis of individual co-expression modules
To further analyze the characterization of single identified modules,
we constructed the functional interaction networks. These networks were
built based on human PPIs, fly PPIs, worm PPIs, yeast PPIs, domain
interaction, Lee's Gene Expression, Prieto's Gene Expression, GO BP
sharing and PPIs from GeneWays [[95]34]. We found that the
corresponding proportions of the genes linked to the functional
interaction (FI) networks in the four modules were 80.63%, 83.78%,
76.60% and 87.20%, respectively ([96]Supplementary Table 1). This
result indicated that the four clusters were highly conserved at the
functional protein level. According to the results from the above
subsections, all clusters were closely associated with the eight
cancers. Among of them, most of the enriched pathways in the cluster 5
were consistent with the previous pan-cancer outcomes [[97]18],
therefore, we did further functional analysis of the cluster 5-based FI
network as shown in the Figure [98]4.
Figure 4. Network visualization of the cluster 5.
[99]Figure 4
[100]Open in a new tab
The functional interaction network consists of eight sub-modules marked
with different colors. The genes and link genes in the modules are
represented as circles and diamonds, respectively.
The FI network was clustered into small modules and eight modules were
annotated as M1~M8 ([101]Supplementary Table 8). We performed the
enrichment analysis for GO BP and found that some high significantly
BPs (FDR < 0.001) were consistent with the above results. Extracellular
matrix disassembly, collagen catabolic process, extracellular matrix
organization, collagen fibril organization, skeletal system development
and cell adhesion were enriched in the M2. The BPs in the M7 are
associated with the negative regulation of extracellular matrix
disassembly and endothelial cell migration. Endothelial cell migration
has been reported to contribute the entry of cancer cells into the
circulatory system [[102]35]. There were two BPs in the M5, including
O-glycan processing and protein O-linked glycosylation. It has been
reported that alterations in glycosylation impacted cell cycle and may
support neoplastic progression [[103]36]. Interestingly, M1 had only
one enriched pathway, that is, the notch signaling pathway. Notch
signaling pathway plays an important in regulating stem cell
self-renewal and the pathogenesis of breast cancer [[104]37]. The most
enriched pathway in the M2 was ECM-receptor interaction. Phenylalanine
metabolism was the top enriched pathway in the M3. Nicotinic
acetylcholine receptor signaling pathway in M4 and amino acid
metabolism pathway in M6 are associated with cancer growth
[[105]38–[106]40]. These results indicated that distinct BPs enriched
in different sub-modules contribute to the development of cancers.
We then defined an individual gene in the modules with at least ten
neighbors as one hub gene, and obtained 20 hub genes, including 15
linker genes and 5 module genes. Over half of the hub genes were
enriched in the M2, including FN1, COL1A1, COL1A2, COL3A1 and COL6A3.
FN1 is a FDA-approved drug target gene against cancer [[107]41,
[108]42]. COL1A1, COL1A2, COL3A1 and COL6A3 are members of the collagen
family. These five genes were enriched in pathways such as ECM-receptor
interaction, protein digestion and absorption, integrin signaling
pathway and PI3K-Akt signaling pathway. DES, THBS2, PLAU and POSTN in
the M2 have been associated with multiple cancers [[109]43–[110]50].
DES encodes a muscle-specific class III intermediate filament and is
related with colorectal cancer, breast cancer, prostate cancer, kidney
cancer and lung cancer [[111]43–[112]46]. As the member of THBS family,
THBS2 plays an important role in cancer progression [[113]47]. PLAU is
involved in cancer cell migration [[114]48]. The protein encoded by
POSTN has been reported to function in cancer stem cell [[115]49,
[116]50].
Survival analysis of gene co-expression modules
To further test the clinical significance of the co-expressed gene
clusters, we did survival analysis for the eight types of cancers. The
patients were clustered into different groups using non-negative matrix
factorization (NMF). The cluster 2 and cluster 5 were able to predict
patient survival of various cancers, including KIRC, THCA, LUAD, BLCA,
COAD and STAD (Figure [117]5A, [118]Supplementary Figure 3). The
cluster 7 and cluster 8 could predict survival of LUAD, KIRC and BLCA
cancer patients ([119]Supplementary Figures 4, [120]5). The cox models
were used to identify prognostic genes in various cancers. The distinct
proportions of prognostic genes in the eight cancers were observed in
the four clusters (Figure [121]5B). Among all of the clusters, KIRC
contains the largest proportion of prognostic genes. We also carried
out the survival analysis on the eight sub-modules in the cluster 5, as
well as the module 9 containing the genes that weren't connected to the
networks in the cluster 5 ([122]Supplementary Tables 8, [123]9). The
patients were grouped using the k-means method when the number of genes
in the sub-module was one. In the survival analysis of individual
sub-modules, all sub-modules showed statistically significant
differences in survival probabilities (Figure [124]5C). M1, M2, M4 and
M9 were significantly associated with the patients’ survival in at
least three cancers. But only one sub-module showed a significant
difference in the survival analysis of BRCA and none of the sub-modules
was significant in PRAD. Analysis of STAD showed that individual
sub-modules couldn't predict these patients’ survival but the
corresponding cluster 5 had the predictive power. The results of the
sub-modules in the survival analysis were mostly consistent with the
cluster 5. All these results revealed that the sub-modules associated
with known biological processes cooperate with each other to contribute
to the prognosis in multiple cancers.
Figure 5. Survival analysis of gene co-expression modules for 8 types of
cancers.
[125]Figure 5
[126]Open in a new tab
(A) Survival analysis of the cluster 5 using Kaplan-Meier curves. The
calculation of the log-rank p-values is based on the split of patient
groups using non-negative matrix factorization. The number of patients
in each group is also labeled in each panel. (B) The distributions of
prognostic genes for the eight cancer types in each cluster. The y-axis
represents the gene proportion of each cancer in the corresponding
cluster. The eight cancers are marked with different colors. (C) The
differential significance distributions of nine sub-modules for each
cancer. The y-axis represents the log-rank p-value and the x-axis is
annotated with the sub-modules grouped by the network clustering. The
solid line represents –log[2](0.05).
In order to further validate the prognosis of cluster 5 in multiple
cancers, we performed survival analysis on HNSC, READ, LIHC and UCEC.
The results showed that the cluster 5 was able to predict patient
survival in HNSC, LIHC and UCEC (Figure [127]6A), but not in READ. The
corresponding sub-modules obtained from the functional network analysis
had statistically significant differences in survival analysis of the
four cancers (Figure [128]6B). M2 had predictive power for patient
survival in three cancers. We also applied survival analysis on other
eight cancers, including glioblastoma multiforme (GBM), brain lower
grade glioma (LGG), ovarian serous cystadenocarcinoma (OV), skin
cutaneous melanoma (SKCM), adrenocortical carcinoma (ACC), cervical
squamous cell carcinoma and endocervical adenocarcinoma (CESC), kidney
renal papillary cell carcinoma (KIRP) and lung squamous cell carcinoma
(LUSC). As shown in the [129]Supplementary Figure 6, the cluster 5 can
predict patients’ survival of LGG, SKCM, ACC and KIRP cancers. These
results confirmed the prognostic ability of cluster 5 in 7 cancer
types. We also applied survival analysis on the above twelve cancer
types for the cluster 2, 7 and 8. All of the three clusters can predict
patients’ survival of LIHC, UCEC, LGG, SKCM, ACC and KIRP cancers.
Additionally, cluster 2 can predict the survival of HNSC and CESC
cancer patients.
Figure 6. Survival analysis of gene co-expression modules for HNSC, LIHC,
READ and UCEC.
[130]Figure 6
[131]Open in a new tab
(A) Survival analysis of the cluster 5 using Kaplan-Meier curves for
HNSC, LIHC, READ and UCEC. (B) The differential significance
distributions of nine sub-modules for each cancer.
DISCUSSION
In this study, we applied four methods to analyze the data and obtained
4973 differentially expressed genes shared by the eight cancers. We
constructed the co-expression network using HO-GSVD and identified
modules and associated pathways. Some of these pathways have been
reported in the pan-cancer analysis [[132]18], such as focal adhesion,
PI3K-Akt signaling pathway, ECM-receptor interaction and Systemic lupus
erythematosus. We constructed the functional interaction networks for
further analyzing individual co-expression modules. These sub-modules
were enriched with different pathways and cooperated with each other
across cancers. We found that these modules are associated with
patients’ survival in multiple cancers. In addition, the individual
sub-modules in various cancers had different prognostic capability.
Gene differential expression analysis is a commonly used method for
identifying differentially expressed genes without considering the
relationship between genes. There is no consensus on the best method
for differential expression analysis [[133]14, [134]51]. Limma, edgeR,
DESeq, and SAMseq have been commonly used for gene differential
expression analysis. Each of them has advantages and disadvantages. The
limma method, based on linear models and the voom transformation, is
developed for analyzing RNA-seq data [[135]52]. The negative binomial
model and empirical Bayes methods in edgeR are used to detect
differential gene expression, and then the gene-wise dispersions is
estimated by conditional maximum likelihood [[136]10]. DESeq models the
count data using the negative binomial distribution and estimates the
mean-variance relationship of each gene. In contrast to edgeR, it
allows a widely data-driven parameter in the statistical test
[[137]13]. SAMseq, a non-parametric method, uses the Wilcoxon rank
statistic and resampling procedure to identify differential expressed
genes [[138]12]. edgeR becomes liberal for small sample sizes with
default settings to a certain extent and keeps a better balance between
speed and accuracy than DESeq [[139]14, [140]51]. Limma and DESeq are
described as the safest choices in some cases in terms of the
consistency of differentially expressed genes when analyzing the
complete data. SAMseq can identify a large number of genes that are
usually not detected with the other methods [[141]53]. The combination
of these methods can compensate for the shortcomings of a single
method. The results of differential expression analysis obtained using
these four methods are robust. We chose the co-expression networks to
analysis cancer-related genes on the system level. Then we applied the
simple framework of HO-GSVD to identify co-expression modules, which
has been proved to be a simple, parameter-free and reproducible method.
The results in the functional enrichment analysis showed the all the
identified modules may play regulatory roles in multiple cancers. We
found that the enriched BPs and pathways in the three clusters are
associated with the cancer hallmarks including deregulating cellular
energetics and inducing angiogenesis [[142]54]. The enriched pathways
in the cluster 5 and 8 are related to deregulating cellular energetics,
including focal adhesion, PI3K-Akt signaling pathway and ECM-receptor
interaction. Focal Adhesion Kinase through ECM promotes the activation
of PI3K-AKT signaling [[143]55]. Then P13K-AKT signaling pathway
increases glycolysis in metabolic processes [[144]19]. The enriched BPs
in the cluster 2 are related to inducing angiogenesis, such as platelet
degranulation, positive regulation of angiogenesis, leukocyte migration
and angiogenesis. Angiogenesis plays an important role during
macroscopic and microscopic neoplastic progression [[145]54]. The
microenvironment-related BPs in the cluster 7 are closely associated
with anti-cancer immunity. Not surprisingly, we found that the genes in
the modules were implicated in diseases, such as neoplastic processes,
immune system diseases, cancer or viral infections and
neovascularization. Some miRNA and transcription factors were also
enriched in the modules. The mir-200 family has been identified as a
biomarker in cancer [[146]56]. Mir-199a has been associated with
various cancers, including kidney, breast, bladder, bronchial squamous
and stomach cancers [[147]57–[148]61]. Studies indicate that mir-29
family members may cooperatively or separately contribute to the
development of breast and colon cancer [[149]62]. FOXO4 is reported to
suppress tumor proliferation and metastasis in stomach carcinoma, and
its clinical significance is observed in multiple cancers
[[150]63–[151]65]. The NFAT-related roles have been studied in the
tumor microenvironment [[152]33]. MAZ promotes the tumor progression in
glioblastoma, breast cancer, prostate cancer and hepatocellular cancer
[[153]31, [154]32]. The regulatory and clinical significance of E2A are
studied in colorectal cancer [[155]66]. Our study indicates that the
genes in the identified modules play a cooperate role in multiple
cancers through miRNA and transcription factors.
The sub-modules in the cluster 5 were analyzed for further
understanding the regulatory mechanism in multiple cancers. We found
that some genes in sub-modules play vital roles in multiple cancers,
such as tumor suppressors and oncogenes. There was only one pathway in
the largest sub-module M1 but the sub-module was highly enriched in
cancer-related genes. Hub genes enriched in M2 have been identified to
play a prognostic role in multiple cancers. Genes in M3 are involved in
metastasis and prognosis. These sub-modules may help in discover new
genes related with multiple cancers.
The identified modules showed different predictive power in prognosis
of cancers. Some modules were able to predict survival in six cancers,
such as cluster 2 and cluster 5. These modules were not significantly
related to survival in BRCA and PRAD. Sample size and the number of
deaths are two important factors in survival analysis. There exists the
unbalanced number of genes involved in multiple cancers. These factors
may result in the poor performance of prognostic capability. Some
sub-modules in the cluster 5 showed the statistical significance of
survival analysis in at least three cancers, such as M1, M2, M4 and M9.
Other sub-modules also revealed the prognostic capability in different
cancers. The prognosis of cluster 5 was validated in other cancers.
Importantly, the cluster 5 could be prognostic in 12 cancer types in
total.
In summary, we identified the functional modules and co-expression
networks for the systematic analysis of the carcinogenic properties and
regulatory mechanisms of multiple cancer. Further analysis indicates
that these co-expression modules have a strong ability in predicting
the survival of cancer patients. The results will be helpful in
identifying new targets associated with cancer treatment. Our results
will be valuable in cancer-related gene function annotation, and for
the evaluation of cancer patients’ prognosis.
MATERIALS AND METHODS
Differential gene expression analysis
We downloaded TCGA gene expression data from the Gene Expression
Omnibus under accession number [156]GSE62944. Eight types of cancers
were used to construct the co-expression network, including BRCA, KIRC,
PRAD, THCA, LUAD, BLCA, COAD and STAD. After the construction of the
co-expression network, we also analyzed UCEC, HNSC, READ and LIHC for
further assessing the functional significance of the modules
([157]Supplementary Table 9).
Four methods were used for differential gene expression analysis,
including limma, edgeR, DESeq, and SAMseq. The sets of differentially
expressed genes from each cancer were pooled together to increase the
statistical power.
The genes with multiple test corrected p-value < 0.05 were considered
to be significantly differentially expressed. For SAMseq, the number of
permutations used to estimate false discovery rates was set to 200 and
the number of resamples used to construct test statistic was set to
100. For limma, we used the TMM method of the edgeR package and the
voom transformation.
We proposed a three-step procedure for selecting the common genes
shared by the eight cancer types (BRCA, KIRC, PRAD, THCA, LUAD, BLCA,
COAD, STAD). First, genes which were significantly differentially
expressed in at least four types of cancers analyzed by one method were
clustered into one gene set. Second, genes shared by all four gene sets
obtained by four methods were selected as a candidate gene set. At
last, the genes that were not mapped to the corresponding DAVID genes
were removed. We detected genes shared by UCEC, HNSC, READ, and LIHC
cancers by following the rules similar to the above procedure, but with
modification. Three methods (edgeR, SAMseq, and Limma) were used and
significant genes shared by at least two cancer types with one method
were clustered in one gene set.
Co-expression network construction
The gene expression data expressed as fragments per kilobase of exon
per million reads mapped (FPKM) and normalized on log2 scale were
represented by matrices. We then used HO-GSVD to extract common modules
through matrix decomposition. The basic idea behind this approach is
using spectral decomposition to identify common structures
(subnetworks) in multiple datasets.
For the tth cancer
[MATH:
(t=1, 2, …, T) :MATH]
, the input data
[MATH:
Dt∈R
nt×p :MATH]
is the real matrix represented by the rows denoted by the samples
[MATH: nt :MATH]
and the columns denoted by the genes
[MATH: p. :MATH]
The mathematical form of the HO-GSVD framework of
[MATH: T :MATH]
real matrices is given by
[MATH:
D1=U1<
/mn>Σ1VT<
/msup>, D2=<
/mo>U2Σ2<
msup>VT, …,DT=UT
ΣTVT
:MATH]
(1)
where
[MATH:
Ut∈R
nt×p :MATH]
is composed of normalized left basis vectors,
[MATH:
Σt∈R
p×p
:MATH]
is a non-negative diagonal matrix consisted of the higher-order
generalized singular values and
[MATH:
V∈Rp×p :MATH]
is composed of normalized right basis vectors. As the previous method
[[158]8], the right basis vectors
[MATH: V :MATH]
were defined as the solution of the eigensystem of the matrix
[MATH: S :MATH]
:
[MATH: S=1T(T−1)∑t=1<
/mn>T−1∑r=t<
/mi>+1T(EtE<
/mi>r−1+ErEt−
mo>1) :MATH]
(2)
where the covariance matrix
[MATH:
Et=DtTDt
:MATH]
can be treated as the co-expression matrix. Importantly, the common
HO-GSVD subspace is spanned by the right basis vectors
[MATH: V :MATH]
. Then we used the right basis vectors to select common structures
shared by all cancer types. The advantage of this approach for
identification of co-expressed structures across cancer types is able
to reproduce accurately common structures without relying on any
predefined parameters.
Based on the eigenvalues of the eigen-decomposition of
[MATH: S :MATH]
, we chose the top ten eigenvectors to identify co-expression gene
modules. A small part of the genes is supposed to have significantly
similarity with each other and these genes can be regarded as the
co-expression genes. Similarly to the strategy used in [[159]67], the
selected eigenvectors were decomposed into two components and modeled
using Gaussian Mixture Model (GMM). In addition, the small weight
component of bimodal distribution can identify the small proportional
genes with high similarity. We calculated the tail area-based false
discover rate (q-value) using the R package fdrtool to identify
co-expression genes. Then we clustered the genes into the co-expression
gene module through the cut-off of q-value (0.001).
Enrichment analysis
Functional enrichment analysis of the co-expressions gene modules was
performed using DAVID [[160]15]. Benjamini-Hochberg method was used for
the multiple test correction of p-values. KEGG pathways and Gene
Ontology terms with Benjamini-Hochberg corrected p-value less than 0.05
were considered as significantly enriched. To validate the functional
significance of modules, only gene modules involved in significant
pathways were retained for further analysis. The cell type enrichment
was analyzed using Cten [[161]17]. Disease association analysis was
conducted using the gene set analysis toolkit WebGestalt. In addition,
the enrichment of miRNAs and transcription factors was performed using
WebGestalt [[162]16].
Functional network analysis
We used module genes to construct the functional interaction network
within the Cytoscape FI plugin [[163]34]. Linker genes were used to
maximize the number of connected genes in the module. Then we clustered
the FI network into small modules. The functions of small modules were
analyzed for GO terms and pathway enrichment.
Survival analysis
We downloaded overall survival data from the Firehose. The patients
with multiple repeated samples were not included in the survival
analysis. The R package ‘survival’ was used for the Cox proportional
hazards (PH) model and Kaplan-Meier survival analysis. To detect
prognostic genes, we calculated P-values based on the raw Wald test for
the Cox PH model. Based on the expression of module genes, we
classified tumor samples into clusters through NMF [[164]68]. We then
estimated the significance of differences between different clusters
with patients’ survival using log-rank test of the Kaplan-Meier method.
We also performed survival analysis on small modules generated from the
FI network.
SUPPLEMENTARY MATERIALS FIGURES AND TABLES
[165]oncotarget-08-112928-s001.pdf^ (4.1MB, pdf)
[166]oncotarget-08-112928-s002.xlsx^ (12.2KB, xlsx)
[167]oncotarget-08-112928-s003.xlsx^ (20.4KB, xlsx)
[168]oncotarget-08-112928-s004.xlsx^ (11.5KB, xlsx)
[169]oncotarget-08-112928-s005.xlsx^ (14.7KB, xlsx)
Abbreviations
DAVID
the Database for Annotation, Visualization, and Integrated
Discovery
TCGA
The Cancer Genome Atlas
KEGG
Kyoto Encyclopedia of Genes and Genomes
Footnotes
Author contributions
WY collected and analyzed the data, and wrote the manuscript; XZ and WY
conceived the idea. BZ participated in this work and analyzed the data.
SZ participated in this work, supervised the whole work and revised the
manuscript. YW, WZ and XZ revised the manuscript extensively. All
authors read and approved the final manuscript.
CONFLICTS OF INTEREST
The authors declare no potential conflicts of interest.
FUNDING
This work was supported in part by the National Basic Research Program
of China (973 Program) under Grant 2014CB340404, in part by the
National Science Foundation of China under Grant 61471267, Grant
61373105 and Grant 61672422, in part by the Fundamental Research Funds
for the Central Universities, and in part by National Institutes of
Health [1U01CA166886, 1U01AR069395 and 1R01GM123037].
REFERENCES