Abstract
The stratification of cancer into subtypes that are significantly
associated with clinical outcomes is beneficial for targeted prognosis
and treatment. In this study, we integrated somatic mutation and gene
expression data to identify clusters of patients. In contrast to
previous studies, we constructed cancer-type-specific significant
co-expression networks (SCNs) rather than using a fixed gene network
across all cancers, such as the network-based stratification (NBS)
method, which ignores cancer heterogeneity. For each type of cancer,
the gene expression data were used to construct the SCN network, while
the gene somatic mutation data were mapped onto the network,
propagated, and used for further clustering. For the clustering, we
adopted an improved network-regularized non-negative matrix
factorization (netNMF) (netNMF_HC) for a more precise classification.
We applied our method to various datasets, including ovarian cancer
(OV), lung adenocarcinoma (LUAD) and uterine corpus endometrial
carcinoma (UCEC) cohorts derived from the TCGA (The Cancer Genome
Atlas) project. Based on the results, we evaluated the performance of
our method to identify survival-relevant subtypes and further compared
it to the NBS method, which adopts priori networks and netNMF
algorithm. The proposed algorithm outperformed the NBS method in
identifying informative cancer subtypes that were significantly
associated with clinical outcomes in most cancer types we studied. In
particular, our method identified survival-associated UCEC subtypes
that were not identified by the NBS method. Our analysis indicated
valid subtyping of patient could be applied by mutation data with
cancer-type-specific SCNs and netNMF_HC for individual cancers because
of specific cancer co-expression patterns and more precise clustering.
Introduction
Cancer is a heterogeneous disease which is formed by various subtypes.
In an organ, different tumour subtypes is a reflection of certain
molecular oncogenic processes and different clinical outcomes, which
means these subtypes are supposed be regarded as different kinds of
cancers in treatment design [[36]1]. As cancer genomic, transcriptomic
and epigenomic information is becoming increasingly available, the
stratification of tumours into valid clinic subtypes according to
molecular data is crucial for guiding better treatment and prognosis.
Due to the growth of mass high-throughput omics data, standard
unsupervised clustering methods are used to cluster samples
[[37]2–[38]4], such as non-negative matrix factorization (NMF) and
hierarchical clustering. Computational methods can be used to identify
tumour subtypes which have different survival rates, tumour levels or
stage, histological types, and responses of treatment.
Key point of previous studies is utilizing messenger RNA (mRNA)
expression data [[39]2–[40]3,[41]5] to successfully group patients
based on similarities in gene expression into clinically relevant
phenotypes. While somatic mutations can make mutated genes lose
function and offer more clinical guidance [[42]6–[43]8], these
mutations are not universal phenomenon among patients [[44]9–[45]10];
thus, it is impossible to directly measure the similarity among tumours
using mutated genes. The most advanced state-of-the-art integrative
method for cancer analysis, network-based stratification (NBS),
considers the sparsity of mutations by searching for mutational
consistencies at the network level rather than at the individual gene
level. This method can be used to identify patients’ subgroups with
similar molecular-network patterns by propagating mutation labels on a
prior gene interaction network.
As the molecular network can be regarded as an obvious sign for
interactions and relationships between molecules, it is adopted for the
biological discovery of complex diseases [[46]11–[47]12]. NBS and
current studies based on NBS all utilized the same prior networks
across cancers [[48]13–[49]16]. However, we recognize that the
regulation of gene expression levels among genes might be related to
the type of cancer, and different cancers correspond to different
regulations. Yang Y et al. constructed 4 cancer-type-specific
co-expression networks (CNs) and revealed that, the hub genes which can
be found in specific cancer networks are merely slightly overlap
[[50]17]. Thus, for a distinct cancer, the mutations should be mapped
onto a network that was derived specifically from that cancer. A CN is
constructed based on the correlations among the quantitative gene
expression levels in tumours, and the CN is presented as a graph in
which the nodes correspond to genes, and the edges correspond to
co-expressions among the genes. A CN contains more precise information
regarding the connectivity among genes in individual cancer types than
prior networks. CNs have been shown to be helpful for describing the
pair wise relationships among gene transcripts [[51]18–[52]19]. Genes
with similar or correlated expression patterns might contribute to the
same regulatory function, and gene co-expression patterns in a CN may
lead to more insightful discoveries regarding the underlying regulatory
mechanisms [[53]20–[54]21].
Thus, we propose a method based on NBS for the stratification of cancer
by constructing a gene network for each cancer. In this study, we
integrated somatic mutation data and gene expression data. For each
cancer, the gene expression data were used to construct a CN, and the
somatic mutations in the genes were mapped into the network and
propagated, which was useful for further clustering.
Furthermore, network-regularized NMF (netNMF) clustering using the NBS
method has been shown to be better than the standard NMF method. netNMF
first maps the samples with smoothed mutations into a lower
k-dimensional feature vector space using netNMF matrix factorization,
which constrains the genes with respect to the gene interaction
network; then, the sample category is determined by the column with the
largest value among the k feature vectors. This clustering is
reasonable because it is possible to select the class that has the
largest weight concerning the most relevant feature vector [[55]7], but
the class cannot be easily estimated if there are two similarly large
values. The class of the samples may be more precisely identified by
clustering the factorized low rank feature vectors of the samples using
a clustering method, such as hierarchical clustering.
In this study, we propose an improved stratification method based on
NBS through combining gene expression data and somatic mutation. First,
we constructed a significant co-expression network (SCN) using gene
expression profiles for each cancer type. Then, for patients, we
projected the profile of mutation onto the cancer-type-specific SCN
network. Network propagation was applied to diffuse the effect of
mutation over its network neighbourhood. Finally, the matrix of the
‘network-smoothed’ was stratified into different subtypes with numbers
ranging from 2 to 8 via the netNMF_HC clustering method. The
effectiveness of our method was evaluated based on the relevance of our
subtypes and clinical outcomes and compared with NBS, which used prior
gene networks and the netNMF clustering method. The results showed that
our method outperformed NBS for the three cancers and identified the
survival-relevant subtypes of uterine endometrial carcinoma (UCEC),
which were not identified by NBS. The workflow of our method is
described in [56]Fig 1.
Fig 1. Workflow of our method.
[57]Fig 1
[58]Open in a new tab
Materials and methods
Data and pre-processing
TCGA somatic mutation data and mRNA expression data
For comparing with NBS fairly, the somatic mutation profiles for serous
ovarian cancer (OV), UCEC, and lung adenocarcinoma (LUAD) from TCGA and
two prior gene-gene interaction networks STRING and Humannet were
collected from materials of Hofree et al [[59]4]. The RNAseqv2 gene
expression profiles for the three cancers were downloaded from the
Synapse database ([60]https://www.synapse.org/#!Synapse:syn300013) in
which the expression levels are normalized using MAS5. The expression
data were used to construct an SCN for further analysis. As shown in
[61]Table 1, the ovarian carcinoma dataset contained somatic mutations
in 9,850 genes from 356 samples and gene expression profiles for 20,534
genes from 430 samples. The LUAD dataset contained somatic mutations in
15,967 genes from 381samples and gene expression profiles for 20,199
genes from 576 samples. The UCEC dataset contained somatic mutations in
17,968 genes from 248 samples and gene expression profiles for 20,531
genes from 381 samples.
Table 1. Sample sizes, somatic mutations and gene expression profiles for
three cancers.
Somatic mutation data Gene expression profiles
Sample size Number of genes Sample size Number of genes
OV 356 9,850 430 20,534
LUAD 381 15,967 576 20,199
UCEC 248 17,968 381 20,531
[62]Open in a new tab
First, the mutation matrix F was generated based on samples with
somatic mutations. F is binary as follows: if any gene mutates (a
single nucleotide base change or an insertion or deletion of bases) in
a certain sample relative to the germ line, the mutation is marked the
number 1; otherwise, 0 is assigned. The expression matrix E is a real
matrix, and each of its entries indicates a normalized given gene
expression in a given sample. In all matrices, the samples and genes
are represented by the rows and columns. We filtered samples with fewer
than 10 mutations in mutation matrix F and genes with 0 expression in
all samples in the gene expression matrix E. The clinical data,
including survival, stage and grade of the three cancers, were gotten
from the Synapse database
([63]https://www.synapse.org/#!Synapse:syn300013) and were applied in
evaluation of the relevance of the identified subtypes and clinical
outcomes.
Gene interaction networks
The patient mutations were projected onto the gene interaction networks
as follows: NBS utilizes STRING v.9 [[64]22] and HumanNet v.1 [[65]23],
and our method employs an SCN that was constructed by gene expression
profiles.
Method
Construction of the SCN
For each cancer type, an SCN was constructed to represent the
significant correlations between a pair of genes without a prior
interaction network. First, the absolute Spearman’s rank correlations
and the corresponding p-values of these correlations were calculated
for all gene pairs based on the expression profile matrix of each
cancer type. The original co-expression network (CN) was constructed
based on the absolute correlations as the weight of the edges. Then,
the expression of gene pairs was considered significantly correlated if
the q-value (Bonferroni corrected p-value) of their correlation was
smaller than 0.05. The SCN was obtained by filtering the edges with
correlations whose corresponding q-values were greater than 0.05. Thus,
the SCN is an unweighted undirected network in which each vertex
denotes a gene used in our work. Spearman’s rank correlation was chosen
as a measure to evaluate the relevance of two genes because it can
detect nonlinear relationships, and it has been verified to have better
performance than other measures, such as Euclidean distances and
Pearson’s correlations, in measuring the similarity between genes
[[66]24]. Spearman’s rank correlations and the corresponding p-values
were calculated using the function corr() in MATLAB.
Finally, three cancer-type-specific SCNs were obtained, in which two
genes are connected if they are estimated to be significantly
co-expressed.
Network smoothing
For each patient of each cancer, we mapped the mutation profile onto
the constructed cancer-type-specific SCN. Network propagation [[67]25]
was used to propagate the mutation signal among networks. The key is to
spread the mutation information of every gene to its neighbours
iteratively until a stable state is achieved. The algorithm used is as
follows:
* Step1: Construct the degree-normalized matrix W′ = D^-1/2WD^-1/2,
where D is a diagonal matrix, and its columns sums W on the
diagonal; W is the adjacency matrix of the SCN network; and the
diagonal elements of W are set to zero. The normalized matrix W′ is
utilized for the following smooth process.
* Step2: Iterate F[t+1] = αF[t]W′+ (1−α)F[0] until convergence is
achieved (the matrix norm of F[t+1] −F[t] <1×10^−7), where F[0] is
the patient-by-gene somatic mutation matrix, and the parameter α is
a tuning parameter in (0, 1) that governs the relative amount of
the information from the gene's neighbours and its initial mutation
information. It should be noted that self-reinforcement should be
avoided because the diagonal elements of the adjacency matrix W are
set to zero at the beginning. The optimal value of α is set as
0.7and is represented in NBS [[68]4]. Each row in F[t] represents
the smoothed mutation of genes in a sample after it is influenced
by its neighbours.
* Step3: For F[t], quantile normalization is regarded as the
guarantee for patient to follow the same smoothed mutation profile
distribution. F was applied to show the transpose of the final
normalized and smoothed mutation matrix.
Improved netNMF_HC
NMF is a matrix factorization algorithm which can resolve a matrix into
two lower rank non-negative matrices [[69]26]. netNMF can be regarded
as an evolution of NMF which rules NMF to keep the gene interaction
network structure. The objective is producing two non-negative
matrices, W and H, to minimize the following function using an
iterative method [[70]27]:
[MATH:
minW,
H>0
‖F−WH‖
mrow>F2+λtr
ace(WtLW
) :MATH]
(1)
Where
[MATH:
‖.‖
F2 :MATH]
denotes the matrix Frobenius norm, W and H are decompositions of the
smoothed m by n matrix F. W is an m by k basis matrix or "metagenes",
and H is a k by n coefficient matrix. The reduced dimension is
controlled by the value k and k is set 2~8.
L is the graph Laplacian of the p-nearest-neighbour network derived
from the original weighted gene co-expression network (CN). If v[i] and
v[j] are two connected vertices in the original network, the weight of
edge w[ij] of the p-nearest neighbour network w is as follows:
[MATH:
wij={1,if vi
msub>∈Np(vj) or vj∈Np(vi)0,otherwise :MATH]
(2)
where N[p](v[i]) indicates the set of p-nearest neighbours of v[i]. The
graph Laplacian of w is L = D(w) − w, where D is a diagonal matrix with
the sums of a column (or row as w is symmetrical) of w D[ii] =
Σ[j]w[ij]on the diagonal line. We set the number of nearest neighbours
p = 11, and the regularization parameter λ was set to 200, which are on
the same scale in NBS [[71]4]. The function was run iteratively until
it converged (||F–WH||<1×10^−3).
Finally, in previous works, the class of the samples was obtained from
the coefficient matrix H. H is a k by n coefficient matrix, indicating
n samples with k feature vectors. For a sample n, class
[MATH:
cn=argmaxk(Hkn) :MATH]
; therefore, the sample belongs to the column number of the feature
vector with the largest value.
It is not precise enough to use the netNMF method for determining the
category of n samples; for example, if two vector values are almost the
same, it may be difficult to determine the final class. Therefore, we
propose the improved method netNMF_HC, which considers the k by n
coefficient matrix H from netNMF to be a low dimension feature space of
the patients, and then, H is utilized to group samples by hierarchical
clustering (HC) to obtain the class of the patients.
Consensus clustering
Robust clustering was achieved by applying consensus clustering
[[72]28] to produce the final subtypes. Precisely, 80% of the patients
and genes were sampled randomly for network smoothing and netNMF_HC was
used to perform the clustering. The process was repeated 100 times. The
results of the 100 clustering made up the patient-patient similarity
matrix. The matrix recorded the frequency of the sampling of each pair
of patients and the rates at which the pairs were clustered in same
group among all replicates. According to the similarity matrix,
hierarchical clustering with average linkage can be produced.
Clinical analysis
The analysis of survival was generated by the R “survival” package.
Kaplan-Meier survival curves of subtypes and log-rank test p-values
were utilized to evaluate the association between the subtypes and
patient 10 year survival time. Pearson’s chi-squared test was aimed to
evaluate the relationship between the subtypes and tumour grade or
stage.
Identification of differentially mutated genes in subgroups
The significantly mutated genes of each subtype relative to the
remaining subtypes were identified using the significance analysis of
microarrays (SAM) method [[73]29] with the network-smoothed mutation
data. The q-values were estimated by SAM with the Wilcoxon-rank
statistic and 1,000 permutations. Then, the differentially mutated
genes with q-values<0.05 for each subtype were selected for further
analysis.
We performed a biological processes and pathway enrichment analysis for
the significantly differentially mutated genes in each subtype using
DAVID 6.8 ([74]https://david.ncifcrf.gov/). Only enriched annotation
terms whose q-values were lower than 0.05 were retained.
Results
We tested our method and the NBS method in ovarian, uterine and lung
cancer. In our method, the somatic mutations in these cancer types were
propagated onto an SCN, and then, the smoothed mutation profiles were
clustered with consensus clustering based on netNMF_HC. In NBS, the
somatic mutations in these cancer types were mapped onto the gene
interaction networks STRING and Humannet, and then, the smoothed
mutation profiles were clustered with consensus clustering based on
netNMF. To determine whether the SCN network or the improved clustering
method netNMF_HC contribute to our method, we also performed
experiments in which only changing the network or the clustering
method. The clustering outcomes (cluster number k = 2~8) for each kind
of cancer are shown in the [75]S1 File. Finally, different outcomes
were observed for a given cancer type and a number of clusters k.
Clinical analysis
The relationship between subtypes and 10 year survival was investigated
in the first time. All three cancer subtypes derived from our method
were significantly associated with survival in certain cluster numbers
(log-rank test p-value < 0.05) as shown in [76]Table 2 and [77]Fig 2.
Compared with NBS, when only the network was changed, the
cancer-specific SCN performed better than prior network STRING in
discovering clinically relevant subtypes for OV and LUAD; when only
changing the clustering method, the improved clustering method
netNMF_HC performed better than netNMF for OV. When both were changed
simultaneously, our method performed better than the NBS method for all
three cancers. Additionally, using our method (SCN + netNMF_HC), the
OV, LUAD and UCEC samples were divided into 4, 6 and 3 clusters
respectively having most significant association with the survival
time, each cluster was independent and differed in survival ([78]Fig
3), while the NBS method using the STRING network was less effective
(p-values were not significant). Especially, the survival-relevant UCEC
subtypes could not be obtained with NBS based on the STRING or Humannet
network, which is consistent with a previous work [[79]4]. We then
measured the relationship between the subtypes and the tumour grade or
stage. Among three kinds of cancer, there was no relationship between
the clusters and the tumour grade or stage, except for UCEC.
Table 2. Critical relationship between subtypes and survival in 3 tumour
types using NBS and improved methods.
Survival p-value Cluster number k NBS Only changing network Only
changing clustering method Our method
STRING +netNMF Humannet +netNMF SCN+ netNMF STRING +netNMF_HC Humannet
+netNMF_HC SCN+NetNMF_HC
OV 4 0.058222 0.040828 0.007334 0.00792 0.003891 7.70E-07
LUAD 6 0.177624 0.014901 0.033739 0.076693 0.07448 0.024237
UCEC 3 0.260764 0.913596 0.080808 0.248289 0.491311 0.000314
[80]Open in a new tab
Fig 2. Survival p-values in three cancers with distinct methods.
[81]Fig 2
[82]Open in a new tab
(A) Significance with -Log10(p-value) association between 10 year
survival and subtypes obtained by distinct combination of networks
(STRING(blue), Humannet(green) and Significant co-expression
network(SCN)(red)) and clustering methods (netNMF and netNMF_HC) for
ovarian cancer (OV) with cluster number k = 4. (B) for lung
cancer(LUAD) with cluster number k = 6. (C) for uterine cancer(UCEC)
with cluster number k = 3. Dashed lines represent the -log10(P = 0.05)
threshold.
Fig 3. Survival curves of three cancers with our method.
Fig 3
[83]Open in a new tab
(A) Kaplan-Meier survival plots of subtypes obtained by our method that
combining the significant co-expression network(SCN) and clustering
method netNMF_HC for ovarian cancer (OV) with cluster number k = 4. (B)
for lung cancer(LUAD) with cluster number k = 6. (C) for uterine
cancer(UCEC) with cluster number k = 3.
Overall, the different networks applied in the work influenced the
stratification results. The SCN network performed better than the prior
networks STRING and Humannet in most of the studied cancer types.
Although the clustering method improvement did not contribute much, the
combination of SCN network and netNMF_HC clustering method achieved
better performance than NBS for all three cancers.
Identifying differentially mutated genes in subgroups
We further identified the significantly differentially mutated genes in
each subtype of UCEC for instance. The overlap of these gene sets is
very few, which means these genes are specific to certain subtype
([84]Fig 4). In addition, the enriched biological processes and
pathways of these genes are also distinct for different subtypes
([85]Fig 5).
Fig 4. Overlap of genes distinguishing the three subtypes of UCEC.
[86]Fig 4
[87]Open in a new tab
Fig 5. Top enriched biological processes and pathways of the differentially
mutated genes in each subtype of UCEC.
[88]Fig 5
[89]Open in a new tab
Bars indicate the significance with -log10(Benjamini correction
p-value) (blue) and the frequency of the mapped genes (red) of the
corresponding function.
Mutation pattern analysis
As shown in [90]Fig 3C, three UCEC subtypes were obtained. Cluster 1
(red) has the worst survival and cluster 2 (blue) has the best
survival. The mutation patterns (before the network smoothing) of the
three UCEC subtypes which are predictive of survival was analysed. As
shown in [91]Fig 6, the three subtypes have different mutation schemas,
and cluster 3 harboured more mutations than the other clusters. Both
PIK3CA and PTEN alterations have been reported to have strong
relationships with UCEC [[92]30].
Fig 6. Somatic mutation patterns in three UCEC subtypes.
[93]Fig 6
[94]Open in a new tab
Each row indicates a highly frequent gene, and each column denotes a
sample. Dark colours in the figure indicate that a mutation occurred in
a gene in a sample.
Discussion
Exome and whole genome sequencing have provided a large amount of
genomic and transcriptome data. These data enable the stratification of
patients into clinically relevant subtypes, making molecular-driven
diagnoses and therapy feasible. Network-based methods have integrated
mutations and prior gene interaction networks to identify the
clinically relevant subtypes. In this study, we showed that due to
tumour heterogeneity, combination of a cancer-type-specific SCN and
improved clustering method for each cancer type can achieve a superior
stratification compared to using the prior fixed gene network for all
cancers and often also has a better predictive performance of survival.
This finding indicated that the cancer-type-specific gene SCNs can
offer useful individual cancer biological knowledge for effective
subtyping.
Clinically relevant tumour subtypes of some cancer types may be driven
by various mechanisms, such as copy number aberration or methylation,
besides somatic mutations and gene expression levels. Integrating
multiple types of molecular data to discover truly predictive subtypes
is essential for the future. Based on our analysis, the top 50 genes
that are mutated frequently across tumour samples (generally called
mutation drivers) can distinguish the UCEC subtypes by the mutation
patterns. Driver genes may be beneficial for stratification because
they are signatures that can capture the differences among the
subtypes. In conclusion, clinically relevant subtyping performance may
be further improved by integrating clinical driver genes obtained from
integrated molecular data and cancer-specific networks.
Supporting information
S1 File. Significant p-value of association between subtypes and
survival for three cancers.
(PDF)
[95]Click here for additional data file.^ (70.4KB, pdf)
Acknowledgments