Abstract

   The stratification of cancer into subtypes that are significantly
   associated with clinical outcomes is beneficial for targeted prognosis
   and treatment. In this study, we integrated somatic mutation and gene
   expression data to identify clusters of patients. In contrast to
   previous studies, we constructed cancer-type-specific significant
   co-expression networks (SCNs) rather than using a fixed gene network
   across all cancers, such as the network-based stratification (NBS)
   method, which ignores cancer heterogeneity. For each type of cancer,
   the gene expression data were used to construct the SCN network, while
   the gene somatic mutation data were mapped onto the network,
   propagated, and used for further clustering. For the clustering, we
   adopted an improved network-regularized non-negative matrix
   factorization (netNMF) (netNMF_HC) for a more precise classification.
   We applied our method to various datasets, including ovarian cancer
   (OV), lung adenocarcinoma (LUAD) and uterine corpus endometrial
   carcinoma (UCEC) cohorts derived from the TCGA (The Cancer Genome
   Atlas) project. Based on the results, we evaluated the performance of
   our method to identify survival-relevant subtypes and further compared
   it to the NBS method, which adopts priori networks and netNMF
   algorithm. The proposed algorithm outperformed the NBS method in
   identifying informative cancer subtypes that were significantly
   associated with clinical outcomes in most cancer types we studied. In
   particular, our method identified survival-associated UCEC subtypes
   that were not identified by the NBS method. Our analysis indicated
   valid subtyping of patient could be applied by mutation data with
   cancer-type-specific SCNs and netNMF_HC for individual cancers because
   of specific cancer co-expression patterns and more precise clustering.

Introduction

   Cancer is a heterogeneous disease which is formed by various subtypes.
   In an organ, different tumour subtypes is a reflection of certain
   molecular oncogenic processes and different clinical outcomes, which
   means these subtypes are supposed be regarded as different kinds of
   cancers in treatment design [[36]1]. As cancer genomic, transcriptomic
   and epigenomic information is becoming increasingly available, the
   stratification of tumours into valid clinic subtypes according to
   molecular data is crucial for guiding better treatment and prognosis.
   Due to the growth of mass high-throughput omics data, standard
   unsupervised clustering methods are used to cluster samples
   [[37]2–[38]4], such as non-negative matrix factorization (NMF) and
   hierarchical clustering. Computational methods can be used to identify
   tumour subtypes which have different survival rates, tumour levels or
   stage, histological types, and responses of treatment.

   Key point of previous studies is utilizing messenger RNA (mRNA)
   expression data [[39]2–[40]3,[41]5] to successfully group patients
   based on similarities in gene expression into clinically relevant
   phenotypes. While somatic mutations can make mutated genes lose
   function and offer more clinical guidance [[42]6–[43]8], these
   mutations are not universal phenomenon among patients [[44]9–[45]10];
   thus, it is impossible to directly measure the similarity among tumours
   using mutated genes. The most advanced state-of-the-art integrative
   method for cancer analysis, network-based stratification (NBS),
   considers the sparsity of mutations by searching for mutational
   consistencies at the network level rather than at the individual gene
   level. This method can be used to identify patients’ subgroups with
   similar molecular-network patterns by propagating mutation labels on a
   prior gene interaction network.

   As the molecular network can be regarded as an obvious sign for
   interactions and relationships between molecules, it is adopted for the
   biological discovery of complex diseases [[46]11–[47]12]. NBS and
   current studies based on NBS all utilized the same prior networks
   across cancers [[48]13–[49]16]. However, we recognize that the
   regulation of gene expression levels among genes might be related to
   the type of cancer, and different cancers correspond to different
   regulations. Yang Y et al. constructed 4 cancer-type-specific
   co-expression networks (CNs) and revealed that, the hub genes which can
   be found in specific cancer networks are merely slightly overlap
   [[50]17]. Thus, for a distinct cancer, the mutations should be mapped
   onto a network that was derived specifically from that cancer. A CN is
   constructed based on the correlations among the quantitative gene
   expression levels in tumours, and the CN is presented as a graph in
   which the nodes correspond to genes, and the edges correspond to
   co-expressions among the genes. A CN contains more precise information
   regarding the connectivity among genes in individual cancer types than
   prior networks. CNs have been shown to be helpful for describing the
   pair wise relationships among gene transcripts [[51]18–[52]19]. Genes
   with similar or correlated expression patterns might contribute to the
   same regulatory function, and gene co-expression patterns in a CN may
   lead to more insightful discoveries regarding the underlying regulatory
   mechanisms [[53]20–[54]21].

   Thus, we propose a method based on NBS for the stratification of cancer
   by constructing a gene network for each cancer. In this study, we
   integrated somatic mutation data and gene expression data. For each
   cancer, the gene expression data were used to construct a CN, and the
   somatic mutations in the genes were mapped into the network and
   propagated, which was useful for further clustering.

   Furthermore, network-regularized NMF (netNMF) clustering using the NBS
   method has been shown to be better than the standard NMF method. netNMF
   first maps the samples with smoothed mutations into a lower
   k-dimensional feature vector space using netNMF matrix factorization,
   which constrains the genes with respect to the gene interaction
   network; then, the sample category is determined by the column with the
   largest value among the k feature vectors. This clustering is
   reasonable because it is possible to select the class that has the
   largest weight concerning the most relevant feature vector [[55]7], but
   the class cannot be easily estimated if there are two similarly large
   values. The class of the samples may be more precisely identified by
   clustering the factorized low rank feature vectors of the samples using
   a clustering method, such as hierarchical clustering.

   In this study, we propose an improved stratification method based on
   NBS through combining gene expression data and somatic mutation. First,
   we constructed a significant co-expression network (SCN) using gene
   expression profiles for each cancer type. Then, for patients, we
   projected the profile of mutation onto the cancer-type-specific SCN
   network. Network propagation was applied to diffuse the effect of
   mutation over its network neighbourhood. Finally, the matrix of the
   ‘network-smoothed’ was stratified into different subtypes with numbers
   ranging from 2 to 8 via the netNMF_HC clustering method. The
   effectiveness of our method was evaluated based on the relevance of our
   subtypes and clinical outcomes and compared with NBS, which used prior
   gene networks and the netNMF clustering method. The results showed that
   our method outperformed NBS for the three cancers and identified the
   survival-relevant subtypes of uterine endometrial carcinoma (UCEC),
   which were not identified by NBS. The workflow of our method is
   described in [56]Fig 1.

Fig 1. Workflow of our method.

   [57]Fig 1
   [58]Open in a new tab

Materials and methods

Data and pre-processing

TCGA somatic mutation data and mRNA expression data

   For comparing with NBS fairly, the somatic mutation profiles for serous
   ovarian cancer (OV), UCEC, and lung adenocarcinoma (LUAD) from TCGA and
   two prior gene-gene interaction networks STRING and Humannet were
   collected from materials of Hofree et al [[59]4]. The RNAseqv2 gene
   expression profiles for the three cancers were downloaded from the
   Synapse database ([60]https://www.synapse.org/#!Synapse:syn300013) in
   which the expression levels are normalized using MAS5. The expression
   data were used to construct an SCN for further analysis. As shown in
   [61]Table 1, the ovarian carcinoma dataset contained somatic mutations
   in 9,850 genes from 356 samples and gene expression profiles for 20,534
   genes from 430 samples. The LUAD dataset contained somatic mutations in
   15,967 genes from 381samples and gene expression profiles for 20,199
   genes from 576 samples. The UCEC dataset contained somatic mutations in
   17,968 genes from 248 samples and gene expression profiles for 20,531
   genes from 381 samples.

Table 1. Sample sizes, somatic mutations and gene expression profiles for
three cancers.

        Somatic mutation data       Gene expression profiles
        Sample size Number of genes Sample size Number of genes
   OV   356         9,850           430         20,534
   LUAD 381         15,967          576         20,199
   UCEC 248         17,968          381         20,531
   [62]Open in a new tab

   First, the mutation matrix F was generated based on samples with
   somatic mutations. F is binary as follows: if any gene mutates (a
   single nucleotide base change or an insertion or deletion of bases) in
   a certain sample relative to the germ line, the mutation is marked the
   number 1; otherwise, 0 is assigned. The expression matrix E is a real
   matrix, and each of its entries indicates a normalized given gene
   expression in a given sample. In all matrices, the samples and genes
   are represented by the rows and columns. We filtered samples with fewer
   than 10 mutations in mutation matrix F and genes with 0 expression in
   all samples in the gene expression matrix E. The clinical data,
   including survival, stage and grade of the three cancers, were gotten
   from the Synapse database
   ([63]https://www.synapse.org/#!Synapse:syn300013) and were applied in
   evaluation of the relevance of the identified subtypes and clinical
   outcomes.

Gene interaction networks

   The patient mutations were projected onto the gene interaction networks
   as follows: NBS utilizes STRING v.9 [[64]22] and HumanNet v.1 [[65]23],
   and our method employs an SCN that was constructed by gene expression
   profiles.

Method

Construction of the SCN

   For each cancer type, an SCN was constructed to represent the
   significant correlations between a pair of genes without a prior
   interaction network. First, the absolute Spearman’s rank correlations
   and the corresponding p-values of these correlations were calculated
   for all gene pairs based on the expression profile matrix of each
   cancer type. The original co-expression network (CN) was constructed
   based on the absolute correlations as the weight of the edges. Then,
   the expression of gene pairs was considered significantly correlated if
   the q-value (Bonferroni corrected p-value) of their correlation was
   smaller than 0.05. The SCN was obtained by filtering the edges with
   correlations whose corresponding q-values were greater than 0.05. Thus,
   the SCN is an unweighted undirected network in which each vertex
   denotes a gene used in our work. Spearman’s rank correlation was chosen
   as a measure to evaluate the relevance of two genes because it can
   detect nonlinear relationships, and it has been verified to have better
   performance than other measures, such as Euclidean distances and
   Pearson’s correlations, in measuring the similarity between genes
   [[66]24]. Spearman’s rank correlations and the corresponding p-values
   were calculated using the function corr() in MATLAB.

   Finally, three cancer-type-specific SCNs were obtained, in which two
   genes are connected if they are estimated to be significantly
   co-expressed.

Network smoothing

   For each patient of each cancer, we mapped the mutation profile onto
   the constructed cancer-type-specific SCN. Network propagation [[67]25]
   was used to propagate the mutation signal among networks. The key is to
   spread the mutation information of every gene to its neighbours
   iteratively until a stable state is achieved. The algorithm used is as
   follows:
     * Step1: Construct the degree-normalized matrix W′ = D^-1/2WD^-1/2,
       where D is a diagonal matrix, and its columns sums W on the
       diagonal; W is the adjacency matrix of the SCN network; and the
       diagonal elements of W are set to zero. The normalized matrix W′ is
       utilized for the following smooth process.
     * Step2: Iterate F[t+1] = αF[t]W′+ (1−α)F[0] until convergence is
       achieved (the matrix norm of F[t+1] −F[t] <1×10^−7), where F[0] is
       the patient-by-gene somatic mutation matrix, and the parameter α is
       a tuning parameter in (0, 1) that governs the relative amount of
       the information from the gene's neighbours and its initial mutation
       information. It should be noted that self-reinforcement should be
       avoided because the diagonal elements of the adjacency matrix W are
       set to zero at the beginning. The optimal value of α is set as
       0.7and is represented in NBS [[68]4]. Each row in F[t] represents
       the smoothed mutation of genes in a sample after it is influenced
       by its neighbours.
     * Step3: For F[t], quantile normalization is regarded as the
       guarantee for patient to follow the same smoothed mutation profile
       distribution. F was applied to show the transpose of the final
       normalized and smoothed mutation matrix.

Improved netNMF_HC

   NMF is a matrix factorization algorithm which can resolve a matrix into
   two lower rank non-negative matrices [[69]26]. netNMF can be regarded
   as an evolution of NMF which rules NMF to keep the gene interaction
   network structure. The objective is producing two non-negative
   matrices, W and H, to minimize the following function using an
   iterative method [[70]27]:
   [MATH:
   <mrow><munder><mrow><mtext>min</mtext></mrow><mrow><mi>W</mi><mo>,</mo>
   <mi>H</mi><mo>></mo><mn>0</mn></mrow></munder><msubsup><mrow><mrow><mo>
   ‖</mo><mrow><mi>F</mi><mo>−</mo><mi>W</mi><mi>H</mi></mrow><mo>‖</mo></
   mrow></mrow><mi>F</mi><mn>2</mn></msubsup><mo>+</mo><mi>λ</mi><mtext>tr
   ace</mtext><mo
   stretchy="false">(</mo><msup><mi>W</mi><mi>t</mi></msup><mi>L</mi><mi>W
   </mi><mo stretchy="false">)</mo></mrow> :MATH]
   (1)

   Where
   [MATH:
   <mrow><msubsup><mrow><mrow><mo>‖</mo><mo>.</mo><mo>‖</mo></mrow></mrow>
   <mi>F</mi><mn>2</mn></msubsup></mrow> :MATH]
   denotes the matrix Frobenius norm, W and H are decompositions of the
   smoothed m by n matrix F. W is an m by k basis matrix or "metagenes",
   and H is a k by n coefficient matrix. The reduced dimension is
   controlled by the value k and k is set 2~8.

   L is the graph Laplacian of the p-nearest-neighbour network derived
   from the original weighted gene co-expression network (CN). If v[i] and
   v[j] are two connected vertices in the original network, the weight of
   edge w[ij] of the p-nearest neighbour network w is as follows:
   [MATH:
   <msub><mi>w</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo><mrow
   ><mo>{</mo><mrow><mtable columnalign="left"><mtr
   columnalign="left"><mtd
   columnalign="left"><mrow><mn>1</mn><mo>,</mo></mrow></mtd><mtd
   columnalign="left"><mrow><mtext>if </mtext><msub><mi>v</mi><mi>i</mi></
   msub><mo>∈</mo><msub><mi>N</mi><mi>p</mi></msub><mo
   stretchy="false">(</mo><msub><mi>v</mi><mi>j</mi></msub><mo
   stretchy="false">)</mo><mtext> </mtext><mi>o</mi><mi>r</mi><mtext> </mt
   ext><msub><mi>v</mi><mi>j</mi></msub><mo>∈</mo><msub><mi>N</mi><mi>p</m
   i></msub><mo
   stretchy="false">(</mo><msub><mi>v</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow></mtd></mtr><mtr columnalign="left"><mtd
   columnalign="left"><mrow><mn>0</mn><mo>,</mo></mrow></mtd><mtd
   columnalign="left"><mrow><mtext>otherwise</mtext></mrow></mtd></mtr></m
   table></mrow></mrow> :MATH]
   (2)

   where N[p](v[i]) indicates the set of p-nearest neighbours of v[i]. The
   graph Laplacian of w is L = D(w) − w, where D is a diagonal matrix with
   the sums of a column (or row as w is symmetrical) of w D[ii] =
   Σ[j]w[ij]on the diagonal line. We set the number of nearest neighbours
   p = 11, and the regularization parameter λ was set to 200, which are on
   the same scale in NBS [[71]4]. The function was run iteratively until
   it converged (||F–WH||<1×10^−3).

   Finally, in previous works, the class of the samples was obtained from
   the coefficient matrix H. H is a k by n coefficient matrix, indicating
   n samples with k feature vectors. For a sample n, class
   [MATH:
   <mrow><msub><mi>c</mi><mi>n</mi></msub><mo>=</mo><mtext>arg</mtext><mun
   der><mrow><mtext>max</mtext></mrow><mi>k</mi></munder><mo
   stretchy="false">(</mo><msub><mi>H</mi><mrow><mi>k</mi><mi>n</mi></mrow
   ></msub><mo stretchy="false">)</mo></mrow> :MATH]
   ; therefore, the sample belongs to the column number of the feature
   vector with the largest value.

   It is not precise enough to use the netNMF method for determining the
   category of n samples; for example, if two vector values are almost the
   same, it may be difficult to determine the final class. Therefore, we
   propose the improved method netNMF_HC, which considers the k by n
   coefficient matrix H from netNMF to be a low dimension feature space of
   the patients, and then, H is utilized to group samples by hierarchical
   clustering (HC) to obtain the class of the patients.

Consensus clustering

   Robust clustering was achieved by applying consensus clustering
   [[72]28] to produce the final subtypes. Precisely, 80% of the patients
   and genes were sampled randomly for network smoothing and netNMF_HC was
   used to perform the clustering. The process was repeated 100 times. The
   results of the 100 clustering made up the patient-patient similarity
   matrix. The matrix recorded the frequency of the sampling of each pair
   of patients and the rates at which the pairs were clustered in same
   group among all replicates. According to the similarity matrix,
   hierarchical clustering with average linkage can be produced.

Clinical analysis

   The analysis of survival was generated by the R “survival” package.
   Kaplan-Meier survival curves of subtypes and log-rank test p-values
   were utilized to evaluate the association between the subtypes and
   patient 10 year survival time. Pearson’s chi-squared test was aimed to
   evaluate the relationship between the subtypes and tumour grade or
   stage.

Identification of differentially mutated genes in subgroups

   The significantly mutated genes of each subtype relative to the
   remaining subtypes were identified using the significance analysis of
   microarrays (SAM) method [[73]29] with the network-smoothed mutation
   data. The q-values were estimated by SAM with the Wilcoxon-rank
   statistic and 1,000 permutations. Then, the differentially mutated
   genes with q-values<0.05 for each subtype were selected for further
   analysis.

   We performed a biological processes and pathway enrichment analysis for
   the significantly differentially mutated genes in each subtype using
   DAVID 6.8 ([74]https://david.ncifcrf.gov/). Only enriched annotation
   terms whose q-values were lower than 0.05 were retained.

Results

   We tested our method and the NBS method in ovarian, uterine and lung
   cancer. In our method, the somatic mutations in these cancer types were
   propagated onto an SCN, and then, the smoothed mutation profiles were
   clustered with consensus clustering based on netNMF_HC. In NBS, the
   somatic mutations in these cancer types were mapped onto the gene
   interaction networks STRING and Humannet, and then, the smoothed
   mutation profiles were clustered with consensus clustering based on
   netNMF. To determine whether the SCN network or the improved clustering
   method netNMF_HC contribute to our method, we also performed
   experiments in which only changing the network or the clustering
   method. The clustering outcomes (cluster number k = 2~8) for each kind
   of cancer are shown in the [75]S1 File. Finally, different outcomes
   were observed for a given cancer type and a number of clusters k.

Clinical analysis

   The relationship between subtypes and 10 year survival was investigated
   in the first time. All three cancer subtypes derived from our method
   were significantly associated with survival in certain cluster numbers
   (log-rank test p-value < 0.05) as shown in [76]Table 2 and [77]Fig 2.
   Compared with NBS, when only the network was changed, the
   cancer-specific SCN performed better than prior network STRING in
   discovering clinically relevant subtypes for OV and LUAD; when only
   changing the clustering method, the improved clustering method
   netNMF_HC performed better than netNMF for OV. When both were changed
   simultaneously, our method performed better than the NBS method for all
   three cancers. Additionally, using our method (SCN + netNMF_HC), the
   OV, LUAD and UCEC samples were divided into 4, 6 and 3 clusters
   respectively having most significant association with the survival
   time, each cluster was independent and differed in survival ([78]Fig
   3), while the NBS method using the STRING network was less effective
   (p-values were not significant). Especially, the survival-relevant UCEC
   subtypes could not be obtained with NBS based on the STRING or Humannet
   network, which is consistent with a previous work [[79]4]. We then
   measured the relationship between the subtypes and the tumour grade or
   stage. Among three kinds of cancer, there was no relationship between
   the clusters and the tumour grade or stage, except for UCEC.

Table 2. Critical relationship between subtypes and survival in 3 tumour
types using NBS and improved methods.

   Survival p-value Cluster number k NBS Only changing network Only
   changing clustering method Our method
   STRING +netNMF Humannet +netNMF SCN+ netNMF STRING +netNMF_HC Humannet
   +netNMF_HC SCN+NetNMF_HC
   OV 4 0.058222 0.040828 0.007334 0.00792 0.003891 7.70E-07
   LUAD 6 0.177624 0.014901 0.033739 0.076693 0.07448 0.024237
   UCEC 3 0.260764 0.913596 0.080808 0.248289 0.491311 0.000314
   [80]Open in a new tab

Fig 2. Survival p-values in three cancers with distinct methods.

   [81]Fig 2
   [82]Open in a new tab

   (A) Significance with -Log10(p-value) association between 10 year
   survival and subtypes obtained by distinct combination of networks
   (STRING(blue), Humannet(green) and Significant co-expression
   network(SCN)(red)) and clustering methods (netNMF and netNMF_HC) for
   ovarian cancer (OV) with cluster number k = 4. (B) for lung
   cancer(LUAD) with cluster number k = 6. (C) for uterine cancer(UCEC)
   with cluster number k = 3. Dashed lines represent the -log10(P = 0.05)
   threshold.

Fig 3. Survival curves of three cancers with our method.

   Fig 3
   [83]Open in a new tab

   (A) Kaplan-Meier survival plots of subtypes obtained by our method that
   combining the significant co-expression network(SCN) and clustering
   method netNMF_HC for ovarian cancer (OV) with cluster number k = 4. (B)
   for lung cancer(LUAD) with cluster number k = 6. (C) for uterine
   cancer(UCEC) with cluster number k = 3.

   Overall, the different networks applied in the work influenced the
   stratification results. The SCN network performed better than the prior
   networks STRING and Humannet in most of the studied cancer types.
   Although the clustering method improvement did not contribute much, the
   combination of SCN network and netNMF_HC clustering method achieved
   better performance than NBS for all three cancers.

Identifying differentially mutated genes in subgroups

   We further identified the significantly differentially mutated genes in
   each subtype of UCEC for instance. The overlap of these gene sets is
   very few, which means these genes are specific to certain subtype
   ([84]Fig 4). In addition, the enriched biological processes and
   pathways of these genes are also distinct for different subtypes
   ([85]Fig 5).

Fig 4. Overlap of genes distinguishing the three subtypes of UCEC.

   [86]Fig 4
   [87]Open in a new tab

Fig 5. Top enriched biological processes and pathways of the differentially
mutated genes in each subtype of UCEC.

   [88]Fig 5
   [89]Open in a new tab

   Bars indicate the significance with -log10(Benjamini correction
   p-value) (blue) and the frequency of the mapped genes (red) of the
   corresponding function.

Mutation pattern analysis

   As shown in [90]Fig 3C, three UCEC subtypes were obtained. Cluster 1
   (red) has the worst survival and cluster 2 (blue) has the best
   survival. The mutation patterns (before the network smoothing) of the
   three UCEC subtypes which are predictive of survival was analysed. As
   shown in [91]Fig 6, the three subtypes have different mutation schemas,
   and cluster 3 harboured more mutations than the other clusters. Both
   PIK3CA and PTEN alterations have been reported to have strong
   relationships with UCEC [[92]30].

Fig 6. Somatic mutation patterns in three UCEC subtypes.

   [93]Fig 6
   [94]Open in a new tab

   Each row indicates a highly frequent gene, and each column denotes a
   sample. Dark colours in the figure indicate that a mutation occurred in
   a gene in a sample.

Discussion

   Exome and whole genome sequencing have provided a large amount of
   genomic and transcriptome data. These data enable the stratification of
   patients into clinically relevant subtypes, making molecular-driven
   diagnoses and therapy feasible. Network-based methods have integrated
   mutations and prior gene interaction networks to identify the
   clinically relevant subtypes. In this study, we showed that due to
   tumour heterogeneity, combination of a cancer-type-specific SCN and
   improved clustering method for each cancer type can achieve a superior
   stratification compared to using the prior fixed gene network for all
   cancers and often also has a better predictive performance of survival.
   This finding indicated that the cancer-type-specific gene SCNs can
   offer useful individual cancer biological knowledge for effective
   subtyping.

   Clinically relevant tumour subtypes of some cancer types may be driven
   by various mechanisms, such as copy number aberration or methylation,
   besides somatic mutations and gene expression levels. Integrating
   multiple types of molecular data to discover truly predictive subtypes
   is essential for the future. Based on our analysis, the top 50 genes
   that are mutated frequently across tumour samples (generally called
   mutation drivers) can distinguish the UCEC subtypes by the mutation
   patterns. Driver genes may be beneficial for stratification because
   they are signatures that can capture the differences among the
   subtypes. In conclusion, clinically relevant subtyping performance may
   be further improved by integrating clinical driver genes obtained from
   integrated molecular data and cancer-specific networks.

Supporting information

   S1 File. Significant p-value of association between subtypes and
   survival for three cancers.

   (PDF)
   [95]Click here for additional data file.^ (70.4KB, pdf)

Acknowledgments