Abstract

   Recently, with the rapid progress of high-throughput sequencing
   technology, diverse genomic data are easy to be obtained. To
   effectively exploit the value of those data, integrative methods are
   urgently needed. In this paper, based on SNF (Similarity Network
   Diffusion) [[28]1], we proposed a new integrative method named ndmaSNF
   (network diffusion model assisted SNF), which can be used for cancer
   subtype discovery with the advantage of making use of somatic mutation
   data and other discrete data. Firstly, we incorporate network diffusion
   model on mutation data to make it smoothed and adaptive. Then, the
   mutation data along with other data types are utilized in the SNF
   framework by constructing patient-by-patient similarity networks for
   each data type. Finally, a fused patient network containing all the
   information from different input data types is obtained by using a
   nonlinear iterative method. The fused network can be used for cancer
   subtype discovery through the clustering algorithm. Experimental
   results on four cancer datasets showed that our ndmaSNF method can find
   subtypes with significant differences in the survival profile and other
   clinical features.

   Keywords: cancer subtyping, integrative method, network diffusion,
   somatic mutation data

INTRODUCTION

   Cancer is believed to be a complicated and heterogeneous disease since
   that it is driven by different combinations of mutated genes rather
   than the individual gene, and those mutations vary among tumor samples.
   Great efforts have been made by several large-scale projects such as
   The Cancer Genome Atlas (TCGA) [[29]2], International Cancer Genome
   Consortium (ICGC) [[30]3], and Cancer Cell Line Encyclopedia (CCLE)
   [[31]4], etc., which generated a sea of multiple genomic platform data.
   Therefore, integrative methods are urgently needed to simultaneously
   employ those molecular data for identification of tumor subsets with
   different clinical and biological meaning.

   Until now, many successful researches on such integrative framework for
   cancer subtype identification have been published. For instance, Liu et
   al. [[32]5] brought forward a method using regularized non-negative
   matrix factorization for gene expression analysis. Liu et al. [[33]6]
   also came up with an approach for integrated analysis via
   block-constraint robust principal component analysis. Gu et al. [[34]7,
   [35]8] came up with approaches which had made progress in
   classification and regression. Shen et al. [[36]9] proposed a joint
   latent variable model named iCluster which can realize data integration
   and dimensionality reduction simultaneously. Clustering result can be
   obtained by applying a standard K-means algorithm on the joint latent
   variable. Though pioneering and effective, iCluster to a great extent
   relies on the step of feature preselection. Wang et al. [[37]1]
   introduced a distinct integrative approach called SNF which contains a
   few steps. First, for each data type, a sample-by-sample similarity
   network is constructed using the Euclidean distance and a scaled
   exponential similarity kernel, then these similarity networks are fused
   into one single network by a nonlinear iterative method. At last, this
   fused network is clustered by spectral clustering to receive several
   tumor groups. In SNF, diverse data such as DNA methylation, mRNA
   expression and miRNA expression data were used for identification of
   meaningful cancer subtypes. However, those data types are with
   continuous value for which the Euclidean metric is suitable. Obviously,
   it turns out to be helpless with discrete profile such as somatic
   mutation data. Indeed, for discrete data they do propose to use
   chi-squared distance (Supplementary Note-Chi-squared distance) to
   calculate the similarity between the patients, nevertheless by which we
   cannot get a satisfactory result.

   There are intrinsic differences between mutation data and other data
   types with quantitative value: (i) mutation data has binary value so it
   is not suitable for Euclidean measurement; (ii) high-dimensionality
   makes typical binary similarity measures hard to be used; (iii) its
   sparseness (fewer than 100 genes mutated in nearly ten thousand genes)
   makes it heterogeneous such that clinically identical patients rare to
   share more than a single mutation. So it makes traditional
   distance-based similarity measurement impossible to be used. Actually,
   somatic mutation data has important value since it provides information
   about relationships between genes and biochemical pathways and
   comprehensive insight into tumor progress [[38]10]. To deal with this
   problem, Hofree et al. [[39]11] brought forward a method named NBS
   (network-based stratification) which integrated somatic mutation data
   with gene networks using network diffusion model and performed
   clustering in a consensus clustering framework to make result robust.
   It shows that somatic mutation data is a promising source for cancer
   subtype identification. However, NBS did not use any other levels of
   information data such as epigenome, transcriptome, etc.

   In this paper, we proposed a method named ndmaSNF (network diffusion
   model assisted SNF) based on the integrative framework of SNF [[40]1]
   for cancer subtype identifying using somatic mutation profile and other
   data from different platforms simultaneously. Figure [41]1 shows the
   schematic overview of our method. We roughly divided the data sources
   into two categories: continuous data and discrete data (Figure [42]1A).
   For discrete data (e.g. mutation status), we made it fit the framework
   of SNF by using network diffusion model (Figure [43]1B) along with gene
   interaction network. Then the discrete data was smoothed and could be
   used well via SNF framework together with those continuous data (Figure
   [44]1C). By combining similarity matrices from those two different
   kinds of data, a fused patient-by-patient similarity matrix was
   obtained through the nonlinear combination method used in SNF framework
   (Figure [45]1D). On this fused matrix, clustering result can be
   acquired by applying a clustering algorithm such as spectral
   clustering. We extensively applied ndmaSNF on several human cancer data
   sets consisted of various kinds of data types, and received
   biologically and clinically relevant cohorts of patients, with better P
   value and silhouette value compared to SNF. The clustering result
   broadly met the PAM50 classification indicated clinical value for
   treatment.

Figure 1. The flow chart of ndmaSNF.

   [46]Figure 1
   [47]Open in a new tab

   (A) Dividing data into two main categories. (B) Pre-process for data
   types with discrete value via network diffusion model incorporating
   gene interaction network. (C) “Smoothed” mutation data. (D) All of
   these patient similarity matrices derived from various data types were
   combined into one fused patient similarity matrix through integrative
   framework of SNF.

   Moreover, existing methods generally identify network modules common to
   all tumors which may ignore the heterogeneity between various subtypes.
   In this study, we first use ndmaSNF on various data sources to gain
   cancer subtypes, and for each cancer subtype, we use DriverNet [[48]12]
   to get potential driver genes. We then did pathway enrichment analysis
   on those genes per subtype. And the top 60 potential driver genes
   attained from DriverNet were used for subtype-specific network module
   discovery via software GenRev [[49]13]. The experimental results
   indicated that our ndmaSNF has the ability to find distinct cancer
   subtypes relevant to different clinical outcomes and network modules.

RESULTS

Performance comparison

   We evaluated the performance of our method ndmaSNF by comparing it with
   two state-of-the-art methods, i.e. SNF [[50]1] and LRAcluster [[51]14]
   via silhouette value and P value as metrics on four cancer datasets
   (BIC: breast invasive carcinoma; KRCCC: kidney renal clear cell
   carcinoma; LSCC: lung squamous cell carcinoma; COAD: colon
   adenocarcinoma). The experimental results are listed in Tables [52]1
   and [53]2 (For P value, the lower the better; for silhouette value, the
   higher the better).

Table 1. Comparison of ndmaSNF with other methods on four cancer datasets
using P value.

                               LSCC    KRCCC     BIC      COAD
   SNF without mutation data 1.16E-03 8.76E-04 5.74E-06 3.38E-04
   SNF                       9.86E-04 1.45E-03 1.59E-06 3.56E-04
   LRAcluster                4.30E-02 3.24E-02 5.70E-02 9.90E-03
   ndmaSNF                   2.83E-04 3.43E-04 2.46E-08 1.40E-04
   [54]Open in a new tab

Table 2. Comparison of ndmaSNF with other methods on four cancer datasets
using silhouette value.

                             LSCC KRCCC BIC  COAD
   SNF without mutation data 0.46 0.34  0.43 0.50
   SNF                       0.46 0.33  0.34 0.51
   LRAcluster                0.50 0.32  0.46 0.35
   ndmaSNF                   0.52 0.39  0.45 0.43
   [55]Open in a new tab

   In Table [56]1, the terms in second row (SNF without mutation data)
   mean that we used 3 continuous data types (DNA methylation, mRNA
   expression, miRNA expression). And the terms in other rows (SNF,
   LRAcluster and ndmaSNF) are the results with 4 data types including
   mutation data. By comparing the second row (SNF without mutation data)
   and the fifth row (ndmaSNF), we can see that somatic mutation profile
   is a promising data source for identification of cancer subtypes.
   However, the promising value of the mutation data was not reflected by
   using original SNF as the third row (SNF) shows. By comparing the
   second row and the third row, we can see that SNF didn’t exploit the
   mutation data well, and even may had a bad influence compared with
   result without mutation data (KRCCC, COAD). LRAcluster [[57]14] is
   another integrative method with fast properties to find the shared
   principal subspace across multiple data types. However, it even didn’t
   perform well compared with original SNF. Due to its fastness, we think
   that LRAcluster has an advantage in large-scale data analysis such as
   pan-cancer analysis instead.

   In terms of silhouette value, the promotion of our method compared with
   other methods was slight (Table [58]2). And for COAD cancer data set,
   performance of our method even decreased slightly, we attributed the
   result to the fact that COAD has at least one subtype with few
   patients, it makes the silhouette value very sensitive and unstable.
   However, we can at least conclude that the involvement of the mutation
   data did not destroy the combination of the original 3 data types (DNA
   methylation, mRNA expression, miRNA expression) used in SNF [[59]1].

A case study: breast invasive carcinoma

   To further validate that our ndmaSNF can identify subtypes with
   biological and clinical differences, we then did in-depth research on
   breast invasive carcinoma. Breast invasive carcinoma (BIC) is a common
   breast cancer, growing into normal and healthy tissues.

   We totally identified 5 subtypes of BIC with log-rank P value of
   2.46E-08. To show the extent of those subtypes discovered by our method
   corresponded to the established PAM50 classification, we gathered
   statistics of the distribution of PAM50 per subtype. C1-C5 in Figure
   [60]2A represents subtypes identified by our method. We can see that C3
   and C4 are considerably fit the result of PAM50 classification:
   Basal-like for C3 and Luminal A for C4. And C1 is mostly consisted of
   Basal-like cases, C2 is mostly composed of luminal B cases. C5 is
   mostly comprised by Basal-like cases and Her2-enriched cases. Luminal A
   subtype is more likely to have a good prognosis while Basal-like
   subtype is aggressive and have a poorer prognosis, this can be
   reflected in Kaplan-Meier plot which shows an obvious significant
   survival difference in Figure [61]2B. C3 has significantly shorter
   overall survival durations than those with C4. Although C1 and C3 are
   both Basal-like subtypes, they have a difference in survival
   probability (P = 0.036) which can be seen in Figure [62]2A. C1 is more
   aggressive than C3 as the survival curves shows.

Figure 2.

   [63]Figure 2
   [64]Open in a new tab

   (A) Distribution of PAM50 samples in the identified subtypes. (B)
   Kaplan-Meier survival curves of 5 subtypes identified.

   In Figure [65]3A, we can see that C1 and C3 are mainly triple negative
   while C2 and C4 are largely ER positive, PR positive and HER2 negative,
   however, the situation of C5 is somewhat complicated. Basal-like
   subtype breast cancer is usually triple negative, this verified the
   fact that C1 and C3 are mostly consisted of Basal-like cases in Figure
   [66]2A.

Figure 3.

   Figure 3
   [67]Open in a new tab

   (A) Clinical features with ER/PR/HER2 per subtype. (B) Comparison of
   the mutation frequencies among the identified subtypes.

   Furthermore, we turned to mutation frequencies for more validation.
   Thus, we focus on the genes with high mutation frequency and can find
   evident differences between each cluster (Figure [68]3B): ZNF670,
   SNMYD3, CNST, and TFB2M for cluster 1; CCND1, MAP3K1 and ERBB2 for
   cluster2; TP53, CTSS, NLRP3, SH3BP5L for cluster 3; PIK3CA, GATA3 for
   cluster 4 and TP53, PIK3CA, ERBB2 and CCND1 for cluster 5. It shows
   that each subtype identified has a different combination of genes
   highly mutated and corresponded to various biological processes.

Driver gene analysis per subtype identified in breast invasive carcinoma

   To further study by what gene combination each subtype is driven, and
   whether those driver genes combination are different corresponded to
   different biological pathway, we applied DriverNet [[69]12] to find
   important genes by using gene expression data, mutation data and
   gene-by-gene network.

   In table [70]3, Note that TP53 showed great importance in all subtypes,
   however, a total combination of top driver genes is distinct in each
   subtype. To clearly show the difference, we used the top 60 driver
   genes identified from each subtype to do further study including
   pathway enrichment analysis and network module identification. The aim
   is to find out what biological process and important pathway those
   driver genes from different subtypes participated in.

Table 3. Top 10 driver gene per subtype attained by DriverNet [[71]5].

   C1        C2      C3     C4     C5
   TP53     TP53    TP53   TP53   TP53
   CSNK2A1   MYC    MYC   PIK3CA ERBB2
   EP300    CCND1  CDKN2A  MYC    MYC
   PRKCA    PAK1    RB1   IGF1R  PIK3R1
   UBQLN4  CSNK2A1 STAT5A MAP2K4 SMAD3
   SHC1     ERBB2   MCL1   LRP2  ACTL6A
   MYC      IGF1R  IGF1R  GATA3   TTN
   CCDC85B  MAPT   TUBG1   MCL1  U2AF2
   RELA     RELA   IKBKB   CDH1  SMAD2
   PAK1    PIK3R1  BRCA1   TTN   CDKN2A
   [72]Open in a new tab

   We did a KEGG pathway enrichment analysis per subtype and selected
   pathways related to breast cancer. From Figure [73]4, we can see
   differences between subtypes at the enrichment level. It is not
   surprising to see that all subtypes have an apparent enrichment in
   hsa05200: Pathways in cancer. Also, Apoptosis, a programmed cell death
   mechanism, is commonly enriched in C2, C3 and C5.

Figure 4. Pathway enrichment analysis for the top 60 driver genes per
subtype.

   [74]Figure 4
   [75]Open in a new tab

   C1 is typically enriched in Wnt and ErbB signaling pathway. The Wnt
   signaling pathway is one of a group of signal transduction pathways
   made of proteins that pass signals into a cell through cell surface
   receptors. Wnt signaling is identified for its role in carcinogenesis.
   This pathway's clinical importance is demonstrated by mutations that
   lead to various diseases, including breast cancer [[76]15].
   Furthermore, excessive ErbB signaling is associated with the
   development of a wide variety of types of solid tumor [[77]16].

   C2 is typically enriched in MAPK signaling pathway. The MAPK is a chain
   of proteins in the cell that communicates a signal from a receptor on
   the surface of the cell to the DNA in the nucleus of the cell. When one
   of the proteins in the pathway is mutated, it can become stuck in the
   “on” or “off” position, which is a necessary step in the development of
   many cancers. Components of the MAPK/ERK pathway were discovered when
   they were found in cancer cells. Drugs that reverse the “on” or “off”
   switch are being investigated as cancer treatments [[78]17].

   C3 is typically enriched in p53 signaling pathway. In breast cancer,
   p53 mutation is associated with more aggressive disease and worse
   overall survival. Molecular pathological analysis of the structure and
   expression of constituents of the p53 pathway is likely to have value
   in diagnosis, in prognostic assessment and in treatment of breast
   cancer [[79]18].

   C4 is typically enriched in Cell cycle and Adherens junction. The cell
   cycle is the series of events that takes place in a cell leading to its
   division and duplication. Regulation of the cell cycle involves
   processes crucial to the survival of a cell, including the detection
   and repair of genetic damage as well as the prevention of uncontrolled
   cell division. Adherens junctions, the most common type of
   intercellular adhesions, are important for maintaining tissue
   architecture and cell polarity and can limit cell movement and
   proliferation.

   C5 is typically enriched in many pathways represented in C1-C4 such as
   Cell cycle, p53 signaling pathway, Adherens junction, ErbB signaling
   pathway and Apoptosis. This also can be reflected in Figure [80]2A: C5
   is a mixture of different PAM50 subtypes.

Network module identification and analysis per subtype identified in breast
invasive carcinoma

   To get a more clear understanding of the combination of different
   driver genes, we seek for their significance at network module level.
   Therefore, we used them as seed genes to find network module per
   subtype.

   For subtype C1, we used top 60 driver genes as seed genes and 42 genes
   connected to each other on gene interaction network by utilizing GenRev
   [[81]13]. We totally found 10 network modules and 5 of which are
   connected to each other and have more than 4 genes with the division
   modularity of 0.53. Those 5 modules comprised the largest sub-network
   (Figure [82]5). The densest module is TP53 module which contains many
   important genes related to breast cancer. TP53 is a well-known tumor
   suppressor gene associated with various cancers including breast
   invasive carcinoma. Its mutation status and gene-expression based
   groups are important survival markers of breast cancer, and these
   molecular markers may provide prognostic information that complements
   clinical variables [[83]19]. TP53 module also contained SMARCA4, which
   can inhibit the cells’ ability to migrate and invade. So it attaches an
   importance to pathogenesis of breast cancer as a prognostic marker
   together with a possibly selective therapeutic target [[84]20]. HDAC2
   is another important gene related to breast cancer that is inclined to
   strongly express in aggressive breast cancer tumor subgroups [[85]21].
   We also discovered a SMAD4 module. Current research shows that SMAD4
   plays a key role in both tumor suppression and progression of breast
   cancer cells [[86]22]. Another critical gene included in this module is
   EP300, it encodes the transcriptional cofactor p300, which is highly
   expressed in diverse human cancers. Specially, the over expression of
   p300 in breast cancer predicts tumor recurrence and adverse prognosis
   [[87]23]. The remaining three modules contain some other important
   genes such as PIK3CA, TYK2 and APOA1, respectively. PIK3CA is a
   well-known oncogene in human cancers. Accumulating evidence suggests
   that mutation of PIK3CA is an early event in breast cancer and is more
   likely to play a role in breast tumor initiation than in invasive
   progression [[88]24]. The role of TYK2 is confirmed by biological
   experiments in suppressing the growth and metastasis of breast cancer
   [[89]25]. For APOA1, it is one of the most significant genes correlated
   with the proteomic profile that are closely related to breast cancer
   and may be involved in robust detection of disease progression
   [[90]26].

Figure 5. Network modules discovered in subtype 1.

   [91]Figure 5
   [92]Open in a new tab

   The green nodes represent genes we input, and the yellow nodes
   represent linker genes connecting those genes we input.

   For subtype C2, after inputting top 60 driver genes as seeds, 41 genes
   were retained and we wholly got 8 modules with the division modularity
   of 0.50. The most densely connected sub-network is shown in Figure
   [93]6. The ESR1 module contained some important genes such as ESR1.
   Recent studies suggest that activating mutations in ESR1 are a key
   mechanism in acquired endocrine resistance in breast cancer therapy
   [[94]27]. The PIK3CA module contained some important genes such as
   STAT3 and PIK3R1. Current findings show that activated STAT3 signaling
   contributes to breast cancer progression and resistance to chemotherapy
   by inducing expression of the antiapoptotic protein, Survivin in part
   [[95]28]. PIK3CA mutations and PIK3R1 underexpression show opposite
   effects on patient outcome and could become useful prognostic and
   predictive factors in breast cancer [[96]29]. We also identified a
   CDC42 module including important genes such as CDC42 and PAK1. Growth
   and motility inhibition of breast cancer cells by epidermal growth
   factor receptor degradation is correlated with inactivation of CDC42
   [[97]30]. And study shows associations between PAK1 expression and
   subcellular localization in tumor cells and tamoxifen resistance
   [[98]31].

Figure 6. Network modules discovered in subtype 2.

   [99]Figure 6
   [100]Open in a new tab

   We also did network module analysis for C3, C4 and C5, the results are
   given in [101]Supplementary Materials (Supplementary Results-network
   module analysis for C3, C4 and C5).

DISCUSSION

   Integrative methods are urgently needed to exploit multiple genomic
   platform data simultaneously and get insight into human neoplasia, such
   as identification of cancer subtypes. In our work, we proposed a method
   named ndmaSNF by extending SNF, an integrative framework, to make full
   use of somatic mutation data. By using a network diffusion model, the
   somatic mutation data was “smoothed” and its value can be exploited to
   a large extent. The experimental results on several cancer data sets
   indicated that our method outperformed in identification of patients
   cohort with biological and clinical meaning. For example, we totally
   find 5 subtypes C1-C5 in BIC with different biological and clinical
   features. C3 is mostly consisted of Basal-like cases whereas C4 is
   mostly composed of luminal A cases. And the prognosis of C4 is better
   than C3. Interestingly, C5 is a mixture of different PAM50 subtypes and
   is typically enriched in many pathways represented in C1-C4. According
   to those subtypes, we did a deeper analysis including pathway
   enrichment analysis and network module identification. The results
   showed that our method could capture biological and clinical features
   effectively. Our research also demonstrated the value of the mutation
   data in giving insight into tumorigenesis. In the future, we will use
   some other discrete data such as copy number variations to make our
   method more compatible.

MATERIALS AND METHODS

Datasets

   The data (DNA methylation, mRNA expression, miRNA expression) we used
   in this paper including four cancer data sets from TCGA website
   ([102]https://cancergenome.nih.gov/), which have been processed and
   provided by Wang et al. [[103]1]. And the mutation data of those four
   cancer data sets were obtained from UCSC data portal
   ([104]http://genome.ucsc.edu/). We restricted our analysis to the 85
   TCGA LSCC cases, 75 TCGA COAD cases, 101 TCGA KRCCC cases and 105 TCGA
   BIC cases, for which all DNA methylation, mRNA expression, miRNA
   expression and somatic mutation data were available. We used PPI
   (protein-protein interaction) network data obtained from NBS [[105]11]
   after processing, with 11491 genes as gene interaction network.

SNF integrative framework

   Suppose we have n samples (X[1], X[2] … X[n]) which possess several
   data sources on multi-scale level (e.g. mutation data, expression
   data). We want to use these data simultaneously for identification of
   cancer subtype. The SNF framework can be described as follows.

   First, for each data type, an nn patient similarity matrix W was
   constructed with its entry W(i, j) demonstrating the similarity between
   X[i] patient and patient X[j]. The specific formula to calculate W is
   as follows:
   [MATH:
   <mrow><mi>W</mi><mrow><mo>(</mo><mrow><mi>i</mi><mo>,</mo><mi>j</mi></m
   row><mo>)</mo></mrow><mo>=</mo><mi>exp</mi><mrow><mo>(</mo><mrow><mo>−<
   /mo><mfrac><mrow><msup><mi>ρ</mi><mn>2</mn></msup><mrow><mo>(</mo><mrow
   ><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>x</mi><mi>j</mi><
   /msub></mrow><mo>)</mo></mrow></mrow><mrow><mi>μ</mi><msub><mi>ε</mi><m
   row><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow></mfrac></mrow><
   mo>)</mo></mrow></mrow> :MATH]
   (1)

   Here
   [MATH:
   <mrow><mi>ρ</mi><mrow><mo>(</mo><mrow><msub><mi>X</mi><mi>i</mi></msub>
   <mo>,</mo><msub><mi>X</mi><mi>j</mi></msub></mrow><mo>)</mo></mrow></mr
   ow> :MATH]
   represents the Euclidean distance between patient X[i] and patient
   X[j]. And μ is an empirical hyper parameter which is recommended to be
   set in the range of [0.3, 0.8]. Furthermore, ε[i, j] is defined as
   follows:
   [MATH:
   <mrow><msub><mi>ε</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub
   ><mo>=</mo><mfrac><mrow><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi><mrow><
   mo>(</mo><mrow><mi>ρ</mi><mrow><mo>(</mo><mrow><msub><mi>x</mi><mi>i</m
   i></msub><mo>,</mo><msub><mi>N</mi><mi>i</mi></msub></mrow><mo>)</mo></
   mrow></mrow><mo>)</mo></mrow><mo>+</mo><mi>m</mi><mi>e</mi><mi>a</mi><m
   i>n</mi><mrow><mo>(</mo><mrow><mi>ρ</mi><mrow><mo>(</mo><mrow><msub><mi
   >x</mi><mi>j</mi></msub><mo>,</mo><msub><mi>N</mi><mi>j</mi></msub></mr
   ow><mo>)</mo></mrow></mrow><mo>)</mo></mrow><mo>+</mo><mi>ρ</mi><mrow><
   mo>(</mo><mrow><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>x</
   mi><mi>j</mi></msub></mrow><mo>)</mo></mrow></mrow><mn>3</mn></mfrac></
   mrow> :MATH]
   (2)

   Here
   [MATH:
   <mrow><mi>m</mi><mi>e</mi><mi>a</mi><mi>n</mi><mrow><mo>(</mo><mrow><mi
   >ρ</mi><mrow><mo>(</mo><mrow><msub><mi>X</mi><mi>i</mi></msub><mo>,</mo
   ><msub><mi>N</mi><mi>i</mi></msub></mrow><mo>)</mo></mrow></mrow><mo>)<
   /mo></mrow></mrow> :MATH]
   is the average of the sum of the distances between X[i] and each of its
   neighbors. Obviously, the Euclidean distance measure is suitable for
   continuous variables. For discrete data, the chi-square distance is
   proposed (Supplementary Note-Chi-squared distance). There are two
   derivatives of matrix W, namely, matrix P and matrix S. Matrix P
   carries the full information about the similarity of each patient to
   all others obtained by performing normalization on W:
   [MATH:
   <mrow><mi>P</mi><mrow><mo>(</mo><mrow><mi>i</mi><mo>,</mo><mi>j</mi></m
   row><mo>)</mo></mrow><mo>=</mo><mrow><mo>{</mo><mtable
   columnalign="left"><mtr><mtd><mfrac><mrow><mi>W</mi><mrow><mo>(</mo><mr
   ow><mi>i</mi><mo>,</mo><mi>j</mi></mrow><mo>)</mo></mrow></mrow><mrow><
   mn>2</mn><mstyle
   displaystyle="true"><msub><mo>∑</mo><mrow><mi>k</mi><mo>≠</mo><mi>i</mi
   ></mrow></msub><mrow><mi>W</mi><mrow><mo>(</mo><mrow><mi>i</mi><mo>,</m
   o><mi>k</mi></mrow><mo>)</mo></mrow></mrow></mstyle></mrow></mfrac><mo>
   ,</mo><mi>j</mi><mo>≠</mo><mi>i</mi></mtd></mtr><mtr><mtd><mtext> </mte
   xt><mtext> </mtext><mtext> </mtext><mrow><mn>1</mn><mo>/</mo><mn>2</mn>
   </mrow><mo>,</mo><mi>j</mi><mo>=</mo><mi>i</mi></mtd></mtr></mtable></m
   row></mrow> :MATH]
   (3)

   Matrix S only encodes the similarity to the K most similar patients for
   each patient via K nearest neighbors (KNN):
   [MATH:
   <mrow><mi>S</mi><mrow><mo>(</mo><mrow><mi>i</mi><mo>,</mo><mi>j</mi></m
   row><mo>)</mo></mrow><mo>=</mo><mrow><mo>{</mo><mtable
   columnalign="left"><mtr><mtd><mfrac><mrow><mi>W</mi><mrow><mo>(</mo><mr
   ow><mi>i</mi><mo>,</mo><mi>j</mi></mrow><mo>)</mo></mrow></mrow><mrow><
   mstyle
   displaystyle="true"><msub><mo>∑</mo><mrow><mi>k</mi><mo>∈</mo><msub><mi
   >N</mi><mi>i</mi></msub></mrow></msub><mrow><mi>W</mi><mrow><mo>(</mo><
   mrow><mi>i</mi><mo>,</mo><mi>k</mi></mrow><mo>)</mo></mrow></mrow></mst
   yle></mrow></mfrac><mo>,</mo><mi>j</mi><mo>∈</mo><msub><mi>N</mi><mi>i<
   /mi></msub></mtd></mtr><mtr><mtd><mtext> </mtext><mtext> </mtext><mtext
   > </mtext><mn>0</mn><mo>,</mo><mi
   mathvariant="italic">otherwise</mi></mtd></mtr></mtable></mrow></mrow>
   :MATH]
   (4)

   where N[i] represents a set of X[i]’s neighbors including X[i]. By
   using P as the global structure and S capturing local structure, a
   nonlinear iterative procedure is proposed:
   [MATH:
   <mrow><msup><mi>P</mi><mrow><mrow><mo>(</mo><mi>v</mi><mo>)</mo></mrow>
   </mrow></msup><mo>=</mo><msup><mi>S</mi><mrow><mrow><mo>(</mo><mi>v</mi
   ><mo>)</mo></mrow></mrow></msup><mo>×</mo><mrow><mo>(</mo><mrow><mfrac>
   <mrow><mstyle
   displaystyle="true"><msub><mo>∑</mo><mrow><mi>k</mi><mo>≠</mo><mi>v</mi
   ></mrow></msub><mrow><msup><mi>P</mi><mrow><mrow><mo>(</mo><mi>k</mi><m
   o>)</mo></mrow></mrow></msup></mrow></mstyle></mrow><mrow><mi>m</mi><mo
   >−</mo><mn>1</mn></mrow></mfrac></mrow><mo>)</mo></mrow><mo>×</mo><msup
   ><mrow><mrow><mo>(</mo><mrow><msup><mi>S</mi><mrow><mrow><mo>(</mo><mi>
   v</mi><mo>)</mo></mrow></mrow></msup></mrow><mo>)</mo></mrow></mrow><mi
   >T</mi></msup><mo>,</mo><mi>v</mi><mo>=</mo><mn>1,2</mn><mo>,</mo><mo>.
   ..</mo><mo>,</mo><mi>m</mi></mrow> :MATH]
   (5)

   where P^(V) represents P calculated from the v-th data profile. This
   procedure updates every P^(V) each time by m parallel interchanging
   diffusion processes. After t steps, the fused matrix P^(C) can be
   learned by taking average of all P^(V).

Network diffusion model

   We proposed to apply network diffusion model [[106]32] incorporating
   gene interaction network on mutation profile and other discrete data.
   By using this method, the discrete data was “smoothed” and carries the
   information about similarity of tumor sample at the pathway level
   rather than the individual gene level, thus making SNF integrative
   framework work suitably and effectively on discrete data.

   We first mapped patient mutation profile onto a gene interaction
   network. Then network diffusion model was applied to diffuse the effect
   of each mutated gene over this network for each patient according to
   the function:
   [MATH: <mrow><msub><mi>F</mi><mrow><mi
   mathvariant="normal">t</mi><mo>+</mo><mn>1</mn></mrow></msub><mo>=</mo>
   <mi>α</mi><msub><mi>F</mi><mi>t</mi></msub><mi>A</mi><mo>+</mo><mrow><m
   o>(</mo><mrow><mn>1</mn><mo>−</mo><mi>α</mi></mrow><mo>)</mo></mrow><ms
   ub><mi>F</mi><mn>0</mn></msub></mrow> :MATH]
   (6)

   F[0]is the binary patient-by-gene mutation data (Figure [107]7A), and A
   is a degree-normalized adjacency matrix of the gene interaction network
   (Figure [108]7B).α is used to adjust the distance that the mutation
   signal can propagate in the network. It is a tuning parameter in the
   optimal range of [0.5, 0.8]. The diffusion function run iteratively
   until F[t+1] converges
   [MATH:
   <mrow><mrow><mo>(</mo><mrow><msub><mi>F</mi><mrow><mi>t</mi><mo>+</mo><
   mn>1</mn></mrow></msub><mo>−</mo><msub><mi>F</mi><mi>t</mi></msub><mo><
   </mo><mn>1</mn><mo>×</mo><msup><mrow><mn>10</mn></mrow><mrow><mo>−</mo>
   <mn>6</mn></mrow></msup></mrow><mo>)</mo></mrow></mrow> :MATH]
   . The result F[t+1] obtained is a “smoothed” mutation profile with
   quantitative value indicating the influence of each mutation per
   patient through network diffusion (Figure [109]7C). In this way, not
   only genes that are mutated will get high influence scores, but also
   genes that are close to the mutated genes in the network. According to
   this “smoothed” matrix, we seek for patient similarity as mutational
   consistency at pathway level rather than individual gene level. The
   benefit is 2-fold: (i) by “smoothing”, the sparseness is reduced, so
   the traditional distance measurement is feasible. (ii) in network
   diffusion model, mutation consistency is searched at pathway level
   rather than individual gene level, thus it will give a more
   comprehensive insight into similarity between patients. Since tumor
   process is driven by a combination of mutated genes, those genes’
   influence is propagated through gene interaction network, so the tumor
   similarity at pathway level is more biologically significant and can
   improve the identification of cancer subtype.

Figure 7. Simple presentation on network diffusion model.

   [110]Figure 7
   [111]Open in a new tab

   (A) Somatic mutation data. (B) Gene interaction network. (C) “Smoothed”
   mutation data via network diffusion model.

Evaluation metrics

   To compare the performance of our method with established methods, we
   chose two metrics as evaluation index. First, we used P value for
   log-rank test of survival analysis by using survival time. P value
   measures the degree of significant difference between survival data of
   different subtypes. The lower the P value is, the more obvious the
   difference between subtypes is. For survival analysis, we took the same
   method used in SNF [[112]1], thus we used the number of days to the
   last follow-up and vital status. However, for COAD, due to many missing
   values, these are combined with the number of days of last known
   living.

   We also used silhouette value [[113]33] to measure the quality of the
   clustering result. The silhouette value ranges from -1 to 1, where a
   high value indicates that the patient is well matched to its own
   cluster and poorly matched to other clusters. Then the mean value of
   silhouette value for all the patients was used as a measure of the
   compactness within clusters and the separation among clusters.

SUPPLEMENTARY MATERIALS FIGURES

   [114]oncotarget-08-89021-s001.pdf^ (1.8MB, pdf)

Footnotes

   CONFLICTS OF INTEREST

   The authors declare that they have no competing interests.

   FUNDING

   This study was supported by the National Natural Science Foundation of
   China (No. 61672037), the Key project of Anhui Provincial Education
   Department (No. KJ2017ZD01), and the Anhui Provincial Natural Science
   Foundation (Nos. 1508085QF135 and 1608085MF136).

REFERENCES