Abstract

   Discovering the common modules that are co-expressed across various
   stages can lead to an improved understanding of the underlying
   molecular mechanisms of cancers. There is a shortage of efficient tools
   for integrative analysis of gene expression and protein interaction
   networks for discovering common modules associated with cancer
   progression. To address this issue, we propose a novel regularized
   multi-view subspace clustering (rMV-spc) algorithm to obtain a
   representation matrix for each stage and a joint representation matrix
   that balances the agreement across various stages. To avoid the
   heterogeneity of data, the protein interaction network is incorporated
   into the objective of rMV-spc via regularization. Based on the interior
   point algorithm, we solve the optimization problem to obtain the common
   modules. By using artificial networks, we demonstrate that the proposed
   algorithm outperforms state-of-the-art methods in terms of accuracy.
   Furthermore, the rMV-spc discovers common modules in breast cancer
   networks based on the breast data, and these modules serve as
   biomarkers to predict stages of breast cancer. The proposed model and
   algorithm effectively integrate heterogeneous data for dynamic modules.

   Keywords: conserved modules, network analysis, subspace clustering,
   regularization, protein interaction networks

1. Introduction

   The advances in biological technologies, such as the RNA-seq, make it
   possible to generate genome-wide high-throughput data with various
   platforms. The world consortia, such as The Cancer Genome Atlas (TCGA)
   [26]https://cancergenome.nih.gov/ and the Encyclopedia of DNA Elements
   (ENCODE) [27]https://www.encodeproject.org/, have generated large-scale
   heterogeneous data on, for example, gene expression, DNA methylation,
   and mutation for various cancers or tissues (cells). The accumulated
   biological data provides a great opportunity to investigate the
   mechanisms of cancers.

   Among these genomic data, great efforts have been devoted to the
   analysis of gene expression because regulation of gene expression
   refers to the control of the amount and timing of appearance of the
   functional product of a gene. Control of expression is vital to allow a
   cell to produce the gene products it needs when it needs them; in turn,
   this gives cells the flexibility to adapt to a variable environment,
   external signals, damage to the cell, and other stimuli
   [[28]1,[29]2,[30]3]. The differentially expressed genes between two
   cohorts shed light on revealing the regulation mechanisms of cells. For
   example, Li et al. [[31]4] demonstrated that PE1 inhibits stem cell
   self-renewal in human chronic myelocytic leukemia. To investigate the
   high-order relation among genes, network-based analysis has been
   devoted to gene expression, which extracts many interesting patterns
   that are different from differentially expressed genes. For instance,
   Langfelder et al. [[32]5] proposed the weighted gene co-expression
   network analysis tool (WGCNA) to mine the co-expression modules.

   Furthermore, biological networks have been proven to be powerful for
   describing and analyzing profile data, where each vertex represents a
   gene and each edge corresponds to an interaction between a pair of
   genes. There are many biological networks, such as gene regulation
   networks [[33]6], signal transduction networks [[34]7], protein–protein
   interaction (PPI) networks [[35]8], disease networks [[36]9], and gene
   regulation networks [[37]10,[38]11,[39]12,[40]13,[41]14,[42]15]. The
   accumulated biological networks provide an opportunity to explore the
   mechanisms of cells via mining the graph patterns. Great efforts have
   been devoted to network analysis, where the graph patterns shed light
   on the structure–function relations in biology. For example, Taylor et
   al. [[43]16] analyzed the PPI network and demonstrated that the genes
   with large degrees (hub genes) play a critical role in the prognosis of
   breast cancer. Furthermore, Chuang et al. [[44]17] showed that the
   pathways where genes are differentially expressed between two cohorts
   of cancer patients serve as biomarkers for predicting cancer
   metastasis.

   However, a vast majority of analysis ignores the dynamics of data.
   Complex diseases, such as cancers, are dynamic and involve a continuum
   of molecular events associated with disease progression, from early
   warning events to catastrophic end-stage events [[45]18]. How to
   extract modules associated with cancer progression is critical for
   discovering the mechanisms of cancers because these patterns provide
   clues for biologists for further research [[46]19,[47]20]. However, it
   is non-trivial to detect dynamic modules associated with cancer
   progression because it is difficult to characterize and extract
   dynamics of modules. Thus, the available algorithms for the dynamic
   modules differ greatly in terms of how to define dynamic modules and
   the strategies to discover the predefined patterns. Ma et al. [[48]21]
   designed the M-Module algorithm to the common modules across various
   stages of breast cancer, and demonstrated that the dynamics of
   interaction strength is critical for the acceleration of heart failure
   [[49]22]. Similar efforts have also been devoted to common and specific
   modules for breast cancer [[50]23,[51]24]. However, these algorithms
   only focus on extracting the common and specific modules associated
   with cancer progression. In [[52]25], the authors developed the NMF-DM
   algorithm to investigate how the pathway dynamically recruits genes,
   for example, in cancer progression.

   However, these algorithms are only based on gene expression or DNA
   methylation data and do not integrate any other data. In fact,
   integrative analysis of omic data has been extensively studied since it
   identifies interesting patterns that cannot be obtained by analysis of
   a single type of data [[53]26]. Compared to the gene co-expression
   network, the protein interaction network is more reliable since the
   large co-expression value between a pair of genes does not imply
   physical interaction. Thus, the protein interaction network should be
   integrated with gene expression data to extract dynamic modules. Even
   though many algorithms have been developed to integrate protein
   interaction and gene expression data, no attempt has been made to
   identify modules associated with cancer progression. The reason is that
   the integrative analysis of these data is difficult because it involves
   both the breast progression and heterogeneity of data.

   In this study, we address the integration of gene expression data and a
   protein interaction network to mine the dynamic modules associated with
   cancer progression. As done in [[54]21,[55]22], the dynamic modules are
   defined as common modules that are co-expressed across various stages.
   To analyze cancer gene expression data, we adopt the multi-view
   subspace clustering algorithm with sparsity constraints to obtain a
   representation matrix for each view and a consensus matrix, as shown in
   [56]Figure 1 ([57]Supplementary Materials). By effectively integrating
   the protein interaction networks, we expected that the joint
   representation matrix C would not only balance the agreement across
   various stages but also preserve the topological structure of the
   protein interaction network. Therefore, the protein interaction network
   was incorporated into multi-view subspace clustering via
   regularization. In this way, the common module detection problem is
   transformed into a convex optimization. The interior point algorithm
   was used for convex optimization. The experimental results demonstrate
   that the proposed algorithm is more accurate than the state of the art.
   The modules obtained by our algorithm are more enriched by the known
   pathways and serve as biomarkers to predict cancer stages.

Figure 1.

   [58]Figure 1
   [59]Open in a new tab

   Overview of the rMV-s2c algorithm, which comprises two major
   components, namely, the regularized subspace clustering procedure,
   which obtains the subspaces for gene expression data of each clinical
   stage by regularizing the protein interaction network, and the module
   discovery procedure, which identifies common communities across cancer
   stages based on the consensus space.

   The rest of the paper is organized as follows: [60]Section 2 proposes
   the mathematical model and algorithm. The related materials are
   presented in [61]Section 3. The experimental results are provided in
   [62]Section 4. The conclusion is discussed in [63]Section 5.

2. Methods

   The objective function and optimization procedure of the proposed
   algorithm, and the algorithm analysis, are presented in this section.
   The rMV-spc algorithm comprises two major components as shown in
   [64]Figure 1.

2.1. Preliminaries

   Prior to giving the detailed description of the procedure of rMV-spc,
   let us introduce some terminologies that are widely used in the
   forthcoming sections.

   The protein interaction network can be modeled by an unweighted and
   undirected graph
   [MATH:
   <mrow><mrow><mi>G</mi><mo>=</mo><mo>(</mo><mi>V</mi><mo>,</mo><mi>E</mi
   ><mo>)</mo></mrow></mrow> :MATH]
   , where the vertex set
   [MATH:
   <mrow><mrow><mi>V</mi><mo>=</mo><mo>{</mo><msub><mi>v</mi><mn>1</mn></m
   sub><mo>,</mo><msub><mi>v</mi><mn>2</mn></msub><mo>,</mo><mo>…</mo><mo>
   ,</mo><msub><mi>v</mi><mi>n</mi></msub><mo>}</mo></mrow></mrow> :MATH]
   contains all the genes (proteins) and the edge set
   [MATH:
   <mrow><mrow><mi>E</mi><mo>=</mo><mo>{</mo><mrow><mo>(</mo><msub><mi>v</
   mi><mi>i</mi></msub><mo>,</mo><msub><mi>v</mi><mi>j</mi></msub><mo>)</m
   o></mrow><mo>}</mo></mrow></mrow> :MATH]
   denotes the interaction between a pair of genes. The protein
   interaction network G can be represented by an
   [MATH: <mrow><mrow><mi>n</mi><mo>×</mo><mi>n</mi></mrow></mrow> :MATH]
   adjacency matrix A, where
   [MATH:
   <mrow><msub><mi>a</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   =1 if vertex
   [MATH: <mrow><msub><mi>v</mi><mi>i</mi></msub></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>v</mi><mi>j</mi></msub></mrow> :MATH]
   are connected, 0 otherwise. The degree of vertex
   [MATH: <mrow><msub><mi>v</mi><mi>i</mi></msub></mrow> :MATH]
   is the number of edges connected to it, i.e.,
   [MATH:
   <mrow><mrow><msub><mi>d</mi><mi>i</mi></msub><mo>=</mo><msub><mo>∑</mo>
   <mi>j</mi></msub><msub><mi>a</mi><mrow><mi>i</mi><mi>j</mi></mrow></msu
   b></mrow></mrow> :MATH]
   . The degree matrix D is the diagonal matrix with a degree sequence of
   G, i.e.,
   [MATH:
   <mrow><mrow><mi>D</mi><mo>=</mo><mi>d</mi><mi>i</mi><mi>a</mi><mi>g</mi
   ><mo>(</mo><msub><mi>d</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,</
   mo><msub><mi>d</mi><mi>n</mi></msub><mo>)</mo></mrow></mrow> :MATH]
   . The trace of a matrix W is the sum of diagonal elements of W, i.e.,
   [MATH:
   <mrow><mrow><mi>t</mi><mi>r</mi><mi>a</mi><mi>c</mi><mi>e</mi><mrow><mo
   >(</mo><mi>W</mi><mo>)</mo></mrow><mo>=</mo><msub><mo>∑</mo><mi>i</mi><
   /msub><msub><mi>w</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow></
   mrow> :MATH]
   .

   Let
   [MATH:
   <mrow><mrow><mo>{</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mo>…</mo
   ><mo>,</mo><mi>m</mi><mo>}</mo></mrow></mrow> :MATH]
   be a finite set of cancer clinical stages and the attached subscript s
   be the value of the variable at the s-th stage. The gene expression for
   cancer with various clinical stages
   [MATH: <mrow><mrow><mi
   mathvariant="script">X</mi><mo>=</mo><mo>{</mo><msub><mi>X</mi><mn>1</m
   n></msub><mo>,</mo><msub><mi>X</mi><mn>2</mn></msub><mo>,</mo><mo>…</mo
   ><mo>,</mo><msub><mi>X</mi><mi>m</mi></msub><mo>}</mo></mrow></mrow>
   :MATH]
   , where each
   [MATH: <mrow><msub><mi>X</mi><mi>i</mi></msub></mrow> :MATH]
   is the gene expression for the stage S. The gene expression data
   [MATH: <mrow><msub><mi>X</mi><mi>s</mi></msub></mrow> :MATH]
   is an
   [MATH:
   <mrow><mrow><msub><mi>n</mi><mi>s</mi></msub><mo>×</mo><mi>n</mi></mrow
   ></mrow> :MATH]
   matrix, where each row corresponds a gene, each column represents a
   sample (patient), and element
   [MATH:
   <mrow><msub><mi>x</mi><mrow><mi>i</mi><mi>j</mi><mi>s</mi></mrow></msub
   ></mrow> :MATH]
   denotes the expression level of the j-th patients in the i-th gene at
   stage s.

2.2. Procedure of Algorithm

   In the single-view clustering, the sparse subspace clustering (SSC)
   [[65]27,[66]28] represents each data point using a small number of data
   points from its own subspace. Given the data X, it amounts to the
   minimization problem as
   [MATH:
   <mrow><mrow><munder><mi>min</mi><mi>C</mi></munder><msub><mrow><mo>∥</m
   o><mi>C</mi><mo>∥</mo></mrow><mn>1</mn></msub><mo>,</mo><mspace
   width="1.em"></mspace><mi>s</mi><mo>.</mo><mi>t</mi><mo>.</mo><mspace
   width="1.em"></mspace><mi>X</mi><mo>=</mo><mi>X</mi><mi>C</mi><mo>,</mo
   ><mspace
   width="1.em"></mspace><mi>d</mi><mi>i</mi><mi>a</mi><mi>g</mi><mrow><mo
   >(</mo><mi>C</mi><mo>)</mo></mrow><mo>=</mo><mn>0</mn></mrow></mrow>
   :MATH]
   (1)

   where
   [MATH:
   <mrow><msub><mrow><mo>∥</mo><mi>C</mi><mo>∥</mo></mrow><mn>1</mn></msub
   ></mrow> :MATH]
   is the
   [MATH: <mrow><msub><mi>l</mi><mn>1</mn></msub></mrow> :MATH]
   norm, and constraint
   [MATH:
   <mrow><mrow><mi>d</mi><mi>i</mi><mi>a</mi><mi>g</mi><mo>(</mo><mi>C</mi
   ><mo>)</mo><mo>=</mo><mn>0</mn></mrow></mrow> :MATH]
   is used to avoid trivial solutions where a data point is represented as
   a linear combination of itself. In the case of the corrupted data, the
   above equation can be re-written as
   [MATH:
   <mrow><mrow><munder><mi>min</mi><mi>C</mi></munder><msub><mrow><mo>∥</m
   o><mi>C</mi><mo>∥</mo></mrow><mn>1</mn></msub><mo>+</mo><mfrac><msub><m
   i>λ</mi><mi>z</mi></msub><mn>2</mn></mfrac><msubsup><mrow><mo>∥</mo><mi
   >Z</mi><mo>∥</mo></mrow><mi>F</mi><mn>2</mn></msubsup><mo>,</mo><mi>s</
   mi><mo>.</mo><mi>t</mi><mo>.</mo><mspace
   width="1.em"></mspace><mi>X</mi><mo>=</mo><mi>X</mi><mi>C</mi><mo>+</mo
   ><mi>Z</mi><mo>,</mo><mspace
   width="1.em"></mspace><mi>d</mi><mi>i</mi><mi>a</mi><mi>g</mi><mrow><mo
   >(</mo><mi>C</mi><mo>)</mo></mrow><mo>=</mo><mn>0</mn></mrow></mrow>
   :MATH]
   (2)

   where the
   [MATH: <mrow><msub><mi>l</mi><mn>1</mn></msub></mrow> :MATH]
   norm promotes sparsity of the columns of C, while the Frobenius norm
   favors small entries in the columns of Z.

   Given gene expression associated with cancer progression
   [MATH: <mrow><mrow><mi
   mathvariant="script">X</mi><mo>=</mo><mo>{</mo><msub><mi>X</mi><mn>1</m
   n></msub><mo>,</mo><msub><mi>X</mi><mn>2</mn></msub><mo>,</mo><mo>…</mo
   ><mo>,</mo><msub><mi>X</mi><mi>m</mi></msub><mo>}</mo></mrow></mrow>
   :MATH]
   , the multi-view clustering finds representation matrices
   [MATH:
   <mrow><mrow><msub><mi>C</mi><mn>1</mn></msub><mo>,</mo><mo>…</mo><mo>,<
   /mo><msub><mi>C</mi><mi>m</mi></msub></mrow></mrow> :MATH]
   across different stages and a joint representation matrix C that
   balance the agreement across various stages [[67]29]. According to
   [[68]30], we use the centroid based strategy to obtain the consensus
   matrix C for the subspace clustering. Therefore, Equation ([69]2)
   becomes
   [MATH: <mrow><mtable displaystyle="true"><mtr><mtd
   columnalign="left"><munder><mi>min</mi><mrow><msub><mi>C</mi><mn>1</mn>
   </msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>C</mi><mi>m</mi></msub><
   mo>,</mo><mi>C</mi></mrow></munder></mtd><mtd
   columnalign="left"><mrow><munderover><mo>∑</mo><mrow><mi>s</mi><mo>=</m
   o><mn>1</mn></mrow><mi>m</mi></munderover><mrow><mo>∥</mo></mrow><msub>
   <mi>C</mi><mi>s</mi></msub><msub><mrow><mo>∥</mo></mrow><mn>1</mn></msu
   b><mo>+</mo><mfrac><msub><mi>λ</mi><mi>z</mi></msub><mn>2</mn></mfrac><
   mrow><mo>∥</mo></mrow><msub><mi>Z</mi><mi>s</mi></msub><msubsup><mrow><
   mo>∥</mo></mrow><mi>F</mi><mn>2</mn></msubsup><mo>+</mo><mfrac><msub><m
   i>λ</mi><mi>c</mi></msub><mn>2</mn></mfrac><msup><mrow><mo>∥</mo><msub>
   <mi>C</mi><mi>s</mi></msub><mo>−</mo><mi>C</mi><mo>∥</mo></mrow><mn>2</
   mn></msup></mrow></mtd></mtr><mtr><mtd columnalign="left"><mi
   mathvariant="normal">s</mi><mo>.</mo><mi
   mathvariant="normal">t</mi><mo>.</mo></mtd><mtd
   columnalign="left"><mrow><msub><mi>X</mi><mi>s</mi></msub><mo>=</mo><ms
   ub><mi>X</mi><mi>s</mi></msub><msub><mi>C</mi><mi>s</mi></msub><mo>+</m
   o><msub><mi>Z</mi><mi>s</mi></msub><mo>,</mo><mspace
   width="1.em"></mspace><mi>d</mi><mi>i</mi><mi>a</mi><mi>g</mi><mrow><mo
   >(</mo><msub><mi>C</mi><mi>s</mi></msub><mo>)</mo></mrow><mo>=</mo><mn>
   0</mn><mo>,</mo><mrow><mi>s</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo
   ><mo>,</mo><mi>m</mi><mo>.</mo></mrow></mrow></mtd></mtr></mtable></mro
   w> :MATH]
   (3)

   We present the regularized multi-view sparse subspace clustering
   (rMV-spc) algorithm to discover the common modules in multiple views of
   gene expression for cancers. However, the common modules solely based
   on gene expression data assume that the genes within a module are
   co-expressed. In fact, protein interactions between genes are more
   reliable than the co-expression relation. Thus, it is promising to
   integrate the gene expression and protein interaction network to
   discover the common modules across cancer stages. However, the protein
   interaction network is sparse. Therefore, we also expect that the joint
   representation matrix C not only balances the agreement across various
   stages but also preserves the topological structure of protein
   interaction network G. According to [[70]31], the
   local-structure-preserved embedding can be formulated as the trace
   form, which is defined as
   [MATH:
   <mrow><mrow><mi>O</mi><mrow><mo>(</mo><mi>C</mi><mo>,</mo><mi>G</mi><mo
   >)</mo></mrow><mo>=</mo><mi>T</mi><mi>r</mi><mi>a</mi><mi>c</mi><mi>e</
   mi><mrow><mo>(</mo><msup><mi>C</mi><mo>′</mo></msup><msub><mi>L</mi><mi
   >G</mi></msub><mi>C</mi><mo>)</mo></mrow></mrow></mrow> :MATH]
   (4)

   where
   [MATH: <mrow><msub><mi>L</mi><mi>G</mi></msub></mrow> :MATH]
   is the Laplacian matrix of graph G, i.e.,
   [MATH:
   <mrow><mrow><msup><mi>L</mi><mi>G</mi></msup><mo>=</mo><mi>D</mi><mo>−<
   /mo><mi>A</mi></mrow></mrow> :MATH]
   . By imposing the topology preserving constraint, the model in Equation
   ([71]3) is formulated as
   [MATH: <mrow><mtable displaystyle="true"><mtr><mtd
   columnalign="left"><munder><mi>min</mi><mrow><msub><mi>C</mi><mn>1</mn>
   </msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>C</mi><mi>m</mi></msub><
   mo>,</mo><mi>C</mi></mrow></munder></mtd><mtd
   columnalign="left"><mrow><munderover><mo>∑</mo><mrow><mi>s</mi><mo>=</m
   o><mn>1</mn></mrow><mi>m</mi></munderover><mrow><mo>∥</mo></mrow><msub>
   <mi>C</mi><mi>s</mi></msub><msub><mrow><mo>∥</mo></mrow><mn>1</mn></msu
   b><mo>+</mo><mfrac><msub><mi>λ</mi><mi>z</mi></msub><mn>2</mn></mfrac><
   mrow><mo>∥</mo></mrow><msub><mi>Z</mi><mi>s</mi></msub><msubsup><mrow><
   mo>∥</mo></mrow><mi>F</mi><mn>2</mn></msubsup><mo>+</mo><mfrac><msub><m
   i>λ</mi><mi>c</mi></msub><mn>2</mn></mfrac><msup><mrow><mo>∥</mo><msub>
   <mi>C</mi><mi>s</mi></msub><mo>−</mo><mi>C</mi><mo>∥</mo></mrow><mn>2</
   mn></msup><mo>+</mo><msub><mi>λ</mi><mi>G</mi></msub><mi>T</mi><mi>r</m
   i><mi>a</mi><mi>c</mi><mi>e</mi><mrow><mo
   stretchy="false">(</mo><msup><mi>C</mi><mo>′</mo></msup><msub><mi>L</mi
   ><mi>G</mi></msub><mi>C</mi><mo
   stretchy="false">)</mo></mrow></mrow></mtd></mtr><mtr><mtd
   columnalign="left"><mi mathvariant="normal">s</mi><mo>.</mo><mi
   mathvariant="normal">t</mi><mo>.</mo></mtd><mtd
   columnalign="left"><mrow><msub><mi>X</mi><mi>s</mi></msub><mo>=</mo><ms
   ub><mi>X</mi><mi>s</mi></msub><msub><mi>C</mi><mi>s</mi></msub><mo>+</m
   o><msub><mi>Z</mi><mi>s</mi></msub><mo>,</mo><mspace
   width="1.em"></mspace><mi>d</mi><mi>i</mi><mi>a</mi><mi>g</mi><mrow><mo
   >(</mo><msub><mi>C</mi><mi>s</mi></msub><mo>)</mo></mrow><mo>=</mo><mn>
   0</mn><mo>,</mo><mi>s</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,
   </mo><mi>m</mi><mo>.</mo></mrow></mtd></mtr></mtable></mrow> :MATH]
   (5)

   To solve the model in Equation ([72]5), we adopt an alternative
   two-step procedure. Specifically, we update
   [MATH:
   <mrow><mrow><msub><mi>C</mi><mi>i</mi></msub><mrow><mo>(</mo><mn>1</mn>
   <mo>≤</mo><mi>i</mi><mo>≤</mo><mi>m</mi><mo>)</mo></mrow></mrow></mrow>
   :MATH]
   by fixing C, while we update C by fixing
   [MATH:
   <mrow><mrow><msub><mi>C</mi><mi>i</mi></msub><mrow><mo>(</mo><mn>1</mn>
   <mo>≤</mo><mi>i</mi><mo>≤</mo><mi>m</mi><mo>)</mo></mrow></mrow></mrow>
   :MATH]
   . In each procedure, the problem in Equation ([73]5) is a convex
   optimization, which can be solved using the convex programming
   algorithms [[74]32,[75]33], and the sparsity of solutions is also
   preferred [[76]34,[77]35]. In this study, we adopt the interior-point
   algorithm [[78]32] to obtain matrix C.

   After obtaining the consensus matrix C, we construct the affinity
   matrix W as
   [MATH:
   <mrow><mrow><mi>W</mi><mo>=</mo><mi>C</mi><mo>+</mo><msup><mi>C</mi><mo
   >′</mo></msup><mo>.</mo></mrow></mrow> :MATH]
   (6)

   The spectral clustering algorithm is used to obtain the final modules.
   The procedure is depicted in Algorithm 1.
   Algorithm 1 The rMV-spc algorithm
   Input:
     
   [MATH: <mrow><mi mathvariant="script">X</mi></mrow> :MATH]
   : Gene expression data
     
   [MATH:
   <mrow><mrow><mi>G</mi><mo>=</mo><mo>(</mo><mi>V</mi><mo>,</mo><mi>E</mi
   ><mo>)</mo></mrow></mrow> :MATH]
   : Protein interaction network
   Output:
     
   [MATH:
   <mrow><msubsup><mrow><mo>{</mo><msub><mi>V</mi><mi>i</mi></msub><mo>}</
   mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>k</mi></msubsu
   p></mrow> :MATH]
   : Common modules
     * 1:
       Update
       [MATH: <mrow><msub><mi>C</mi><mi>s</mi></msub></mrow> :MATH]
       by fixing C and
       [MATH:
       <mrow><mrow><msub><mi>C</mi><mi>i</mi></msub><mrow><mo>(</mo><mi>i<
       /mi><mo>≠</mo><mi>s</mi><mo>)</mo></mrow></mrow></mrow> :MATH]
       based on the interior point algorithm [[79]32]
     * 2:
       Update C by fixing
       [MATH:
       <mrow><mrow><msub><mi>C</mi><mi>s</mi></msub><mrow><mo>(</mo><mn>1<
       /mn><mo>≤</mo><mi>m</mi><mo>)</mo></mrow></mrow></mrow> :MATH]
       based on the interior point algorithm [[80]32];
     * 3:
       Normalize the columns of consensus matrix C;
     * 4:
       Construct the affinity matrix
       [MATH:
       <mrow><mrow><mi>W</mi><mo>=</mo><mi>C</mi><mo>+</mo><msup><mi>C</mi
       ><mo>′</mo></msup></mrow></mrow> :MATH]
       ;
     * 5:
       Apply spectral clustering to obtain modules based on matrix W;
     * 6:
       return common modules.

   [81]Open in a new tab

3. Materials

3.1. Statistical Significance of Modules

   The statistical significance of common modules is computed based on the
   null score distribution of modules generated using randomized
   permutation. Each gene expression is completely randomized 1000 times
   by sample shuffling. The average Pearson coefficient among the gene
   pair with the module is used as the module score. To construct the null
   distribution for module scores, we perform the proposed algorithm on
   the randomized gene expression data. Using the null distribution, the
   empirical p-value of a module is calculated as the probability of the
   module having the observed score or greater by chance. p-values are
   corrected for multiple testing using the method of Benjamini–Hochberg
   [[82]36]. An adjusted p-value of 0.05 is considered as significant.

3.2. Module-Based Features for a Support Vector Machine (SVM)

   Given a module C, we normalize the expression level of each gene across
   all samples using z-score transformation [[83]17], denoted by
   [MATH:
   <mrow><mrow><mi>E</mi><mi>x</mi><msub><mi>p</mi><mrow><mi>i</mi><mi>j</
   mi></mrow></msub></mrow></mrow> :MATH]
   for the i-th gene and j-th patient. For each sample j, the activity
   score of the k-th module is defined as the average gene expression of
   all genes within the module, i.e.,
   [MATH:
   <mrow><mrow><msub><mi>e</mi><mi>C</mi></msub><mo>=</mo><munder><mo>∑</m
   o><mrow><mi>i</mi><mo>∈</mo><mi>C</mi></mrow></munder><mi>E</mi><mi>x</
   mi><msub><mi>p</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>/</mo><m
   sqrt><mrow><mo>|</mo><mi>C</mi><mo>|</mo></mrow></msqrt></mrow></mrow>
   :MATH]
   (7)

   where
   [MATH: <mrow><mrow><mo>|</mo><mi>C</mi><mo>|</mo></mrow></mrow> :MATH]
   is the number of genes in C. For each patient sample, a feature vector
   is constructed by all modules.

3.3. Normalized Mutual Information

   The normalized mutual information (NMI) [[84]37] is based on the
   confusion matrix N whose rows correspond to the real modules in
   standard partition
   [MATH: <mrow><msup><mi>P</mi><mo>*</mo></msup></mrow> :MATH]
   and the columns correspond to the modules in obtained partition P. The
   element
   [MATH:
   <mrow><msub><mi>N</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow>
   :MATH]
   is the number of vertices overlapped by the i-th real and j-th
   predicted module. The NMI is defined as
   [MATH:
   <mrow><mrow><mi>N</mi><mi>M</mi><mi>I</mi><mrow><mo>(</mo><mi>P</mi><mo
   >,</mo><msup><mi>P</mi><mo>*</mo></msup><mo>)</mo></mrow><mo>=</mo><mfr
   ac><mrow><mo>−</mo><mn>2</mn><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</
   mo><mn>1</mn></mrow><mrow><mo>|</mo><mi>P</mi><mo>|</mo></mrow></msubsu
   p><msubsup><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><
   mrow><mo>|</mo></mrow><msup><mi>P</mi><mo>*</mo></msup><mrow><mo>|</mo>
   </mrow></mrow></msubsup><msub><mi>N</mi><mrow><mi>i</mi><mi>j</mi></mro
   w></msub><mi>log</mi><mrow><mo
   stretchy="false">(</mo><mfrac><mrow><msub><mi>N</mi><mrow><mi>i</mi><mi
   >j</mi></mrow></msub><mi>N</mi></mrow><mrow><msub><mi>N</mi><mrow><mi>i
   </mi><mo>.</mo></mrow></msub><msub><mi>N</mi><mrow><mo>.</mo><mi>j</mi>
   </mrow></msub></mrow></mfrac><mo
   stretchy="false">)</mo></mrow></mrow><mrow><msubsup><mo>∑</mo><mrow><mi
   >i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><mi>P</mi><mo>|</mo><
   /mrow></msubsup><msub><mi>N</mi><mrow><mi>i</mi><mo>.</mo></mrow></msub
   ><mi>log</mi><mrow><mo
   stretchy="false">(</mo><mfrac><msub><mi>N</mi><mrow><mi>i</mi><mo>.</mo
   ></mrow></msub><mi>N</mi></mfrac><mo
   stretchy="false">)</mo></mrow><mo>+</mo><msubsup><mo>∑</mo><mrow><mi>i<
   /mi><mo>=</mo><mn>1</mn></mrow><mrow><mrow><mo>|</mo></mrow><msup><mi>P
   </mi><mo>*</mo></msup><mrow><mo>|</mo></mrow></mrow></msubsup><msub><mi
   >N</mi><mrow><mo>.</mo><mi>j</mi></mrow></msub><mi>log</mi><mrow><mo
   stretchy="false">(</mo><mfrac><msub><mi>N</mi><mrow><mo>.</mo><mi>j</mi
   ></mrow></msub><mi>N</mi></mfrac><mo
   stretchy="false">)</mo></mrow></mrow></mfrac></mrow></mrow> :MATH]

   where
   [MATH: <mrow><mrow><mo>|</mo><mi>P</mi><mo>|</mo></mrow></mrow> :MATH]
   is the number of modules in P and
   [MATH:
   <mrow><msub><mi>N</mi><mrow><mi>i</mi><mo>.</mo></mrow></msub></mrow>
   :MATH]
   is the sum of the i-th row of the matrix.

3.4. Artificial Networks

   The GN benchmark network, where each network consists of 128 nodes that
   are grouped into 4 clusters of equal sizes, is introduced in [[85]38].
   Every node has an average degree of 16 and shares
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub
   ></mrow> :MATH]
   edges connecting nodes outside of the module to which it belongs. As
   parameter
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub
   ></mrow> :MATH]
   increases from 1 to 8, the detection of clusters in the networks
   becomes increasingly difficult. In this study, we combine three GN
   networks to construct the artificial networks to testify the
   performance of the proposed algorithms, where the first two networks
   are used for the multiple views and the last network is used for the
   regularization.

3.5. Breast Cancer Gene Expression Data

   The gene expression data for breast cancer is downloaded from the TCGA
   Data Portal, where the clinical stage information for patients is also
   available. The RPKM values (RNA-seq IlluminaHiSeq_RNASeq with level 3)
   are used. There are 809 samples across four stages (Stage I: 129, Stage
   II: 458, Stage III: 209, Stage IV: 13).

3.6. Protein Interaction Network

   The protein interaction network is downloaded from BioGrid database
   [86]https://thebiogrid.org/, which comprises 22,365 proteins (genes)
   and 437,751 interactions among genes. There are 435,543 physical
   interaction and 2208 genetic interactions.

4. Results

   To validate the performance of the proposed algorithm, three
   state-of-the-art algorithms are selected to make a comparison of both
   artificial data and breast cancer data. The compared algorithms are the
   M-Module algorithm [[87]21], multi-view clustering (MV-NMF) [[88]39],
   and spectral clustering [[89]40]. Notice that the spectral clustering
   cannot be applied to the multiple networks directly. Thus, we apply the
   spectral clustering to each network and then combine the results on
   each network based on consensus clustering (CSC).

   Two types of datasets, including both the artificial and real breast
   cancer data, are employed for a comparison between various algorithms.
   The artificial networks are adopted to test the accuracy of the rMV-spc
   algorithm, and the breast cancer data are used to determine the
   applicability of the proposed algorithm in discovering common modules
   in real networks with strong backgrounds.

4.1. Benchmarking Performance on the Artificial Networks

   In the artificial networks, we combine three GN networks, where the
   first two networks are used for multiple views and the remaining one is
   used for regularization (Materials). To increase the difficulty in
   discovering the common modules, we increase the parameter
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub
   ></mrow> :MATH]
   from 1 to 8 while we fix
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub
   ></mrow> :MATH]
   as 6. To quantify the performance of algorithms, the normalized mutual
   information (NMI) is adopted since the community structure is known in
   the artificial networks (Materials).

   Prior to giving the performance of algorithms, we first investigate how
   the parameter affects the performance of the proposed algorithm. Notice
   that there are three involved parameters: parameter
   [MATH: <mrow><msub><mi>λ</mi><mi>Z</mi></msub></mrow> :MATH]
   controls the importance of the regularizer of factorization, parameter
   [MATH: <mrow><msub><mi>λ</mi><mi>C</mi></msub></mrow> :MATH]
   determines the tradeoff between the consensus matrix among multiple
   views, and parameter
   [MATH: <mrow><msub><mi>λ</mi><mi>G</mi></msub></mrow> :MATH]
   denotes the importance of the network for regularization. Similar to
   [[90]41], we assume that these parameters are equal since we
   hypothesize that all items for regularization are equally important. By
   setting parameter
   [MATH:
   <mrow><mrow><mi>λ</mi><mo>∈</mo><mo>{</mo><msup><mn>10</mn><mrow><mo>−<
   /mo><mn>2</mn></mrow></msup><mo>,</mo><msup><mn>10</mn><mrow><mo>−</mo>
   <mn>1</mn></mrow></msup><mo>,</mo><msup><mn>10</mn><mn>0</mn></msup><mo
   >,</mo><msup><mn>10</mn><mn>1</mn></msup><mo>,</mo><msup><mn>10</mn><mn
   >2</mn></msup><mo>}</mo></mrow></mrow> :MATH]
   , we check how the accuracy of the proposed algorithm changes as
   parameter
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub
   ></mrow> :MATH]
   increases from 1 to 8 in terms of NMI, which is shown in [91]Figure 2A.
   As
   [MATH: <mrow><mi>λ</mi></mrow> :MATH]
   increases from
   [MATH:
   <mrow><msup><mn>10</mn><mrow><mo>−</mo><mn>2</mn></mrow></msup></mrow>
   :MATH]
   to
   [MATH: <mrow><msup><mn>10</mn><mn>0</mn></msup></mrow> :MATH]
   , the accuracy of the rMV-spc algorithm increases and achieves the best
   performance at
   [MATH: <mrow><mi>λ</mi></mrow> :MATH]
   = 1. The reason is that, when
   [MATH: <mrow><mi>λ</mi></mrow> :MATH]
   is small, the objective function is denominated by subspace clustering,
   and the contribution of items of regularization is subtle. As
   [MATH: <mrow><mi>λ</mi></mrow> :MATH]
   increases, the contribution of regularized items becomes increasingly
   important, which improves the accuracy of rMV-spc. As
   [MATH: <mrow><mi>λ</mi></mrow> :MATH]
   increases from
   [MATH: <mrow><msup><mn>10</mn><mn>0</mn></msup></mrow> :MATH]
   to
   [MATH: <mrow><msup><mn>10</mn><mn>2</mn></msup></mrow> :MATH]
   , the accuracy of the proposed algorithm decreases dramatically. The
   reason is that, as
   [MATH:
   <mrow><mrow><mi>l</mi><mi>a</mi><mi>m</mi><mi>b</mi><mi>d</mi><mi>a</mi
   ></mrow></mrow> :MATH]
   continues to increasing, the objective function of rMV-spc is dominated
   by the regularization, resulting in the decrease in the performance of
   the algorithm. Furthermore, the proposed algorithm is robust since its
   accuracy is stable for a wide range of
   [MATH: <mrow><mi>λ</mi></mrow> :MATH]
   values. In all experiments, we set
   [MATH: <mrow><mi>λ</mi></mrow> :MATH]
   = 1.

Figure 2.

   [92]Figure 2
   [93]Open in a new tab

   Parameter effect and performance of the compared algorithms on
   artificial data. (A) Parameter effect: how the NMI changes as parameter
   [MATH: <mrow><mi>λ</mi></mrow> :MATH]
   increases from
   [MATH:
   <mrow><msup><mn>10</mn><mrow><mo>−</mo><mn>2</mn></mrow></msup></mrow>
   :MATH]
   to
   [MATH: <mrow><msup><mn>10</mn><mn>2</mn></msup></mrow> :MATH]
   . (B) Performance as a function of the amount of parameter
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub
   ></mrow> :MATH]
   in the simulated data among various algorithms, where NMI is used as
   the performance measure.

   We compare the MV-NMF, CSC, M-Module, and rMV-spc algorithms on the
   artificial networks in terms of accuracy, which is shown in [94]Figure
   2B. From the panel, we assert that the proposed algorithm achieves the
   best performance, followed by M-Module, MV-NMF, and CSC. While the
   M-Module is inferior to the rMV-spc algorithm, it is much better than
   the others. There are two possible reasons why the proposed algorithm
   outperforms the other methods. First, the subspaces are more precise in
   characterizing the module structure in multiple view data compared with
   the data in the original space. Second, the proposed algorithm
   incorporates both the subspace and topological information, which
   provides a better way to characterize the structure of common modules.
   Moreover, it is easy to conclude that the performance of algorithms
   decreases dramatically as
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub
   ></mrow> :MATH]
   increases from 1 to 8 because the module structure becomes fuzzy as
   [MATH:
   <mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow></msub
   ></mrow> :MATH]
   increases. For example, the NMI is about 1 when
   [MATH:
   <mrow><mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow>
   </msub><mo>≤</mo><mn>4</mn></mrow></mrow> :MATH]
   . As
   [MATH:
   <mrow><mrow><msub><mi>Z</mi><mrow><mi>o</mi><mi>u</mi><mi>t</mi></mrow>
   </msub><mo>></mo><mn>4</mn></mrow></mrow> :MATH]
   , the NMI value decreases dramatically.

4.2. Benchmarking Performance on the Breast Cancer Networks

   The artificial data is used to test the performance of the proposed
   algorithm in detecting the common modules in terms of accuracy. To
   check whether the proposed algorithm can identify common modules across
   various clinical stages in the data with biological background.

   Because the true modules are unknown, multiple reference pathway
   annotations, including Gene Ontology [[95]42], KEGG [[96]43], and
   Biocart [[97]44], are used to determine the effectiveness of the
   algorithms by using the enrichment analysis (Materials). To evaluate
   the performance, we use specificity and sensitivity to quantify the
   accuracy, where specificity is defined as the fraction of the predicted
   modules that significantly overlaps with at least one reference
   pathway, while sensitivity is defined as the fraction of the reference
   pathways that significantly overlaps with at least one predicted
   module. [98]Figure 3A,B shows that the rMV-spc algorithm achieves
   higher specificity while maintaining comparable sensitivity than the
   other methods. Specifically, the specificity values of rMV-spc are
   76.9%, 80.3%, and 81.7% for the GO, KEGG, and BioCart pathways,
   respectively, while those of the M-Module algorithm are 72.4%, 74.4%
   and 76.5%. The results demonstrate that the common modules obtained
   bythe proposed method are more enriched by the known pathways than
   those obtained by others. Notice that the rMV-spc algorithm is inferior
   to M-Module in terms of sensitivity. We check the significance of the
   difference between rMV-spc and M-Module on sensitivity using the Fisher
   exact test with a cutoff of 0.05. The results demonstrate that the
   difference in specificity is significant, while it is not significant
   in terms of sensitivity.

Figure 3.

   [99]Figure 3
   [100]Open in a new tab

   Performance of the compared algorithms on the TCGA breast cancer data.
   (A) Specificity of modules obtained by various algorithms in the known
   pathway enrichment analysis of various algorithms. (B) Sensitivity of
   communities obtained by various algorithms in the known pathway
   enrichment analysis of different algorithms. (C) Specificity of modules
   obtained by the proposed algorithms with and without the regularization
   of the protein interaction network. (D) Sensitivity of modules obtained
   by the proposed algorithms with and without the regularization of the
   protein interaction network. The * denotes that the difference is
   significant using Fisher’s exact test with a cutoff of
   [MATH: <mrow><mrow><mn>0.05</mn></mrow></mrow> :MATH]
   .

   The proposed algorithm integrate both the gene expression and protein
   interaction networks. Then, we ask what is the different if the protein
   interaction network is not integrated. The specificity and sensitivity
   of modules are shown in [101]Figure 3C,D. From the panel, we assert
   that the integration of the protein interaction network increases the
   percentage of modules that are enriched by known pathways. The results
   demonstrate that the integration is promising in identifying the common
   modules associated with cancer progression.

4.3. Common Modules Serve as Biomarkers to Predict Breast Cancer Stages

   It has been shown that the hub genes [[102]16] and modules
   [[103]17,[104]21] are predictive for the breast cancer diagnosis. Thus,
   we hypothesize that the common modules can also be used to predict the
   stages of breast cancer. Following [[105]17], we construct module-based
   features to predict the stages of breast cancer (Materials). For each
   module, we construct a feature vector that is the average of the gene
   expression of the genes within the modules. Based on the feature
   vectors, we use the SVM to predict the stage of cancers.

   For a baseline comparison, we compare the classification accuracy by
   using the following feature sets: modules generated by other
   algorithms, size-matched differentially expressed genes, and randomly
   selected genes. We trained the support vector machine (SVM) classifier
   to perform multi-class classification. This SVM employed accuracy (the
   percentage of patients that are corrected classified) to measure
   performance. The results on the TCGA breast cancer data using five-fold
   cross validation are presented in [106]Figure 4A. The modules obtained
   by our algorithms are more discriminative than the others.
   Specifically, the rMV-spc algorithm has significantly higher accuracy
   than the M-Module (74.5% vs. 71.3%). These results demonstrate that the
   common modules obtained by rMV-spc capture the specificity of pathways
   as breast cancer progression.

Figure 4.

   [107]Figure 4
   [108]Open in a new tab

   Subtype-specific methylation modules improve the accuracy of breast
   cancer stage classification using 50 independent 5-fold cross
   validations. (A) Classification accuracy of breast cancer stages using
   different feature sets, including the stage-specific modules obtained
   by various algorithms. Accuracy is defined as the number of patient
   samples correctly classified. The Y-axis is the accuracy and the error
   bar is for the standard deviation. (B) External validation by training
   on TCGA data and testing on the external data.

   To further validate the performance of various algorithms, we evaluated
   the performance of the SVM classifiers by using external data
   ([109]GSE5874). We trained the SVM classifier on the TCGA data and
   tested it on an external microarray dataset. Consistent results
   indicate that the performance is not due to hidden confounding factors
   in the TCGA dataset ([110]Figure 4B). The accuracy of rMV-spc is 51.4%,
   while the accuracies of the M-Module, MV-NMF, CSC, and DGis are 49.8%,
   44.9%, 41.3%, and 38.7%, respectively. The results show that the
   proposed algorithm is better than the available approaches in
   discovering common modules in data integration.

5. Conclusions

   The advances in biological technologies enable the possibility of
   generating multiple genomic profiling of biological samples for various
   conditions. How to integrate the heterogeneous genomic data to extract
   patterns is critical since these patterns may shed light on the
   mechanisms of cancers. Even though many algorithms have been devoted to
   the integrative analysis of omic data, few attempts have been made to
   simultaneously integrate heterogeneous and time-series gene expression
   data.

   In order to attack this issue, we provide a novel algorithm by
   considering the time and heterogeneity factors at the same time. In
   this study, the gene expression associated with cancer progression are
   projected to subspaces based on subspace clustering. In order to
   incorporate the protein interaction network, we treat it as a
   regularizer with an immediate purpose to alleviate the effects of
   heterogeneity. The experimental results demonstrate that the proposed
   algorithm is promising in discovering common modules across various
   cancer stages. We see ample opportunities to improve on the basic
   concept of rMV-spc in future work. For example, we can extend the
   algorithm by integrating more heterogeneous data, such as DNA copy
   number variation and methylation.

Acknowledgments