Abstract
Discovering the common modules that are co-expressed across various
stages can lead to an improved understanding of the underlying
molecular mechanisms of cancers. There is a shortage of efficient tools
for integrative analysis of gene expression and protein interaction
networks for discovering common modules associated with cancer
progression. To address this issue, we propose a novel regularized
multi-view subspace clustering (rMV-spc) algorithm to obtain a
representation matrix for each stage and a joint representation matrix
that balances the agreement across various stages. To avoid the
heterogeneity of data, the protein interaction network is incorporated
into the objective of rMV-spc via regularization. Based on the interior
point algorithm, we solve the optimization problem to obtain the common
modules. By using artificial networks, we demonstrate that the proposed
algorithm outperforms state-of-the-art methods in terms of accuracy.
Furthermore, the rMV-spc discovers common modules in breast cancer
networks based on the breast data, and these modules serve as
biomarkers to predict stages of breast cancer. The proposed model and
algorithm effectively integrate heterogeneous data for dynamic modules.
Keywords: conserved modules, network analysis, subspace clustering,
regularization, protein interaction networks
1. Introduction
The advances in biological technologies, such as the RNA-seq, make it
possible to generate genome-wide high-throughput data with various
platforms. The world consortia, such as The Cancer Genome Atlas (TCGA)
[26]https://cancergenome.nih.gov/ and the Encyclopedia of DNA Elements
(ENCODE) [27]https://www.encodeproject.org/, have generated large-scale
heterogeneous data on, for example, gene expression, DNA methylation,
and mutation for various cancers or tissues (cells). The accumulated
biological data provides a great opportunity to investigate the
mechanisms of cancers.
Among these genomic data, great efforts have been devoted to the
analysis of gene expression because regulation of gene expression
refers to the control of the amount and timing of appearance of the
functional product of a gene. Control of expression is vital to allow a
cell to produce the gene products it needs when it needs them; in turn,
this gives cells the flexibility to adapt to a variable environment,
external signals, damage to the cell, and other stimuli
[[28]1,[29]2,[30]3]. The differentially expressed genes between two
cohorts shed light on revealing the regulation mechanisms of cells. For
example, Li et al. [[31]4] demonstrated that PE1 inhibits stem cell
self-renewal in human chronic myelocytic leukemia. To investigate the
high-order relation among genes, network-based analysis has been
devoted to gene expression, which extracts many interesting patterns
that are different from differentially expressed genes. For instance,
Langfelder et al. [[32]5] proposed the weighted gene co-expression
network analysis tool (WGCNA) to mine the co-expression modules.
Furthermore, biological networks have been proven to be powerful for
describing and analyzing profile data, where each vertex represents a
gene and each edge corresponds to an interaction between a pair of
genes. There are many biological networks, such as gene regulation
networks [[33]6], signal transduction networks [[34]7], protein–protein
interaction (PPI) networks [[35]8], disease networks [[36]9], and gene
regulation networks [[37]10,[38]11,[39]12,[40]13,[41]14,[42]15]. The
accumulated biological networks provide an opportunity to explore the
mechanisms of cells via mining the graph patterns. Great efforts have
been devoted to network analysis, where the graph patterns shed light
on the structure–function relations in biology. For example, Taylor et
al. [[43]16] analyzed the PPI network and demonstrated that the genes
with large degrees (hub genes) play a critical role in the prognosis of
breast cancer. Furthermore, Chuang et al. [[44]17] showed that the
pathways where genes are differentially expressed between two cohorts
of cancer patients serve as biomarkers for predicting cancer
metastasis.
However, a vast majority of analysis ignores the dynamics of data.
Complex diseases, such as cancers, are dynamic and involve a continuum
of molecular events associated with disease progression, from early
warning events to catastrophic end-stage events [[45]18]. How to
extract modules associated with cancer progression is critical for
discovering the mechanisms of cancers because these patterns provide
clues for biologists for further research [[46]19,[47]20]. However, it
is non-trivial to detect dynamic modules associated with cancer
progression because it is difficult to characterize and extract
dynamics of modules. Thus, the available algorithms for the dynamic
modules differ greatly in terms of how to define dynamic modules and
the strategies to discover the predefined patterns. Ma et al. [[48]21]
designed the M-Module algorithm to the common modules across various
stages of breast cancer, and demonstrated that the dynamics of
interaction strength is critical for the acceleration of heart failure
[[49]22]. Similar efforts have also been devoted to common and specific
modules for breast cancer [[50]23,[51]24]. However, these algorithms
only focus on extracting the common and specific modules associated
with cancer progression. In [[52]25], the authors developed the NMF-DM
algorithm to investigate how the pathway dynamically recruits genes,
for example, in cancer progression.
However, these algorithms are only based on gene expression or DNA
methylation data and do not integrate any other data. In fact,
integrative analysis of omic data has been extensively studied since it
identifies interesting patterns that cannot be obtained by analysis of
a single type of data [[53]26]. Compared to the gene co-expression
network, the protein interaction network is more reliable since the
large co-expression value between a pair of genes does not imply
physical interaction. Thus, the protein interaction network should be
integrated with gene expression data to extract dynamic modules. Even
though many algorithms have been developed to integrate protein
interaction and gene expression data, no attempt has been made to
identify modules associated with cancer progression. The reason is that
the integrative analysis of these data is difficult because it involves
both the breast progression and heterogeneity of data.
In this study, we address the integration of gene expression data and a
protein interaction network to mine the dynamic modules associated with
cancer progression. As done in [[54]21,[55]22], the dynamic modules are
defined as common modules that are co-expressed across various stages.
To analyze cancer gene expression data, we adopt the multi-view
subspace clustering algorithm with sparsity constraints to obtain a
representation matrix for each view and a consensus matrix, as shown in
[56]Figure 1 ([57]Supplementary Materials). By effectively integrating
the protein interaction networks, we expected that the joint
representation matrix C would not only balance the agreement across
various stages but also preserve the topological structure of the
protein interaction network. Therefore, the protein interaction network
was incorporated into multi-view subspace clustering via
regularization. In this way, the common module detection problem is
transformed into a convex optimization. The interior point algorithm
was used for convex optimization. The experimental results demonstrate
that the proposed algorithm is more accurate than the state of the art.
The modules obtained by our algorithm are more enriched by the known
pathways and serve as biomarkers to predict cancer stages.
Figure 1.
[58]Figure 1
[59]Open in a new tab
Overview of the rMV-s2c algorithm, which comprises two major
components, namely, the regularized subspace clustering procedure,
which obtains the subspaces for gene expression data of each clinical
stage by regularizing the protein interaction network, and the module
discovery procedure, which identifies common communities across cancer
stages based on the consensus space.
The rest of the paper is organized as follows: [60]Section 2 proposes
the mathematical model and algorithm. The related materials are
presented in [61]Section 3. The experimental results are provided in
[62]Section 4. The conclusion is discussed in [63]Section 5.
2. Methods
The objective function and optimization procedure of the proposed
algorithm, and the algorithm analysis, are presented in this section.
The rMV-spc algorithm comprises two major components as shown in
[64]Figure 1.
2.1. Preliminaries
Prior to giving the detailed description of the procedure of rMV-spc,
let us introduce some terminologies that are widely used in the
forthcoming sections.
The protein interaction network can be modeled by an unweighted and
undirected graph
[MATH:
G=(V,E) :MATH]
, where the vertex set
[MATH:
V={v1,v2,…
,vn} :MATH]
contains all the genes (proteins) and the edge set
[MATH:
E={(v
mi>i,vj)} :MATH]
denotes the interaction between a pair of genes. The protein
interaction network G can be represented by an
[MATH: n×n :MATH]
adjacency matrix A, where
[MATH:
aij
:MATH]
=1 if vertex
[MATH: vi :MATH]
and
[MATH: vj :MATH]
are connected, 0 otherwise. The degree of vertex
[MATH: vi :MATH]
is the number of edges connected to it, i.e.,
[MATH:
di=∑
jaij :MATH]
. The degree matrix D is the diagonal matrix with a degree sequence of
G, i.e.,
[MATH:
D=diag(d1,…,
mo>dn) :MATH]
. The trace of a matrix W is the sum of diagonal elements of W, i.e.,
[MATH:
trace(W)=∑i<
/msub>wij
mrow> :MATH]
.
Let
[MATH:
{1,2,…,m} :MATH]
be a finite set of cancer clinical stages and the attached subscript s
be the value of the variable at the s-th stage. The gene expression for
cancer with various clinical stages
[MATH: X={X1,X2,…,Xm}
:MATH]
, where each
[MATH: Xi :MATH]
is the gene expression for the stage S. The gene expression data
[MATH: Xs :MATH]
is an
[MATH:
ns×n :MATH]
matrix, where each row corresponds a gene, each column represents a
sample (patient), and element
[MATH:
xijs :MATH]
denotes the expression level of the j-th patients in the i-th gene at
stage s.
2.2. Procedure of Algorithm
In the single-view clustering, the sparse subspace clustering (SSC)
[[65]27,[66]28] represents each data point using a small number of data
points from its own subspace. Given the data X, it amounts to the
minimization problem as
[MATH:
minC∥C∥1,s.t.X=XC,diag(C)=0
:MATH]
(1)
where
[MATH:
∥C∥1 :MATH]
is the
[MATH: l1 :MATH]
norm, and constraint
[MATH:
diag(C)=0 :MATH]
is used to avoid trivial solutions where a data point is represented as
a linear combination of itself. In the case of the corrupted data, the
above equation can be re-written as
[MATH:
minC∥C∥1+λz2∥Z∥F2,s
mi>.t.X=XC+Z,diag(C)=0
:MATH]
(2)
where the
[MATH: l1 :MATH]
norm promotes sparsity of the columns of C, while the Frobenius norm
favors small entries in the columns of Z.
Given gene expression associated with cancer progression
[MATH: X={X1,X2,…,Xm}
:MATH]
, the multi-view clustering finds representation matrices
[MATH:
C1,…,<
/mo>Cm :MATH]
across different stages and a joint representation matrix C that
balance the agreement across various stages [[67]29]. According to
[[68]30], we use the centroid based strategy to obtain the consensus
matrix C for the subspace clustering. Therefore, Equation ([69]2)
becomes
[MATH: minC1
,…,Cm<
mo>,C∑s=1m∥
Cs∥1+λz2<
mrow>∥Zs<
mo>∥F2+λc2∥
Cs−C∥2
mn>s.t.Xs=XsCs+Zs,diag(Cs)=
0,s=1,…,m. :MATH]
(3)
We present the regularized multi-view sparse subspace clustering
(rMV-spc) algorithm to discover the common modules in multiple views of
gene expression for cancers. However, the common modules solely based
on gene expression data assume that the genes within a module are
co-expressed. In fact, protein interactions between genes are more
reliable than the co-expression relation. Thus, it is promising to
integrate the gene expression and protein interaction network to
discover the common modules across cancer stages. However, the protein
interaction network is sparse. Therefore, we also expect that the joint
representation matrix C not only balances the agreement across various
stages but also preserves the topological structure of protein
interaction network G. According to [[70]31], the
local-structure-preserved embedding can be formulated as the trace
form, which is defined as
[MATH:
O(C,G)=Trace
mi>(C′LGC) :MATH]
(4)
where
[MATH: LG :MATH]
is the Laplacian matrix of graph G, i.e.,
[MATH:
LG=D−<
/mo>A :MATH]
. By imposing the topology preserving constraint, the model in Equation
([71]3) is formulated as
[MATH: minC1
,…,Cm<
mo>,C∑s=1m∥
Cs∥1+λz2<
mrow>∥Zs<
mo>∥F2+λc2∥
Cs−C∥2
mn>+λGTrace(C′LGC)s.t.Xs=XsCs+Zs,diag(Cs)=
0,s=1,…,
m. :MATH]
(5)
To solve the model in Equation ([72]5), we adopt an alternative
two-step procedure. Specifically, we update
[MATH:
Ci(1
≤i≤m)
:MATH]
by fixing C, while we update C by fixing
[MATH:
Ci(1
≤i≤m)
:MATH]
. In each procedure, the problem in Equation ([73]5) is a convex
optimization, which can be solved using the convex programming
algorithms [[74]32,[75]33], and the sparsity of solutions is also
preferred [[76]34,[77]35]. In this study, we adopt the interior-point
algorithm [[78]32] to obtain matrix C.
After obtaining the consensus matrix C, we construct the affinity
matrix W as
[MATH:
W=C+C′. :MATH]
(6)
The spectral clustering algorithm is used to obtain the final modules.
The procedure is depicted in Algorithm 1.
Algorithm 1 The rMV-spc algorithm
Input:
[MATH: X :MATH]
: Gene expression data
[MATH:
G=(V,E) :MATH]
: Protein interaction network
Output:
[MATH:
{Vi}
mo>i=1k :MATH]
: Common modules
* 1:
Update
[MATH: Cs :MATH]
by fixing C and
[MATH:
Ci(i<
/mi>≠s) :MATH]
based on the interior point algorithm [[79]32]
* 2:
Update C by fixing
[MATH:
Cs(1<
/mn>≤m) :MATH]
based on the interior point algorithm [[80]32];
* 3:
Normalize the columns of consensus matrix C;
* 4:
Construct the affinity matrix
[MATH:
W=C+C′ :MATH]
;
* 5:
Apply spectral clustering to obtain modules based on matrix W;
* 6:
return common modules.
[81]Open in a new tab
3. Materials
3.1. Statistical Significance of Modules
The statistical significance of common modules is computed based on the
null score distribution of modules generated using randomized
permutation. Each gene expression is completely randomized 1000 times
by sample shuffling. The average Pearson coefficient among the gene
pair with the module is used as the module score. To construct the null
distribution for module scores, we perform the proposed algorithm on
the randomized gene expression data. Using the null distribution, the
empirical p-value of a module is calculated as the probability of the
module having the observed score or greater by chance. p-values are
corrected for multiple testing using the method of Benjamini–Hochberg
[[82]36]. An adjusted p-value of 0.05 is considered as significant.
3.2. Module-Based Features for a Support Vector Machine (SVM)
Given a module C, we normalize the expression level of each gene across
all samples using z-score transformation [[83]17], denoted by
[MATH:
Expij
mi> :MATH]
for the i-th gene and j-th patient. For each sample j, the activity
score of the k-th module is defined as the average gene expression of
all genes within the module, i.e.,
[MATH:
eC=∑i∈CEx
mi>pij/|C|
:MATH]
(7)
where
[MATH: |C| :MATH]
is the number of genes in C. For each patient sample, a feature vector
is constructed by all modules.
3.3. Normalized Mutual Information
The normalized mutual information (NMI) [[84]37] is based on the
confusion matrix N whose rows correspond to the real modules in
standard partition
[MATH: P* :MATH]
and the columns correspond to the modules in obtained partition P. The
element
[MATH:
Nij
:MATH]
is the number of vertices overlapped by the i-th real and j-th
predicted module. The NMI is defined as
[MATH:
NMI(P,P*)=−2∑i=
mo>1|P|∑j=1<
mrow>|P*|
Nijlog(NijNNi
.N.j
)∑i=1|P|<
/mrow>Ni.log(Ni.N)+∑i<
/mi>=1|P
*|N.jlog(N.jN) :MATH]
where
[MATH: |P| :MATH]
is the number of modules in P and
[MATH:
Ni.
:MATH]
is the sum of the i-th row of the matrix.
3.4. Artificial Networks
The GN benchmark network, where each network consists of 128 nodes that
are grouped into 4 clusters of equal sizes, is introduced in [[85]38].
Every node has an average degree of 16 and shares
[MATH:
Zout :MATH]
edges connecting nodes outside of the module to which it belongs. As
parameter
[MATH:
Zout :MATH]
increases from 1 to 8, the detection of clusters in the networks
becomes increasingly difficult. In this study, we combine three GN
networks to construct the artificial networks to testify the
performance of the proposed algorithms, where the first two networks
are used for the multiple views and the last network is used for the
regularization.
3.5. Breast Cancer Gene Expression Data
The gene expression data for breast cancer is downloaded from the TCGA
Data Portal, where the clinical stage information for patients is also
available. The RPKM values (RNA-seq IlluminaHiSeq_RNASeq with level 3)
are used. There are 809 samples across four stages (Stage I: 129, Stage
II: 458, Stage III: 209, Stage IV: 13).
3.6. Protein Interaction Network
The protein interaction network is downloaded from BioGrid database
[86]https://thebiogrid.org/, which comprises 22,365 proteins (genes)
and 437,751 interactions among genes. There are 435,543 physical
interaction and 2208 genetic interactions.
4. Results
To validate the performance of the proposed algorithm, three
state-of-the-art algorithms are selected to make a comparison of both
artificial data and breast cancer data. The compared algorithms are the
M-Module algorithm [[87]21], multi-view clustering (MV-NMF) [[88]39],
and spectral clustering [[89]40]. Notice that the spectral clustering
cannot be applied to the multiple networks directly. Thus, we apply the
spectral clustering to each network and then combine the results on
each network based on consensus clustering (CSC).
Two types of datasets, including both the artificial and real breast
cancer data, are employed for a comparison between various algorithms.
The artificial networks are adopted to test the accuracy of the rMV-spc
algorithm, and the breast cancer data are used to determine the
applicability of the proposed algorithm in discovering common modules
in real networks with strong backgrounds.
4.1. Benchmarking Performance on the Artificial Networks
In the artificial networks, we combine three GN networks, where the
first two networks are used for multiple views and the remaining one is
used for regularization (Materials). To increase the difficulty in
discovering the common modules, we increase the parameter
[MATH:
Zout :MATH]
from 1 to 8 while we fix
[MATH:
Zout :MATH]
as 6. To quantify the performance of algorithms, the normalized mutual
information (NMI) is adopted since the community structure is known in
the artificial networks (Materials).
Prior to giving the performance of algorithms, we first investigate how
the parameter affects the performance of the proposed algorithm. Notice
that there are three involved parameters: parameter
[MATH: λZ :MATH]
controls the importance of the regularizer of factorization, parameter
[MATH: λC :MATH]
determines the tradeoff between the consensus matrix among multiple
views, and parameter
[MATH: λG :MATH]
denotes the importance of the network for regularization. Similar to
[[90]41], we assume that these parameters are equal since we
hypothesize that all items for regularization are equally important. By
setting parameter
[MATH:
λ∈{10−<
/mo>2,10−
1,100,101,102} :MATH]
, we check how the accuracy of the proposed algorithm changes as
parameter
[MATH:
Zout :MATH]
increases from 1 to 8 in terms of NMI, which is shown in [91]Figure 2A.
As
[MATH: λ :MATH]
increases from
[MATH:
10−2
:MATH]
to
[MATH: 100 :MATH]
, the accuracy of the rMV-spc algorithm increases and achieves the best
performance at
[MATH: λ :MATH]
= 1. The reason is that, when
[MATH: λ :MATH]
is small, the objective function is denominated by subspace clustering,
and the contribution of items of regularization is subtle. As
[MATH: λ :MATH]
increases, the contribution of regularized items becomes increasingly
important, which improves the accuracy of rMV-spc. As
[MATH: λ :MATH]
increases from
[MATH: 100 :MATH]
to
[MATH: 102 :MATH]
, the accuracy of the proposed algorithm decreases dramatically. The
reason is that, as
[MATH:
lambda :MATH]
continues to increasing, the objective function of rMV-spc is dominated
by the regularization, resulting in the decrease in the performance of
the algorithm. Furthermore, the proposed algorithm is robust since its
accuracy is stable for a wide range of
[MATH: λ :MATH]
values. In all experiments, we set
[MATH: λ :MATH]
= 1.
Figure 2.
[92]Figure 2
[93]Open in a new tab
Parameter effect and performance of the compared algorithms on
artificial data. (A) Parameter effect: how the NMI changes as parameter
[MATH: λ :MATH]
increases from
[MATH:
10−2
:MATH]
to
[MATH: 102 :MATH]
. (B) Performance as a function of the amount of parameter
[MATH:
Zout :MATH]
in the simulated data among various algorithms, where NMI is used as
the performance measure.
We compare the MV-NMF, CSC, M-Module, and rMV-spc algorithms on the
artificial networks in terms of accuracy, which is shown in [94]Figure
2B. From the panel, we assert that the proposed algorithm achieves the
best performance, followed by M-Module, MV-NMF, and CSC. While the
M-Module is inferior to the rMV-spc algorithm, it is much better than
the others. There are two possible reasons why the proposed algorithm
outperforms the other methods. First, the subspaces are more precise in
characterizing the module structure in multiple view data compared with
the data in the original space. Second, the proposed algorithm
incorporates both the subspace and topological information, which
provides a better way to characterize the structure of common modules.
Moreover, it is easy to conclude that the performance of algorithms
decreases dramatically as
[MATH:
Zout :MATH]
increases from 1 to 8 because the module structure becomes fuzzy as
[MATH:
Zout :MATH]
increases. For example, the NMI is about 1 when
[MATH:
Zout
≤4 :MATH]
. As
[MATH:
Zout
>4 :MATH]
, the NMI value decreases dramatically.
4.2. Benchmarking Performance on the Breast Cancer Networks
The artificial data is used to test the performance of the proposed
algorithm in detecting the common modules in terms of accuracy. To
check whether the proposed algorithm can identify common modules across
various clinical stages in the data with biological background.
Because the true modules are unknown, multiple reference pathway
annotations, including Gene Ontology [[95]42], KEGG [[96]43], and
Biocart [[97]44], are used to determine the effectiveness of the
algorithms by using the enrichment analysis (Materials). To evaluate
the performance, we use specificity and sensitivity to quantify the
accuracy, where specificity is defined as the fraction of the predicted
modules that significantly overlaps with at least one reference
pathway, while sensitivity is defined as the fraction of the reference
pathways that significantly overlaps with at least one predicted
module. [98]Figure 3A,B shows that the rMV-spc algorithm achieves
higher specificity while maintaining comparable sensitivity than the
other methods. Specifically, the specificity values of rMV-spc are
76.9%, 80.3%, and 81.7% for the GO, KEGG, and BioCart pathways,
respectively, while those of the M-Module algorithm are 72.4%, 74.4%
and 76.5%. The results demonstrate that the common modules obtained
bythe proposed method are more enriched by the known pathways than
those obtained by others. Notice that the rMV-spc algorithm is inferior
to M-Module in terms of sensitivity. We check the significance of the
difference between rMV-spc and M-Module on sensitivity using the Fisher
exact test with a cutoff of 0.05. The results demonstrate that the
difference in specificity is significant, while it is not significant
in terms of sensitivity.
Figure 3.
[99]Figure 3
[100]Open in a new tab
Performance of the compared algorithms on the TCGA breast cancer data.
(A) Specificity of modules obtained by various algorithms in the known
pathway enrichment analysis of various algorithms. (B) Sensitivity of
communities obtained by various algorithms in the known pathway
enrichment analysis of different algorithms. (C) Specificity of modules
obtained by the proposed algorithms with and without the regularization
of the protein interaction network. (D) Sensitivity of modules obtained
by the proposed algorithms with and without the regularization of the
protein interaction network. The * denotes that the difference is
significant using Fisher’s exact test with a cutoff of
[MATH: 0.05 :MATH]
.
The proposed algorithm integrate both the gene expression and protein
interaction networks. Then, we ask what is the different if the protein
interaction network is not integrated. The specificity and sensitivity
of modules are shown in [101]Figure 3C,D. From the panel, we assert
that the integration of the protein interaction network increases the
percentage of modules that are enriched by known pathways. The results
demonstrate that the integration is promising in identifying the common
modules associated with cancer progression.
4.3. Common Modules Serve as Biomarkers to Predict Breast Cancer Stages
It has been shown that the hub genes [[102]16] and modules
[[103]17,[104]21] are predictive for the breast cancer diagnosis. Thus,
we hypothesize that the common modules can also be used to predict the
stages of breast cancer. Following [[105]17], we construct module-based
features to predict the stages of breast cancer (Materials). For each
module, we construct a feature vector that is the average of the gene
expression of the genes within the modules. Based on the feature
vectors, we use the SVM to predict the stage of cancers.
For a baseline comparison, we compare the classification accuracy by
using the following feature sets: modules generated by other
algorithms, size-matched differentially expressed genes, and randomly
selected genes. We trained the support vector machine (SVM) classifier
to perform multi-class classification. This SVM employed accuracy (the
percentage of patients that are corrected classified) to measure
performance. The results on the TCGA breast cancer data using five-fold
cross validation are presented in [106]Figure 4A. The modules obtained
by our algorithms are more discriminative than the others.
Specifically, the rMV-spc algorithm has significantly higher accuracy
than the M-Module (74.5% vs. 71.3%). These results demonstrate that the
common modules obtained by rMV-spc capture the specificity of pathways
as breast cancer progression.
Figure 4.
[107]Figure 4
[108]Open in a new tab
Subtype-specific methylation modules improve the accuracy of breast
cancer stage classification using 50 independent 5-fold cross
validations. (A) Classification accuracy of breast cancer stages using
different feature sets, including the stage-specific modules obtained
by various algorithms. Accuracy is defined as the number of patient
samples correctly classified. The Y-axis is the accuracy and the error
bar is for the standard deviation. (B) External validation by training
on TCGA data and testing on the external data.
To further validate the performance of various algorithms, we evaluated
the performance of the SVM classifiers by using external data
([109]GSE5874). We trained the SVM classifier on the TCGA data and
tested it on an external microarray dataset. Consistent results
indicate that the performance is not due to hidden confounding factors
in the TCGA dataset ([110]Figure 4B). The accuracy of rMV-spc is 51.4%,
while the accuracies of the M-Module, MV-NMF, CSC, and DGis are 49.8%,
44.9%, 41.3%, and 38.7%, respectively. The results show that the
proposed algorithm is better than the available approaches in
discovering common modules in data integration.
5. Conclusions
The advances in biological technologies enable the possibility of
generating multiple genomic profiling of biological samples for various
conditions. How to integrate the heterogeneous genomic data to extract
patterns is critical since these patterns may shed light on the
mechanisms of cancers. Even though many algorithms have been devoted to
the integrative analysis of omic data, few attempts have been made to
simultaneously integrate heterogeneous and time-series gene expression
data.
In order to attack this issue, we provide a novel algorithm by
considering the time and heterogeneity factors at the same time. In
this study, the gene expression associated with cancer progression are
projected to subspaces based on subspace clustering. In order to
incorporate the protein interaction network, we treat it as a
regularizer with an immediate purpose to alleviate the effects of
heterogeneity. The experimental results demonstrate that the proposed
algorithm is promising in discovering common modules across various
cancer stages. We see ample opportunities to improve on the basic
concept of rMV-spc in future work. For example, we can extend the
algorithm by integrating more heterogeneous data, such as DNA copy
number variation and methylation.
Acknowledgments