Abstract
Molecular subtyping of cancer is recognized as a critical and
challenging step towards individualized therapy. Most existing
computational methods solve this problem via multi-classification of
gene-expressions of cancer samples. Although these methods, especially
deep learning, perform well in data classification, they usually
require large amounts of data for model training and have limitations
in interpretability. Besides, as cancer is a complex systemic disease,
the phenotypic difference between cancer samples can hardly be fully
understood by only analyzing single molecules, and differential
expression-based molecular subtyping methods are reportedly not
conserved. To address the above issues, we present here a new framework
for molecular subtyping of cancer through identifying a robust specific
co-expression module for each subtype of cancer, generating network
features for each sample by perturbing correlation levels of specific
edges, and then training a deep neural network for multi-class
classification. When applied to breast cancer (BRCA) and stomach
adenocarcinoma (STAD) molecular subtyping, it has superior
classification performance over existing methods. In addition to
improving classification performance, we consider the specific
co-expressed modules selected for subtyping to be biologically
meaningful, which potentially offers new insight for diagnostic
biomarker design, mechanistic studies of cancer, and individualized
treatment plan selection.
Keywords: molecular subtyping of cancer, specific co-expression module,
network perturbation, multi-classification, machine learning
1 Introduction
Precision cancer medicine aims to characterize the distinct biology of
an individual or a group of cancer patients sharing certain
commonalities and treat them by targeting the specific oncogenic event
shared by such a group ([36]Lipinski et al., 2016; [37]Russnes et al.,
2017; [38]Ozturk et al., 2018; [39]Zhang et al., 2019). Using breast
cancer as an example, the majority of such cancers fall into one of the
three subtypes: estrogen receptor positive (ER+), human epidermal
growth factor receptor 2 positive (HER2+), and triple-negative
([40]Vuong et al., 2014). Distinct treatment plans have been developed
to effectively treat these three subtypes of breast cancer. Patients
with ER+ tumors receive endocrine therapy, supplemented with
chemotherapy for some; patients of HER2+ tumors receive targeted drug
therapy or small-molecule inhibitor therapy combined with chemotherapy;
and patients of triple-negative breast cancer are treated using
chemotherapy only ([41]Waks and Winer, 2019; [42]Yin et al., 2020).
Clearly, the effectiveness of such a treatment plan depends on our
ability to accurately subtype cancer tissues with shared biology,
particularly common druggable targets among subgroups of a specific
cancer type ([43]Chaisaingmongkol et al., 2017). This is the focus of
the current study, specifically to identify distinguishing features,
measured using transcriptomic data, only shared by samples of each
specified subtype of cancer ([44]Valle et al., 2020).
Cancer subtyping through applications of machine learning techniques
has been done by numerous authors on multiple cancer types. Cascianelli
et al. developed a classification method for breast cancer subtyping
that employs several machine learning classifiers to solve the
multi-classification task for breast cancer subtyping ([45]Cascianelli
et al., 2020). Markus et al. modeled and solved the breast cancer
subtyping problem based on integrated analyses of gene expression and
DNA methylation data using a random forest algorithm ([46]List et al.,
2014). Deep-learning algorithms have recently been applied to tackle
the cancer subtyping problem through an end-to-end approach. Guo, et
al. have reported a deep-learning framework to learn the representation
of high-dimensional features derived from gene expression data and
alternative splicing profiles and solve the subtyping problem of breast
cancer ([47]Yang et al., 2018).
While these methods, such as deep learning, have powerful capabilities
in data classification, most of these methods have limitations in
interpretability and tend to require large amounts of data for model
training ([48]Chen et al., 2019), which has clearly limited the
applications of omic-data based subtyping. In addition, these methods
generally rely on gene expression data for classification and have
largely ignored the interaction information among the expressed genes
in cancer, which generally carries more information than the expression
levels of individual genes ([49]Segura-Lepe et al., 2019; [50]Lee et
al., 2020). This is particularly important for modeling genes in cancer
tissues, knowing that considerable metabolic reprogramming has taken
place in cancer tissue cells, as we have previously demonstrated
([51]Sun et al., 2020), which could be captured by co-expression
information. Hence, it is worth the effort to develop
co-expression-based classifiers to capture the distinct reprogrammed
metabolisms and hence the corresponding phenotypes of individual
subtypes of cancer.
A few papers have been published on cancer subtyping based on
co-expression information, which classify cancer samples based on the
general characteristics of the relevant co-expression networks ([52]Liu
et al., 2016; [53]Yu et al., 2020). Jiang et al. developed a
multi-classification method for cancer samples based on differential
co-expression analyses ([54]Jiang et al., 2019), and predicted a
sample’s label through calculating its perturbation on the most
specific edges of each subclass-representing network module. Although
this method performs well in cancer subtyping, there is a lack of
interpretability as the identified edges tend to be unconnected, hence
the lack of functional information.
In this paper, we present a new cancer molecular subtype classification
framework based on a specific co-expression module and a deep neural
network (DNN) named SCM-DNN, which can identify a robust, distinct
co-expression module for each subtype of a cancer. A co-expression
module is a set of genes whose expressions highly correlate with each
other ([55]Wolf et al., 2014), and a distinguishing co-expression
module is a co-expression module that is associated with a specific
subtype but not other subtypes of a cancer. Intuitively, a
distinguishing co-expression module should reflect certain unique
characteristics of a cancer subtype. Specifically, we use the TCGA
transcriptomics data to construct a co-expression network over samples
of each subtype and then apply weighted correlation network analysis
(WGCNA) ([56]Zhang and Horvath, 2005; [57]Langfelder and Horvath, 2008;
[58]Sipko et al., 2018) to partition the network into co-expression
modules. Then we assess the discerning power of each co-expression
module for cancer subtyping by [59](1) identifying the most discerning
modules and their most specific edges between samples of the current
subtype and samples of other subtypes; 2) perturbing the correlation
levels of such edges to generate new samples with co-expression network
features for each sample; and 3) then training the classifier based on
such new samples. When applying this classifier to breast cancer (BRCA)
and stomach adenocarcinoma (STAD), we found it has superior performance
under both macro-average recall (Macro-R) and macro-average f1-score
(Macro-F1) metrics over existing methods. We consider that this
co-expression module-based subtyping not only provides an improved
method for cancer subtyping but also provides meaningful information
about the unique biology of cancer samples of each subtype, hence
potentially offering new information about the underlying mechanism of
the cancer subtype and suggesting new individualized treatment targets.
2 Materials and Methods
We present a new computational framework, SCM-DNN, shown in [60]Figure
1 and [61]Figure 2, for subtyping cancer samples.
FIGURE 1.
[62]FIGURE 1
[63]Open in a new tab
(A) The workflow from data processing to specific edges identification.
Take four-subclass classification as an example. Each subtype is
represented as a gene expression matrix with n genes after data
processing. WGCNA is used to divide whole gene set into different
co-expression modules. The specific edges of one subtype are extracted
from the specific module of their subtype. The perturbation of these
specific edges (gene pairs) is used to generate network features data.
(B) Detailed process of generating one piece of network feature data.
The perturbation values of a sample are the difference of specific
edges between expanded network and the reference network.
FIGURE 2.
FIGURE 2
[64]Open in a new tab
Sufficient network feature data generation for model training and
prediction. One reference sample set consists of T groups of samples
that from T subtype (T: total number of subclass). Network feature data
corresponding to training samples are used for model training.
2.1 Data Processing
RNA-seq data and clinical information of breast cancer and stomach
cancer tissue and normal samples are downloaded from the TCGA database
([65]Weinstein et al., 2013). These cancer samples are pre-labeled with
their subtype information. Overall, 113, 437, 37 and 115 samples are
labeled as control, ER+, HER2+, and triple-negative BRCA tissues
respectively; and 33, 107, 23, 47, and 50 samples are marked as
control, CIN, EBV, MSI, and GS STAD tissues, respectively. The FPKM
value (with log2 transformation) is used to measure the expression
levels in our analysis. For each cancer type, genes whose average
expression levels are less than 10 over all the samples are removed,
and the median absolute deviation (mad) is used to estimate the
variance of a gene’s expression. In a dataset with sample size N, the
‘mad’ value of gene X is calculated as follows:
[MATH:
mad=med<
mi>ianXi−median
mi>Xi=1,2,…,N. :MATH]
(1)
X = (X [1], X [2], … , X [i ], … , X [N ]), X [i ]is the expression
value of gene X of the ith sample. Clearly, the more similar the
expression levels of a gene are across all samples, the closer its
“mad” value is to zero. For our analyses, we only keep the top 90%
genes with the largest “mad” values. Overall, 14,439 and 7,761 genes
are kept for BRCA and STAD, respectively.
2.2 Construction of Co-Expression Networks and Generation of the
Co-Expression Modules
For each cancer type, we first construct gene co-expression networks
for each subtype; that is, for a cancer type with T molecular subtypes,
T co-expression networks need to be constructed. The Spearman
correlation coefficient is used to construct the co-expression
networks. According to ([66]Anglani et al., 2014), although spearman
correlation is an efficient way to construct co-expression networks,
its coefficient and statistical significance depend on the sample size
to some extent. Since the issue of imbalanced sample size always
exists, directly constructing co-expression networks for each category
will lead to incomparability among different categories. To solve this
problem, we perform sampling to construct the co-expression network for
each cancer type.
Given the sample sizes of each subtype {s1, s2, ...sT}, we have
performed F-fold sampling to calculate the correlations for each
subset, with each fold having N [s ]samples. N [s ]should be smaller
than min {s1, s2, ...sT}, and F should be large enough to ensure that
all samples are selected at least one time. For the fth fold in lth
subset,
[MATH:
corfl :MATH]
represents the correlation values matrix for the co-expression network,
and
[MATH:
pfl
:MATH]
represents the corresponding p-values. The final correlation values and
p-values of lth subset are defined as Formula (2) and (3):
[MATH:
corl=1F
∑f=1
mn>Fcorfl. :MATH]
(2)
[MATH:
Pl=∏f=1
mn>Fp
fl<
/mfenced>1/F.
:MATH]
(3)
Furthermore, we have removed gene pairs in the network whose
associations are not significant (i.e., p-value >0.01) and genes that
do not connect with any other genes in the network. In the end, we have
obtained T co-expression networks {MeanNet1, MeanNet2, ...MeanNetT} for
each subtype. For each MeanNet, we apply WGCNA to divide it into
several co-expression modules. We set the soft thresholds according to
the scale free topology fitting index R2 coefficient for each subtype.
It reweights the MeanNet by adjusting the coefficient of each
co-expression pair to make the network satisfy the scale-free property.
All the genes are then hierarchically clustered into different groups
based on the weighted network, and the genes that can’t cluster
together with other genes are stored in Module0.
2.3 Identification of the Specific Co-Expression Modules
A specific co-expression module is defined if the genes of a subtype
are highly correlated in a subtype but weakly correlated within other
subtypes. It is worth noting that we don’t consider Module0 of each
subtype. We identify the specific co-expression module of each subtype
by integrating the following two scores:
Score 1: Specific aggregation score. If genes of one subset are
concentrated in a module of one subtype but they are scattered in many
different modules for all the other subtypes, it indicates that these
genes have a specific co-expression pattern in this subtype. According
to this idea, we perform a cross calculation among all the modules of
different subtypes to evaluate the specificity of each module. For
module
[MATH:
Mis
(i=1,2,
...,Sn) :MATH]
, we first get the gene intersections of
[MATH:
Mis
:MATH]
and
[MATH:
Mjt
:MATH]
. (s: source subtype,
[MATH:
Mis
:MATH]
: the ith module of subtype s, Sn: number of modules in the source
subtype,
[MATH:
Mjt
:MATH]
: the jth module of subtype t, t: target subtype, t ∈ {1, 2, ..T}\s, T
is the total number of subtypes). In order to avoid the bias caused by
the number of genes in each module, we will calculate the overlap ratio
between
[MATH:
Mis
:MATH]
and
[MATH:
Mjt
:MATH]
as:
[MATH:
Overlap<
mi>ratios,t,i,
j=|Mi
s∩M<
/mi>jt||Mi
s|. :MATH]
(4)
If for any t and j, the Overlapratio [(s,t,i,j)] values of
[MATH:
Mis
:MATH]
are small, it indicates that the genes in scarcely cluster together in
other subtypes. So, for a module
[MATH:
Mis
:MATH]
, we define
[MATH: Maxoverl
mi>apratiois<
/msubsup> :MATH]
to represent the maximal overlap between
[MATH:
Mis
:MATH]
and all the other modules of other subtypes. Then, we sort all modules’
Max overlapratio of this subtype in ascending order and the ranking of
[MATH:
Mis
:MATH]
is equal to its score 1. The lower ranking of
[MATH: Maxoverl
mi>apratiois<
/msubsup> :MATH]
, the more likely
[MATH:
Mis
:MATH]
will be identified as a specific co-expression module.
Score 2: Correlation significance score. If co-expression coefficients
of the edges in this module are overall significantly stronger than
their coefficients in other module subtypes, then this module is more
likely to be a specific one.
For a certain module
[MATH:
Mis
:MATH]
, the mean co-expression value of its edges is defined as
[MATH:
edgemea<
msubsup>nMis
s :MATH]
. Meanwhile, the mean co-expression value of these edges on other
subtypes’ co-expression networks is calculated and denoted as
[MATH:
edgemea<
msubsup>nMis
t :MATH]
(t: target subtype, t ∈ {1, 2, ..T}\s). If some edges in
[MATH:
Mis
:MATH]
do not appear in co-expression network of subtype t, their values in
subtype t are recorded as 0. Then the difference between
[MATH:
edgemea<
msubsup>nMis
s :MATH]
and is
[MATH:
edgemea<
msubsup>nMis
t :MATH]
is defined as:
[MATH:
△edgeme<
mi>anMis
s,t=edgemea
nM
iss
mrow>−edgem<
/mi>ean
Mis
msubsup>t. :MATH]
(5)
[MATH: △minmean
Mis :MATH]
represents the smallest
[MATH:
△edgeme<
mi>anMis
s,t :MATH]
of
[MATH:
Mis
:MATH]
. Next we sorted
[MATH: △minmean
Mis(i=1,2,
…Sn
mrow>) :MATH]
in a descending order, their ranking is defined as score 2. Similarly,
the lower rank
[MATH: △minmean
Mis :MATH]
is, the more likely
[MATH:
Mis
:MATH]
is to be a specific co-expression module. Taking the sum of score 1 and
score 2 as final score for each module
[MATH:
Mis
(i=1,2,
…Sn
mrow>) :MATH]
, we rearrange all modules of subtypes in an ascending order, and
select the module with lowest rank as the specific co-expression module
of subtype s.
2.4 Identification of Specific Edges in Specific Modules
As the sizes of specific modules are different and there are many edges
in each specific module, it is necessary for us to select the most
specific edges that are highly co-expressed only in one subtype to
represent the character of each specific module. In addition, selecting
same number of edges for each subclass can improve the comparability.
If we want to select E specific edges for each specific module,
following steps can be taken. For a gene pairs (i, j) in the specific
co-expression module, their correlation values on all subtypes are
denoted as
[MATH: (cor
mi>(i,j)1<
mo>,cor(i,j)2<
mo>,…,cor
mi>(i,j)T<
/mrow>) :MATH]
(T is the number of subtypes), and
[MATH:
maxcor<
mrow>(i,j)x
:MATH]
is the max value of
[MATH:
cor(i,j)x<
mrow>(x=1,2,
…,T)\s :MATH]
. Then, the difference between
[MATH:
cor(i,j)s
:MATH]
and
[MATH: maxcor
mrow>(i,j)x
:MATH]
is defined as:
[MATH:
△cori,j<
mrow>s=co
ri,j<
mrow>s−maxco
mi>ri,j<
mrow>x. :MATH]
(6)
The
[MATH:
△cor(i,j)s
:MATH]
of all gene pairs are sorted in descending order, and the top E gene
pairs are specific edges.
2.5 Generation of Network Feature for Each Cancer Sample
Although specific co-expression modules could capture the prominent
characteristics of each subtype, it is not easy to transfer these
characteristics directly to a single sample. Hence, our method proposes
learning the sample’s network feature by calculating its perturbation
effect when adding it to each specific module. Intuitively, when a
sample is added to the specific co-expression module of its same
subtype, its disturbance to this module is not significant. Otherwise,
when adding this sample to specific modules of other subtypes, their
disturbance is relatively large.
For each subtype, we randomly select 90% of the samples as the training
set and the remaining 10% as the test set. In order to avoid the
classification bias due to imbalanced sample sizes of different
subtypes, we generate and amplify new samples by adding one sample to
multiple reference network sets and ensuring the sample sizes of each
subtype are similar for training.
First, we generate a series of reference network sets covering the
specific co-expression edges of each subtype. Reference network of one
subtype is generated by genes in its specific module, naturally,
specific co-expression edges are covered. The size of samples used for
constructing reference networks is uniformly assigned as P (P is
smaller than the sample size of any subtype). For each subsampling, a
reference network set is generated, including T reference networks
corresponding to T subtypes, and we randomly select P samples from each
subtype several times and generate several reference network sets,
shown in [67]Figure 2.
Then, one cancer sample is added to a reference set, which is T
reference networks, to construct T new co-expression networks, called
expanded networks. The perturbation value of a specific edge is
obtained by calculating the difference between an expanded network and
a reference network.
[MATH:
△corix=|corix′−corix|. :MATH]
(7)
Here, i is the ith specific edge of subtype x,
[MATH:
corix′
mrow> :MATH]
and
[MATH:
corix :MATH]
are the correlation value of ith specific edge of subtype x in the
expanded network and reference network, respectively.
[MATH:
△corix :MATH]
when a sample is added to the reference network, is perturbation value
to ith specific edge. Then, for one cancer sample, it’s T *E
perturbation values are merged into a vector, where E is the number of
specific edges selected for each subtype, generating a piece of network
feature data.
One piece of network feature data shows the characteristics of a sample
at the co-expression network level. In order to augment the sample
size, we add each training sample to several reference network sets.
Hence, we can obtain enough network feature data for model training
even though there are few cancer samples, which guarantees the
classifiers are able to learn sufficient information for each subtype.
For each test sample, it is also randomly added to the reference
network sets to generate its corresponding new sample(s). It is worth
noting that all the reference networks are constructed from samples of
training sets.
2.6 Construction of Cancer Subtype Multi-Classifier
We build a fully connected feed forward neural network classifier with
cross-entropy loss function.
[MATH:
L=−1N
∑i∑c=1
mn>Tyiclogpic
. :MATH]
(8)
Here, the value of y depends on the true label of data i. Let h be a
neural network, in which the activation function of hidden layers and
output layer are ReLu and softmax, respectively. p [ic ]is the
probability of the data i belonging to subtype c. N is the size of the
data. The optimization algorithm is stochastic gradient descent. We
apply an early stop strategy to avoid over-fitting in the training
process and take 10-fold cross validation to verify the performance of
the classification method. In prediction, when adding each testing
sample into different reference networks, it generates several new
samples and then gets multiple prediction labels, voting strategy are
used to obtain final prediction label of this sample.
2.7 Baseline Methods
We compared our method, SCM-DNN with three traditional filter feature
selection methods (Chi-square test, Analysis of Variance, and Mutual
Information), and one state-of-the-art wrapper feature selection
method, (HSIC-Lasso) following with DNN. In addition, we also compared
our method with one of the few co-expression-based cancer subtyping
methods. Moreover, we compared our method with one of the few
co-expression-based cancer subtyping methods (SCP), which predicted a
sample’s label through calculating its perturbation on the most
specific edges of each subclass-representing network. In addition, we
also compared our method with DeepCC, which is a deep learning-based
framework integrating functional spectra quantifying activities of
biological pathways for molecular subtyping of cancer ([68]Gao et al.,
2019).
3 Results
3.1 Statistic of Distinguishing Co-Expression Modules of Each Cancer Subtype
14439 and 7761 genes were used for the construction of co-expression
networks for BRCA and STAD, respectively. We decompose the
co-expression network into several modules for each cancer subtype. The
number of co-expression modules for each cancer subtype, and the number
of genes and edges in each specific co-expression module are shown in
[69]Table 1.
TABLE 1.
Statistics of co-expression modules of each cancer subtype.
Subtypes #Edges of #Modules #Genes in #Edges in
Cancer Co-expression Specific Specific
Types Network Co-expression Co-expression
Module Module
BRCA ER+ 18002953 37 123 7161
BRCA HER2+ 26774509 20 1834 810674
BRCA Triple 17261163 32 1334 322232
Negative
BRCA Control 54354208 65 698 241789
STAD CIN 5155648 43 124 4483
STAD EBV 12952861 52 75 2328
STAD GS 11937803 29 789 262884
STAD MSI 8714653 18 190 9780
STAD Control 17645626 50 105 5330
[70]Open in a new tab
3.2 Evaluation of the Discerning Power of the Co-Expression Module for Each
Subtype
To evaluate the discerning power and stability of each co-expression
module between each subtype and the samples of the other subtypes of a
cancer, we have used accuracy, macro-average recall and macro-average
F1-score to avoid possible issues created by imbalanced sample sizes
among the subtypes, defined as follow.
[MATH:
Accurac<
mi>y=∑i=1<
mrow>TTP
i∑i=1<
mrow>T#
i. :MATH]
(9)
[MATH:
Macro−P<
mo>=1T<
munderover accentunder="false"
accent="false">∑i=1
mn>TTPi<
mi>TPi+FPi
msub>. :MATH]
(10)
[MATH:
Macro−R<
mo>=1T<
munderover accentunder="false"
accent="false">∑i=1
mn>TTPi<
mi>TPi+FNi
msub>. :MATH]
(11)
[MATH:
Macro−F<
mn>1=2*Macro−p*M
acro−RMacro−p+Macro−
R. :MATH]
(12)
Here, #i is the sample size of ith group; TP is for true positives, FP
for false positives, FN for false negatives, and TN for true negatives.
For BRCA subtyping, we have conducted two experiments by selecting the
top 100 and top 200 distinguishing co-expressed edges from each
co-expression module to evaluate their discerning power. Considering
the relatively small sample size and the number of features, a neural
network with two-hidden layers is employed to train a classifier, which
has 50 and 10 nodes on the first and the second layer, respectively. We
have compared the performance of our approach with six other published
classifiers (see Methods), each employing the same number of features
as our approach.
The subtyping performance of our method on BRCA samples along with the
performance by other five methods are shown in [71]Figure 3A. Our
method clearly performs better across all the metrics, especially in
terms of macro-avg recall and macro-avg f1-score. Imbalanced sample
sizes tend to create problems for classification methods, which tend to
give higher weights to subtypes with higher numbers of samples. In
BRCA, the numbers of samples for the four subtypes are 113, 437, 37,
and 115, with HER2+ having the smallest sample size. We note that the
recall values for HER2+ samples are 0.891, 0.675, 0.575, 0.475, 0.650,
and 0.622 by SCM-DNN, HSIC-lasso, ANOVA, Chi-square, mutual information
and DeepCC, respectively.
FIGURE 3.
[72]FIGURE 3
[73]Open in a new tab
Cancer subtyping performance by seven methods: our method
SCM-DNN,HSIC-LASSO, ANOVA, Chi-square mutual information, SCP and
DeepCC (A) BRCA subtyping and (B) STAD subtyping with using top100 and
200 distinguishing co-expressed gene pairs.
For STAD subtyping, we set the same experimental parameters, including
the organization of the neural networks as for BRCA breast cancer
molecular subtyping task. The performance by our method vs. the other
six methods is comparable to that on BRCA, with our method performing
the best as detailed in [74]Figure 3B. It is worth noting that the
DeepCC classified cancer samples according to a large number of genes
which are not suitable for feature selection, so we use all its
features and compared it with our method when selecting 100 features
and 200 features, respectively.
Overall, the results reveal that our method gives the best and stable
subtyping performance, particularly for the subtyping problems with
highly imbalanced sample sizes. We found that our method always
performs best specially in recall and F1-score, the reason is: we
generate sufficient network feature data for neural network model
training, and it avoids the situation that the classifier only learns
sufficient information for the category with largest scale, instead of
categories with small scale. Hence, our method is superior to other
methods when predict the subtype with smallest scale. In addition,
network feature data can reflect the characteristics of each individual
subtypes. It also proves that specific modules with differentiation and
robustness are conducive to improving classification performance. We
display network feature data in the form of heat map and find that the
samples of the same subtype naturally gather into one block. Details
are shown in the [75]Supplementary Material.
3.3 Functional Analyses of the Genes in Each Specific Module
To elucidate the possibly unique biology for each cancer subtype, a
pathway enrichment analysis is conducted over edges of the identified
co-expression module for each subtype. It is worth noting that the
number of genes in specific modules of each molecular subtype is
different. Specifically, there are 171, 86, 281 and 205 genes in the
specific modules of control, ER+, HER2+ and triple negative BRCA
samples, respectively, with detailed gene lists given in
[76]Supplementary Table S1. And their co-expressed gene pairs are
selected for function analyses. The most significantly enriched
biological processes and pathways enriched by each of the four gene
sets are shown in [77]Table 2.
TABLE 2.
The most significantly enriched pathways by the genes belonging to top
200 specific edges of each molecular subtype in BRCA.
Pathway p-Value
Controls
GO:cell-cell adhesion 1.59E-05
KEGG:Regulation of actin cytoskeleton 9.65E-05
GO:leukotriene biosynthetic process 5.71E-04
GO:ephrin receptor signaling pathway 7.36E-04
KEGG:Cyanoamino acid metabolism 1.63E-03
KEGG:T cell receptor signaling pathway 1.97E-03
ER+
GO:Wnt signaling pathway 5.94E-03
GO:negative regulation of Wnt signaling pathway 1.73E-02
GO:lens fiber cell development 2.68E-02
GO:positive regulation of DNA-templated 2.68E-02
transcription, initiation
GO:epithelial cell-cell adhesion 3.43E-02
KEGG:HTLV-I infection 4.56E-02
GO:muscle organ development 4.66E-02
GO:eyelid development in camera-type eye 4.92E-02
HER2+
GO:nitrobenzene metabolic process 1.15E-03
GO:substrate adhesion-dependent cell spreading 1.90E-03
GO:negative regulation of extrinsicapoptotic signaling pathway
1.90E-03
GO:skeletal system development 3.18E-03
GO:glutathione derivative biosynthetic process 3.42E-03
GO:outflow tract septum morphogenesis 3.90E-03
GO:xenobiotic catabolic process 3.91E-03
GO:positive regulation of cell migration 4.44E-03
Triple-negative
GO:signal transduction 8.57E-07
GO:neuron migration 1.06E-04
GO:nervous system development 1.94E-04
GO:positive regulation of signal transduction 3.60E-03
KEGG:Thyroid hormone signaling pathway 4.16E-03
GO:positive regulation of phosphatidylinositol 4.52E-03
3-kinase signaling
GO:cellular amino acid metabolic process 8.03E-03
[78]Open in a new tab
The most enriched pathways in each distinct set of samples shown in
[79]Table 2 are quite informative. For example, pathways enriched by
the control samples revealed key features of control vs. BRCA cancer
samples in terms of their functionalities, namely cell-cell adhesion
(which is altered in all cancer samples), interactions with immune
cells (which is clearly altered in all cancer samples vs. controls).
Similar can be said about neural functions (ephrin receptor signaling),
cell polarity (which is considerably altered in cancer, actin
cytoskeleton) and inflammation signaling (leukotriene biosynthesis).
Similarly, the most enriched pathways for ER+ samples are growth
related (Wnt signaling), muscle development (also including eyelid
development and fiber cell development), and a specific type of immune
response (HTLV-I infection). And the most enriched pathways for HER2+
are related to xenobiotic metabolism (including dealing with
nitrobenzene), oxidative stress (glutathione biosynthesis), and cell
morphogenesis changes. The pathways uniquely enriched by triple
negative samples involve neural systems, a general indicator for the
level of malignancy of a cancer type, and phosphatidylinositol 3-kinase
signaling (a key regulator of cell polarity), also strongly indicating
the level of malignancy of the cancer subtype.
For STAD, 72, 81,67,119, and 217 genes and their co-expressed gene
pairs are selected as distinguishing features for the control, CIN,
EBV, MSI, and GS STAD samples, respectively. The enrichment results by
each gene set are shown in [80]Table 3.
TABLE 3.
The most significantly enriched pathways by the genes belonging to top
200 specific edges of each molecular subtype in STAD.
Pathway p-Value
Controls
GO:protein phosphorylation 8.83E-05
KEGG:Oxytocin signaling pathway 2.67E-04
GO:apoptotic cell clearance 1.64E-03
GO:peptidyl-serine phosphorylation 1.65E-03
KEGG:Endocrine and other factor-regulated 2.51E-03
calcium reabsorption
GO:vesicle-mediated transport 3.35E-03
CIN
GO:apoptotic process 8.53E-03
GO:steroid metabolic process 1.35E-02
GO:intracellular protein transport 1.61E-02
GO:catecholamine metabolic process 3.64E-02
GO:sulfation 4.43E-02
GO:response to toxic substance 4.78E-02
EBV
KEGG:Metabolic pathways 5.84E-03
GO:response to ionizing radiation 1.16E-02
KEGG:Valine, leucine and isoleucine degradation 1.54E-02
GO:methylation 2.47E-02
GO:activation of cysteine-type endopeptidase activity 3.13E-02
involved in apoptotic process
GO:mRNA splicing, via spliceosome 3.79E-02
MSI
GO:immune response 3.05E-04
GO:response to interferon-gamma 4.37E-04
GO:type I interferon signaling pathway 6.88E-04
GO:interferon-gamma-mediated signaling pathway 1.02E-03
GO:inflammatory response 2.60E-03
GS
GO:cell division 3.48E-08
KEGG:Cell cycle 9.42E-08
GO:mitotic nuclear division 2.59E-07
GO:mitotic nuclear envelope disassembly 7.55E-07
GO:sister chromatid cohesion 3.55E-06
GO:G2/M transition of mitotic cell cycle 5.18E-06
[81]Open in a new tab
The distinct biology of each of the four subtypes of STAD samples, as
indicated by their enriched pathways, is striking. For CIN subtype, we
see strong indication of toxicity and detoxification in their cells,
e.g., by response to toxic substance, sulfation, intracellular protein
transport and steroid metabolic process. In EBV samples, the distinct
characteristics are dealing with oxidative stress as shown by response
to ionizing radiation, valine, leucine, and isoleucine degradation,
activation of cysteine-type endopeptidase activity, and upregulation of
spliceosome. In MSI, we see that all signals are related to
inflammation and immune response in immune response, response to
interferon-gamma, type I interferon signaling pathway, and inflammatory
response. In GS, the key distinguishing characteristic is rapid cell
division, as indicated by cell division, cell cycle, nuclear division,
chromatid cohesion and G2/M transition.
3.4 Comparison of Selected Features Between Gene Expression Based and
Co-Expression Based Methods
We have compared the consistency and differences among the top 100
selected features obtained by each of the five methods, including ours,
with results summarized in [82]Figure 4. We note that genes selected
based on gene-expression levels are quite different from the genes
identified based on co-expression levels for both BRCA and STAD. And
there is considerable overlap among the features selected by different
gene-expression level-based methods. For example, genes selected by
ANOVA and the mutual information method have a 60% overlap in both
cancer types. It should be noted that the top 100 network features
obtained by SCM-DNN are 100 gene-pairs, hence the number of genes for
SCM-DNN is larger than 100.
FIGURE 4.
[83]FIGURE 4
[84]Open in a new tab
Venn diagram for overlaps among top 100 (network) features obtained by
SCM-DNN, HSIC-LASSO, ANOVA, Chi-square and mutual information in (A)
BRCA and (B) STAD.
Through further performing differential gene expression analyses on the
genes obtained by SCM-DNN, we find their expression have little changes
among different subtypes of the same cancer type. This result reveals
that differential gene expression-based methods have clear limitations
in characterizing changes in biological systems. Hence
co-expression-based analyses for cancer subtyping and possibly many
other cancer omic data analysis problems could prove to be the way to
go.
We have also analyzed the connectivity of the selected genes in the
co-expression modules. In our subtyping prediction, we used only the
top 100 and 200 co-expressed gene pairs. An interesting observation is
that all the selected genes could be connected using at most two
additional genes in the relevant module, suggesting that the selected
feature genes are strongly functionally associated. However, regarding
the genes selected by traditional gene expression based feature
selection methods, they are generally highly dispersed across a
co-expression module.
Additionally, due to the transmissibility of information in a network,
it’s not hard to control the whole module by managing a few nodes.
Moreover, since these modules are specific to each molecular subtype,
in other words, they are probably the most striking features of this
disease. Hence, they are expected to be the most effective drug targets
for individualized therapy.
4 Discussion
In this paper, we proposed a computational classification method for
cancer molecular subtyping based on co-expression network features of
each cancer sample. It has been recognized that the phenotypic
difference in cancer samples can hardly be fully understood by only
analyzing single molecules, and it is the relevant system or specific
network that is ultimately responsible for such a phenomenon ([85]Liu
et al., 2016). Moreover, network-based biomarkers, e.g. subnetwork
markers ([86]Ideker and Krogan, 2012), network biomarkers ([87]Liu et
al., 2014), and edge biomarkers ([88]Zhang et al., 2015), are
demonstrated superior to traditional single molecule biomarkers for
accurately characterizing disease states. However, it is generally
challenging to construct specific network and obtain individual network
feature for each sample ([89]Liu et al., 2016). Here, we generate a
sample’s network feature by calculating its perturbation effect on each
background class-specific module after adding it to them. Intuitively,
the quality of constructed class-specific networks will direct
influence the generation of network feature and then further guide the
final classification results. Hence, to ensure the robustness of each
subtype specific network, we construct multiple co-expression networks
for each molecular subtype by sampling and then integrate them. Our
previous study had proved that sampling-based co-expression network
construction could avoid the bias caused by both data noise and
imbalanced sample size among different subtypes ([90]Jiang et al.,
2020). Class-specific modules are identified by a top-down approach
(i.e. decomposing the whole co-expression network of each subtype and
making comprehensive comparison across different subtypes), which is
different from some existing specific modules identification method
based on collecting specific co-expression gene pairs. In comparison,
co-expression modules give a relatively complete path of signal
transmission or transcriptional regulation, and provide much more
information for us to understand biological mechanism of each subtype,
and then could help researchers to identify both actionable targets for
drug design as well as biomarkers for response prediction.
The classification performance of our method is superior to
conventional molecule biomarker-based methods, when applied to breast
and stomach cancer molecular subtyping, under several evaluation
indexes. It is a universal framework and is expected to perform well in
molecular subtyping task for other cancer types. Besides, it is also
easy to transfer to other subtyping tasks, such as cancer sample
staging and grading classification. Similarly, through constructing
co-expression networks and extracting specific co-expression modules
for each cancer stage or grade, a sample could be accurately classified
according to its network features generated by calculating the
perturbation effect of this sample on each background class-specific
module. We assume that specific module of each cancer stage (or grade)
can capture the essential distinguishing property of its samples. And
adding a sample of a different class to the specific module will induce
large disturbance, while adding a sample of its same class will not
disturb too much. One of the advantages of this study is that it
doesn’t need too many training samples. Prior knowledge in the basis of
satisfying the statistical significance indicates that the sample
number of each subtype reaching 15 is enough to construct co-expression
networks for each subtype. Then, a large number of new samples with a
network feature can be generated.
Omics data have enabled the unbiased characterization of the molecular
features of multiple human diseases, particularly in cancer.
Multi-omics may provide molecular insights beyond the sum of individual
omics, and it is becoming increasingly common to characterize multiple
omics layers to gain biological insights spanning multiple types of
cellular processes ([91]Vitrinel et al., 2019). Hence, in our further
work, besides transcriptomics data, we will introduce other omics data
to construct heterogeneous correlated networks and extract
heterogeneous specific modules for each subtype. Moreover, this study
provides a general framework with extensible and replaceable executive
function modules. Other machine learning methods could be applied for
the final multi-class classification according to specific task and
data distribution.
5 Conclusion
We present here a new framework, SCM-DNN, to identify each molecular
subtype’s robust, specific co-expression modules that could efficiently
and steadily predict patients’ molecular subtypes of breast and stomach
cancer. Compared with traditional gene expression based feature
selection methods for multi-classification, SCM-DNN performs better
under all the metrics even the sample size of each class is extremely
imbalanced. Additionally, these specific genes identified by SCM-DNN
could probably represent the striking characteristics of individual
subtypes; meanwhile, they are concentrated in the co-expression
network. Hence, they are promised to assist us to better understand the
underlying mechanism of molecular subtyping and potentially guide
individualized medicine.
Multi-omics data and their integration are recognized as an effective
way to explore the biological mechanism. In future studies, we will
make full use of those data to develop a more comprehensive and robust
classification method by integrating multi-omics data to construct
subtype-specific correlation networks for molecular subtyping of
cancers, expecting a deeper mechanism to be discovered.
Acknowledgments