Abstract
Background
The molecular characteristics of human diseases are often represented
by a list of genes termed “signature genes”. A significant challenge
facing this approach is that of reproducibility: signatures developed
on a set of patients may fail to perform well on different sets of
patients. As diseases are resulted from perturbed cellular functions,
irrespective of the particular genes that contribute to the function,
it may be more appropriate to characterize diseases based on these
perturbed cellular functions.
Methods
We proposed a profile-based approach to characterize a disease using a
binary vector whose elements indicate whether a given function is
perturbed based on the enrichment analysis of expression data between
normal and tumor tissues. Using breast cancer and its four primary
clinically relevant subtypes as examples, this approach is evaluated
based on the reproducibility, accuracy and resolution of the resulting
pathway profiles.
Results
Pathway profiles for breast cancer and its subtypes are constructed
based on data obtained from microarray and RNA-Seq data sets provided
by The Cancer Genome Atlas (TCGA), and an additional microarray data
set provided by The European Genome-phenome Archive (EGA). An average
reproducibility of 68% is achieved between different data sets (TCGA
microarray vs. EGA microarray data) and 67% average reproducibility is
achieved between different technologies (TCGA microarray vs. TCGA
RNA-Seq data). Among the enriched pathways, 74% of them are known to be
associated with breast cancer or other cancers. About 40% of the
identified pathways are enriched in all four subtypes, with 4, 2, 4,
and 7 pathways enriched only in luminal A, luminal B, triple-negative,
and HER2+ subtypes, respectively. Comparison of profiles between
subtypes, as well as other diseases, shows that luminal A and luminal B
subtypes are more similar to the HER2+ subtype than to the
triple-negative subtype, and subtypes of breast cancer are more likely
to be closer to each other than to other diseases.
Conclusions
Our results demonstrate that pathway profiles can successfully
characterize both common and distinct functional characteristics of
four subtypes of breast cancer and other related diseases, with
acceptable reproducibility, high accuracy and reasonable resolution.
Keywords: Signature genes, Pathway, Pathway profile, Enrichment
analysis, Breast cancer
Background
Diseases are abnormal conditions of the human body resulted from
significant nonlethal malfunctions that affect the human physiological
system. Traditionally, diseases are characterized by pathology and
observation of clinical phenotypes. Although these methods have proved
successful in many applications, they lack the sensitivity to detect
diseases before the appearance of symptoms and also have a limited
ability to distinguish complex disease classes [[31]1,[32]2] which may
present confusing or overlapping symptoms.
With the development of genomic technology, a promising approach to
overcome limitations of the traditional method is to identify a set of
genes as a genetic signature whose combined expression pattern is the
uniquely characteristic of a given phenotype [[33]3,[34]4]. In the last
decade, some gene signatures have been developed for cancers
[[35]5-[36]9] and other diseases [[37]10,[38]11], indicating that the
use of these signatures can assist in defining disease, predicting
disease recurrence, aiding disease diagnosis and guiding treatment
decision.
There are still obvious obstacles that prevent the application of gene
signatures in clinical practice [[39]12,[40]13]. One major problem is
the low reproducibility between signature genes. The overlaps of gene
signatures derived from different data sets of the same disease are
generally very few compared to the total number of signature genes
[[41]14], while stable gene signatures are crucial to the robustness of
predictors [[42]15]. The reasons for this discrepancy may include
different cohorts of patients, different statistical methods, and
different experimental technologies involved in identifying the
signature genes [[43]16]. It has been further suggested that a large
number of samples are required to achieve a robust gene signature
[[44]14,[45]15]. However, another reason for this discrepancy may lie
in the observation that diseases are directly resulted from perturbed
cellular functions which are generally carried out by groups of genes
in the forms of complexes, modules or pathways [[46]17]. Therefore, it
is reasonable to assume that any gene whose change of expression leads
to the perturbed molecular function may be a potential signature gene.
This assumption is partially evidenced by the fact that both gene
signatures developed in [[47]18] and [[48]19] can capture cell
proliferation related biological processes and pathways [[49]20], and
that dysregulations of functionally related genes result in similar
clinical phenotypes [[50]21]. This assumption may also explain why
sophisticated methods can rarely find much better gene signatures than
simple methods [[51]22]. From this perspective, it may be more
appropriate to characterize diseases at the functional level.
Pathway-based methods have been extensively applied to analyze
large-scale genome-wide data with varied purposes and applications
[[52]23-[53]29]. Some of them classify tumor samples based on
pathway-level measurements [[54]30]; many of them, such as PWEA
[[55]31] and GSEA [[56]32] identify perturbed pathways between two
distinct phenotypes (e.g., tumor vs. normal) using expression data
[[57]28]. These enrichment methods often require a significant number
of samples to achieve a statistically robust analysis. With appropriate
stability and reliability, the resulting pathways of these methods may
serve as reference pathways to be compared against pathways identified
by sample-based analysis [[58]33,[59]34] in clinical applications such
as disease diagnosis and personalized medicine.
In this study, we report a new approach to characterize diseases at the
functional level, with our aim being to both consolidate redundant gene
lists and to generate a list of pathways which is both accurate and
reproducible. For a given disease, a pathway profile is generated based
on the enrichment analysis of differential gene expression data between
normal and tumor tissues: a binary vector whose elements indicate
whether a given function (represented by a KEGG pathway) is perturbed.
Using breast cancer and four clinically-relevant subdivisions (luminal
A, luminal B, triple-negative and HER2+) as examples, we examine the
new approach from three perspectives: to determine whether the pathway
profile can be reproduced from the data generated by different
technologies (Microarray vs. RNA-Seq), as well as from separate cohorts
(The Cancer Genome Atlas (TCGA) vs. The European Genome-phenome Archive
(EGA)), to determine whether the resulting pathways are associated with
the functional perturbation resulted from the breast cancer and its
subtypes, and finally to determine whether the pathway profile can
distinguish different subtypes of breast cancer as well as distinguish
breast cancer from other diseases. Our results indicate that the new
approach achieves 68% average reproducibility between different data
sets (TCGA microarray vs. EGA microarray data) and 67% average
reproducibility between different technologies (TCGA microarray vs.
TCGA RNA-Seq data). Among the enriched pathways, 74% of them are known
to be associated with breast cancer or other cancers by extensive
literature search. Approximately 40% of the pathways are enriched in
all four subtypes and there are 4, 2, 4, and 7 pathways enriched only
in the luminal A, luminal B, triple-negative, and HER2+ subtypes,
respectively, implying that pathway profiles not only reveal shared
mechanisms in the four subtypes but also outline the subtype-specific
operations that may potentially be used as signature pathways to
distinguish them. Comparison of profiles between subtypes, as well as
other diseases including ovarian cancer, glioblastoma multiforme (GBM),
and obesity, reveals that the luminal A and luminal B subtypes of
breast cancer are closer to each other than to other subtypes, luminal
A and luminal B subtypes are closer to the HER2+ subtype than to the
triple-negative subtype, and subtypes of breast cancer are more likely
to be closer to each other than to other diseases.
Methods
Data sources
808 tumor and 106 normal samples of TCGA RNA-Seq data (Illumina HiSeq
2000 RNA Sequencing platform) were downloaded from the TGGA portal on
Oct. 2012. 522 tumor samples with available PAM50 classification
[[60]35] and 63 normal samples of TCGA microarray data (Agilent G4502A
platform) were downloaded on Nov. 2012. 496 tumor and 58 normal samples
overlap between the two TCGA data sets (drawn from the same patients).
Both the “discovery” and “validation” EGA data sets were also
downloaded, which consisted of 997 and 995 tumor samples, respectively,
and 144 normal samples (Illumina HT-12 v 3 platform, accession number
EGA S00000000083) [[61]36]. The EGA discovery data set was used in our
analysis and the EGA validation data set was used to further verify our
major results (See discussion in Additional file [62]1). For brevity,
the “discovery” EGA data is referred to as the EGA data set in the
remainder of this paper unless otherwise stated. 37 tumor and 8 normal
samples of the TCGA ovarian cancer data set (Affymetrix HG-U133A
platform, batch 9) were downloaded on Feb. 2013, and 24 tumor and 10
normal samples of TCGA GBM data set (Affymetrix HG-U133A platform,
batch 8) were download on Nov. 2013. The obesity data set (Affymetrix
HG-U133_Plus_2 platform) of 5 obesity and 6 control samples was
downloaded from GEO [GDS3688] (omental adipose from obese, prepubertal
children).
No major batch effects were observed for the two TCGA breast cancer
data sets [[63]35]. The batch effect of EGA data set has been removed
by a linear model [[64]36]. To eliminate batch effects in the ovarian
cancer data, only batch 9 of TCGA ovarian cancer data set was used
[[65]37]. Similarly, only batch 8 of the TCGA GBM data set was used
because the possibility of batch effects within the GBM data set may
not be ignored
([66]http://bioinformatics.mdanderson.org/main/TCGABatchEffects:Overvie
w).
A total of 269 human pathways were downloaded from KEGG [[67]38] on
Jun. 2013. 175 of these pathways in 30 pathway categories were used in
our analysis after excluding all disease pathways and pathways with
size either smaller than 16 or larger than 350 (to increase statistical
power).
Classification of breast cancer samples
To be consistent with clinical practice, we classified tumor samples
into luminal A, luminal B, triple-negative, and HER2+ subtypes using
the following steps. First, HER2+ samples were identified based on the
test results of immunohistochemistry (IHC) or florescence in situ
hybridization (FISH). The rest of the samples were then split into ER+
and ER- samples based on estrogen receptor (ER) status provided by IHC
test results. ER+ samples were then further classified into luminal A
and luminal B subtypes using PAM50 classification. Finally,
triple-negative samples were extracted from the pool of ER- samples
according to progesterone receptor (PR) status based on IHC test
results. PAM50 classification results alone were used whenever ER or
HER2 status was not available in the clinic data. As EGA data does not
provide the PR status, instead we used the expression-based
classification result for PR status [[68]36]. PAM50 classification
results were downloaded from the UCSC Cancer Genomics Browser [[69]39]
for TCGA data sets and the supplementary materials in its original
publication [[70]36] for the EGA data set. More details on sample
classification can be found in Additional files [71]1 and [72]2.
Pathway enrichment analysis
We focused on the enrichment of pathways abnormally perturbed in the
disease state compared to the normal state in four major subtypes of
breast cancer. PWEA [[73]31] was used in this study to carry out
pathway enrichment analysis, as a comprehensive study previously
indicated that it has a higher sensitivity than other enrichment
analysis methods including GSEA [[74]32] with little or no loss of
specificity [[75]31,[76]40]. The PWEA results of all data sets are
provided in Additional file [77]3.
Definition of reproducibility
Our measure of reproducibility was applied to both pathway and
gene-based profiles. Using pathways as an example, the reproducibility
r can be defined as
[MATH: r=NCN+NCN'/2 :MATH]
where N and N’ are the number of enriched pathways for two different
data sets, and N[ c ]is the number of overlapping of enriched pathways
between two data sets.
Pathway profile
For a given disease, a pathway profile p can be defined as:
[MATH: p=p1,p<
/mi>2,…pN :MATH]
where N is the total number of pathways used in the analysis and p[ n
](1 ≤ n ≤ N) is equal to either 1 or 0 to indicate whether n[th]
pathway is enriched or not, respectively.
Analysis workflow
The sketch of our approach is depicted in Figure [78]1. Our approach
utilized three main data sets which encompassed two separate patient
cohorts and three distinct gene expression measurement platforms. The
samples were first split into luminal A, luminal B, triple-negative,
and HER2+ subgroups on each platform independently. Perturbed functions
(represented by KEGG pathways [[79]38]) were then identified using the
PWEA algorithm [[80]31]. Pathway profiles were constructed thereafter
as binary vectors with length equal to the number of pathways with each
element set to either 1 or 0 to indicate whether the corresponding
pathway is enriched. Finally, correlations between breast cancer
subtypes and other diseases were calculated by comparing their pathway
profiles using hypergeometric statistics. The detail of our results is
described below according to three key factors that may impact the
performance of our analysis: reproducibility, accuracy and resolution.
Figure 1.
Figure 1
[81]Open in a new tab
The workflow of pathway profiling. The tumor samples of three data sets
are classified into luminal A, luminal B, triple-negative, and HER2+
subtypes by clinical data and PAM50 classification results. We then
perform pathway enrichment analysis for each subtype in each data set
individually using the PWEA algorithm. After that we evaluate our
method based on its reproducibility, accuracy, and resolution. Finally,
we build a small disease/subtype network based on the pathway profiles
of each of the four subtypes of breast cancer as well as related
diseases.
Results
Breast cancer is a complex and heterogeneous disease [[82]41] which
consists of four major clinically-relevant subtypes: luminal A, luminal
B, triple-negative, and HER2+, which vary in prognosis and require
distinct treatments [[83]5,[84]42]. Gene signatures play an important
role in the classification of breast cancer subtypes. However, a
comparative study of five sets of gene signatures for breast cancer
indicated that the overlap between gene signatures is still low
[[85]43]. Our goal therefore is to have a stable and accurate
characterization of breast cancer subtypes using pathway profiles so
that the common perturbed pathways among the four subtypes, as well as
specific pathways uniquely perturbed for each subtype, will be
appropriately identified.
The numbers of enriched pathways for each subtype in different data
sets are shown in Table [86]1. It is clear that, on average, the
luminal A subtype has fewer enriched pathways than the other three
subtypes. All of the enriched pathways are identified based on a false
discovery rate (FDR) cut-off of 0.1, calculated using the
Benjamini-Hochberg method [[87]44].
Table 1.
Number of enriched pathways for each subtype of breast cancer and data
set
Data set Luminal A Luminal B Triple-negative HER2+
TCGA RNA-Seq
__________________________________________________________________
66
__________________________________________________________________
84
__________________________________________________________________
55
__________________________________________________________________
85
__________________________________________________________________
TCGA microarray
__________________________________________________________________
48
__________________________________________________________________
66
__________________________________________________________________
65
__________________________________________________________________
69
__________________________________________________________________
EGA microarray 61 87 111 103
[88]Open in a new tab
Reproducibility
Reproducibility is an essential requirement for almost all published
scientific work. We focused on the reproducibility of our method over
varying patient cohorts and platforms, with the intention to show that
our result can be broadly applied in practice.
The reproducibility (See definition in Methods) of our method was
evaluated using the gene expression data generated either from
different expression measurement technology (Microarray vs. RNA-Seq),
different sample sets (TCGA vs. EGA), or both. Only matched tumor
samples (496) and normal samples (58) from each TCGA data set were used
when examining the reproducibility between two technologies, and all
914 TCGA RNA-Seq samples (808 tumor and 106 normal) and 585 TCGA
microarray samples (522 tumor and 63 normal) were used for the rest of
the analysis.
Overall, about 67%, 71%, 67% and 76% average reproducibility were
achieved (Figure [89]2) for luminal A, luminal B, triple-negative, and
HER2+, respectively over different data sets. This result is consistent
with a previous study [[90]45] where RNA samples were analyzed
separately on 4 different microarray platforms and the percentage of
overlapped functional perturbation between any two platforms fell in
the range of 57-70%. On the contrary, it has been shown that any pair
of gene signatures developed for breast cancer share only a few common
genes [[91]43]. For example, there are only 17 overlapping genes in the
two signatures sets [[92]14] (with 456 and 231 genes respectively)
developed for breast cancer survival-related prediction [[93]6,[94]18].
Similarly, only three common genes have been found [[95]46] in two sets
of genetic markers (each has about 70 genes) predicting the metastasis
of breast cancer [[96]7,[97]19] with similar accuracy. To make a more
intuitive comparison with the gene-based method, we calculated the
reproducibility of the top 1500 DEGs between each pair of our data
sets. As shown in Figure [98]2, the reproducibility of each enriched
pathway set is clearly much higher than that of the top DEGs for all
pairs of data sets. These results further verified that pathway
profiling has a much better reproducibility than gene-based methods.
Figure 2.
Figure 2
[99]Open in a new tab
Reproducibility of enriched pathways (the left side) and top DEGs (the
right side) between each pair of data sets for each subtype of breast
cancer. The FDR cut-off is set as 0.1 for enriched pathways. The top
1500 genes are used to calculate reproducibility for DEGs.
Reproducibility across technologies
Recently, RNA-Seq has become a powerful alternative to microarrays due
to advantages such as high resolution, increased dynamic range, lower
background noise, relatively little technical variation, and the
ability to profile the entire transcriptome [[100]47,[101]48]. It is
therefore expected that the TCGA RNA-Seq data set may provide more
reliable results than other two data sets.
We focus on the matched samples between the TCGA RNA-Seq and TCGA
microarray data sets to examine the effect of gene expression
measurement technology on the reproducibility of our approach. The
comparison of enriched pathways generated from the two data sets
results in 60%, 73%, 65% and 70% reproducibility for the luminal A,
luminal B, triple-negative, and HER2+ subtypes respectively, with about
67% on average. We also found that the reproducibility based on matched
samples was slightly smaller than those based on all samples except in
the case of luminal B: 65%, 70%, 66%, and 75% respectively, for luminal
A, luminal B, triple-negative and HER2+ subtypes as shown as blue bars
in the left side of Figure [102]2. This observation implies that
increasing the number of samples may improve reproducibility
[[103]14,[104]15]. Meanwhile, only 45%, 53%, 50%, and 52%
reproducibility was found for the top 1500 DEGs for the corresponding
subtypes with 50% on average. These results indicate that the
functional profile-based method presents advantages over the gene-based
method and can be used to directly compare results that are generated
from data sets produced by different technologies.
Reproducibility across different data sets
We further investigate the reproducibility of our approach across
different data sets. Among the three data sets used in the study, there
is, as expected, the smallest difference between TCGA RNA-Seq and TCGA
microarray data sets because 496 out of 522 samples from the TCGA
microarray data are drawn from the same patient pool as the TCGA
RNA-Seq data set, whereas the largest difference separated the TCGA
RNA-Seq and EGA microarray data (different patients and different
technologies). However, the corresponding average reproducibility over
four subtypes is almost invariable with 69%, 68%, and 74% for the data
set pairs TCGA RNA-Seq vs. TCGA microarray, TCGA microarray vs. EGA
microarray, and TCGA RNA-Seq vs. EGA microarray respectively (left side
of Figure [105]2). On the contrary, there is an obvious difference with
regards to the reproducibility of the top 1500 DEGs between each
dataset pair, and this is shown on the right side of Figure [106]2. For
example, the reproducibility of top DEGs between the two TCGA data sets
is always larger than the reproducibility of the other two data set
pairs in all four subtypes. From this perspective, pathway profiles
would appear to be better suited for comparative studies with different
data sets that may be generated by different technologies.
Determination of the appropriate data set
As mentioned at the beginning of the previous section, RNA-Seq data is
expected to generate more reliable results than two other data sets
[[107]47,[108]48]. However, in general it is challenging to determine
which pathway profiles are better due to the lack of a gold standard
[[109]40,[110]49]. Here we adopt a simple strategy to address this
issue with the assumption that the pathways predicted by multiple data
sets should be more reliable than those predicated by only one data
set. This assumption is partly evidenced by the observation that the
pathways enriched in all three data sets generally have lower p-value
while the pathways enriched in only one data set often have higher
p-values (more discussion can be found in Additional files [111]1 and
[112]4). From these perspectives, we first generated a reference
pathway profile as the benchmark where a pathway is enriched only if it
is enriched in more than one data set. We then calculated the
reproducibility of pathway profiles in each data set against the
reference pathway profiles for each subtype. As shown in Table [113]2,
the pathway profile resulted from the TCGA RNA-Seq data set has the
highest average reproducibility. As a result of these findings, the
analyses in the remainder of this paper are performed against the
RNA-Seq data set unless otherwise stated.
Table 2.
Reproducibility between pathway profiles (PPs) of each data set and
reference pathway profiles (RPPs)
Luminal A Luminal B Triple-negative HER2+ Average
PPs of TCGA RNA-Seq data set vs. RPPs
__________________________________________________________________
88%
__________________________________________________________________
90%
__________________________________________________________________
79%
__________________________________________________________________
91%
__________________________________________________________________
87%
__________________________________________________________________
PPs of TCGA microarray data set vs. RPPs
__________________________________________________________________
76%
__________________________________________________________________
80%
__________________________________________________________________
89%
__________________________________________________________________
85%
__________________________________________________________________
83%
__________________________________________________________________
PPs of EGA microarray data set vs. RPPs 86% 86% 79% 88% 85%
[114]Open in a new tab
Accuracy
The accuracy of the perturbed pathways identified in our analysis is
measured by their biological relevance to breast or other cancers based
on survey of relevant literature sources. About 74% of the enriched
pathways (enriched in at least one subtype) are known to be associated
with breast cancer or cancers such as those shown in Table [115]3 and
[116]4.
Table 3.
Overlaps of common enriched pathways across three data sets
Pathway Pathway category Reference^ a
Glycolysis/Gluconeogenesis
__________________________________________________________________
Carbohydrate metabolism
__________________________________________________________________
[[117]82]
__________________________________________________________________
Sphingolipid metabolism
__________________________________________________________________
Lipid metabolism
__________________________________________________________________
[[118]83]
__________________________________________________________________
Purine metabolism
__________________________________________________________________
Nucleotide metabolism
__________________________________________________________________
[[119]84]
__________________________________________________________________
Pyrimidine metabolism
__________________________________________________________________
Nucleotide metabolism
__________________________________________________________________
[[120]85]
__________________________________________________________________
Arginine and proline metabolism
__________________________________________________________________
Amino acid metabolism
__________________________________________________________________
[[121]86]
__________________________________________________________________
Tyrosine metabolism
__________________________________________________________________
Amino acid metabolism
__________________________________________________________________
[[122]87]
__________________________________________________________________
Phenylalanine metabolism
__________________________________________________________________
Amino acid metabolism
__________________________________________________________________
[[123]88]
__________________________________________________________________
One carbon pool by folate
__________________________________________________________________
Metabolism of cofactors and vitamins
__________________________________________________________________
[[124]89]
__________________________________________________________________
Fanconi anemia pathway
__________________________________________________________________
Replication and repair
__________________________________________________________________
[[125]90]
__________________________________________________________________
PI3K-Akt signaling pathway
__________________________________________________________________
Signal transduction
__________________________________________________________________
[[126]91]
__________________________________________________________________
Regulation of actin cytoskeleton
__________________________________________________________________
Cell motility
__________________________________________________________________
[[127]92]
__________________________________________________________________
Focal adhesion
__________________________________________________________________
Cell communication
__________________________________________________________________
[[128]93]
__________________________________________________________________
Adipocytokine signaling pathway
__________________________________________________________________
Endocrine system
__________________________________________________________________
[[129]94]
__________________________________________________________________
Progesterone-mediated oocyte maturation
__________________________________________________________________
Endocrine system
__________________________________________________________________
-
__________________________________________________________________
Axon guidance Development [[130]95]
[131]Open in a new tab
^aReference shows association between a given pathway and breast cancer
or other cancers.
Table 4.
HER2+ specific pathways
Pathway Pathway category Reference^ a
Tryptophan metabolism
__________________________________________________________________
Amino acid metabolism
__________________________________________________________________
-
__________________________________________________________________
Terpenoid backbone biosynthesis
__________________________________________________________________
Metabolism of terpenoids and polyketides
__________________________________________________________________
-
__________________________________________________________________
Drug metabolism - cytochrome P450
__________________________________________________________________
Xenobiotics biodegradation and metabolism
__________________________________________________________________
-
__________________________________________________________________
mTOR signaling pathway
__________________________________________________________________
Signal transduction
__________________________________________________________________
[[132]63]
__________________________________________________________________
Serotonergic synapse
__________________________________________________________________
Nervous system
__________________________________________________________________
-
__________________________________________________________________
Long-term depression
__________________________________________________________________
Nervous system
__________________________________________________________________
-
__________________________________________________________________
Circadian rhythm Environmental adaptation [[133]64]
[134]Open in a new tab
^aReference shows association between a given pathway and the HER2+
subtype.
Common enriched pathways
As shown in Figure [135]3, there are 28 common enriched pathways (CEPs)
across all four subtypes of breast cancer (See Additional file [136]5
for the whole list), which indicate the shared features of breast
cancer. The CEPs occupy about 42%, 33%, 51%, and 33% of the total
enriched pathways in luminal A, luminal B, triple-negative, and HER2+
subtypes, respectively. To test the reliability of CEPs, we first
performed the enrichment analysis over all tumor samples without the
separation of subtypes, and 86% of these CEPs are enriched in this
case. We also checked enrichment analysis results in the TCGA and EGA
microarray datasets, where we found that 24 of the total 28 CEPs (86%)
are confirmed by at least one other data set and 15 of them are
confirmed in both (Table [137]3). An extensive literature search
indicates that 24 CEPs show relationships with either breast cancer or
cancers (See Table [138]3 for details). These results indicate that
despite the heterogeneity of breast cancer samples, different subtypes
share a relatively high degree of similar molecular mechanisms to
support tumor growth and metastatic dissemination. These CEPs might
also be used as the “signature pathways” of breast cancer when compared
to other diseases.
Figure 3.
Figure 3
[139]Open in a new tab
The Venn diagram of pathway enrichment analysis results from the
RNA-Seq data. There are 15 regions in the Venn diagram including common
enriched pathways and subtype-specific pathways. Common enriched
pathways are pathways enriched in all four subtypes of breast cancer.
Subtype specific pathways are pathways enriched in only one subtype of
breast cancer.
Among 28 CEPs, 9 are relevant to metabolism such as
Glycolysis/gluconeogenesis, Purine metabolism and Tyrosine metabolism.
This is not surprising, as cancer cells must perturb metabolic pathways
to provide energy and building blocks to support aggressive cell growth
and proliferation [[140]50]. Thus, these pathways may account for
higher biosynthesis of nucleic acids (e.g., pathways in the category of
nucleotide metabolism) and proteins (e.g., pathways in the category of
amino acid metabolism) and higher energy demands (e.g., pathways in the
category of carbohydrate metabolism) of tumors.
We also noted that CEPs consist of some well-known cancer related
pathways such as pathways in the categories of replication and repair,
signal transduction, cell growth and death. For example, the CEP “Cell
cycle” appears to be in agreement with the intuition that inappropriate
proliferation is one of the most remarkable characteristics of the
cancer cell. This is expected and can be considered as a benchmark
result for our method, as several other pathway-based studies in cancer
have observed similar results [[141]51-[142]53]. Additionally, the CEPs
also contain the PI3K-Akt signaling pathway. It has been shown that
many breast cancer tumors harbor mutations in the PI3K-Akt signaling
pathway [[143]35]. These mutations are thought to lead to activation of
RHEB, which in turn promotes activation of the mTOR gene [[144]54], one
downstream effect of which is vastly increased protein production and
much larger cancer cells. The notch signaling pathway, on the other
hand, has attracted increasing attention as the potential new
therapeutic targets for cancer patients [[145]55], and our results
indicate that it may also be applicable to the breast cancer.
Two hormonally-related pathways are observed to be enriched in CEPs as
well, including Progesterone-mediated oocyte maturation and the
Estrogen signaling pathway. As expected, we observe that Estrogen
signaling is significantly upregulated in ER+ tumors (luminal A and
luminal B) and significantly downregulated in triple-negative tumors.
Though HER2+ tumors are not known to be directly enriched in
estrogen-related genes, nevertheless we find that the pathway is still
significantly dysregulated in HER2+ tumors. It has been suggested that
ER+/HER2+ tumors utilize crosstalk between the MAPK signaling pathway
and the ER signaling pathway to evade common anti-estrogenic therapies
such as Tamoxifen [[146]56].
Subtype-specific pathways
Subtype-specific pathways are especially interesting, since they are
potential candidates for signature pathways for each subtype of breast
cancer. As shown in Figure [147]3, luminal A, luminal B,
triple-negative, and HER2+ subtypes have 4, 2, 4, and 7
subtype-specific pathways (about 6%, 2%, 7%, and 8% of their enriched
pathways) respectively. These pathways represent the heterogeneity of
breast cancer subsets.
Luminal A specific pathways
Luminal A has 4 subtype-specific pathways: Ubiquitin mediated
proteolysis, Endocytosis, Carbohydrate digestion and absorption, and
Vasopressin-regulated water reabsorption. The enrichment of Ubiquitin
mediated proteolysis pathway is evidenced by the recent work [[148]57]
where it is found that the luminal A subtype had an increased
expression level of cyclin D1 which regulates proteolysis mediated by
ubiquitin [[149]58].
There are two interesting pathways that are not enriched in luminal A
subtype in comparison to the other three subtypes: the Homologous
recombination and the p53 signaling pathways. This observation is
consistent among all three data sets and may therefore be used to aid
in the distinguishability of the luminal A subtype. This observation is
also supported in the literature where it has been reported that a
functional defect in homologous recombination is common in
triple-negative breast cancer and in a subset of high grade ER and/or
HER2 positive breast cancer [[150]59]. Homologous recombination may
also be associated with the luminal B subtype through the BRCA2 gene
which is known to be involved in error-free DNA repair of double-strand
breaks (DSBs) through homologous recombination and BRCA2 mutation
carriers have a predilection for developing breast cancers of the
luminal B subtype [[151]60]. On the other hand, the luminal A subtype
has the lowest mutation frequency among the four subtypes of breast
cancer [[152]35], therefore it is not surprising that the p53 signaling
pathway is not enriched in the luminal A subtype.
Luminal B specific pathways
The luminal B subtype has only two subtype-specific pathways: Histidine
metabolism and Phosphatidylinositol signaling systems, although it has
many more enriched pathways than the luminal A subtype. It should also
be noted that pathways enriched by the luminal B subtype have much more
overlaps with those found enriched in the triple-negative and HER2+
subtypes than in the luminal A subtype (Figure [153]3).
Since ER status is a very important factor in planning breast cancer
treatment, we also outline the pathways that are specific to the ER+
subtype (luminal A and luminal B). Among 6 ER+ specific pathways, four
of them have supporting evidence from previous studies: the Primary
bile acid biosynthesis, Jak-STAT signaling pathway, Complement and
coagulation cascades, and GnRH signaling pathways (More details can be
found in Additional file [154]1).
Triple-negative specific pathways
The triple-negative subtype has 4 subtype-specific pathways: Alanine,
aspartate and Glutamate metabolism, Lysine degradation, Vascular smooth
muscle contraction, and Glutamatergic synapse. Two of these have
supporting evidence in previous studies. Metabotropic glutamate
receptor-1 (GRM1) has been reported as an oncogene in the progression
of triple negative breast cancer [[155]61], whose alteration may affect
the pathway Alanine, aspartate and glutamate metabolism. It has been
observed that the amino acid metabolism is also a major source of
energy and carbon for tumor cell growth and survival in invasive breast
cancer such as the triple-negative subtype [[156]62]. Thus, the
perturbation of the Lysine degradation pathway is very likely resulted
from the changes in the energy metabolism of tumor cells.
HER2+ specific pathway
The HER2+ subtype has the largest number of subtype-specific pathways
among the four subtypes as shown in Table [157]4, which may be due to
the important role of HER2 in promoting cell growth and proliferation.
Among 7 HER2+ specific pathways, mTOR signaling pathway and Circadian
rhythm have supporting evidence from previous studies. For example, it
was pointed out that constitutively activating HER2 and EGFR stimulated
many of the same intracellular signaling proteins and pathways as wild
type receptors, including the mTOR pathway [[158]63]. Additionally, the
deregulated expression of the circadian related genes PER1, PER2 and
PER3 in breast cancers has been studied [[159]64]. It was found that
methylation of the PER gene promoters has a strong correlation with
c-erbB2 expression.
Resolution
The resolution of our approach aims to test whether the pathway
profiles have enough detail not only to distinguish the different
diseases but also to correctly assess the correlations between them
based on the perturbed functions. To achieve this, we evaluate the
disease correlation by calculating the hypergeometric probability of
corresponding pathway profiles: a method that has been successfully
applied to calculate the correlation between phylogenetic profiles in
our previous study [[160]65]. We further generated pathway profiles for
ovarian cancer (TCGA, Affymetrix HG-U133A platform), GBM (TCGA,
Affymetrix HG-U133A platform) and obesity (GEO, GDS3688, HG-U133_Plus_2
platform) in order to compare breast cancer against other diseases.
The resulting profiles are drawn in Figure [161]4A using Gitools
[[162]66] with 11, 85, and 14 enriched pathways for ovarian cancer,
GBM, and obesity respectively according to PWEA results (See Additional
file [163]6). Figure [164]4B shows the correlations between seven
diseases/subtypes as a small network drawn by VisANT [[165]67-[166]71]
where 4 subtypes are encapsulated by the metanode [[167]72-[168]74] of
breast cancer and the edge thickness is roughly proportional to
correlation strength. With a 0.05 p-value cutoff, a total of 17 disease
pairs exhibit significant correlation with varied strength (See
Additional file [169]7). As expected, most of the correlations between
four subtypes are stronger than those between subtypes and the other
three diseases. Among the four subtypes, luminal A and luminal B
subtypes are closer to each other than to the other two subtypes. On
the other hand, the luminal A and luminal B subtypes are closer to the
HER2+ subtype than to the triple-negative subtype, which may not be
surprising because the HER2+ subtype is simply characterized by copy
number variation in the HER2 amplicon, while it has been suspected that
ER+ (luminal A and luminal B) and ER- subtypes may have differing cells
of origin [[170]75]. Similarly, both triple-negative and HER2+ subtypes
have stronger correlation to ovarian cancer than luminal A and luminal
B subtypes have, most likely because of the effect of P53 gene
[[171]35]. Meanwhile, the fact that luminal B, triple-negative, and
HER2+ subtypes are more aggressive tumors than the luminal A subtype is
reflected by the observation that both triple-negative and HER2+
subtypes are closer to the luminal B subtype than the luminal A
subtype. We also find the luminal A subtype is closer to obesity than
to ovarian cancer, indicating the potential connection between breast
cancer and obesity through estrogen [[172]76]. Interestingly, our
results show that GBM is significantly correlated with luminal A,
luminal B, HER2+ breast cancer and obesity. It was pointed out that sex
hormones are important in the growth of breast cancer and are also
important in the development of GBM [[173]77]. Furthermore, the
anti-estrogen drug Tamoxifen has been found to be effective in
decreasing glioblastoma cell proliferation [[174]78]. On the other hand
HER2 is the most frequently expressed tyrosine kinase receptors in GBM
cells [[175]79]. The association between GBM and obesity may be
explained by the important role of leptin in both of these two
diseases. The current model suggests that obesity in human is due to a
desensitization to leptin while within gliomas, there is a correlation
between tumor grade and tumor expression of leptin and its receptor
[[176]80].
Figure 4.
Figure 4
[177]Open in a new tab
Evaluate disease correlations by using pathway profiles. (A) Pathway
profiles for luminal A, luminal B, HER2+, and triple-negative subtypes
of breast cancer, as well as ovarian cancer, glioblastoma multiforme
(GBM), and obesity. Each red rectangle represents an enriched pathway.
From right to left, KEGG pathways are sorted according to pathway
categories and have the same order as the FDR values of enriched
pathways in these 7 diseases/subtypes (Additional file [178]6). Due to
limitation of space, here only pathway categories are labeled. This
figure roughly shows that luminal A and luminal B subtypes are closer
to the HER2+ subtype than the triple-negative subtype. (B)
Disease/subtype correlation networks based on pathway profile. Each
node represents one disease and subtypes are embedded in a metanode
representing breast cancer [[179]72-[180]74]. The width of each edge is
roughly proportional to the correlation between the disease/subtype
pair. Here only the significant correlations (p-value < 0.05) are
displayed (See Additional file [181]7). The network is drawn by the
VisANT system [[182]67-[183]71].
Discussion
All the enriched pathways in this study are identified based on a FDR
cut-off 0.1. This cut-off was optimized to achieve reasonable
reproducibility (Figure [184]5) while maintaining adequate coverage and
accuracy. More detail on the FDR cut-off is addressed in Additional
file [185]1. Another important factor that needed to be taken into
account is the coverage of KEGG pathway genes with the corresponding
gene expression measurement technology. All three data sets used in
this study have a good coverage of KEGG pathway genes (TCGA RNA-Seq,
TCGA microarray and EGA microarray data sets contain about 20360, 17814
and 17621 genes, corresponding to the 97%, 91% and 90% coverage of 5584
total KEGG pathways genes, respectively), which however may not hold
for the large sets of microarray array data available in GEO databases.
More discussion of the pathway coverage can also be found in Additional
file [186]1.
Figure 5.
Figure 5
[187]Open in a new tab
Reproducibility of enriched pathways between different data
sets/platforms vs. FDR cut-off for each subtype of breast cancer. In
general, the reproducibility increases quickly at the beginning and
gradually becomes flat as FDR cut-off increases.
The comparison of the reproducibility resulted from the pathway
profiles and top DEGs are in general conservative. From a
hypergeometric statistics perspective, a larger number of genes
resulted from the different analyses will achieve better
reproducibility. We therefore used the top 1500 DEGs in the
reproducibility comparison instead of known signature genes available
in the literature because the number of the latter is generally much
smaller than 1500. In addition, we also performed the comparison using
the top 6000 DEGs and the enriched pathways still achieves better
reproducibility (See details in Additional file [188]1). The robustness
of our approach is further verified by using the EGA validation data
set [[189]36]. There is little change in the reproducibility between
two TCGA data sets and the EGA data set when substituting the EGA
discovery set with the validation set (in Additional file [190]1 Figure
S5 where more discussion of the EGA validation set can also be found).
Despite the improvement over classical gene-based methods in
characterizing breast cancer and related diseases, this approach has
great potential to be further enhanced. First, it depends on the prior
knowledge of pathways which are far from complete, and identifying new
pathways is a difficult, time-consuming, and labor intensive work. An
alternative approach to partially ameliorate this limit is to replace
pathways with functional modules that may be computationally identified
[[191]81]. Second, current pathway profiles are represented as a simply
binary vector that may be improved by incorporating some additional
information such as the corresponding statistical significance of the
enriched pathway to make each profile more quantitative. For a simple
approach, one may directly set elements of pathway profile as the
p-values or enrichment scores of given pathways.
Conclusion
Reproducibility is one of the main challenges for the identification of
gene signatures. Besides technical factors, the disparity may result
from the fact that diseases are directly caused by the perturbations of
the molecular function that are generally carried out by a set of genes
in the form of modules or pathways. Therefore it may be more
appropriate to characterize diseases at the functional level than at
the gene level. Following this perspective, we developed a novel
approach to characterize diseases using the pathway profiles and
evaluated the approach based on profiles’ reproducibility, accuracy and
resolution. Using four subtypes of breast cancer as an example, the
results of this new approach are promising with 70% average
reproducibility, 74% average accuracy (e.g., references in