Abstract
Background
Delineating the molecular drivers of cancer, i.e. determining cancer
genes and the pathways which they deregulate, is an important challenge
in cancer research. In this study, we aim to identify pathways of
frequently mutated genes by exploiting their network neighborhood
encoded in the protein-protein interaction network. To this end, we
introduce a multi-scale diffusion kernel and apply it to a large
collection of murine retroviral insertional mutagenesis data. The
diffusion strength plays the role of scale parameter, determining the
size of the network neighborhood that is taken into account. As a
result, in addition to detecting genes with frequent mutations in their
genomic vicinity, we find genes that harbor frequent mutations in their
interaction network context.
Results
We identify densely connected components of known and putatively novel
cancer genes and demonstrate that they are strongly enriched for cancer
related pathways across the diffusion scales. Moreover, the mutations
in the clusters exhibit a significant pattern of mutual exclusion,
supporting the conjecture that such genes are functionally linked.
Using multi-scale diffusion kernel, various infrequently mutated genes
are found to harbor significant numbers of mutations in their
interaction network neighborhood. Many of them are well-known cancer
genes.
Conclusions
The results demonstrate the importance of defining recurrent mutations
while taking into account the interaction network context. Importantly,
the putative cancer genes and networks detected in this study are found
to be significant at different diffusion scales, confirming the
necessity of a multi-scale analysis.
Background
Cancer is a complex genetic disease, caused by a multistep process in
which several collaborating mutations allow a cell to evade the
checkpoints that normally safeguard it against DNA damage and other
disruptions of healthy cell growth [[30]1-[31]3]. Among all human genes
more than 1% are implicated in cancer by somatic or germline mutations
[[32]4]. This implies that hundreds of genes may contribute to tumor
formation. This percentage is expected to increase since cancer genome
sequencing of human tumors [[33]5] and forward genetics screens in
model organisms [[34]6-[35]8] are providing evidence for thousands more
[[36]9]. Arguably, cancer genes are thus better considered as
contributing to cancer rather than causal for cancer [[37]10], as their
effect is dependent on many other genetic aberrations.
One of the major challenges in cancer research is to determine all
genes that contribute to the disease as well as delineate the pathways
through which these genes exert their malicious effects on a cellular
process. A potent way of discovering genes driving carcinogenesis is to
perform forward genetic screens in animal models such as retroviral or
transposon insertional mutagenesis (IM). Retroviral insertional
mutagenesis (RIM) is based on integration of a retrovirus into the host
cell’s genome, mutating the host cell’s DNA in the process. Putative
cancer genes in the RIM datasets are determined by searching for genes
in the vicinity of genomic regions that harbor recurrent mutations
across multiple independent tumors [[38]6,[39]7,[40]11]. Such regions
in the genome are called Common Insertion Sites (CIS). Statistical
significance of CIS can be determined by means of permutation
strategies [[41]12].
A major problem in this approach is that, assuming realistic sample
sizes, not all cancer-associated genes harbor sufficient mutations in
multiple tumors to be called significant. This is because mutations in
the vicinity of different genes can exert the same deleterious effect
on the same pathway [[42]10]. This explains that mutations of different
genes represent alternative routes to the acquisition of the same
cancer hallmark [[43]13,[44]14]. A compelling example is given by the
Rb pathway. Mutations in four genes RB1, CDKN2A, CDK4, and CCND1 can be
considered equivalent because their effect on controlling transition to
replication phase is similar [[45]10]. On the other hand, acquisitions
of some cancer hallmarks require mutations of multiple genes. Examples
of this include the frequently observed co-mutation of RAS and
TP53[[46]15]. Co-mutated genes can thus be considered to collaborate in
carcinogenesis [[47]16].
An important implication of this fact is that the definition of a
recurrent gene mutation, defined as a gene that harbors frequent
mutations in its genomic vicinity in more tumors than expected by
chance, is no longer appropriate. Instead, recurrent mutations should
be defined at the scale of a pathway, rather than the scale of a single
gene.
A straightforward approach would be to regard mutations affecting
different genes in the same pathway as recurrent by a summarization
across pathways defined in databases such as KEGG [[48]17] and using
enrichment statistics such as GSEA [[49]18]. Here, we argue against
such an approach for the following reasons. First, such approaches
consider a pathway as a group of genes, and thus, disregard any
topological information in the gene interaction network that describes
a pathway. Edge weights, capturing confidence in a certain relation,
are also not taken into account. Most importantly, however, gene sets
that are defined based on a pathway represent only one of many scales
at which this pathway can be defined. In reality, mutations may target
very specific components of a pathway (i.e. small-scale), whereas in
other cases summarization across a collection of functionally similar
pathways (i.e. large-scale) may be justified.
When a pathway is defined at a scale which is too small, mutations are
split across different distinct pathways. As a result, mutations are
only recurrent across a relatively small number of tumors. This implies
loss of power to detect relevant mutation patterns. On the other hand,
when a pathway is defined at a scale which is too large, many
irrelevant genes are considered that do not harbor mutation. This also
implies reduced power to detect recurrence. Since it is a priori
unknown at which scale it is most appropriate to summarize mutations
and define a pattern of recurrent mutation, a multi-scale approach that
evaluates pathway descriptions of different scales is required.
For this reason, we propose to consider gene mutations in the context
of its pathway neighborhood as encoded in a functional protein-protein
interaction (PPI) network [[50]19]. This network describes a variety of
gene and protein interactions (including experimentally verified
physical interactions, co-expressed genes, interactions mined from
literature, known pathway interactions, interactions predicted from
protein homology, etc.). Importantly, this offers the possibility to
obtain a more continuous definition of the size of the pathway
neighborhood, or, in other words, a less discrete definition of the
scale at which the interaction network context is considered. Using a
multi-scale diffusion kernel [[51]20] (detailed in the Methods
section), this scale can be varied across a range of values. This
enables analysis of the mutations in interaction contexts of different
scale, ranging from small-scale signaling cascades to large-scale
molecular pathways.
The PPI network has been successfully used to define subnetworks with
predictive value for patient prognosis from gene expression data
[[52]21-[53]26]. Other approaches use PPI information to identify
subnetworks with frequent mutations, since these may point to commonly
mutated pathways [[54]27,[55]28]. In both cases, a scoring function is
optimized to identify high-scoring subnetworks. However, these types of
optimizations are NP-hard and therefore require heuristic strategies
that are computationally expensive [[56]25-[57]27].
Here, we propose a different approach, based on diffusion kernel, to
determine genes recurrently mutated in interaction context (ReMIC
genes). ReMIC genes are detected when the gene itself or the
interaction network neighborhood of the gene harbors more mutations
than expected by chance (Figure [58]1). The amount of diffusion
determines the scale of the network neighborhood that will be taken
into account. A permutation analysis is employed to determine the
significance of a ReMIC gene within the context of its network
neighborhood at a particular scale. By varying this scale across a
range of values, ReMIC genes that arise as a result of mutations in
small, well-connected subnetworks (small-scale) as well as those that
arise as a result of mutations in large pathway components
(large-scale) can be identified.
Figure 1.
Figure 1
[59]Open in a new tab
Framework. A) Insertion mutation data (blue lollipops, each occurring
in one tumor) across a set of tumors (not shown) and four genomic
regions. The region on Chr 5 and Chr 7 harbor sufficient mutations to
be called CIS. The region on Chr 11 and Chr 16 do not. B) Gene mutation
scores are obtained by weighted summarization. The weighing function is
a flat-top Gaussian. C) PPI network with genes as nodes. Color denotes
the mutation gene score. D) Illustration of the diffusion kernel.
Conceptually, the diffusion kernel is similar to a heat diffusion
process. In graphs this means that the mutation score at the graph node
is diffused throughout the network dependent on the graph topology that
connects this node to the rest of the network. Its global distance to
the other nodes is thus dependent on the weight and number of paths
connecting them as well as the diffusion strength. The latter parameter
can be regarded as a scale parameter. For low diffusion strength,
scores hardly diffuse and the interaction context of a gene is
determined by itself and a few well-connected neighbors. For high
diffusion strength, scores are almost fully diffused and can thus reach
distant genes in the network, albeit in very small amounts. Using a
permutation approach, it is possible to establish whether the diffused
scores are higher than expected by chance (starred genes). Notably,
genes with few or no mutations can still reach significance due to high
scoring nodes in their neighborhood.
The proposed approach establishes significance estimates per gene,
mitigating the need of excessive correction for multiple testing that
arises when testing many networks of different sizes. It also requires
no prior assumptions on the size of subnetworks. However, the major
benefit is that our method explicitly takes care of the scale at which
a subnetwork is considered.
In the following, we outline the proposed methodology and show results
on a large collection of viral mutation data from two large RIM screens
[[60]6,[61]11]. In particular, we show that different networks of
recurrent mutations arise at different scales. Many genes that do not
harbor sufficient mutations to be called significant in a genome-based
analysis are identified using the proposed approach. A significant
portion of these genes are well-known cancer genes. In addition, we
determine densely connected subnetworks of the significant genes which
are highly enriched for cancer related pathways at different diffusion
strength. These clusters also show significant patterns of mutual
exclusion, indicating a functional relation between the mutated genes.
Methods
Mutation data
Retroviral insertional mutation data is acquired from Mutapedia
(available on [62]http://mutapedia.nki.nl). In total, the dataset
contains 19 997 viral insertions in 933 tumors that have been produced
in mice with various genetic backgrounds [[63]11]. All insertions in
this dataset are mapped to Ensembl genes by employing a flat-top
Gaussian scoring function (Figure [64]1B). This scoring function is
uniform within a gene and has Gaussian tails outside the gene. The
width (standard deviation) of the Gaussian is chosen to be 25kb, based
on observations made in de Jong et al.[[65]29]. Insertions further than
50kb from the gene start or gene termination site are not associated to
that gene. In this way, we obtain a (t × g) mutation matrix M where t
is the number of tumors in the study, g is the number of genes in the
PPI network and M[ij] indicates the mutation score of gene j in tumor
i. The mutation score vector S capturs the insertion frequency in the
vicinity of each gene and is computed as:
[MATH:
Sj=∑∀i
Mij :MATH]
(Figure [66]1).
Interaction graph construction
A PPI graph is obtained from STRING [[67]19]. The combined scores are
used as interaction weights between proteins. To associate proteins in
the PPI network to their corresponding genes, Ensembl gene IDs are
mapped to protein IDs. In case of multiple proteins mapping to the same
gene, interaction weights are collapsed into a single gene-to-gene
interaction weight by averaging. We retain reliable interactions by
removing all links with a weight below threshold T[I]. As a result, we
obtain an undirected and weighted interaction network N = (G, I, S, W),
where G denotes a set of nodes representing the genes, I denotes a set
of edges representing interaction edges between proteins associated
with the Ensembl genes, vector S denotes mutation scores per gene, and
weight matrix W denotes the interaction weights between the genes.
Diffusion kernel
To represent the interaction network context of a gene, it is desirable
to obtain a global similarity metric that captures the local topology
of the network connecting two genes. To this end, we use the diffusion
kernel approach proposed by Kondor and Lafferty [[68]20].
The diffusion kernel of a weighted interaction graph N is described by:
[MATH:
Kβ=<
/mo>eβL :MATH]
where e^βL denotes the matrix exponential and the β >= 0 is the
diffusion parameter that controls the extent of diffusion (i.e. the
scale). Matrix L is equivalent to the Laplacian matrix of the weighted
graph N, and is defined by:
[MATH: Lij=-Wijifi≠j∑l=1
nW
ilifi=j
mtable> :MATH]
The Laplacian matrix L thus represents the graph topology of graph N,
while the diffusion kernel K[β] provides a smooth global similarity
measure on N.
The diffusion parameter β plays an important role as it controls the
extent of the diffusion. For β equal to 0, there is no diffusion (i.e.
K[0] = I) while for β approaching infinity the diffusion reaches
equilibrium where all (connected) nodes receive the same diffusion
contribution. Diffusion Diffusion kernel matrices are created for a
range of β values. Finally, each diffusion kernel matrix is applied to
the gene mutation score vector, to obtain diffused mutation scores, by
calculating SK[β].
Permutation procedure
Significance estimates, establishing whether a gene’s diffusion score
is higher than expected by chance, are calculated on a per gene and per
scale basis by permutation analysis. To this end, we randomize the
association between genes and mutations by permuting the mutation
scores in each tumor independently and recalculating the overall
mutation score vector S. The diffusion kernel at a particular scale is
then applied to the randomized S. This procedure is repeated 10^7
times, resulting in a null-distribution per gene from which an
empirical p-value (including pseudocount) can be calculated.
To reduce the computational burden, we precompute 10^5 random score
vectors S and store these in vector S[rnd]. In every permutation
iteration, g scores are randomly selected from S[rnd]. Moreover, the
permutation procedure for a gene is discontinued when more than 10
permuted diffused scores exceed its non-permuted diffused score. As a
result, the permutation procedure is procedure is carried out for fewer
genes, reducing the computational burden considerably. Correction for
multiple testing is achieved by using the method of Storey and
Tibshirani to calculate a false discovery rate (FDR) per gene [[69]30].
The p-values are adjusted by considering the number of genes in the PPI
network as well as the number of scales at which the diffusion kernel
is employed. Genes exceeding the α-level are called ReMIC genes. The
source code is available in Additional file [70]1.
Graph clustering
In this study, we divide subnetworks of significant genes into densely
connected subgroups by applying the Girvan-Newman graph clustering
method that uses a natural division of the graph into clusters based on
the strength of connections between the nodes [[71]31]. In this
algorithm, a betweenness score of an edge is computed based on the
number of shortest paths that flow through that edge. A high
betweenness score means that this particular edge connects clusters
with densely connected nodes. A dendrogram is constructed by gradually
removing the edges with the highest betweenness score. The root of the
dendrogram represents the original network and the leaves are
individual genes (Additional file [72]2: Figure S6-S11). For each
division of the network into clusters a modularity measure is
calculated which indicates the fraction of edges in the network that
connect nodes in the same clusters. The dendrogram is cut at the point
where the maximum modularity is achieved.
To calculate a modularity measure for a particular division of the
network with n clusters a symmetric (n × n) matrix E is defined. The
matrix entry E[ij]is the fraction of all edges in the network that
connect nodes in cluster i to nodes in cluster j. The trace of E (T[E])
indicates the fraction of edges in the network that link nodes in the
same cluster. Accordingly, a high value of T[E] demonstrates a good
division of the network into densely connected clusters. However, T[E]
itself yields no information about the structure of clusters, since,
for example, a single cluster consisting of all nodes results in a
maximal value of T[E]while it is not a good division of the network.
For this reason the trace of E - ∥E^2∥ is used as a modularity measure
where ∥E^2∥ indicates the sum of the entries of the matrix E^2. This
modularity measure is the fraction of the edges in the network that
link nodes in the same cluster minus the expected value of the same
quantity for randomized network (in which edges are distributed at
random without considering the structure of clusters). A modularity
measure equal to zero means that the number of within-cluster edges is
not better than random connections. A good division of densely
connected clusters results in a modularity measure close to one. The
modularity measure is calculated for each division of the network and
the best cut-level of the dendrogram is determined based on the maximum
modularity. In this work, we use Cytoscape plugin [[73]32] GLay
community structure analysis that implements the Girvan-Newman
algorithm [[74]33].
KEGG pathway enrichment
In order to get a general idea of the function of ReMIC gene clusters,
KOBAS [[75]34] is applied to calculate KEGG pathway enrichments. More
specifically, we use KOBAS to determine overlap of a ReMIC gene cluster
with all known pathways in the KEGG database [[76]17]. For each pathway
that occurs in the input gene set, KOBAS counts the total number of
input genes that are involved in the pathway. It also counts the total
number of genes in the background gene set (all genes in the
interaction network) that are involved in the same pathway. A p-value
of the pathway enrichment is calculated using the Fisher exact test
with Benjamini FDR correction to adjust for multiple testing.
Mutual exclusivity analysis
To determine whether the genes in a subnetwork exhibit more than random
mutual exclusion of mutations (i.e. mutations in different genes of the
subnetwork tend not to co-occur within the same tumor), we employ the
mutual exclusivity module (MEMo) approach [[77]35]. The mutation matrix
M is considered as the adjacency matrix of a graph where M[gt] > 0,
represents an edge connecting gene g to tumor t. The switching
permutation technique is applied to generate 10^4 random mutation
matrices [[78]36].
In the switching permutation method, a randomized mutation matrix is
generated from the original mutation matrix M while the overall
adjacency distribution between genes and tumors is preserved. To do
this, two edges are randomly selected between gene g[1] and tumor
t[1](g[1], t[1]) and gene g[2] and tumor t[2](g[2], t[2]). If these
edges appear in the adjacency graph of M (i.e.
[MATH:
Mg<
mn>1t1 :MATH]
and
[MATH:
Mg<
mn>2t2 :MATH]
are non-zero entries) the ends of the are non-zero entries) the ends of
the edges are swapped to give (g[1], t[2]) and (g[2], t[1]). This
exchange is only allowed if it generates no duplicate edges between a
gene and a tumor. The process is repeated for k×q iterations where k is
the total number of edges in the adjacency graph of M and q is a
constant number. It has been empirically shown that q = 100 is adequate
for many networks [[79]36].
The resulting random mutation matrices are then used to calculate an
empirical p-value for a gene cluster. The p-values are calculated as
the fraction of permutations that lead to a greater or equal number of
mutations than those observed in M. Genes within significant clusters
(with α-level of 0.05) are considered to be more mutually exclusive
than expected by chance.
Results and discussion
Identifying Genes Recurrently Mutated in Interaction Context (ReMIC)
We apply our method to the obtained interaction graph (N) as described
in the Methods section. This graph contains reliable interaction links
with scores more than the median value of the combined scores of all
interactions (T[I] = 500). This is a trade-off between high-confidence
interactions (score > 700) and low-confidence interactions (score <
400) [[80]19]. The resulting interaction graph N with edge weights W,
contains 14 842 nodes and 448 472 edges. The mutation scores S are
computed as described in the Methods section for all genes present in
the interaction network.
The diffusion kernel is applied to the complete PPI graph N for a range
of scales by varying β in the range [0, 0.03]. A permutation procedure
is performed to obtain p-values for each gene and each scale
separately. Genes recurrently mutated in their interaction context
(ReMIC genes) are defined as nodes that are significant at an α-level
of 0.001 (Additional file [81]3). ReMIC genes are presented within the
context of their original protein-protein interactions. The resulting
ReMIC network is clustered using Girvan-Newman clustering technique
(Additional file [82]2: Figure S6-S11). The resulting clusters are
called ReMIC gene clusters.
ReMIC genes are organized in connected components
Inspection of the significant genes at the various diffusion scales
reveals that the vast majority of them are organized in connected
components. This demonstrates that, even though the significance
estimates are performed on a per gene bases, the diffusion process
captures groups of significant genes that are connected in the PPI
network.
For β equal to 0 there is no diffusion. Hence, all significant genes at
this scale harbor sufficient mutations to be called significant
irrespective of their protein interaction context. Indeed, the
collection of significant genes at this scale represents the known CIS
network, since virtually all of these genes have been previously
discovered in the analysis of the original screen using traditional CIS
analysis [[83]11]. A substantial fraction of them are known tumor
suppressors or oncogenes such as Gfi1, Notch1 and Myc[[84]37]. Figure
[85]2A shows these CISs and the protein-protein interactions between
them.
Figure 2.
[86]Figure 2
[87]Open in a new tab
Gene clusters across the diffusion scales. A) The CIS genes shown in
the context of the interaction network which are significant without
any diffusion (β = 0). These genes represent the major sources of
mutation score, and drive low scoring genes in the diffusion process.
B) Overview of the identified ReMIC gene clusters at three scale
levels: small-, medium- and large-scale. The color of the nodes
reflects the mutation score before diffusion. Two major gene clusters,
Runx- and Pik3-cluster are apparent across the scales. C)p-values of
pathway enrichment analysis of three pathways in the KEGG database for
two detected clusters in all diffusion scales. Both clusters are highly
enriched for pathways in cancer across a range of scales. The Runx- and
Pik3-cluster show strongest enrichment for the small- and large-scale,
respectively. D)p-values of mutual exclusivity analysis for the two
ReMIC gene clusters. Both clusters are significantly mutually exclusive
across all scale levels. The minimum p-value is, however, attained for
the scale at which the strongest KEGG enrichment was observed.
When the scale is increased, genes gradually diffuse their scores
across the network. High scoring nodes that are connected to many low
scoring nodes diffuse their score rapidly, whereas low scoring nodes
with few connections diffuse their score more slowly. As a result,
genes with relatively low (pink) or zero (white) mutation score are
more included at higher scales as demonstrated in Figure [88]2B for
three specific diffusion scale levels (β) representing small-, medium-
and large-scale. Therefore, while CIS genes have high initial scores,
the ReMIC genes arise as a result of mutations in their network
neighborhood rather than in their genomic vicinity alone. It should be
noted that the CIS genes do contribute to the significance of ReMIC
genes by virtue of score diffusion through the local graph topology.
ReMIC gene clusters strongly enriched for cancer related pathways
A striking observation from the ReMIC gene clusters across the scales
is that, in addition to some small clusters, two major gene clusters
are apparent (indicated in Figure [89]2B). It is apparent that these
clusters gradually increase size as the scale parameter increases. In
the following, we refer to these clusters as the Runx- and
Pik3-cluster.
To examine if one scale is more relevant than other scales, we
investigate KEGG pathway enrichments of the genes in the Runx- and
Pik3-cluster across different scales (Additional file [90]4). The
results of this analysis are presented in Figure [91]2C, and show that
both clusters are highly enriched for the general pathways in cancer
category. It furthermore shows that the enrichment becomes more
pronounced when the scale is increased and the minimum p-value is
attained at different scales for each cluster. The highest enrichments
occur at the small-scale for the Runx-cluster (Figure [92]3A) and the
large-scale for the Pik3-cluster (Figure [93]3B). Similar observations
can be made for leukemia pathways (dashed lines in Figure [94]2C).
Notably, overlaps between the insertional mutations in mouse lymphomas
and copy number variation in human acute myeloid leukemia and SNPs in
human chronic myeloid leukemia have been observed previously [[95]11].
Figure 3.
[96]Figure 3
[97]Open in a new tab
Prominent ReMIC gene clusters. ReMIC genes in the relevant diffusion
scale of A) Runx-cluster (consisting of 100 ReMIC genes at the
small-scale), B) Pik3-cluster (consisting of 436 ReMIC genes at the
large-scale) and C) Gfi-cluster (consisting of 75 ReMIC genes at the
medium-scale). The clusters are visualized using Organic Cytoscape
layout [[98]32]. The color of the nodes reflects the mutation score
before diffusion.
In addition to the Runx- and Pik3-cluster, the Gfi-cluster is apparent
at the medium-scale. This cluster includes important cancer related
genes such as Gfi1, Myb[[99]11] and Nocht1[[100]10] (Figure [101]3C).
The Gfi-cluster is enriched for pathways in cancer category
(p-value = 0.06) and chronic myeloid leukemia pathway (p-value = 0.04).
Interestingly, this cluster is exclusively apparent at the
medium-scale.
Next, for the Runx- and Pik3-cluster, we select the scales for which
the minimum p-value is attained and determined enrichments in other
KEGG pathways. Table [102]1 summarizes the results by showing the
p-value and rank of the most enriched pathways. Although most prominent
cancer related pathways are enriched for both clusters, the top-ranked
pathways are notably different. Taken together, these results clearly
illustrate the benefit of multi-scale analysis when determining ReMIC
gene clusters.
Table 1.
Pathway enrichments
KEGG pathways
__________________________________________________________________
Pik3-cluster
__________________________________________________________________
Runx-cluster
__________________________________________________________________
Rank p -value Rank p -value
Cell cycle
__________________________________________________________________
60
__________________________________________________________________
0.11
__________________________________________________________________
1
__________________________________________________________________
1.7E-10
__________________________________________________________________
Prostate cancer
__________________________________________________________________
4
__________________________________________________________________
1.4E-07
__________________________________________________________________
2
__________________________________________________________________
1.3E-08
__________________________________________________________________
Chronic myeloid leukemia
__________________________________________________________________
2
__________________________________________________________________
4.1E-10
__________________________________________________________________
3
__________________________________________________________________
3.6E-08
__________________________________________________________________
Pathways in cancer
__________________________________________________________________
8
__________________________________________________________________
3.4E-07
__________________________________________________________________
4
__________________________________________________________________
9.2E-08
__________________________________________________________________
HTLV-I infection
__________________________________________________________________
28
__________________________________________________________________
0.001
__________________________________________________________________
5
__________________________________________________________________
5.2E-07
__________________________________________________________________
Small cell lung cancer
__________________________________________________________________
20
__________________________________________________________________
4.6E-05
__________________________________________________________________
6
__________________________________________________________________
8.9E-07
__________________________________________________________________
Colorectal cancer
__________________________________________________________________
14
__________________________________________________________________
2.1E-06
__________________________________________________________________
7
__________________________________________________________________
8.3E-06
__________________________________________________________________
Glioma
__________________________________________________________________
3
__________________________________________________________________
3.1E-08
__________________________________________________________________
8
__________________________________________________________________
5.2E-05
__________________________________________________________________
p53 signaling pathway
__________________________________________________________________
73
__________________________________________________________________
0.27
__________________________________________________________________
9
__________________________________________________________________
5.3E-05
__________________________________________________________________
Transcriptional misregulation in cancer
__________________________________________________________________
77
__________________________________________________________________
0.36
__________________________________________________________________
10
__________________________________________________________________
5.3E-05
__________________________________________________________________
T cell receptor signaling pathway
__________________________________________________________________
1
__________________________________________________________________
1.1E-11
__________________________________________________________________
35
__________________________________________________________________
0.47
__________________________________________________________________
B cell receptor signaling pathway
__________________________________________________________________
5
__________________________________________________________________
1.5E-07
__________________________________________________________________
34
__________________________________________________________________
0.44
__________________________________________________________________
Non-small cell lung cancer
__________________________________________________________________
6
__________________________________________________________________
1.6E-07
__________________________________________________________________
13
__________________________________________________________________
0.0002
__________________________________________________________________
Fc epsilon RI signaling pathway
__________________________________________________________________
7
__________________________________________________________________
3.2E-07
__________________________________________________________________
65
__________________________________________________________________
0.95
__________________________________________________________________
ErbB signaling pathway
__________________________________________________________________
9
__________________________________________________________________
7.9E-07
__________________________________________________________________
21
__________________________________________________________________
0.05
__________________________________________________________________
Pancreatic cancer 10 8.7E-07 14 0.0003
[103]Open in a new tab
The top ranked pathways of KEGG enrichment analysis for the Runx- and
Pik3-cluster determined at the small- and large-scale, respectively
(Figure [104]3A, B).
ReMIC gene clusters exhibit a pattern of mutual exclusive mutation
A hallmark of cancer is that mutations of many different genes can lead
to alteration of the same pathway. Moreover, once the proliferative
advantaged is conferred, additional mutations in the deregulated
pathway are less likely to occur. Mutually exclusive mutations,
therefore, provide evidence that the altered genes are functionally
linked in a common biological pathway [[105]35]. To test this
conjecture, we employ the test for mutual exclusion presented in the
MEMo algorithm [[106]35]. To this end, the MEMo permutation procedure
is applied to identify whether the clustered ReMIC genes exhibit a
pattern of mutual exclusion that extends beyond one expected by chance.
The results are shown at Figure [107]2D for the Runx- and Pik3-cluster
across the scales.
From these results it follows that both clusters exhibit a significant
pattern of mutual exclusion (p<0.05). It is particularly striking that
the strongest significance levels for both clusters arise at different
scales. For ReMIC genes within the Runx-cluster, mutual exclusion is
strongest at the small diffusion scale (Additional file [108]2: Figure
S4), whereas for the Pik3-cluster this occurs at the large-scale
(Additional file [109]2: Figure S5). These scales coincide with the
scales of maximum functional enrichment (Figure [110]2D). These results
illustrate the importance of a multi-scale definition of the pathway
context in which mutations occur.
ReMIC clusters include many non-CIS genes
For all ReMIC clusters with β > 0 it is apparent that they contain
genes that harbored no or only few mutations (indicated by white or
pink nodes in Figure [111]2B). Therefore, these genes would not have
been discovered in a traditional CIS analysis.
Many of the white ReMIC genes are also known and prominent cancer
genes. Examples include Ctnnb1, Jun, Bcr and Src, which are detected
across all the scales. Other cancer genes are exclusively detected at
larger scales, such as Smad4 and Ntrk1 in the Pik3-cluster. These genes
harbor no mutations in the RIM dataset and are thus undetectable unless
the interaction context is taken into account.
Also, many pink ReMIC genes represent well-characterized cancer genes.
A prominent example is the Rbl1 gene in the Runx-cluster. Rbl1 is a
tumor-suppressor gene found to be dysfunctional in many human tumors
including leukemia [[112]11]. Other examples include Raf1 and Ptk2 in
the Pik3-cluster [[113]38]. In our data, these genes are hit in only
few tumors, and hence, mutation scores are insufficient to reach
significance thresholds.
The question arises whether pink ReMIC genes are selected simply
because they are part of the network neighborhood of important cancer
genes (i.e. the CIS genes) or, more interestingly, that they are
selected due to the combination of local graph topology and
distribution of mutations across this topology.
To examine this hypothesis, we compare the pink ReMIC genes from the
Runx- and Pik3-cluster with sets of genes that are not significant in
our analysis but that are connected to one of the high-scoring genes in
the same cluster. For both groups of genes, that both harbor few
mutations, enrichments in the two leukemia pathways and pathways in
cancer are determined. The results are summarized in Table [114]2. From
these results it follows that pink ReMIC genes from both clusters rank
higher in the enrichment test for two out of three tested relevant
pathways. This is especially apparent for the larger Pik3-cluster, for
which substantially lower p-values are obtained for enrichment in the
two leukemia pathways. We therefore conclude that the local network
topology plays an important role in determining whether genes are
involved in deregulating a pathway.
Table 2.
Pink ReMIC genes
KEGG pathways
__________________________________________________________________
Pink nodes in Pik3-cluster
__________________________________________________________________
Pink nodes in Runx-cluster
__________________________________________________________________
__________________________________________________________________
Significant
__________________________________________________________________
Non-significant
__________________________________________________________________
Significant
__________________________________________________________________
Non-significant
__________________________________________________________________
p -value Rank p -value Rank p -value Rank p -value Rank
AM leukemia
__________________________________________________________________
0.002(17/294)
__________________________________________________________________
26
__________________________________________________________________
0.54(10/933)
__________________________________________________________________
61
__________________________________________________________________
0.01(6/81)
__________________________________________________________________
16
__________________________________________________________________
0.001(11/265)
__________________________________________________________________
3
__________________________________________________________________
CM leukemia
__________________________________________________________________
2.5E-07(31/294)
__________________________________________________________________
2
__________________________________________________________________
0.9(9/933)
__________________________________________________________________
101
__________________________________________________________________
4.5E-06(12/81)
__________________________________________________________________
3
__________________________________________________________________
0.001(10/265)
__________________________________________________________________
9
__________________________________________________________________
Pathways in cancer 8.7E-06(67/294) 11 0.54(45/933) 62 1.3E-05(23/81) 4
2.1E-05(34/265) 1
[115]Open in a new tab
p-values of pathway enrichment for ReMIC genes and non-selected pink
nodes which are connected to the red nodes in the
Runx- and Pik3-cluster. Numbers between parentheses show the number of
implicated genes in that particular pathway / total
number of genes in the set.
ReMIC genes co-localize in leukemia pathways
The overlaps between retroviral mutations in mouse lymphomas and human
myeloid leukemia have been previously observed [[116]11]. Therefore,
most of the selected CIS genes are implicated in leukemia pathways. To
identify whether the white and pink ReMIC genes are also part of these
pathways we superimpose non-CIS ReMIC genes that are detected in the
Runx- and Pik3-cluster at the scales for which the strongest KEGG
enrichments are observed and the acute myeloid leukemia KEGG pathway.
ReMIC genes are labeled according to their appearance in either the
Runx- (red) or Pik3-cluster (blue) cluster (with circle size indicating
the number of ReMIC genes within each KEGG meta-node). The result is
illustrated for the acute myeloid leukemia pathway in Figure [117]4
(similar observations can be made for the chronic myeloid leukemia
pathway, shown in the Additional file [118]2: Figure S1).
Figure 4.
Figure 4
[119]Open in a new tab
Superimposed on the KEGG pathways. The white and pink ReMIC genes are
co-localized in a confined part of the acute myeloid leukemia pathway.
Non-CIS ReMIC genes in the Pik3- and Runx-cluster all co-localize in
the top left (blue) and in bottom right (red), respectively.
This reveals that a clear co-localization of the ReMIC genes in a
confined part of the pathway is found. It is particularly striking that
non-CIS ReMIC genes from the Pik3-cluster all co-localize in the top
left of this KEGG pathway (blue circles in Figure [120]4). These genes
play important roles in the ErbB and Mtor signaling pathways, among
others. The non-CIS ReMIC genes in the Runx-cluster, on the other hand,
co-localize in the bottom right. Similar observations are made in case
CIS genes (red ReMIC genes) are included (shown in the Additional file
[121]2: Figure S2 and S3).
Consequently, although this KEGG pathway is enriched for ReMIC genes,
mutations appear to be clustered in clear mutation hot-spots of the
pathway. Moreover, the ReMIC gene clusters comprise many other genes
that are part of other pathways or not present in any of the pathways.
This demonstrates the importance of a multi-scale approach that is
based on protein-protein interactions as opposed to restricting the
search for recurrent mutations to single genes (CIS analysis) or
enrichment of whole discretely defined pathways.
Conclusions
Recently, several methods in computational cancer biology have been
developed to identify significantly mutated pathways using biological
networks such as a PPI graph [[122]21,[123]24-[124]27]. The main
strategy of these approaches is to search for high-scoring subnetworks
in terms of a specific scoring function. Due to the fact that such
problems are NP-hard, greedy and heuristic approaches are employed
[[125]25-[126]27]. In this study, we propose a graph diffusion
framework to overcome this difficulty.
The multi-scale graph diffusion is applied to detect recurrent gene
mutation in the context of the functional protein-protein interaction
network. Using graph diffusion, it is possible to capture the local
topology of the interaction network and diffuse gene mutation scores
across the interaction graph. The diffusion strength acts as a scale
parameter, determining the size of the network neighborhood across
which recurrence is detected. Recurrently mutated genes in interaction
context (ReMIC genes) are identified by permutation analysis per gene
and scale. This reduces the computational complexity of examining the
virtually limitless number of possible subnetworks within the complete
PPI network.
We apply this method to discover recurrent gene mutations in a large
collection of viral mutation data. We demonstrate that significant
genes (called ReMIC genes) are organized in connected components even
though the significance estimates are performed on a per gene bases.
These clusters of ReMIC genes are apparent across multiple diffusion
scales. In addition to some small clusters, two major gene clusters are
apparent: the Runx- and Pik3-cluster. We demonstrate that these
clusters are strongly enriched for cancer related pathways. Moreover,
genes within these clusters are significantly mutually exclusive at
particular diffusion scales, supporting their functional relation in
tumorigenesis. Many ReMIC genes harbor no or infrequent mutations such
as Ctnnb1, Bcr, Src and are, therefore, not detectable in an ordinary
CIS analysis. Literature provides substantial evidence for their roles
in cancer, particularly in leukemia, illustrating the efficacy of
identifying cancer genes in the context of their functional interaction
network.
Graph diffusion has been explored previously to identify subnetworks
harboring frequent mutations in a method called HotNet [[127]27]. One
of the most important differences with our method is the way in which
the significance estimates are calculated. In HotNet, an approximate
solution for the connected maximum coverage problem is adopted to find
significant mutated subnetworks. This solution is heuristic and
computationally expensive since a myriad of subnetworks of different
sizes need to be evaluated. Importantly, in order for this to become
computationally tractable, white nodes need to be removed and are
thereby excluded from appearing in the resulting subnetworks. In our
method, on the other hand, the whole PPI network is considered.
HotNet furthermore requires the user to predefine the size of the
subnetworks that it will find. Our approach, on the other hand,
mitigates the need of making prior assumptions on the size of the ReMIC
gene clusters. As a case in point, we analyze our insertional
mutagenesis data with the HotNet algorithm and demonstrate that, at a
particular diffusion scale, our method outperforms HotNet to detect
distinct subnetworks that are related to known cancer pathways (see
Additional file [128]2: Supplemental text).
Most importantly, however, we show that for different scale levels
different functional enrichments, mutual exclusion patterns and ReMIC
genes are observed. Moreover, for certain scale levels, clusters of
ReMIC genes exhibit a clear co-localization in specific hot-spots of
the KEGG representation of the leukemia pathway. Taken together, this
demonstrates the importance of analyzing gene mutation in the context
of their interaction network in a multi-scale fashion.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SB implemented the methodologies and analysed the data. MH contributed
to design of the diffusion framework. MR and JdR conceived the
experiments and mentored the project. JdR and SB wrote the manuscript
with contributions from MR. All authors revised, discussed, and amended
the manuscript and approved its final version.
Supplementary Material
Additional file 1
Source code. A Zip file containing the source code of the presented
algorithm for Matlab and R and related files. These files are also
available on
[129]http://bioinformatics.tudelft.nl/users/sepideh-babaei. Note that
the results of this study are calculated by Matlab.
[130]Click here for file^ (118.2KB, zip)
Additional file 2
Supplementary text. A PDF file containing the supplemental figures and
text. It includes a figure of the ReMIC genes superimposed on the
leukemia KEGG pathways, results of the analysis of the insertional
mutagenesis data with the HotNet algorithm, dendrograms extracted by
the graph clustering technique as well as figures of the mutation
profiles of the ReMIC genes.
[131]Click here for file^ (3.1MB, pdf)
Additional file 3
Lists of ReMIC genes across the scales. MS Excel XLS tables contain
ReMIC genes which are selected across the diffusion scales.
[132]Click here for file^ (146.2KB, xlsx)
Additional file 4
KEGG pathway enrichments table. MS Excel XLS tables containing enriched
KEGG pathways for both Runx- and Pik3-cluster across the diffusion
scales.
[133]Click here for file^ (131.6KB, xlsx)
Contributor Information
Sepideh Babaei, Email: S.Babaei@tudelft.nl.
Marc Hulsman, Email: M.Hulsman@tudelft.nl.
Marcel Reinders, Email: M.J.T.Reinders@tudelft.nl.
Jeroen de Ridder, Email: J.deRidder@tudelft.nl.
Acknowledgements