Abstract
Background
In cancer research, the comparison of gene expression or DNA
methylation networks inferred from healthy controls and patients can
lead to the discovery of biological pathways associated to the disease.
As a cancer progresses, its signalling and control networks are subject
to some degree of localised re-wiring. Being able to detect disrupted
interaction patterns induced by the presence or progression of the
disease can lead to the discovery of novel molecular diagnostic and
prognostic signatures. Currently there is a lack of scalable
statistical procedures for two-network comparisons aimed at detecting
localised topological differences.
Results
We propose the dGHD algorithm, a methodology for detecting differential
interaction patterns in two-network comparisons. The algorithm relies
on a statistic, the Generalised Hamming Distance (GHD), for assessing
the degree of topological difference between networks and evaluating
its statistical significance. dGHD builds on a non-parametric
permutation testing framework but achieves computationally efficiency
through an asymptotic normal approximation.
Conclusions
We show that the GHD is able to detect more subtle topological
differences compared to a standard Hamming distance between networks.
This results in the dGHD algorithm achieving high performance in
simulation studies as measured by sensitivity and specificity. An
application to the problem of detecting differential DNA co-methylation
subnetworks associated to ovarian cancer demonstrates the potential
benefits of the proposed methodology for discovering network-derived
biomarkers associated with a trait of interest.
Keywords: Network comparisons, Co-methylation networks
Background
Current efforts at understanding diseases rely on the ability to
identify differences between healthy and affected tissues. A number of
high-throughput platforms are now commonly used to compare genome-wide
molecular profiles collected from large cohorts of healthy and diseased
subjects in search for patterns that differentiate between them. For
instance, in cancer research, gene expression and DNA methylation
profiles from diseased tissues are compared to those extracted from
normal controls in order to identify groups of genes whose expression
or methylation levels are significantly different, and consequently
associated to the trait of interest. From a statistical modelling
standpoint, the primary interest of these studies lies in detecting
statistically significant changes in average gene expression or
methylation values in a two-sample comparison. A number of standard
statistical tests, which are generally applied in a univariate fashion,
have been proposed for this task and generate candidate sets of genes
for further investigation [[29]1]. Statistical methods have also been
developed to assess whether these candidate genes are over-represented
in pre-defined biological pathways or subnetworks within protein
interaction networks [[30]2]. These developments are based upon the
principle that, in order to understand the roles of genes in complex
diseases, genes need to be studied in the context of the regulatory
systems they are involved in [[31]2–[32]4].
An alternative way of analysing genome-wide expression and methylation
levels observed in a random sample consists of studying their
interaction patterns, which are often represented in the form of
networks [[33]5, [34]6]. Network edges quantify the similarity in
transcription activity between two genes [[35]7] or in DNA methylation
between two CpG islands [[36]8], respectively. The notion of similarity
is usually measured by linear correlation, partial correlation or
mutual information coefficients estimated from the sample data [[37]7,
[38]9]. The networks arising in the two-sample setting above can then
be compared to assess whether there are statistically significant
differences in network topology that can be associated to the disease.
The detection of markedly distinct interaction patterns across
conditions may be indicative of local disturbances within known
biological pathways, and can be taken as candidate biomarkers. For
instance, as a cancer progresses, it has been observed that its
signalling and control networks are subjected to re-arrangments which
are advantageous for the cancer [[39]10]. Changes in methylation levels
are believed to be among the earliest and most common alterations in
human cancers [[40]11, [41]12], and topological differences in healthy
and diseased networks can reflect significant dysregulations associated
to the disease [[42]13].
In this paper we discuss the the problem of comparing two labelled
biological networks, each one representing a different population or
condition, with the aim of detecting statistically significant
differences between them. We approach this problem from a hypothesis
testing perspective. This is a challenging statistical problem as only
one random network is observed under each condition. Various
computational methodologies have been developed to compare networks,
including graph matching and graph similarity algorithms [[43]14].
Graph matching algorithms have been used to discover similarities
between molecular pathways across organisms and functions [[44]15,
[45]16], but are typically limited to unlabelled graphs, and are not
concerned with hypothesis testing. Graph similarity algorithms also
assume that the graphs are unlabelled, and the attention has mostly
focused on detecting patterns that are most similar between networks
[[46]17]. For instance, gene modules can be identified separately in
each network first, and then compared across networks [[47]7, [48]18,
[49]19]. More closely related work includes inferential methods for
performing two-sample hypothesis tests where the sampling unit is a
network, and assess whether the two paired networks come from the same
assumed model [[50]20].
We take a non-parametric approach to inference that does not require to
make assumptions about a specific random network model. Our premise is
that any true topological differences between the two networks would
involve only a smaller set of edges, compared to all edges in the
network, which we aim to detect. Our contributions to this problem are
as follows. First, we consider the issue of choosing a distance measure
between two paired networks that is able to capture subtle topological
differences. Second, we discuss how to establish whether large values
of this distance can be deemed statistically significant under a null
hypothesis that the networks are independent. Finally, we ask whether
it is possible to identify a differential subnetwork, starting from two
large networks, in a computationally efficient manner.
The article is organised as follows. In Section [51]Methods we
introduce a distance for labelled networks, the Generalised Hamming
Distance (GHD). Building on this distance, a permutation-based test
statistic for two-sample network comparisons is introduced in Section
[52]A non-parametric test for network comparison. Conditions for
asymptotic normality are provided so that p-values can be obtained in
closed-form without the need to carry out computationally expensive
permutations. In order to verify these results in special cases, in
Section [53]Validation of asymptotic normality on scale-free networks
we argue that the proposed conditions hold true for various random
network models, and provide a sketch proof for the case of scale free
networks. In Section [54]Differential subnetwork detection we describe
an algorithm, dGHD, for the detection of differential subnetworks. In
Section [55]Results we present a number of simulation experiments that
highlight the advantages of the proposed methodology under different
graph models. As an illustrative application of the proposed
methodology, a case-control study involving DNA co-methylation networks
in ovarian cancer is presented in Section [56]Application to
co-methylation networks in ovarian cancer. We conclude with a
discussion in Section [57]Conclusions.
Methods
We assume to have observed two paired biological networks, each
represented by a graph, denoted by
[MATH: A=(V,EA)
:MATH]
and
[MATH: ℬ=(V,EB)
:MATH]
, respectively. Both graphs are defined on a common set, V={1,2,…,N}.
The respective sets E [A] and E [B] of edges indicate the connection
between the nodes in the two graphs. We also let the matrices A=(A
[ij]) and B=(B [ij]) denote the two (N×N) adjacency matrices associated
with graphs
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
, respectively.
The Hamming distance (HD) between
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
provides a commonly used metric to quantify the difference between the
networks, and is defined by
[MATH:
12tr<
/mtext>(A−B)2
:MATH]
, where tr[ ·] denotes the trace of a matrix. This distance takes into
account the number of edges that are in common between the two
networks. Here we propose an extension of this metric, which we call
the Generalised Hamming Distance (GHD), defined as
[MATH: GHD(A,ℬ)=1<
/mrow>N(N−1)∑i,jaij′
−bij′
2, :MATH]
(1)
where a ij′ and b ij′ are mean-centred edge weights defined as
[MATH: aij′
=aij−1N(N−1<
mo>)∑i,j
aij,bij′
=bij−1N(N−1<
mo>)∑i,j
bij :MATH]
and
[MATH:
∑i,j :MATH]
denotes summation over distinct i and j. The edge weights, which depend
on the topology of the networks, provide a measure of connectivity
between every pair of nodes i and j in
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
, respectively. When a [ij] and b [ij] are binary values indicating the
presence or absence of an edge, i.e. are the elements of A and B,
[MATH: GHD(A,ℬ) :MATH]
is related to the HD. The specific node weights we propose here instead
quantify the topological overlap (TO) between a pair of nodes by taking
into account the local neighbourhood structure around those nodes
[[58]21]. In the literature, the TO measure has been successfully
applied for the detection of communities in biological networks, and
there is empirical evidence that it carries biological meaning [[59]7,
[60]22].
We use the one-step TO between nodes i and j indicating whether they
share direct connections to other nodes. The weights are obtained from
the adjacency matrix as follows:
[MATH: aij=∑l≠i,
mo>jAilAlj+
Aijmin<
mfenced close=")" open="("
separators="">∑l
≠iAil−
Aij,∑l≠jAil−
Aij+1, :MATH]
(2)
when i≠j, and otherwise a [ij]=1, and analogously for b [ij]. The GHD
sums the squared differences
[MATH:
(aij′
−bij′
)2 :MATH]
over all pairs of nodes in the network. Note that the term
[MATH:
∑l≠i
,jAilAlj :MATH]
is a count of all vertexes (i,l,j) containing node pair (i,j). This
term measures the connectivity information of each (i,j) pair plus
their common one-step neighbours. The denominator in ([61]2) can be
written as min(d [i],d [j])+1−A [ij], where d [i] and d [j] represent
the node degrees of i and j, respectively. It is roughly equal to the
smaller of (d [i],d [j]) and normalises a [ij] such that 0≤a [ij]≤1. A
large discrepancy between
[MATH: aij′
:MATH]
and
[MATH: bij′
:MATH]
indicates a topological difference localised around that pair of nodes.
By exploring the neighbourhood of each node, the proposed GHD can
detect subtle topological changes with higher sensitivity compared to
the HD. A simple illustration of this is given in Fig. [62]1, where
four simple networks are shown: the network labelled (a) is taken as
reference while the three paired networks (b), (c), and (d) have been
generated by changing the position of a single edge in (a). The two
distances, HD and GHD, have been computed to quantify the difference
between (a) and each of the other three networks. It can be observed
that, whereas the HD is unable to distinguish between the three
networks, the GHD score is more sensitive to subtle topological
variations and can discriminate between them.
Fig. 1.
Fig. 1
[63]Open in a new tab
Networks (b), (c) and (d) are generated from the reference network (a)
by a single edge change. Both HD and GHD between the reference network
and each modified paired network have been computed in each case
A non-parametric test for network comparison
For inferential purposes, we require computing the probability that a
distance as extreme or more extreme than the observed GHD value could
have been observed by chance only. By treating the GHD as a random
variable with unknown sampling distribution, this probability can be
estimated non-parametrically via permutation testing. First, we specify
the null hypothesis as being
[MATH: ℋ0:
mo>networksAandℬare independent. :MATH]
(3)
By taking
[MATH: ℬ :MATH]
as reference, each permutation consists of shuffling the labels of the
nodes in
[MATH: A :MATH]
while keeping the edges unchanged. This generates a permuted network
[MATH: Aπ :MATH]
that is isomorphic to
[MATH: A :MATH]
, and the exchangeability property holds. In turn, this signifies that
the original and permuted networks are generated from the same
underlying, but unspecified, model [[64]23, [65]24]. Since all
permutation networks are isomorphic, permuting the labels of the
network is equivalent to shuffling rows and columns of the adjacency
matrix, an approach that bears some similarity with Mantel’s test
[[66]25] for the comparison of two distance matrices. All the the N!
possible permutations are then collected in a set Π, and for each π∈Π a
permuted GHD value is denoted as
[MATH:
GHDπ(Aπ,
mo>ℬ)=1<
/mrow>N(N−1)∑i,j
aπ(i)π(j)′−b
ij′
2,
:MATH]
and is calculated from the edge weights
[MATH:
aπ(i)π(j)′<
/mrow> :MATH]
after permutation. The exact permutation distribution is obtained by
carrying out an exhaustive calculation of all GHD[π] values, and
p-values can then be evaluated as usual. In practice, however, doing so
is computationally infeasible because the cardinality of Π is generally
extremely large, even for relatively small networks. The exhaustive
evaluation for all permutations in Π could be replaced by a Monte Carlo
approach whereby only a smaller number of random permutations are
explored. Nevertheless, the overall computational costs remain high for
networks of the moderately large sizes observed in applications or when
this procedure has to be repeated several times, for instance when
searching for a differential subnetwork as in Section [67]Differential
subnetwork detection.
In what follows, we propose an alternative approach that removes the
need to carry out computationally expensive permutation testing
altogether. We demonstrate that, under our null hypothesis, the exact
GHD permutation distribution can be approximated well by a normal
distribution with moments that can be obtained analytically, in closed
form. First, we notice that the GHD can be rewritten in an equivalent
form in terms of a generalised correlation coefficient as follows:
[MATH: GHDπ(Aπ,
mo>ℬ)=c−2N(N−1)∑i,jaπ(i)π(<
/mo>j)′bij′
, :MATH]
(4)
where c is a constant that does not change under permutations. By
making use of this alternative representation, we are able to exploit
well-known sufficient conditions for asymptotic normality, which can
also be easily checked in practice. For a generalised correlation
coefficient of this form, the exact permutation distribution is
asymptotically normal under two sufficient conditions [[68]26–[69]28]:
[MATH:
∑i,jaij′
=∑i,
jbij′
=0and
:MATH]
(5a)
[MATH: limN→∞
∑ijkla
mi>ij′
aik′
ail′
2∑ijkaij′
aik′
3=limN→∞
∑ijklb
mi>ij′
bik′
bil′
2∑ijkbij′
bik′
3=0. :MATH]
(5b)
Condition ([70]5a) follows directly from the definition of
[MATH: aij′
:MATH]
and
[MATH: bij′
:MATH]
as being mean-centred. In order to gain some insight into the meaning
of condition ([71]5b) in our context, it is instructive to consider the
case where a [ij] and b [ij] are elements of the two adjacency
matrices, i.e. they indicate the presence of an edge. On defining
[MATH: ai·=∑j≠iaij :MATH]
and
[MATH:
ā=1N∑iai·
:MATH]
, we have
[MATH: ai·′<
/msubsup>=∑j≠
iaij′
=ai·−ā
mi>, :MATH]
(6)
and condition ([72]5b), with reference to network
[MATH: A :MATH]
, can be written as
[MATH: limN→∞
∑i(Na
mrow>i·′<
/msubsup>)32∑i(Na
mrow>i·′<
/msubsup>)23=<
mrow>limN→∞
∑i(ai·−ā
mi>)32∑i(ai·−ā
mi>)23=0, :MATH]
(7)
and analogously for
[MATH: ℬ :MATH]
. It can be observed that, when using the adjacency matrix, a [i·]
represents the degree of the i ^th node. An analogous condition also
applies to
[MATH: ℬ :MATH]
. Therefore, checking ([73]5b) amounts to computing the degree of each
node in the two networks, and assessing the limiting behaviour. When
the TO measure is used instead, as in the GHD, the coefficient a [i·]
represents the overall topological overlap information at node i, and
can also be computed using ([74]6).
When both ([75]5a) and ([76]5b) hold true, under the null hypothesis,
the permutation distribution of
[MATH: GHD(A,ℬ) :MATH]
is approximately normal. We then standardise the GHD value by
mean-centring and normalising it, so that it follows a standard normal
distribution asymptotically,
[MATH:
GHDπ(Aπ,
mo>ℬ)−μ
mrow>πσ<
/mrow>π∼N
(0,1) :MATH]
(8)
where μ [π] and σ [π] are the mean and standard deviation of GHD under
the exact permutation distribution, respectively. These two moments can
be computed precisely and in closed-form by enumerative combinatorics;
the calculations follow developments described in the context of
related permutation-based testing procedures [[77]25], and can also be
found in [[78]29]. Here we provide explicit formula for both μ [π] and
[MATH:
σπ2
:MATH]
as follows. First, we need to define
[MATH: t<
mi>Sa=∑i=1N∑j=1Na
ijt
,t=1,2andTa
=∑i=1N∑j=1Naij2
t<
mi>Sb=∑i=1N∑j=1Nb
ijt
,t=1,2andTb
=∑i=1N∑j=1Nbij2 :MATH]
where
[MATH: aijt
:MATH]
and
[MATH: bijt
:MATH]
are edge weights with power t. Here
[MATH: 1<
mi>SaN(N−1) :MATH]
and
[MATH: 2<
mi>SaN(N−1) :MATH]
are empirical raw moment of edge weight a [ij], and analogously for b
[ij]. Furthermore we need to introduce the following quantities,
[MATH: Aa
=1<
mi>Sa<
mrow>2,Ba=Ta−2<
mi>Sa,andCa=Aa
+22<
mi>Sa−4Ta
Ab
=1<
mi>Sb<
mrow>2,Bb=Tb−2<
mi>Sb,andCb=Ab
+22<
mi>Sb−4Tb
:MATH]
Then, closed-form expressions for the mean μ [π] and variance
[MATH:
σπ2
:MATH]
are,
[MATH:
μπ
mrow>=2<
mi>Sa+2<
mi>SbN(N−1)−21<
mi>Sa1<
mi>Sb<
mrow>N2<
mrow>N−12,<
mi>σπ2=4N
mrow>3(N
−1)3<
/mfrac>22<
mi>Sa<
/mfenced>2<
mi>Sb+4(B<
mi>a)(Bb)N−
mo>2+(Ca)(<
mrow>Cb)(N−2)(N
mi>−3)−
(Aa)(Ab<
/msub>)N(N−1). :MATH]
With the expressions for the first two exact moments, a corresponding
p-value can therefore be efficiently computed from the normal
approximation, even for very large networks. We will exploit the
computational efficiency gained here in Section [79]Differential
subnetwork detection, where we apply the test repeatedly on networks of
increasingly smaller size in order to detect differential subnetworks.
Validation of asymptotic normality on scale-free networks
The closed-form approximation for the computation of p-values only
requires that conditions ([80]5a) and ([81]5b) are satisfied, and does
not need any random network model to be specified. These two conditions
can also be verified analytically in special case when certain random
network models are assumed. For instance, in [[82]29] it was proved
that these conditions hold true for scale-free (SF), random geometric
(RG) and Erdös-Rényi (ER) network models when using both HD and GHD
distances. In this section we provide a simplified proof for the case
of SF networks using the Hamming distance. This proof should serve as
an illustration of how these derivations can be carried out
analytically, and as simple validation of the methodology described in
Section [83]A non-parametric test for network comparison for SF
networks. An analogous proof using the GHD distance can be found in the
Supplementary Material, and we refer the reader to [[84]29] for the
other models.
A SF network is a network whose node degree distribution follows a
power law, at least asymptotically, and has often been used to describe
real biological networks [[85]30–[86]32]. The degree of each node is
assumed to be an independent and identically distributed (IID) random
variable with probability mass function defined as
[MATH:
P(di=k)=c<
mi>k−α,k=m,m
+1…,K, :MATH]
(9)
where m and K are the lower and upper cut-offs for the node degree,
respectively, c is a normalising constant, and α represents a power
exponent. It is generally assumed that α is greater than 1, and the
lower cut-off m is generally be taken to be 1. The upper cut-off K for
α>2 is conventionally specified as
[MATH:
K=N1α−1 :MATH]
[[87]33], and generally K=N−1 for 1<α≤2. Values of α for different
biological networks have been characterised, and mostly vary between
1.4 to 1.7 [[88]30].
On defining the weights a [ij] and b [ij] as elements of A and B,
respectively, ([89]7) becomes
[MATH: limN→∞
∑i(di−d¯)
32
∑i(di−d¯)
23
=0, :MATH]
(10)
where
[MATH: d¯ :MATH]
is the average node degree. In order to study this limiting behaviour,
we exploit the fact that both numerator and denominator are powers of
the centralised empirical moments of the node degree distribution. We
let
[MATH:
μs=c∑d=1Kds−α :MATH]
denote the s ^th theoretical moment and
[MATH:
ms=1N
∑i=1Ndis :MATH]
the corresponding empirical moment of this distribution. In order to
study the limit above we need to characterise the order of m [s], for
s=1,2,3, as N increases. Our strategy here consists of first
characterising the order of μ [s] asymptotically, for the first three
moments, and establishing a correspondence with m [s].
We start by examining the order of μ [s], for s=1,2,3, in the limit.
Since this depends on s, we consider three distinct cases: (a) s−α+1<0,
(b) s−α+1=0 and (c) s−α+1>0. For (a), the order of μ [s] is
[MATH:
∑d=1
K1α−1d<
/mi>−1=O
(1) :MATH]
. For (b), the order of μ [s] is
[MATH:
∑d=1
Kd−1=O(
ln(K)) :MATH]
. Finally, for (c), we need to study how μ [s] increases with K. First,
we apply the Euler-Maclaurin formula,
[MATH: ∑d=1
Kd
s−α=K
s−α+1+(α−s)
∫1K
mi>xxα−s+1
mn>dx+O(1
), :MATH]
where ⌊x⌋ denotes the largest integer that is not greater than x. To
compute the order of
[MATH:
∑d=1
Kds−α :MATH]
, we need to know which one of the two terms in the sum dominates in
order. By applying l’Hospital’s rule we have
[MATH: limK→∞
s∫1Kxxα−s+1
mn>dxKs−α+1=ss−<
/mo>α+1,<
mtext> :MATH]
which is a finite constant, and hence μ [s] has the same order as K
^s−α+1. For a SF network, the condition for asymptotic normality also
depends on the values taken by the exponent. In the case where 1<α<2,
for which K=N−1, the calculation of the s ^th moment falls under case
(c), hence we conclude that the order of the first three theoretical
moments are, respectively, O(N ^2−α),O(N ^3−α) and O(N ^4−α).
We now turn to the direct comparison of the orders of μ [s] and m [s]
in the limit. Specifically, we assess whether the order of each μ [s]
established above also holds true for the corresponding m [s]. This can
be verified by checking that
[MATH: limN→
∞m
sμs=cs, :MATH]
(11)
for s=1,2,3, and for some positive constants c [s]. To study the above
limit, we apply the Weak Law of Large Numbers (WLLN). For the WLLN to
hold, μ [s] must be finite. Hence we first transform d [i] so that μ
[s], after the transformation, is finite. We let
[MATH:
Ns=Ns+1
−αs<
/msup> :MATH]
, and define
[MATH: zsi=diNs :MATH]
. The distribution of z [si] is
[MATH:
P(zsi=z)<
/mo>=c′z−α
,z=1
Ns,2
Ns,..,KN
s, :MATH]
where c ^′=c N [s]. Thus the s ^th theoretical moment of z [si] is
[MATH: μzs=
c′∑zz<
mi>s−α=c<
/mi>′∑ddNssd
−α=μs<
mrow>Ns+1−α,
:MATH]
which is finite. Denoting by m [zs] the s ^th empirical moment of z
[si],i=1,...,N, we have
[MATH: mzs=1N∑i=1Nz
sis
=1N
∑i=1NdiNs
s=msNs+1−α.
:MATH]
Now, since μ [zs] is finite and since d [1],d [2],...,d [N] are assumed
IID, z [s1],z [s2],...z [sN] are also IID, and according to the WLLN, m
[zs] converges to μ [zs] in probability. Hence we have
[MATH: 1=lim<
mrow>N→∞mzs<
mi>μzs=<
msub>limN→∞m<
mi>sN
s+1−α
μsN<
mi>s+1−α<
/mfrac>=limN→∞
msμs,
mtd>
:MATH]
indicating that m [s] and μ [s] are of the same order asymptotically.
Using this result, we are able to approximate the orders of the
numerator and denominator of condition ([90]7):
[MATH:
∑idi−d¯3=Nm3−2m2m1+2m13 :MATH]
is O(N ^4−α+1), and
[MATH:
∑idi−d¯2=N(m2−m12) :MATH]
is O(N ^3−α+1). Substituting into ([91]7), we see that the numerator is
of order O(N ^8−2α+2), the denominator is of order O(N ^9−3α+3), and
therefore the ratio is of order O(N ^α−2). Hence for 1<α<2, the limit
in ([92]10) is 0. By following a similar procedure, it can be proved
that the normality condition is also satisfied when α≥3.
Differential subnetwork detection
In this section we leverage the test statistic of Section [93]A
non-parametric test for network comparison to detect a differential
subnetwork. When comparing the two networks, the expectation is that
only a subset of edges would present altered interaction patterns. This
task is formulated here as the problem of detecting a subset V ^∗⊂V for
which there is no sufficient evidence to reject the null hypothesis
that the corresponding subnetworks
[MATH: A∗(
mo>V∗,EA
∗) :MATH]
and
[MATH: ℬ∗(
mo>V∗,EB
∗) :MATH]
are statistically independent.
An algorithm for the detection of V ^∗ should take into account the
fact that a certain degree of topological difference between
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
is always bound to be observed, even when the two population networks
are the same, due to finite sample variability. The GHD test provides
an efficient way to assess the statistical significance of any observed
discrepancy between two paired networks, and is used as a building
block to derive an algorithm that identifies differential subnetworks.
We indicate by V [K] a subset of V of size K≤N, and define the
centralised GHD test statistic computed by comparing
[MATH: A=(V
mrow>K,EA) :MATH]
and
[MATH: ℬ=(V
mrow>K,EB) :MATH]
by
[MATH:
ΔV<
mi>K=GHD(<
/mo>A(VK,E<
mi>A),ℬ(VK,E<
mi>B))−μ<
/mi>VK
msub>, :MATH]
(12)
where
[MATH:
μV<
mi>K :MATH]
is the mean of the permutation distribution for node set V [K].
Furthermore we define
[MATH:
ΔV<
mi>K|i :MATH]
to be the centralised GHD value computed by comparing the two networks
after removal of node i. The quantity
[MATH:
δi=<
/mo>ΔVK|i−
ΔV<
mi>K, :MATH]
measures the influence that node i has on the mean-centred GHD test
when comparing two subnetworks defined on set V [K]. We propose an
iterative procedure which removes a node or set of nodes at each step,
and generates a sequence of node sets of increasing smaller size, i.e.
[MATH:
VN⊃<
/mo>VN−1
mrow>⊃…⊃V
Nmin, :MATH]
where N [min]A :MATH]
and
[MATH: ℬ :MATH]
of size N=250, with parameters d=0.3 and d=0.15. For each d value,
paired networks were independently generated, and the GHD test was
computed to detect differences between them. As a result of this
process, we obtained an empirical distribution of p-values. Under the
null, this distribution is expected to be uniform on [0,1], and the
resulting QQ plots confirm that the empirical moments of this
distribution agree perfectly with the expected theoretical moments for
a RG model; see Fig. [100]3.
Fig. 3.
Fig. 3
[101]Open in a new tab
QQ plots confirming the asymptotic null distribution of the GHD test.
For RG model, 10,000 paired networks of size 250 were independently
generated and the GHD test was applied to detect differences between
them. An empirical distribution of p-values was obtained through 10,000
comparison for each model, which under the null is expected to be
uniformly distributed on [0,1]. The figure shows that empirical and
theoretical moments agree
In the second study, we compared the ability of the GHD test to detect
differential networks against three competing tests: Mean Absolute
Difference (MAD) [[102]38], Quadratic Assignment Procedure (QAP)
[[103]39] and Conditional Uniform Graph (CUG) [[104]40]. The MAD test
counts the number of different edges in the two networks
[MATH: MAD(A,ℬ)=1<
/mrow>N(N−1)∑i,j|aij−
bij|,
:MATH]
(13)
where a [ij] and b [ij] correspond to the (i,j) elements in the
adjacency matrices of
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
, respectively. The QAP uses edge set product statistics to test for
the independence between networks,
[MATH: QAP(A,ℬ)=1<
/mrow>N(N−1)∑i,jaijbij,
:MATH]
(14)
where a [ij] and b [ij] are again elements of the adjacency matrices.
For both the MAD and QAP tests we also used the traditional permutation
testing approach. We further included in the study the CUG approach.
According to this procedure, random networks are generated with
pre-determined properties, such as size and density, matching the
properties of the observed networks. For each simulated pair of random
networks, a measure of correlation between networks is computed, and
its empirical distribution is built up over many simulations. The
correlation coefficient is defined as:
[MATH: gcor(A,ℬ)=∑
i,jaij−∑i,j
aijN(N−1)
mfenced>bij−∑i,j
bijN(N−1)
mfenced>,
:MATH]
where a [ij] and b [ij] are elements of the adjacency matrices for
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
, respectively [[105]41].
This experiment required the simulation of paired networks with a
pre-specified degree of topological dissimilarity. This was achieved by
generating
[MATH: A :MATH]
first, using one of the two random models as described above. Network
[MATH: ℬ :MATH]
was then obtained by first making an exact copy of
[MATH: A :MATH]
, and then randomly shuffling a fixed proportion γ of edges so that, as
γ increases, the dissimilarity between
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
increases. For each given value of γ, we generated 1,000 pairs of
networks, computed the tests and corresponding p-values, and evaluated
the proportion of tests that rejected the null hypothesis of
independence at a 5 % significance level. The results of this study are
summarised in Fig. [106]4 where the “power” is defined as the
proportion of replications, out of 1,000, when we accept the null
hypothesis of independence. This rises from zero at γ=0.84, when
networks are still associated, to close to 1 when a lot of shuffling
has been carried out, to produce nearly independent networks. This
figure shows that for noise levels as large as γ=0.93, the tests based
on HD consider the two networks to be strongly associated. It is only
when reaching that threshold that their power starts increasing rapidly
away from zero. This suggests that the tests based on HD may be too
stringent for real application and miss importance differential
patterns. By contract, the GHD test is able to detect differences at
lower noise levels compared to other tests and capture more subtle
differences. This is not surprising as GHD is more sensitive to
topological changes, as seen in Fig. [107]1.
Fig. 4.
Fig. 4
[108]Open in a new tab
The “power” is defined as the proportion of replications when the null
hypothesis of independence is not rejected. As the noise level γ
increases, the GHD test has more power to detect true structural
changes compared to competing methods. Simulations are based on 2D RG
networks
In the third simulation study, we carried out an investigation to
assess the behaviour of the differential subnetwork detection
algorithm, and quantify its performance in comparison with other tests.
We report on experiments involving RG networks
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
of size 1,000 and generated as described above using a noise parameter
γ. Two independent subnetworks, denoted here by
[MATH: A∗ :MATH]
and
[MATH: ℬ∗ :MATH]
, were introduced by randomly selecting a subset V ^∗⊂V of size 200,
and replacing the existing edges with connections simulated from two
independent RG networks. For each value of γ, we generated 100 such
paired large networks containing smaller differential subnetworks. We
term a true positive (TP) a node that is correctly identified as
belonging to the differential subnetwork, and a false negative (FN) a
node that belongs to the subnetwork but has not been detected by the
algorithm. Similarly we define false positives (FP) and true negatives
(TN). In Table [109]1 we report the sensitivity or true positive rate
(TPR) computed as TP/(TP + FN), and the specificity (SPC) computed as
TN/(FP + TN). For comparative purposes, we have also implemented an
alternative algorithm, called dHD, which is similar to dGHD but uses
the Hamming distance instead for distance calculations. As can be
observed, both dHD and dGHD maintain high sensitivity and specificity
up to moderately high noise levels. For noise levels at the top end of
the spectrum, dHD has slightly higher sensitivity but much smaller
specificity than dGHD, indicating that it detects a larger number of
incorrect nodes.
Table 1.
Sensitivity (TPR) and specificity (SPC) of the subnetwork detection
algorithms for different values of γ, the noise level. The results are
based on simulated RG networks
γ 0.055 0.11 0.23 0.54 0.79 0.95
dGHD TPR 0.897 0.889 0.855 0.627 0.570 0.789
SPC 0.987 0.984 0.974 0.912 0.768 0.439
dHD TPR 0.914 0.904 0.872 0.725 0.712 0.862
SPC 0.978 0.971 0.956 0.843 0.567 0.201
[110]Open in a new tab
Figure [111]5 provides an example of simulated networks
[MATH: A :MATH]
and
[MATH: ℬ :MATH]
and ground truth differential subnetworks
[MATH: A∗ :MATH]
and
[MATH: ℬ∗ :MATH]
as well as the differential subnetworks
[MATH: A^∗<
/mo> :MATH]
and
[MATH: ℬ^∗<
/mo> :MATH]
detected by dGHD in one of the 100 simulations. The corresponding
sequence of p-values generated by running the dGHD algorithm in this
example is shown in Fig. [112]2. It can be noticed how the null
hypothesis of independence is rejected for all the subnetworks of size
ranging from 1000 down to 200, at which point there is no evidence to
reject the null, and the algorithm produced large p-values for all
sizes smaller than 200.
Fig. 5.
Fig. 5
[113]Open in a new tab
Example of differential subnetworks detected by dGHD using 2D RG
networks.
[MATH: A∗ :MATH]
and
[MATH: ℬ∗ :MATH]
are the true simulated independent subnetworks, and
[MATH: A^∗<
/mo> :MATH]
and
[MATH: ℬ^∗<
/mo> :MATH]
are the subnetworks detected by the algorithm (γ≈0.23). Nodes belongs
to differential subnetwork are coloured in green, and edges colored
red. Please refer to Table [114]1 for full results
Application to co-methylation networks in ovarian cancer
We present an application to a case-control epigenetic study of ovarian
cancer. The dataset for this study was originally presented in
[[115]42]. Methylation profiles for 27,578 CpGs islands were obtained
from whole blood samples in 540 women, of which 266 were samples taken
from postmenopausal women with ovarian cancer and 274 were from
age-matched healthy controls. In our analysis we set out to compare
control and case DNA co-methylation networks in search of a
differential subnetwork.
Raw data files were downloaded from GEO (repos. number [116]GSE19711),
and were obtained from Illumina Infinium 27k Human DNA methylation
Beadchip v1.2. The raw data was pre-processed by using the lumi package
in R [[117]43]. After quantile normalization, PCA applied to the beta
value was used to detect and remove extreme outliers. After quality
control, 243 control samples and 215 case samples remained for further
analysis. The networks was inferred by taking each probe as a node.
Following [[118]44], an adjacency measure was computed as ω
[ij]=|(1+cor(g [i],g [j]))/2|^b where cor(g [i],g [j]) denotes the
Pearson’s correlation coefficient between beta values observed at the i
^th and j ^th CpG sites. The power exponent b was set to a default
value of 12 so as to place more emphasis on higher positive
correlations [[119]7]. Two nodes were linked in the network if ω [ij]
was higher than 0.2 so that the presence of an edge indicates a strong
correlation. This value also yields networks that roughly follow a SF
model (see Fig. [120]6). The number of resulting edges is 48,224 and
75,913 in the control network
[MATH: A :MATH]
and case network
[MATH: ℬ :MATH]
, respectively.
Fig. 6.
Fig. 6
[121]Open in a new tab
Node degree distribution for control and case co-methylation networks.
Both plots shows that the networks roughly follow SF network models
At a significance level of 5 % and after correction for multiple
testing, the dGHD algorithm detected a subnetwork of size 1,642, with
1,954 edges in
[MATH: A∗ :MATH]
and 12,556 edges in
[MATH: ℬ∗ :MATH]
. The two resulting subnetworks are presented in Fig. [122]7. Although
the algorithm does not constrain the differential networks to be
connected, they both comprise a number of connected subgraphs. The
Walktrap community detection algorithm, as implemented in the R package
iGraph [[123]45], was used to identify communities in these two
subnetworks, as shown in the figure. The density of the six largest
communities, which are denoted
[MATH: C1,
mo>…,C6 :MATH]
, differs quite substantially between control and cancer networks. In
almost all communities, the density is much higher in
[MATH: ℬ∗ :MATH]
, with the exception of
[MATH: C6 :MATH]
, where it is higher in
[MATH: A∗ :MATH]
.
Fig. 7.
Fig. 7
[124]Open in a new tab
DNA co-methylation networks: differential subnetworks
[MATH: A∗ :MATH]
(controls) and
[MATH: ℬ∗ :MATH]
(cases) detected by dGHD algorithm. Six main communities within the
subnetworks are characterised by a much higher network density in
cancer patients compared to healthy controls. Differential methylation
is mostly concentrated in
[MATH: C2 :MATH]
,
[MATH: C3 :MATH]
,
[MATH: C5 :MATH]
and
[MATH: C6 :MATH]
(See also Table [125]2 and Fig. [126]8)
To gain initial insight into the biological meaning of the subnetworks
and the communities within them we used the R package GOstat [[127]46]
to identify enriched Gene Ontology (GO) terms within two broad
categories, Biological Processes (BP) and Molecular Functions (MF). At
a 5 % significance level, the hypergeometric test detected 762 BP and
154 MF statistically significant terms enriched in the subnetworks
where most of these terms can be found in 6 communities. For instance,
the top three BPs were response to stimulus, cellular response to
stimulus and response to chemical stimulus, and the top three MFs were
protein binding, collagen binding and RNA polymerase II transcription
cofactor activity. Furthermore, we carried out a pathway enrichment
analysis to identify any significantly enriched KEGG pathways. At a 5 %
significance level, 12 pathways were found to be enriched, including
hematopoietic cell lineage, acute myeloid leukemia, and regulation of
action cytoskeleton.
Probes showing statistically significant changes in mean methylation
levels were detected by a two-sample SAM statistic as implemented in
the R package samr. After Benjamini & Hochberg correction for multiple
testing, 2,770 probes were found to be differentially methylated (DM)
at the 5 % significance level. Of these, 620 were also found in the
differential subnetworks, 90 % of which are concentrated in communities
[MATH: C2,
mo>C3,
mo>C5 :MATH]
and
[MATH: C6 :MATH]
. For example in community
[MATH: C3 :MATH]
, there are 109 probes in total, half of which (54) are differentially
methylated. Figure [128]8 shows the distribution of DM probes in the
subnetworks. These results suggest that a differential analysis based
exclusively on detecting mean levels of differential methylation may
miss important differences that can only be identified by comparing the
interaction networks.
Fig. 8.
Fig. 8
[129]Open in a new tab
Visualization of the distribution of differential methylated probes
(red) in differential subnetworks detected by dGHD in the DNA
co-methylation networks
Table [130]2 provides a breakdown of the number of probes,
differentially methylated probes (q [i]), density ratio between control
and case subnetworks (R [i]), and distribution of enriched GO terms and
KEGG pathways in the 6 communities (see also Fig. [131]7). Replicated
GO terms and pathways involved in different communities were excluded
in the subtotal. In
[MATH: C5 :MATH]
we found that all top 6 ranked significant BP terms were related to
interleukin-3 (IL-3), a cytokine that is made by leukocytes and other
cells in the body. IL-3 can increase the number of leukocytes,
neutrophils, and platelets made by the bone marrow [[132]47]. As
Myelosuppression induced by chemotherapy is closely related to the
effect of IL-3 in blood cells when suppressing a tumor during the
therapy [[133]48], this may offer a possible explanation for the
observed enrichment results. A possible explanation for the observed
difference in the
[MATH: C6 :MATH]
cluster may be related to hypermethylation being linked to cancer
[[134]49, [135]50].
Table 2.
DNA co-methylation networks: a summary for different communities
[MATH: C1 :MATH]
[MATH: C2 :MATH]
[MATH: C3 :MATH]
[MATH: C4 :MATH]
[MATH: C5 :MATH]
[MATH: C6 :MATH]
Subtotal Overall
# of probes 418 66 109 34 347 200 1174 1642
q [i] 4 66 54 1 338 97 560 620
R [i] .181 .013 .012 0 .002 23.4 .145 .156
BP 320 25 38 22 236 54 568 762
MF 54 4 15 3 43 27 125 154
KEGG 5 0 1 1 0 1 8 12
[136]Open in a new tab
Conclusions
The comparison of DNA methylation or gene expression profiles across
conditions is enabling the discovery of novel biomarkers for diagnosis
or prognosis, and holds the promise to identify novel targets for
therapeutical intervention. In this paper we have discussed the problem
of comparing two labelled networks that are representative of two
conditions (e.g. healthy and diseased tissues) and detecting
statistically significant differences in their topology. Identifying
disrupted interaction patterns in two labelled network comparisons is a
challenging problem requiring novel statistical tools, and three
contributions have been made here in this direction. Firstly, we have
proposed the GHD, a distance between two labelled networks that detects
more subtle differences compared to the traditional Hamming distance.
Secondly, we have demonstrated that the GHD can be used as a
non-parametric test to assess whether two paired networks are
statistically independent, and have described how p-values can be
computed in closed-form without requiring computationally expensive
permutation procedures. The plausibility of the conditions underpinning
our derivations has been discussed using scale-free random network
models as an example. Thirdly, we have proposed a fast subnetwork
detection procedure, the dGHD algorithm, to detect localized
topological differences between two paired networks. This methodology
provides a useful addition to standard two-sample tests that are
commonly used for biomarker discovery. An initial evaluation has been
carried out by comparing co-methylation networks inferred from healthy
and cancer patients, and detecting differential subnetworks. Further
experimental evaluation on independent datasets will be required to
validate these results. In future work, the methodology could be
extended to the case of more than two conditions.
Acknowledgements