Abstract
Background
Detecting and visualizing nonlinear interaction effects of single
nucleotide polymorphisms (SNPs) or epistatic interactions are important
topics in bioinformatics since they play an important role in
unraveling the mystery of “missing heritability”. However, related
studies are almost limited to pairwise epistatic interactions due to
their methodological and computational challenges.
Results
We develop CINOEDV (Co-Information based N-Order Epistasis Detector and
Visualizer) for the detection and visualization of epistatic
interactions of their orders from 1 to n (n ≥ 2). CINOEDV is composed
of two stages, namely, detecting stage and visualizing stage. In
detecting stage, co-information based measures are employed to quantify
association effects of n-order SNP combinations to the phenotype, and
two types of search strategies are introduced to identify n-order
epistatic interactions: an exhaustive search and a particle swarm
optimization based search. In visualizing stage, all detected n-order
epistatic interactions are used to construct a hypergraph, where a real
vertex represents the main effect of a SNP and a virtual vertex denotes
the interaction effect of an n-order epistatic interaction. By deeply
analyzing the constructed hypergraph, some hidden clues for better
understanding the underlying genetic architecture of complex diseases
could be revealed.
Conclusions
Experiments of CINOEDV and its comparison with existing
state-of-the-art methods are performed on both simulation data sets and
a real data set of age-related macular degeneration. Results
demonstrate that CINOEDV is promising in detecting and visualizing
n-order epistatic interactions. CINOEDV is implemented in R and is
freely available from R CRAN: [35]http://cran.r-project.org and
[36]https://sourceforge.net/projects/cinoedv/files/.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-016-1076-8)
contains supplementary material, which is available to authorized
users.
Keywords: Epistatic interactions, Co-information, Single nucleotide
polymorphisms, Particle swarm optimization, Hypergraph
Background
Following the development of high-throughput sequencing and genotyping
technologies, there has been a rapid increase in the availability of
single nucleotide polymorphisms (SNPs). Hence genome-wide association
studies (GWAS) have become a routine tool in investigating the genetic
architectures of complex diseases, such as cancer, heart disease,
diabetes and many others. With these studies, hundreds of thousands of
SNPs speculated to associate with complex diseases have been
identified. However, these SNPs have been shown to explain only a small
proportion of underlying genetic variance of complex diseases, leaving
the question of “missing heritability” open for further investigation
[[37]1, [38]2].
Some plausible explanations should be taken into account to reveal the
gaps between expectations and realities of GWAS. Firstly, GWAS require
p-values (or other similar measures) of disease-associated SNPs to
reach a genome-wide significance level after several stringent multiple
testing corrections, for example, Bonferroni correction, which may
exclude many genuinely associated SNPs that have moderate or weak
association signals [[39]3]. Secondly, rare SNPs (i.e., minor allele
frequency of each is < 5 %) are difficult to be detected, and sometimes
even be ignored in GWAS, though they may play an important role in
explaining “missing heritability” [[40]1, [41]4, [42]5]. Thirdly,
besides SNPs, other types of biological data, for instance, copy number
variation, DNA methylation, and gene expression, also provide
different, partly independent and complementary, views for unraveling
the mystery of “missing heritability” [[43]6, [44]7]. Fourthly, it is
widely believed that nonlinear interaction effects of multiple SNPs or
epistatic interactions could unveil a large portion of unexplained
heritability of complex diseases [[45]8–[46]11]. In fact, detection of
epistatic interactions has already been a compelling step in GWAS
[[47]12].
In general, detection of epistatic interactions is of great challenge.
The first challenge is the intensive computational burden mainly
imposed by the “curse of dimensionality” and the “combinatorial
explosion”, which has significant implications for GWAS with millions
of SNPs. For instance, search space of a 100 K SNP data set with
maximum order of three is an astronomical number ∑^3[k = 1]C^k[100000],
where the order refers to the number of SNPs in a SNP combination. The
second challenge is the complexity of genetic architecture of a
disease. It may involve multiple epistatic interactions interacting
with other causative factors in a complicated way, each displaying
strong association with the phenotype as a whole but the contained SNPs
possibly having small or even no main effects. Limited prior knowledge
available for a disease, such as the number of epistatic interactions,
the order and the effect magnitude of an epistatic interaction, makes
their detection difficult. The third is the association measure that
determines how well a SNP combination contributes to the phenotype. A
suitable association measure is required to be efficient in
computational cost and insensitive to both SNP combination order and
effect type, and more importantly, it can truly capture causative
epistatic interactions. Though several association measures have been
widely used for the detection of epistatic interactions, such as
permutation test and chi-squared test, developing new association
measures that can effectively and efficiently capture epistatic
interactions is still a direction. All the above are the great
challenges in genome-wide interaction analysis.
Though methodological and computational perplexities of the detection
of epistatic interactions have been well recognized, the algorithmic
development is still ongoing. Exhaustive methods, e.g., MDR [[48]13],
show their successes on small scale data sets. However, for large scale
data sets, especially those for GWAS, the detection of epistatic
interactions becomes a needles-in-a-haystack problem [[49]14] and
exhaustive methods are no longer feasible. Recently, heuristic methods
are gaining increasing favor since they can retain as many informative
SNPs as possible while largely reducing computational complexity. For
instance, Zhang et al. developed TEAM [[50]15] to identify epistatic
interactions, which updates contingency tables by utilizing a minimum
spanning tree. Wan et al. presented an epistatic interaction detection
method BOOST [[51]16], which involves only Boolean values and allows
the use of fast logic operations to obtain contingency tables. They
also proposed another method SNPRuler [[52]17] based on predictive rule
inference. Wang et al. used AntEpiSeeker [[53]18] to identify epistatic
interactions, which is a two-stage ant colony optimization algorithm.
Zhang and Liu developed a Bayesian partition approach BEAM to find
groups of genotypes with large posterior probability [[54]19]. Tang et
al. introduced the concept of epistatic module and designed a Gibbs
sampling approach epiMODE [[55]20] to detect such modules, which is a
generalization of BEAM.
Besides these methods, co-information based methods appear promising in
detecting epistatic interactions since they have a well-developed
theory, and can measure multivariate dependence without any complex
modeling. Chanda et al. [[56]21] developed a co-information based
metric called the interaction index for prioritizing interacting SNPs.
They also proposed another three co-information based methods: AMBIENCE
[[57]22] and KWII [[58]23] for detecting epistatic interactions
associated with the binary phenotype, CHORUS [[59]24] for identifying
associations with quantitative traits. Sucheston et al. [[60]25]
demonstrated that co-information based methods are flexible and have
excellent power to detect epistatic interactions under a variety of
conditions that characterize complex diseases.
Although many methods for detecting epistatic interactions have been
performed, most of them were constrained to pairwise epistatic
interactions, easily ignoring the broader epistasis landscape [[61]26].
Furthermore, these methods usually output identified epistatic
interactions, as well as their significance levels, in flat file
formats. Hence, the correct understanding of them is sometimes a
challenge for researchers, especially for biologists and non-expert
users, who are unfamiliar with the methods. A desirable strategy is to
develop an effective visualization tool not only to intuitively
visualize the detected interactions but also to discover several hidden
patterns [[62]27].
However, there have been few studies focused on their visualization.
Moore et al. [[63]28] built an interaction graph to visualize detected
epistatic interactions. McKinney et al. [[64]29] constructed a genetic
association interaction network (GAIN) to characterize detailed
interactions, whose edges quantify the synergy between pair SNPs with
respect to the phenotype. GAIN has been successfully used for
identifying modulators of antibody response to smallpox vaccine
[[65]30], GWAS of bipolar disorder [[66]31], and analyzing exome data
for systemic lupus erythematous cases and controls [[67]32]. Hu et al.
[[68]33] proposed a statistical epistasis network (SEN) approach, which
has been proven to be able to discover pairwise epistatic interactions
of bladder cancer [[69]34, [70]35] and prostate cancer [[71]36]. They
also demonstrated that SEN supervised search is able to infer several
3-order epistatic interactions with significantly high associations at
a substantially reduced computational cost [[72]37]. Though these
network-assisted methods can provide a global map of pairwise epistatic
interactions, can indirectly capture higher order epistatic
interactions on the basis of observing topology structures of networks,
and can be exported for visualization in existing tools, such as
Cytoscape and Graphviz, they could not directly detect and visualize
n-order epistatic interactions, for example, n = 3 or larger. More
recently, Hu et al. [[73]38] presented a visualization tool ViSEN,
which can show both pairwise and 3-order epistatic interactions, in
addition to main effects, in one network. To the best of our knowledge,
it is the first visualization tool that shows three orders of effects
simultaneously. Nevertheless, different orders of effects in ViSEN are
difficult to be fairly and intuitively compared. Wu et al. [[74]27]
designed another visualization tool EINVis to analyze and explore
genetic interactions, which utilizes a tree ring view to simultaneously
visualize the hierarchical interactions between SNPs, genes, and
chromosomes. However, EINVis is limited in detecting and visualizing
high order epistatic interactions.
In the light of above observations, we develop CINOEDV (Co-Information
based NOrder Epistasis Detector and Visualizer) for the detection and
visualization of epistatic interactions of their orders from 1 to n
(n ≥ 2). CINOEDV is composed of two stages, namely, detecting stage and
visualizing stage. In detecting stage, co-information based measures
are employed to quantify association effects of n-order SNP
combinations to the phenotype, and two types of search strategies are
introduced to identify n-order epistatic interactions: an exhaustive
search for lower order epistatic interactions and/or small scale data
sets, a particle swarm optimization (PSO) based search for higher order
epistatic interactions and/or large scale data sets. In visualizing
stage, all detected n-order epistatic interactions are used to
construct a hypergraph, where a real vertex represents the main effect
of a SNP and a virtual vertex denotes the interaction effect of an
n-order epistatic interaction. By deeply analyzing the constructed
hypergraph, some hidden clues for better understanding the underlying
genetic architecture of complex diseases could be revealed, for
instance, higher order epistatic interactions, hub SNPs and connected
subgraphs. Experiments of CINOEDV and its comparison with
state-of-the-art methods are performed on lots of simulation data sets
under the evaluation measures of both detection power and computational
complexity. Results demonstrate that CINOEDV is promising in detecting
and visualizing n-order epistatic interactions. In addition, CINOEDV is
also applied on a real data set of age-related macular degeneration
(AMD), and results of which provide several new clues for the
exploration of causative factors of AMD. CINOEDV might be an
alternative to existing methods for the detection and visualization of
n-order epistatic interactions.
Methods
Co-information based association measures
Before introducing the measures, several terms and notations are
described. At present, the generally accepted way of mapping SNPs is to
collect them as a matrix, where a row represents genotypes of an
individual and a column represents a SNP. Genotypes of a SNP are coded
as {0, 1, 2}, corresponding to homozygous common genotype, heterozygous
genotype, and homozygous minor genotype. The label of an individual is
a binary phenotype being either 0 (control) or 1 (case). Based on this
numerical mapping, let N and M be the number of SNPs and the number of
individuals in the data respectively. Below we will discuss the
definitions of co-information based association measures between n SNPs
S[1], ⋯, S[n], that randomly sampled from N SNPs, and the phenotype C.
Co-information is one of several generalizations of mutual information,
and can measure multivariate dependence without any complex modeling
[[75]39]. Co-information among n SNPs and the phenotype C is defined as
an alternating sum of the joint entropies of all possible subsets T of
V using the difference operator notation of Han [[76]40],
[MATH: CIS1⋯S
nC=−∑T⊆V−1n+1−THT, :MATH]
where V = {S[1], ⋯, S[n], C}, T represents all possible subsets of V,
and H(T) is the joint entropy of T, which can be written as
[MATH: HT=−∑t∈Tptlogpt, :MATH]
and p(t) is the probability mass function.
It is seen that co-information is a parsimonious, multivariate measure
quantifying interactions that cannot be obtained without observing all
variables at the same time [[77]24], and it seems promising for
detecting n-order epistatic interactions. However, it also has two
confusing properties retarded its wider adoption as an association
measure.
The first is its value. In the bivariate case, co-information is
equivalent to mutual information and its value is always positive. But
in the multivariate case, its value can be positive or negative, the
interpretation of which is generally intuitive [[78]26]: a positive
value is an evidence of interactions among variables; a negative value
indicates the presence of redundancy; and a value of zero denotes that
variables are independent or, more likely, interact with a mixture of
synergy and redundancy. Almost all existing applications of
co-information depend upon this intuitive explanation [[79]21, [80]24,
[81]28, [82]29, [83]33, [84]38], and it is also the basis of our
association measures.
The second is its sensitivity to the SNP combination order. This
property leads to difficulty in ranking SNP combinations of different
orders. As yet, it still lacks the widely accepted normalization
method. In this study, we make use of the order-fixed averages of
co-information values to normalize them, defined as n-order interaction
effect,
[MATH:
NCI(S1;⋯;Sn;C)=CI(S1;⋯;Sn;C)H(C)⋅CI1<
mo stretchy="true">¯CIn<
mo stretchy="true">¯, :MATH]
where H(C) is the entropy of the phenotype,
[MATH: CIn<
mo stretchy="true">¯ :MATH]
and
[MATH: CI1<
mo accent="false">¯ :MATH]
are respective averages of all considered n-order and 1-order
co-information values. The first part of the formula provides the
percentage of explaining the phenotype by giving the knowledge of n
SNPs, and the second part is a coefficient that balances the
contributions of SNP combinations of different orders to the phenotype.
However, NCI only measures contribution of a SNP combination itself,
not containing contributions of its subsets. In fact, the effect of an
n-order SNP combination to the phenotype consists of main effects of
all involved SNPs, as well as interaction effects of itself and its all
subsets. To quantify the total contribution of a SNP combination to the
phenotype, another co-information based association measure is
presented, defined as the summation of all involved contributions,
including its contribution, and contributions of its subsets whose NCI
values reach the user-specified thresholds. The formula can be written
as
[MATH: CCIS1⋯S
nC=∑Z⊆CS∩Z⊆S1⋯S<
/mi>nNCIZ′C, :MATH]
where Z^′ represents all SNPs in the set Z, C represents the phenotype,
and CS is a set of SNP combinations that their NCI values pass the
user-specified thresholds.
Search strategies
CINOEDV supports two types of search strategies, with NCI as its
association measure, to simultaneously detect epistatic interactions of
their orders from 1 to n (n ≥ 2). One is an exhaustive search for lower
order epistatic interactions and/or small scale data sets. With genome
wide SNPs from thousands of individuals, it is difficult to search high
order epistatic interactions exhaustively because of their heavy
computational burden. CINOEDV provides a PSO based search for higher
order epistatic interactions and/or large scale data sets.
The PSO is a popular member of swarm intelligence algorithms inspired
by the collective behaviors of organisms, like birds (viewed as
particles), which can jointly perform many complex tasks though each
individual is very limited in its capability [[85]41]. In PSO, the
position of a particle represents a possible solution which is adjusted
according to its velocity, and estimated by a fitness function at each
generation. A higher fitness value implies a better position. The
velocity of a particle is updated according to three factors: its
previous velocity, its individual experience, and the common knowledge
of the swarm. The individual experience of a particle is the best
position that it has travelled. The common knowledge of the swarm is
the best one among individual experiences of all particles. This
feedback strategy leads the swarm gradually converge to an optimal
solution [[86]42].
In our PSO based search, NCI is applied as its fitness function. That
is to say, a higher NCI value indicates a stronger association between
the SNP combination and the phenotype. Compared with the PSO, our PSO
based search has its own highlights: detecting multiple epistatic
interactions with different orders at the same time, dynamic inertia
weight, and opposition based learning.
Suppose
[MATH: Positiongq=Sq1<
mi>g⋯Sqk
mi>g⋯Sq<
/mi>Kqg :MATH]
is the position of the q[th] particle at iteration g, where
q ∊ {1, ⋯, Q}, g ∊ {1, ⋯, G}, k ∊ {1, ⋯, K[q]}, S[qk]^g is the selected
k[th] SNP of the q[th] particle at iteration g, Q is the number of
particles, G is the number of iterations, and K[q] is the considered
order of epistatic interactions of the q[th] particle, which is
randomly specified within [1, n] at the initialization stage. The
velocity of the q[th] particle at iteration g is denoted as
[MATH: Velocitygq=vq1<
mi>g⋯vqk
mi>g⋯vq<
/mi>Kqg :MATH]
, where v[qk]^g is the velocity of S[qk]^g. The individual experience
of the q[th] particle is written as
[MATH: Pbestgq=PSq1<
/mrow>g,⋯,PSqkg
,⋯,PSq
Kqg :MATH]
. The common knowledge of the swarm is redefined as the best ones among
individual experiences of particles with the same considered orders,
i.e., Gbest[g]^K = (GS[1]^g, ⋯, GS[k]^g, ⋯, GS[K]^g), where K ∊ [1, n].
During the initialization stage, Position[1](q), Velocity[1](q),
Pbest[1](q) and Gbest[1]^K are randomly initialized in their respective
domains.
The PSO based search detects epistatic interactions by continuously
updating velocity and position of each particle at all iterations. The
velocity of S[qk]^g is updated according to the following two
equations,
[MATH: v˜qkg+1=W
mi>qkg⋅vqkg
+c1⋅r1<
/mn>⋅PSqk<
/mrow>g−Sq
kg+<
mi>c2⋅r2⋅GSkg−Sqkg<
/mi>, :MATH]
[MATH:
vqkg+1=v˜qkg+1v˜qkg+1∈1−N,N−<
/mo>1rand1−N,N−<
/mo>1v˜qkg+1∉1−N,N−<
/mo>1,<
/mtext> :MATH]
where acceleration factors c[1] and c[2] control how far a particle
moves in a single iteration, r[1] and r[2] are random values in (0, 1),
GS[k]^g is the k[th] SNP of
[MATH: GbestgK<
/mi>q :MATH]
(K = K[q]) at iteration g, W[qk]^g is the inertia weight regulating the
impact of the previous velocity of a particle on its current velocity.
For the inertia weight, a large weight facilitates the global
exploration and thus enables the method to execute a search over
various regions, while a small weight facilitates the local
exploitation, which helps to search a promising region. In order to
balance the global exploration and the local exploitation, a dynamical
inertia weight is introduced, defined as
[MATH:
Wqkg=maxcountg<
/mfenced>−countgPSqk<
/mrow>gmaxcountg<
/mfenced>−mincountg<
/mfenced>, :MATH]
where count[g] = (ct[1]^g, ⋯, ct[m]^g, ⋯, ct[N]^g), and ct[m]^g is a
counter that counts the number of SNP m presented in Pbest from
iteration 1 to iteration g. This strategy allows particles to cover a
wider search space while the considered SNP is likely to be a random
one, and to converge on a promising region of the search space while
capturing a highly suspected SNP.
Based on v[qk]^g+1, the position of S[qk]^g is updated to S[qk]^g+1
using the following two equations,
[MATH: S˜qkg+1=S
mi>qkg+vqkg+1, :MATH]
[MATH:
Sqkg+1=intS˜qkg+1S˜qkg+1∈1Nintrand1NS˜qkg+1∉1N, :MATH]
where rand( ⋅ ) is the random function and int(⋅) is the rounding
function. Random functions used for updating both v[qk]^g+1 and
S[qk]^g+1 help to increase the diversity of the search.
Another highlight introduced to the PSO based search is the opposition
based learning, basic principle of which is the consideration of a
solution and its corresponding opposite solution simultaneously to
approximate the global optima [[87]43]. In our PSO based search, if the
solution is Position[g](q), its corresponding opposite solution is
defined as
[MATH: Positiong'q=1+N−Positiongq, :MATH]
which not only expands the search space and enhances the global
explorative ability, but also accelerates the convergence and avoids
premature convergence.
By comparing NCI values of Position[g]^’(q), Position[g](q) and
Pbest[g](q), the individual experience of the q[th] particle at
iteration g + 1, i.e., Pbest[g+1](q), is updated to the best one among
them. Similarly, whether the common knowledge of the swarm at iteration
g + 1, e.g., Gbest[g+1]^K, is updated or maintained as Gbest[g]^K
depends on individual experiences of particles with the same order K.
Specifically, Gbest[g+1]^K is updated to Pbest[g+1](q) while NCI value
of Pbest[g+1](q) is the highest one among those of individual
experiences of particles with the order K, and is also higher than that
of Gbest[g]^K. When completing the iteration process, the PSO based
search reports the sorted Pbest[G] according to their descending NCI
values as its detected epistatic interactions.
Epistasis hypergraph and deep analyses
In visualizing stage, lists of all epistatic interactions identified in
detecting stage, or similar lists generated by other methods, are
exported to construct an undirected epistasis hypergraph for refining
the interpretation of genetic basis of disease susceptibility and
disease etiology, by capturing and visualizing broader epistasis
landscape. The hypergraph is composed of weighted vertices and
unweighted edges. For weighted vertices, two types of them are
presented, that is, real vertices and virtual vertices. A real vertex
represents a SNP and its weight is the NCI value between the SNP and
the phenotype, corresponding to the main effect of the SNP to the
phenotype. A virtual vertex denotes the n-order interaction effect of
the combination of linking SNPs to the phenotype, and its weight is the
NCI value between the SNP combination and the phenotype. In the
hypergraph, each red circle is a real vertex in which SNP name or index
is labeled, each non-red circle is a virtual vertex. Sizes of vertices
are respectively in proportion to their weights. For unweighted edges,
each of them links a SNP and an effect of this SNP to the phenotype.
From the hypergraph, effects of different orders, especially high
orders, as well as topology structures of SNPs, can be intuitively
visualized and compared.
For deep analyses of the constructed hypergraph, several useful tools
are also provided in CINOEDV. For example, epistatic interactions can
be visually displayed according to their descending effects, either NCI
values or CCI values; penetrance of an epistatic interaction can be
estimated and visualized; degree of a real vertex in the hypergraph and
connectivity of the hypergraph can be further analyzed. More details
about these tools are available in its user manual. With the help of
these tools, some hidden clues for better understanding the underlying
genetic architecture of complex diseases could be revealed.
Results and Discussion
Experiments on simulation data
Detection power analysis for pairwise epistatic interactions
Six commonly used models of epistatic interactions are simulated for
the study. Details of these models are given in Fig. [88]1. For each
model, 200 data sets are generated by the simulator epiSIM [[89]44],
which describes the simulation steps of SNP data sets as well as true
epistatic interactions of SNPs in detail, and has been used in several
references [[90]26, [91]45, [92]46]. Among them, each data set contains