Abstract Background Detecting and visualizing nonlinear interaction effects of single nucleotide polymorphisms (SNPs) or epistatic interactions are important topics in bioinformatics since they play an important role in unraveling the mystery of “missing heritability”. However, related studies are almost limited to pairwise epistatic interactions due to their methodological and computational challenges. Results We develop CINOEDV (Co-Information based N-Order Epistasis Detector and Visualizer) for the detection and visualization of epistatic interactions of their orders from 1 to n (n ≥ 2). CINOEDV is composed of two stages, namely, detecting stage and visualizing stage. In detecting stage, co-information based measures are employed to quantify association effects of n-order SNP combinations to the phenotype, and two types of search strategies are introduced to identify n-order epistatic interactions: an exhaustive search and a particle swarm optimization based search. In visualizing stage, all detected n-order epistatic interactions are used to construct a hypergraph, where a real vertex represents the main effect of a SNP and a virtual vertex denotes the interaction effect of an n-order epistatic interaction. By deeply analyzing the constructed hypergraph, some hidden clues for better understanding the underlying genetic architecture of complex diseases could be revealed. Conclusions Experiments of CINOEDV and its comparison with existing state-of-the-art methods are performed on both simulation data sets and a real data set of age-related macular degeneration. Results demonstrate that CINOEDV is promising in detecting and visualizing n-order epistatic interactions. CINOEDV is implemented in R and is freely available from R CRAN: [35]http://cran.r-project.org and [36]https://sourceforge.net/projects/cinoedv/files/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1076-8) contains supplementary material, which is available to authorized users. Keywords: Epistatic interactions, Co-information, Single nucleotide polymorphisms, Particle swarm optimization, Hypergraph Background Following the development of high-throughput sequencing and genotyping technologies, there has been a rapid increase in the availability of single nucleotide polymorphisms (SNPs). Hence genome-wide association studies (GWAS) have become a routine tool in investigating the genetic architectures of complex diseases, such as cancer, heart disease, diabetes and many others. With these studies, hundreds of thousands of SNPs speculated to associate with complex diseases have been identified. However, these SNPs have been shown to explain only a small proportion of underlying genetic variance of complex diseases, leaving the question of “missing heritability” open for further investigation [[37]1, [38]2]. Some plausible explanations should be taken into account to reveal the gaps between expectations and realities of GWAS. Firstly, GWAS require p-values (or other similar measures) of disease-associated SNPs to reach a genome-wide significance level after several stringent multiple testing corrections, for example, Bonferroni correction, which may exclude many genuinely associated SNPs that have moderate or weak association signals [[39]3]. Secondly, rare SNPs (i.e., minor allele frequency of each is < 5 %) are difficult to be detected, and sometimes even be ignored in GWAS, though they may play an important role in explaining “missing heritability” [[40]1, [41]4, [42]5]. Thirdly, besides SNPs, other types of biological data, for instance, copy number variation, DNA methylation, and gene expression, also provide different, partly independent and complementary, views for unraveling the mystery of “missing heritability” [[43]6, [44]7]. Fourthly, it is widely believed that nonlinear interaction effects of multiple SNPs or epistatic interactions could unveil a large portion of unexplained heritability of complex diseases [[45]8–[46]11]. In fact, detection of epistatic interactions has already been a compelling step in GWAS [[47]12]. In general, detection of epistatic interactions is of great challenge. The first challenge is the intensive computational burden mainly imposed by the “curse of dimensionality” and the “combinatorial explosion”, which has significant implications for GWAS with millions of SNPs. For instance, search space of a 100 K SNP data set with maximum order of three is an astronomical number ∑^3[k = 1]C^k[100000], where the order refers to the number of SNPs in a SNP combination. The second challenge is the complexity of genetic architecture of a disease. It may involve multiple epistatic interactions interacting with other causative factors in a complicated way, each displaying strong association with the phenotype as a whole but the contained SNPs possibly having small or even no main effects. Limited prior knowledge available for a disease, such as the number of epistatic interactions, the order and the effect magnitude of an epistatic interaction, makes their detection difficult. The third is the association measure that determines how well a SNP combination contributes to the phenotype. A suitable association measure is required to be efficient in computational cost and insensitive to both SNP combination order and effect type, and more importantly, it can truly capture causative epistatic interactions. Though several association measures have been widely used for the detection of epistatic interactions, such as permutation test and chi-squared test, developing new association measures that can effectively and efficiently capture epistatic interactions is still a direction. All the above are the great challenges in genome-wide interaction analysis. Though methodological and computational perplexities of the detection of epistatic interactions have been well recognized, the algorithmic development is still ongoing. Exhaustive methods, e.g., MDR [[48]13], show their successes on small scale data sets. However, for large scale data sets, especially those for GWAS, the detection of epistatic interactions becomes a needles-in-a-haystack problem [[49]14] and exhaustive methods are no longer feasible. Recently, heuristic methods are gaining increasing favor since they can retain as many informative SNPs as possible while largely reducing computational complexity. For instance, Zhang et al. developed TEAM [[50]15] to identify epistatic interactions, which updates contingency tables by utilizing a minimum spanning tree. Wan et al. presented an epistatic interaction detection method BOOST [[51]16], which involves only Boolean values and allows the use of fast logic operations to obtain contingency tables. They also proposed another method SNPRuler [[52]17] based on predictive rule inference. Wang et al. used AntEpiSeeker [[53]18] to identify epistatic interactions, which is a two-stage ant colony optimization algorithm. Zhang and Liu developed a Bayesian partition approach BEAM to find groups of genotypes with large posterior probability [[54]19]. Tang et al. introduced the concept of epistatic module and designed a Gibbs sampling approach epiMODE [[55]20] to detect such modules, which is a generalization of BEAM. Besides these methods, co-information based methods appear promising in detecting epistatic interactions since they have a well-developed theory, and can measure multivariate dependence without any complex modeling. Chanda et al. [[56]21] developed a co-information based metric called the interaction index for prioritizing interacting SNPs. They also proposed another three co-information based methods: AMBIENCE [[57]22] and KWII [[58]23] for detecting epistatic interactions associated with the binary phenotype, CHORUS [[59]24] for identifying associations with quantitative traits. Sucheston et al. [[60]25] demonstrated that co-information based methods are flexible and have excellent power to detect epistatic interactions under a variety of conditions that characterize complex diseases. Although many methods for detecting epistatic interactions have been performed, most of them were constrained to pairwise epistatic interactions, easily ignoring the broader epistasis landscape [[61]26]. Furthermore, these methods usually output identified epistatic interactions, as well as their significance levels, in flat file formats. Hence, the correct understanding of them is sometimes a challenge for researchers, especially for biologists and non-expert users, who are unfamiliar with the methods. A desirable strategy is to develop an effective visualization tool not only to intuitively visualize the detected interactions but also to discover several hidden patterns [[62]27]. However, there have been few studies focused on their visualization. Moore et al. [[63]28] built an interaction graph to visualize detected epistatic interactions. McKinney et al. [[64]29] constructed a genetic association interaction network (GAIN) to characterize detailed interactions, whose edges quantify the synergy between pair SNPs with respect to the phenotype. GAIN has been successfully used for identifying modulators of antibody response to smallpox vaccine [[65]30], GWAS of bipolar disorder [[66]31], and analyzing exome data for systemic lupus erythematous cases and controls [[67]32]. Hu et al. [[68]33] proposed a statistical epistasis network (SEN) approach, which has been proven to be able to discover pairwise epistatic interactions of bladder cancer [[69]34, [70]35] and prostate cancer [[71]36]. They also demonstrated that SEN supervised search is able to infer several 3-order epistatic interactions with significantly high associations at a substantially reduced computational cost [[72]37]. Though these network-assisted methods can provide a global map of pairwise epistatic interactions, can indirectly capture higher order epistatic interactions on the basis of observing topology structures of networks, and can be exported for visualization in existing tools, such as Cytoscape and Graphviz, they could not directly detect and visualize n-order epistatic interactions, for example, n = 3 or larger. More recently, Hu et al. [[73]38] presented a visualization tool ViSEN, which can show both pairwise and 3-order epistatic interactions, in addition to main effects, in one network. To the best of our knowledge, it is the first visualization tool that shows three orders of effects simultaneously. Nevertheless, different orders of effects in ViSEN are difficult to be fairly and intuitively compared. Wu et al. [[74]27] designed another visualization tool EINVis to analyze and explore genetic interactions, which utilizes a tree ring view to simultaneously visualize the hierarchical interactions between SNPs, genes, and chromosomes. However, EINVis is limited in detecting and visualizing high order epistatic interactions. In the light of above observations, we develop CINOEDV (Co-Information based NOrder Epistasis Detector and Visualizer) for the detection and visualization of epistatic interactions of their orders from 1 to n (n ≥ 2). CINOEDV is composed of two stages, namely, detecting stage and visualizing stage. In detecting stage, co-information based measures are employed to quantify association effects of n-order SNP combinations to the phenotype, and two types of search strategies are introduced to identify n-order epistatic interactions: an exhaustive search for lower order epistatic interactions and/or small scale data sets, a particle swarm optimization (PSO) based search for higher order epistatic interactions and/or large scale data sets. In visualizing stage, all detected n-order epistatic interactions are used to construct a hypergraph, where a real vertex represents the main effect of a SNP and a virtual vertex denotes the interaction effect of an n-order epistatic interaction. By deeply analyzing the constructed hypergraph, some hidden clues for better understanding the underlying genetic architecture of complex diseases could be revealed, for instance, higher order epistatic interactions, hub SNPs and connected subgraphs. Experiments of CINOEDV and its comparison with state-of-the-art methods are performed on lots of simulation data sets under the evaluation measures of both detection power and computational complexity. Results demonstrate that CINOEDV is promising in detecting and visualizing n-order epistatic interactions. In addition, CINOEDV is also applied on a real data set of age-related macular degeneration (AMD), and results of which provide several new clues for the exploration of causative factors of AMD. CINOEDV might be an alternative to existing methods for the detection and visualization of n-order epistatic interactions. Methods Co-information based association measures Before introducing the measures, several terms and notations are described. At present, the generally accepted way of mapping SNPs is to collect them as a matrix, where a row represents genotypes of an individual and a column represents a SNP. Genotypes of a SNP are coded as {0, 1, 2}, corresponding to homozygous common genotype, heterozygous genotype, and homozygous minor genotype. The label of an individual is a binary phenotype being either 0 (control) or 1 (case). Based on this numerical mapping, let N and M be the number of SNPs and the number of individuals in the data respectively. Below we will discuss the definitions of co-information based association measures between n SNPs S[1], ⋯, S[n], that randomly sampled from N SNPs, and the phenotype C. Co-information is one of several generalizations of mutual information, and can measure multivariate dependence without any complex modeling [[75]39]. Co-information among n SNPs and the phenotype C is defined as an alternating sum of the joint entropies of all possible subsets T of V using the difference operator notation of Han [[76]40], [MATH: CIS1S nC=TV1n+1THT, :MATH] where V = {S[1], ⋯, S[n], C}, T represents all possible subsets of V, and H(T) is the joint entropy of T, which can be written as [MATH: HT=tTptlogpt, :MATH] and p(t) is the probability mass function. It is seen that co-information is a parsimonious, multivariate measure quantifying interactions that cannot be obtained without observing all variables at the same time [[77]24], and it seems promising for detecting n-order epistatic interactions. However, it also has two confusing properties retarded its wider adoption as an association measure. The first is its value. In the bivariate case, co-information is equivalent to mutual information and its value is always positive. But in the multivariate case, its value can be positive or negative, the interpretation of which is generally intuitive [[78]26]: a positive value is an evidence of interactions among variables; a negative value indicates the presence of redundancy; and a value of zero denotes that variables are independent or, more likely, interact with a mixture of synergy and redundancy. Almost all existing applications of co-information depend upon this intuitive explanation [[79]21, [80]24, [81]28, [82]29, [83]33, [84]38], and it is also the basis of our association measures. The second is its sensitivity to the SNP combination order. This property leads to difficulty in ranking SNP combinations of different orders. As yet, it still lacks the widely accepted normalization method. In this study, we make use of the order-fixed averages of co-information values to normalize them, defined as n-order interaction effect, [MATH: NCI(S1;;Sn;C)=CI(S1;;Sn;C)H(C)CI1< mo stretchy="true">¯CIn< mo stretchy="true">¯, :MATH] where H(C) is the entropy of the phenotype, [MATH: CIn< mo stretchy="true">¯ :MATH] and [MATH: CI1< mo accent="false">¯ :MATH] are respective averages of all considered n-order and 1-order co-information values. The first part of the formula provides the percentage of explaining the phenotype by giving the knowledge of n SNPs, and the second part is a coefficient that balances the contributions of SNP combinations of different orders to the phenotype. However, NCI only measures contribution of a SNP combination itself, not containing contributions of its subsets. In fact, the effect of an n-order SNP combination to the phenotype consists of main effects of all involved SNPs, as well as interaction effects of itself and its all subsets. To quantify the total contribution of a SNP combination to the phenotype, another co-information based association measure is presented, defined as the summation of all involved contributions, including its contribution, and contributions of its subsets whose NCI values reach the user-specified thresholds. The formula can be written as [MATH: CCIS1S nC=ZCSZS1S< /mi>nNCIZC, :MATH] where Z^′ represents all SNPs in the set Z, C represents the phenotype, and CS is a set of SNP combinations that their NCI values pass the user-specified thresholds. Search strategies CINOEDV supports two types of search strategies, with NCI as its association measure, to simultaneously detect epistatic interactions of their orders from 1 to n (n ≥ 2). One is an exhaustive search for lower order epistatic interactions and/or small scale data sets. With genome wide SNPs from thousands of individuals, it is difficult to search high order epistatic interactions exhaustively because of their heavy computational burden. CINOEDV provides a PSO based search for higher order epistatic interactions and/or large scale data sets. The PSO is a popular member of swarm intelligence algorithms inspired by the collective behaviors of organisms, like birds (viewed as particles), which can jointly perform many complex tasks though each individual is very limited in its capability [[85]41]. In PSO, the position of a particle represents a possible solution which is adjusted according to its velocity, and estimated by a fitness function at each generation. A higher fitness value implies a better position. The velocity of a particle is updated according to three factors: its previous velocity, its individual experience, and the common knowledge of the swarm. The individual experience of a particle is the best position that it has travelled. The common knowledge of the swarm is the best one among individual experiences of all particles. This feedback strategy leads the swarm gradually converge to an optimal solution [[86]42]. In our PSO based search, NCI is applied as its fitness function. That is to say, a higher NCI value indicates a stronger association between the SNP combination and the phenotype. Compared with the PSO, our PSO based search has its own highlights: detecting multiple epistatic interactions with different orders at the same time, dynamic inertia weight, and opposition based learning. Suppose [MATH: Positiongq=Sq1< mi>gSqkgSq< /mi>Kqg :MATH] is the position of the q[th] particle at iteration g, where q ∊ {1, ⋯, Q}, g ∊ {1, ⋯, G}, k ∊ {1, ⋯, K[q]}, S[qk]^g is the selected k[th] SNP of the q[th] particle at iteration g, Q is the number of particles, G is the number of iterations, and K[q] is the considered order of epistatic interactions of the q[th] particle, which is randomly specified within [1, n] at the initialization stage. The velocity of the q[th] particle at iteration g is denoted as [MATH: Velocitygq=vq1< mi>gvqkgvq< /mi>Kqg :MATH] , where v[qk]^g is the velocity of S[qk]^g. The individual experience of the q[th] particle is written as [MATH: Pbestgq=PSq1< /mrow>g,,PSqkg ,,PSq Kqg :MATH] . The common knowledge of the swarm is redefined as the best ones among individual experiences of particles with the same considered orders, i.e., Gbest[g]^K = (GS[1]^g, ⋯, GS[k]^g, ⋯, GS[K]^g), where K ∊ [1, n]. During the initialization stage, Position[1](q), Velocity[1](q), Pbest[1](q) and Gbest[1]^K are randomly initialized in their respective domains. The PSO based search detects epistatic interactions by continuously updating velocity and position of each particle at all iterations. The velocity of S[qk]^g is updated according to the following two equations, [MATH: v˜qkg+1=Wqkgvqkg +c1r1< /mn>PSqk< /mrow>gSq kg+< mi>c2r2GSkgSqkg< /mi>, :MATH] [MATH: vqkg+1=v˜qkg+1v˜qkg+11N,N−< /mo>1rand1N,N−< /mo>1v˜qkg+11N,N−< /mo>1,< /mtext> :MATH] where acceleration factors c[1] and c[2] control how far a particle moves in a single iteration, r[1] and r[2] are random values in (0, 1), GS[k]^g is the k[th] SNP of [MATH: GbestgK< /mi>q :MATH] (K = K[q]) at iteration g, W[qk]^g is the inertia weight regulating the impact of the previous velocity of a particle on its current velocity. For the inertia weight, a large weight facilitates the global exploration and thus enables the method to execute a search over various regions, while a small weight facilitates the local exploitation, which helps to search a promising region. In order to balance the global exploration and the local exploitation, a dynamical inertia weight is introduced, defined as [MATH: Wqkg=maxcountg< /mfenced>countgPSqk< /mrow>gmaxcountg< /mfenced>mincountg< /mfenced>, :MATH] where count[g] = (ct[1]^g, ⋯, ct[m]^g, ⋯, ct[N]^g), and ct[m]^g is a counter that counts the number of SNP m presented in Pbest from iteration 1 to iteration g. This strategy allows particles to cover a wider search space while the considered SNP is likely to be a random one, and to converge on a promising region of the search space while capturing a highly suspected SNP. Based on v[qk]^g+1, the position of S[qk]^g is updated to S[qk]^g+1 using the following two equations, [MATH: S˜qkg+1=Sqkg+vqkg+1, :MATH] [MATH: Sqkg+1=intS˜qkg+1S˜qkg+11Nintrand1NS˜qkg+11N, :MATH] where rand( ⋅ ) is the random function and int(⋅) is the rounding function. Random functions used for updating both v[qk]^g+1 and S[qk]^g+1 help to increase the diversity of the search. Another highlight introduced to the PSO based search is the opposition based learning, basic principle of which is the consideration of a solution and its corresponding opposite solution simultaneously to approximate the global optima [[87]43]. In our PSO based search, if the solution is Position[g](q), its corresponding opposite solution is defined as [MATH: Positiong'q=1+NPositiongq, :MATH] which not only expands the search space and enhances the global explorative ability, but also accelerates the convergence and avoids premature convergence. By comparing NCI values of Position[g]^’(q), Position[g](q) and Pbest[g](q), the individual experience of the q[th] particle at iteration g + 1, i.e., Pbest[g+1](q), is updated to the best one among them. Similarly, whether the common knowledge of the swarm at iteration g + 1, e.g., Gbest[g+1]^K, is updated or maintained as Gbest[g]^K depends on individual experiences of particles with the same order K. Specifically, Gbest[g+1]^K is updated to Pbest[g+1](q) while NCI value of Pbest[g+1](q) is the highest one among those of individual experiences of particles with the order K, and is also higher than that of Gbest[g]^K. When completing the iteration process, the PSO based search reports the sorted Pbest[G] according to their descending NCI values as its detected epistatic interactions. Epistasis hypergraph and deep analyses In visualizing stage, lists of all epistatic interactions identified in detecting stage, or similar lists generated by other methods, are exported to construct an undirected epistasis hypergraph for refining the interpretation of genetic basis of disease susceptibility and disease etiology, by capturing and visualizing broader epistasis landscape. The hypergraph is composed of weighted vertices and unweighted edges. For weighted vertices, two types of them are presented, that is, real vertices and virtual vertices. A real vertex represents a SNP and its weight is the NCI value between the SNP and the phenotype, corresponding to the main effect of the SNP to the phenotype. A virtual vertex denotes the n-order interaction effect of the combination of linking SNPs to the phenotype, and its weight is the NCI value between the SNP combination and the phenotype. In the hypergraph, each red circle is a real vertex in which SNP name or index is labeled, each non-red circle is a virtual vertex. Sizes of vertices are respectively in proportion to their weights. For unweighted edges, each of them links a SNP and an effect of this SNP to the phenotype. From the hypergraph, effects of different orders, especially high orders, as well as topology structures of SNPs, can be intuitively visualized and compared. For deep analyses of the constructed hypergraph, several useful tools are also provided in CINOEDV. For example, epistatic interactions can be visually displayed according to their descending effects, either NCI values or CCI values; penetrance of an epistatic interaction can be estimated and visualized; degree of a real vertex in the hypergraph and connectivity of the hypergraph can be further analyzed. More details about these tools are available in its user manual. With the help of these tools, some hidden clues for better understanding the underlying genetic architecture of complex diseases could be revealed. Results and Discussion Experiments on simulation data Detection power analysis for pairwise epistatic interactions Six commonly used models of epistatic interactions are simulated for the study. Details of these models are given in Fig. [88]1. For each model, 200 data sets are generated by the simulator epiSIM [[89]44], which describes the simulation steps of SNP data sets as well as true epistatic interactions of SNPs in detail, and has been used in several references [[90]26, [91]45, [92]46]. Among them, each data set contains