Abstract

Background

   Community detection algorithms are fundamental tools to uncover
   important features in networks. There are several studies focused on
   social networks but only a few deal with biological networks. Directly
   or indirectly, most of the methods maximize modularity, a measure of
   the density of links within communities as compared to links between
   communities.

Results

   Here we analyze six different community detection algorithms, namely,
   Combo, Conclude, Fast Greedy, Leading Eigen, Louvain and Spinglass, on
   two important biological networks to find their communities and
   evaluate the results in terms of topological and functional features
   through Kyoto Encyclopedia of Genes and Genomes pathway and Gene
   Ontology term enrichment analysis. At a high level, the main assessment
   criteria are 1) appropriate community size (neither too small nor too
   large), 2) representation within the community of only one or two broad
   biological functions, 3) most genes from the network belonging to a
   pathway should also belong to only one or two communities, and 4)
   performance speed. The first network in this study is a network of
   Protein-Protein Interactions (PPI) in Saccharomyces cerevisiae (Yeast)
   with 6532 nodes and 229,696 edges and the second is a network of PPI in
   Homo sapiens (Human) with 20,644 nodes and 241,008 edges. All six
   methods perform well, i.e., find reasonably sized and biologically
   interpretable communities, for the Yeast PPI network but the Conclude
   method does not find reasonably sized communities for the Human PPI
   network. Louvain method maximizes modularity by using an agglomerative
   approach, and is the fastest method for community detection. For the
   Yeast PPI network, the results of Spinglass method are most similar to
   the results of Louvain method with regard to the size of communities
   and core pathways they identify, whereas for the Human PPI network,
   Combo and Spinglass methods yield the most similar results, with
   Louvain being the next closest.

Conclusions

   For Yeast and Human PPI networks, Louvain method is likely the best
   method to find communities in terms of detecting known core pathways in
   a reasonable time.

Electronic supplementary material

   The online version of this article (10.1186/s12859-019-2746-0) contains
   supplementary material, which is available to authorized users.

   Keywords: Biological networks, Community detection, Modularity,
   Biological function, Pathways

Background

   The use of networks to study complex interacting systems has been
   applied to many domains during the last two decades, including
   sociology, physics, computer science and biology. An important task in
   the analysis of networks lies in the identification of communities or
   modules whose membership share one or more common features of the
   system. The problem that community detection attempts to solve is the
   identification of groups of nodes with more and/or better interactions
   amongst its members than between its members and the remainder of the
   network [[29]1, [30]2]. For example, in social networks, a community
   may correspond to groups of friends who attend the same school or live
   in the same neighborhood; while in a biological network, communities
   may represent functional modules of interacting proteins.

   Edges in a biological network may represent various types of direct
   interactions and indirect effects. Examples of direct interactions
   include protein-protein interactions as part of signaling pathways or
   as part of protein complexes and substrate-enzyme interactions.
   Indirect effects may include transport processes and regulatory
   effects, which, in most cases, can be substituted with a subnetwork of
   several direct interactions when modeled at a finer granularity.
   Examples of the latter are cholesterol and ion transport across the
   plasma membrane and protein-DNA interactions in gene-regulatory
   networks. Thus, in the context of a cell or tissue, subnetworks or
   communities may correspond to various cellular processes, pathways and
   functions, in which its components (nodes) exhibit a higher-degree of
   interaction as compared to those from outside the pathway.

   Majority of the methods for community detection in networks are based
   on maximization of modularity. While the modularity metric Q, of a
   network, is defined in the Methods section, intuitively, given a
   network, if it can be partitioned in such a way that only a few
   connections exist between the nodes of different partitions and most
   connections are among the nodes within the partitions, then the
   modularity will be high. It is interesting to note that the modularity
   of a sparse network of fully connected subnetworks is higher than that
   of a fully connected network, which is zero. Any partition of a fully
   connected network results in Q < 0. Brandes et al. have carried out
   extensive theoretical analysis of properties of modularity and
   complexity of its maximization [[31]3].

   One of the most important objectives of any large-scale omics study is
   to identify mechanisms for specific functions and phenotypes in a
   chosen context. Biological networks derived from genome-scale
   experimental data and/or legacy knowledge are generally large and
   complex with thousands of nodes and many thousands of connections.
   Associating meaningful biological functions and interpretations to such
   networks is impossible. However, these large networks can be broken
   down into smaller (sub) networks (also called as modules or
   communities) which are more amenable to biological interpretation. Such
   communities are expected to represent one or a few biological functions
   and they may facilitate discovery of mechanisms relating the causes or
   perturbations to the observed phenotypes. Thus, community detection can
   provide valuable biological insights.

   Several methods have been developed to find communities in networks
   using tools and techniques from different disciplines such as applied
   mathematics or statistical physics [[32]4]. All these methods try to
   identify meaningful communities, while keeping the computational
   complexity of the underlying algorithm low [[33]5]. Although these
   methods have proven to be successful in some cases, there is no
   guarantee that the resulting communities provide the best functional
   description of the system. Hence, selecting a suitable method to detect
   communities in a network is challenging. While there have been some
   studies comparing different methods for community detection [[34]5],
   their focus has been on Lancichinetti, Fortunato, Radicchi (LFR)
   benchmark networks (artificial networks that have heterogeneity in the
   distributions of degree of nodes and the size of communities) [[35]6];
   comparisons with respect to biological networks are lacking.

   Classical community detection algorithms initially divide networks into
   communities according to some network features such as edge
   betweenness. One of the most popular and prominent algorithms that uses
   edge betweenness is the Girvan-Newman algorithm [[36]1, [37]7]. In this
   method edges are progressively removed from the original network till
   the modularity reaches its maximum value, making it an optimization
   problem. The connected nodes of the remaining network are the
   communities. The Girvan-Newman algorithm has been successfully applied
   to a variety of networks, including networks of email messages.
   However, its computational complexity, O(m^2n) for a network with n
   nodes and m edges, practically restricts its use to networks of at most
   a few thousand nodes. There are other optimization-based algorithms
   with different objective functions that provide different approaches to
   solve the community detection problem. For example, Leading Eigen
   [[38]8] algorithm also tries to maximize modularity but the modularity
   is expressed in the form of the eigenvalues and eigenvectors of a
   matrix called the modularity matrix. Spinglass method minimizes the
   Hamiltonian of the network [[39]9].

   Since the early 2000s, several methods have been developed that divide
   networks into communities based on the modularity [[40]10–[41]15]. The
   modularity criterion was revisited in 2005 when Duch and Arenas
   proposed a divisive algorithm [[42]16] that optimizes the modularity
   using a heuristic search based on the Extremal Optimization (EO)
   algorithm proposed by Boettcher and Percus [[43]17, [44]18]. Pizzuti
   has suggested an algorithm named GA-net that uses a special assessment
   function described as the community score in addition to the modularity
   function [[45]19]. There are also other approaches to the community
   detection problem in which the use of multiple objectives (or
   assessment criteria) is preferred over the use of a single objective
   for complex networks. Since the objectives are usually directly related
   to the network properties, one advantage of using multi-objective
   optimization is that it balances among the multiple (important)
   properties of the network. The benefits of using multi-objective
   approach have been explained by Shi et al. [[46]20].

   In this manuscript, we briefly review eight algorithms for finding
   communities in biological networks such as Protein-Protein Interaction
   (PPI) networks (discussed in the Methods section). In such networks,
   each node represents a protein (or gene) and each edge represents an
   interaction between two proteins. In particular, we will apply six
   algorithms to the Yeast PPI network with 6532 nodes and 229,696 edges
   and the Human PPI network with 20,644 nodes and 241,008 edges. Using
   several topological metrics, we assess which methods provide similar
   (or dissimilar) results. We evaluate the biological interpretation of
   the communities identified and compare the results in terms of their
   functional features. At a high level, the main criteria for assessment
   of the methods are 1) appropriate community size (neither too small nor
   too large), 2) representation within the community of only one or two
   broad biological functions, 3) most genes from the network belonging to
   a pathway should also belong to only one or two communities, and 4)
   performance speed.

   This paper is organized as follows: in the next section we will present
   the results of applying six methods on the Yeast and Human PPI networks
   and compare the communities based on their topological and functional
   features. In the last part of this section, we will describe an
   orthology analysis between the communities detected for the Yeast PPI
   network and the communities detected for the Human PPI network. In the
   following section, we will present discussion on the results providing
   insights into the algorithmic similarities and robustness of some of
   the methods. In the section after that, we will provide the conclusion
   of our paper. In the Methods section, we will describe eight different
   methods for finding communities in networks. We will also introduce
   three metrics to compare the communities identified by the algorithms.

Results

   Six community detection methods, namely, Combo, Conclude, Fast Greedy,
   Leading Eigen, Louvain and Spinglass, have been applied to the Yeast
   PPI network with 6532 nodes and 229,696 edges and the Human PPI network
   with 20,644 nodes and 241,008 edges. A detailed description of the
   methods is included in the Methods section. We used the BioGRID
   database [[47]21, [48]22] for the PPI networks for Yeast and Human.
   Since our focus in this paper is on undirected and unweighted networks,
   we removed repeated edges and self-loops from our data set.

   In the first part of this section, we will present the results for the
   Yeast PPI network. In the second part, the results for the Human PPI
   network will be presented. In the third part, an orthology comparison
   will be provided between the Yeast and Human PPI networks.

Yeast PPI network

   Among the methods tested to find communities of the Yeast PPI network,
   Combo, Conclude, Fast Greedy, Leading Eigen, Louvain and Spinglass give
   good partitioning results, i.e., the size of communities detected are
   not too small or too large compared to the size of the original
   network. Since the Yeast PPI network has 6532 nodes, Girvan-Newman
   algorithm is not an appropriate method to detect communities. It takes
   44 min (on a PC with 4 GB RAM with 4 2.4 GHz processors) for Rattus PPI
   network which has 3379 nodes and 4580 edges. Its computational
   complexity is proportional to m^2n (where n is the number of nodes and
   m is the number of edges), so, it will take ~ 148 days to find
   communities in the Yeast PPI network (using the computational resource
   mentioned above). Infomap, is also not a good method based on the size
   of communities it detects; the largest community has 6195 nodes and the
   smallest one has just 2 nodes. Since very small communities (e.g.,
   those with less than 100 nodes) are not expected to yield significant
   biological insights, we will not consider them in our analysis. We note
   that there may be some exceptions.

   In the next subsection, first we will compare the methods from a
   topological perspective of the communities identified. Then we will
   provide a functional comparison. To begin with, the results for all
   these methods are described in Table [49]1 in terms of the size of the
   communities detected for the Yeast PPI network.

Table 1.

   Number of nodes and edges for communities detected using different
   methods for the Yeast PPI network (6532 nodes and 229,696 edges). The
   number in parenthesis after the name of each method represents the
   number of communities detected by that method. For example, Combo finds
   8 communities. Modularity scores are also provided for different
   methods. For each method, we only consider the communities with 100 or
   more nodes and list up to 10 communities
   Combo (8) Q = 0.2654 # of nodes 2231 1514 1337 1284
   # of edges 25,137 23,690 38,523 30,585
   Conclude (66) Q = 0.2468 # of nodes 788 744 602 468 423 359 288 271 252
   199
   # of edges 14,585 10,506 3272 988 5826 4123 1404 3486 940 1703
   F. Greedy (10) Q = 0.2112 # of nodes 2608 2410 1466
   # of edges 61,665 72,998 7180
   L. Eigen (4) Q = 0.1686 # of nodes 2661 1910 984 977
   # of edges 75,812 28,664 7373 7203
   Louvain (9) Q = 0.2643 # of nodes 1538 1472 1190 1151 993 131
   # of edges 16,015 23,394 13,247 31,202 22,553 676
   Spinglass (9) Q = 0.2681 # of nodes 1607 1473 1194 1148 1076
   # of edges 16,616 23,876 12,282 32,641 23,854
   [50]Open in a new tab

Comparison based on topological features of communities

   The following table (Table [51]1) represents the results for applying
   six methods on the Yeast PPI network.

   Using three different metrics, namely, Rand Index (RI), Adjusted Rand
   Index (ARI), and Normalized Mutual Information (NMI) (described in the
   Methods section), we are able to compare different pair of methods.
   Table [52]2 represents the results of comparing six methods (Combo,
   Conclude, Fast Greedy, Leading Eigen, Louvain and Spinglass) with
   respect to three topological metrics (RI, ARI and NMI).

Table 2.

   Comparison of different methods with respect to three topological
   metrics, namely, Rand Index (RI), Adjusted Rand Index (ARI) and
   Normalized Mutual Information (NMI) for the Yeast PPI network. When a
   method is compared with itself, RI, ARI and NMI are 1 (diagonal
   elements). Larger (smaller) the value of RI, ARI and NMI, the more
   (less) similar are the two methods being compared. For example, Louvain
   and Spinglass are most similar to each other
                 Combo Conclude F. Greedy L. Eigen Louvain Spinglass
   RI  Combo       1    0.7608   0.7157   0.6788   0.8319  0.8409
   ARI Combo       1    0.1466   0.3125   0.1942   0.5163  0.5479
   NMI Combo       1    0.2905   0.4024   0.2447   0.5413  0.5723
   RI  Conclude           1      0.6815   0.7061   0.8083  0.8012
   ARI Conclude           1      0.0818   0.0825   0.1659  0.1637
   NMI Conclude           1      0.1956   0.1472   0.3016  0.2924
   RI  F. Greedy                    1     0.6334   0.7098  0.7129
   ARI F. Greedy                    1     0.146    0.2629  0.2764
   NMI F. Greedy                    1     0.1918   0.3545  0.3652
   RI  L. Eigen                           1        0.6952  0.6936
   ARI L. Eigen                           1        0.188   0.1914
   NMI L. Eigen                           1        0.2231  0.2285
   RI  Louvain                                     1       0.9021
   ARI Louvain                                     1       0.6922
   NMI Louvain                                     1       0.6644
   RI  Spinglass                                           1
   ARI Spinglass                                           1
   NMI Spinglass                                           1
   [53]Open in a new tab

   Based on the results of Table [54]2, Louvain and Spinglass are most
   similar to each other amongst all pairs of comparisons. To maintain
   consistency in finding dissimilar methods, we selected a method which
   is dissimilar to Louvain, e.g., Conclude or Leading Eigen. Since
   Conclude finds 66 communities with sizes (number of nodes) ranging from
   3 to 788, we compare Louvain with Leading Eigen here. We present the
   results from comparing Louvain and Conclude in the Additional file
   [55]1.

   Table [56]3 provides Jaccard index (as a percentage) between
   communities identified by Louvain and Spinglass. We used Intersect
   function in R to find common genes between two communities and then
   divided the number of common genes by the total number of unique genes
   between the two communities (union function in R) to get the Jaccard
   index. Table [57]4 uses the same approach to find Jaccard index for
   communities detected by dissimilar methods, in particular, Louvain and
   Leading Eigen. The rest of Jaccard index matrices amongst all pairs of
   communities for all methods can be found in the Additional file [58]1:
   Table S1.

Table 3.

   Jaccard index (as a percentage) between the communities identified by
   two similar methods, namely, Louvain and Spinglass, for the Yeast PPI
   network. L1 to L5 refer to the communities detected by Louvain method
   and sorted by their size. Similarly, S1 to S5 refer to the communities
   detected by Spinglass method. The numbers in parenthesis represent the
   number of genes in each community. Community pairs with maximum overlap
   (e.g., L1 vs. S1) are indicated in bold text

   graphic file with name 12859_2019_2746_Tab3_HTML.jpg
   [59]Open in a new tab

Table 4.

   Jaccard index (as a percentage) between the communities identified by
   two dissimilar methods, namely, Louvain and Leading Eigen for the Yeast
   PPI network. (L1 to L5: communities detected by Louvain; LE1 to LE4:
   communities detected by Leading Eigen). The numbers in parenthesis
   represent the number of genes in each community. Community pairs with
   maximum overlap (e.g., L1 vs. LE4) are indicated in bold text

   graphic file with name 12859_2019_2746_Tab4_HTML.jpg
   [60]Open in a new tab

Comparison based on biological/functional features of communities

   As described in the previous subsection, Louvain and Spinglass are most
   similar to each other and Louvain and Leading Eigen are most
   dissimilar. In order to know which communities of similar and
   dissimilar methods have to be compared to each other, we analyzed
   Tables [61]3 and [62]4 (Jaccard index) for all pairs of communities
   between similar and dissimilar methods. After selecting pairs of
   communities with highest value of Jaccard index for each column, we
   used Database for Annotation, Visualization and Integrated Discovery
   (DAVID) version 6.8 [[63]23, [64]24] to perform Kyoto Encyclopedia of
   Genes and Genomes (KEGG) [[65]25] pathway and Gene Ontology (GO) term
   (GOTERM_BP_3) enrichment analysis for each community. In the following
   Tables (Table [66]5 through Table [67]7), we have considered pathways
   with more than 10 genes and with p-values less than or equal to 0.01.
   The number in parenthesis (e.g., after L1 or S1) is the number of genes
   that DAVID could annotate for that specific community for the Yeast PPI
   network. For example, the first community of Louvain (L1) has 1538
   genes (Table [68]1) and of those, DAVID is able to annotate 1481 genes.
   In Tables [69]5, [70]6, [71]7, the first column lists the broad
   category of pathways (M: Metabolism, CP: Cellular Processes, GIP:
   Genetic Information Processing, HD: Human Diseases, and OS: Organismal
   Systems). The second column lists the different pathways enriched.
   Columns 3 and 7 (Count) represent number of genes enriched in the
   pathways, columns 4 and 8 (p-value) represent p-values for those
   pathways in the communities compared, and columns 5 and 9 (FE)
   represent Fold Enrichment for the pathways. FE is defined as
   (s/b)/(k/N) where b is the total number of genes in a chosen pathway;
   s, the number of genes from the community in this pathway; N, the total
   number of genes for the species; and k, the number of genes in the
   community; all the four numbers are based on intersection/overlap with
   the respective DAVID database (e.g., KEGG). Essentially, FE represents
   the relative increase or decrease of the fraction of genes from the set
   of interest belonging to a pathway as compared to the genes from a
   background set (generally covering the whole-genome) belonging to the
   same pathway. The values in columns 5 and 9 are shaded light to dark
   with increasing FE. Column 6 (common) is the number of genes common to
   both communities for the different pathways.

Table 5.

   A Comparison of KEGG pathway enrichment results between the first
   community of Louvain (L1) with 1538 genes and the first community of
   Spinglass (S1) with 1607 genes for the Yeast PPI network. The numbers
   inside parenthesis after L1 and S1 represent the number of genes that
   DAVID could annotate, which is generally less than the number of genes
   in those communities. The first column lists the broad category of
   pathways (M: Metabolism, CP: Cellular Processes). Many pathways
   enriched in L1 and S1 have good overlap (a large number of genes are
   common). FE: Fold Enrichment. False Discovery Rate (FDR) values for all
   pathways and both communities are approximately 1.10E+3 times p-value
   (the factor 1.10E+3 is related to the size of the community)

   [72]graphic file with name 12859_2019_2746_Tab5_HTML.jpg
   [73]Open in a new tab

Table 7.

   A comparison of KEGG pathway enrichment results between the first
   community of Louvain with 1538 genes and the fourth community of
   Leading Eigen (LE4) with 977 genes for the Yeast PPI network. The
   numbers inside parenthesis after L1 and LE4 represent the number of
   genes that DAVID could annotate, which is generally less than the
   number of genes in those communities. The first column lists the broad
   category of pathways (M: Metabolism, and CP: Cellular Processes), FE:
   Fold Enrichment, False Discovery Rate (FDR) values for all pathways and
   both communities are approximately 1.10E+3 times p-value

   [74]graphic file with name 12859_2019_2746_Tab7_HTML.jpg
   [75]Open in a new tab

Table 6.

   A comparison of KEGG pathway enrichment results between the second
   community of Louvain (L2) with 1472 genes and the second community of
   Spinglass (S2) with 1473 genes for the Yeast PPI network. The numbers
   inside parenthesis after L2 and S2 represent the number of genes that
   DAVID could annotate, which is generally less than the number of genes
   in those communities. The first column lists the broad category of
   pathways (GIP: Genetic Information Processing). Many pathways enriched
   in L2 and S2 have good overlap (a large number of genes are common).
   FE: Fold Enrichment. False Discovery Rate (FDR) values for all pathways
   and both communities are approximately 1.05E+3 times p-value

   [76]graphic file with name 12859_2019_2746_Tab6_HTML.jpg
   [77]Open in a new tab

Comparing similar methods

   As the first column of Table [78]3 shows, the first community of
   Louvain (L1) and the first community of Spinglass (S1) have the maximum
   overlap and the results of comparing KEGG pathway enrichment analysis
   between L1 and S1 are presented in Table [79]5. Additional file [80]1:
   Table S2 shows the results of comparing GO term enrichment analysis
   between these two communities. L2 and S2 are 71% similar to each other
   based on Table [81]3 and they are compared in Table [82]6 for KEGG
   pathway enrichment analysis and in Additional file [83]1: Table S3 for
   GO term enrichment analysis. The rest of the comparison Tables (KEGG
   pathway enrichments analysis for L3 vs. S3, L4 vs. S4, and L5 vs. S5)
   are in the Additional file [84]1: Tables S4-S6. Since DAVID did not
   find any pathways for small communities, such as L6 which has 131
   nodes, those communities are not considered in the comparison Tables.

   Based on Table [85]3, L1 and S1 are 76% similar to each other in terms
   of Jaccard index. In Table [86]5, KEGG pathway enrichment results of
   these two communities reveal that majority of these genes are related
   to various metabolic pathways such as carbohydrate metabolism, energy
   metabolism, amino acid metabolism and metabolism of cofactors and
   vitamins. The top four pathways represent broad metabolism pathways.
   There are 13 pathways categorized as amino acid metabolism such as
   cysteine and methionine metabolism, or glycine, serine and threonine
   metabolism. Among pathways that are categorized as energy metabolism,
   oxidative phosphorylation is the one with the lowest p-value.

   In terms of enzyme commission annotation, there are 1738 enzyme-coding
   genes in the entire network. L1 and S1 have 571 and 605 enzyme-coding
   genes, respectively. Of these, 529 enzyme-coding genes are common
   between the two communities, which shows a significant overlap. There
   are a few enzyme-coding genes which are present in L1 but not in S1
   such as aminoacyl-tRNA hydrolase (PTH1) or glutamate 5-kinase (PRO1).
   Similarly, for genes that are present in S1 but not in L1, sulfuric
   ester hydrolase (BDS1) is an example. Since both Louvain and Spinglass
   find 9 non-overlapping communities, all enzyme-coding genes are part of
   one of the communities.

   Table [87]6 shows KEGG pathway enrichment results for the communities
   L2 and S2. All pathways are related to genetic information processing
   with approximately similar genes enriched in the two methods. The first
   pathway with the lowest p-value is ribosome which is a complex molecule
   made of ribosomal RNA molecules and proteins. There are 151 genes
   enriched in L2 and 141 genes enriched in S2 for this pathway. Of these,
   139 genes are common (a 92% overlap). Similar trend is observed for
   other pathways as well, e.g., there is a 95% overlap between L2 and S2
   for Spliceosome and 95% overlap for RNA transport.

   The GO term enrichment results shown in Additional file [88]1: Tables
   S2 and S3 also verify the similarity between L1 and S1, and L2 and S2,
   respectively. Counting all genes for all pathways in Additional file
   [89]1: Table S2 yields 1062 unique genes for L1 and 1103 unique genes
   for S1 and of these, 957 genes are common between the two communities,
   which is an 87% overlap. This similarity value is 84% between L2 and S2
   (Additional file [90]1: Table S3).

   Additional file [91]1: Table S4 provides a comparison between the
   communities L3 and S3. The pathways enriched can be classified into
   four different groups (metabolic processes, environmental information
   processing, cellular processes and human diseases) as opposed to just
   one or two. Still, the overlap between L3 and S3 communities for each
   of the pathways is more than 80%.

   L4 and S4 are 81% similar to each other based on Table [92]3 and the
   results of their comparison are shown in Additional file [93]1: Table
   S5. Most pathways for these two communities are related to genetic
   information processing category. There are two pathways related to
   cellular processes and two pathways related to metabolic processes.
   Based on KEGG pathway enrichment results, there is a good overlap
   between genes enriched in different pathways for these two communities.
   For example, 71 genes of L4 are enriched in cell cycle pathway and 77
   genes of S4 are also enriched in this pathway. Among genes enriched in
   cell cycle pathway, 70 genes are common between L4 and S4, giving a 91%
   overlap.

   Additional file [94]1: Table S6 compares L5 and S5. Based on
   Table [95]3, they are 86% similar to each other. KEGG pathway
   enrichment results of these two communities show that most pathways are
   related to metabolic processes and there are also other pathways
   related to other categories such as endocytosis, which is in the
   cellular processes category. The results of KEGG pathway also verify
   the similarity of Table [96]3. As seen from the enriched pathways,
   almost all of them have the same genes enriched in both communities.
   For example, there are 29 genes for L5 enriched in N-Glycan
   biosynthesis and the same genes are found in S5 in the same pathway.

Comparing dissimilar methods

   In this subsection we will compare the methods that are most dissimilar
   to each other, namely Louvain and Leading Eigen. As Table [97]4 shows,
   the first community of Louvain has the maximum overlap with the fourth
   community of Leading Eigen (LE4). The results of this comparison based
   on KEGG and GO term enrichment analysis are shown in Tables [98]7 and
   Additional file [99]1: Table S7, respectively. The rest of the
   comparisons can be found in the Additional file [100]1: Table S8 for L2
   vs. LE1, Additional file [101]1: Table S9 for L4 vs. LE2 and Additional
   file 1: Table S10 for L5 vs. LE3).

   The first pathway of Table [102]7 with the lowest p-value is metabolic
   pathways with 346 genes enriched in L1 and 253 genes enriched in LE4.
   Of these genes, 224 genes are common between the two communities which
   is a 65% overlap. In contrast, the first pathway of Table [103]5 shows
   that 337 genes are common between L1 and S1, which is a 94% overlap.
   There are some pathways in Table [104]7 that are blank for LE4 such as
   biosynthesis of amino acids. For these pathways, although there are
   some genes enriched in L1, there are no genes enriched in LE4 or if
   there are any, the p-value for the pathway is higher than the defined
   cut-off of 0.01. Counting all genes for all pathways yields 406 unique
   genes for L1 and 273 unique genes for LE4 and of these, 239 genes are
   common between the two communities which is a 59% overlap. Based on GO
   term enrichment analysis shown in Additional file [105]1: Table S7,
   there are 474 genes common between L1 and LE4 out of 1062 unique genes
   for L1 and 662 unique genes for LE4, which is a 45% overlap. These
   relatively low best-overlaps also confirm that these two methods are
   dissimilar to each other.

   Based on Table [106]2, Louvain and Conclude methods are also dissimilar
   to each other. We compared the communities obtained from these two
   methods. As Additional file [107]1: Table S1 shows, the first community
   of Louvain (L1) has the maximum overlap with the third community of
   Conclude (CL3). The results of this comparison based on KEGG pathway
   enrichment analysis are shown in Additional file [108]1: Table S11. The
   metabolic pathways is the most enriched pathway with 346 genes enriched
   in L1, 230 genes enriched in CL3, and 173 genes common between the two
   communities, which is a 50% overlap. Counting all unique genes for all
   pathways yields 406 genes for L1 and 241 genes for CL3 and of these,
   181 genes are common between the two communities (which is a 45%
   overlap). This similarity value is close to what we calculated for
   Louvain vs. Leading Eigen, using KEGG pathway and GO term enrichment
   analysis (Table [109]7 and Additional file [110]1: Table S7). The rest
   of the comparisons can be found in Additional file [111]1: Table S12
   for L2 vs. CL2, Additional file 1: Table S13 for L4 vs. CL1, and
   Additional file 1: Table S14 for L5 vs. CL5. Overall, dissimilarity at
   the topological level translates into dissimilarity at the functional
   level as well.

Human PPI network

   Six methods, namely, Combo, Conclude, Fast Greedy, Leading Eigen,
   Louvain, and Spinglass, have been applied to the Human PPI network with
   20,644 nodes and 241,008 edges. Although all of them were able to find
   communities, we will not consider the results of Conclude because it
   finds 495 communities, many of which are very small communities with
   less than 50 nodes. For Combo and Spinglass, since they use a random
   number generator in the procedure of finding communities, we ran them
   10 times with 10 different seeds between 0 and 10,000 and used the
   results from the run with the largest modularity. Modularity scores and
   the number of communities detected in each run for Combo and Spinglass
   are summarized in Table [112]8 and Table [113]9, respectively.

Table 8.

   Modularity scores and number of communities detected by Combo for the
   Human PPI network. Each run uses a random seed between 0 and 10,000 in
   the procedure for finding communities
   Modularity                0.3735 0.3734 0.3734 0.3729 0.3723
   # of communities detected 11     13     11     15     11
   Modularity                0.3723 0.3718 0.3715 0.3711 0.3704
   # of communities detected 13     12     10     12     13
   [114]Open in a new tab

Table 9.

   Modularity scores and number of communities detected by Spinglass for
   the Human PPI network. Each run uses a random seed between 0 and 10,000
   in the procedure for finding communities
   Modularity                0.3729 0.3727 0.3725 0.3725 0.3724
   # of communities detected 21     22     22     24     21
   Modularity                0.3721 0.3716 0.3716 0.3711 0.3688
   # of communities detected 22     23     21     23     23
   [115]Open in a new tab

   After finding modularity scores and communities for 10 runs, we
   selected communities corresponding to the largest modularity score
   which is 0.3735 for Combo (11 communities) and 0.3729 (21 communities)
   for Spinglass.

   The results of comparing all methods excluding Conclude are presented
   in Table [116]10.

Table 10.

   Comparison of different methods with respect to three topological
   metrics, namely, RI, ARI and NMI for the Human PPI network (20,644
   nodes and 241,008 edges). When a method is compared with itself, RI,
   ARI and NMI are 1 (diagonal elements). Larger (smaller) the value of
   RI, ARI and NMI, the more (less) similar are the two methods being
   compared. For example, Combo and Spinglass are most similar to each
   other, Louvain being the next most similar to them. Overall, Combo,
   Louvain and Spinglass provide similar results
                 Combo F. Greedy L. Eigen Louvain Spinglass
   RI  Combo       1    0.7314   0.3606   0.8805  0.8948
   ARI Combo       1    0.1806   0.0315   0.416   0.4998
   NMI Combo       1    0.3025   0.0936   0.4601  0.5551
   RI  F. Greedy           1     0.444    0.7243  0.7258
   ARI F. Greedy           1     0.0624   0.1609  0.1739
   NMI F. Greedy           1     0.0787   0.2682  0.3063
   RI  L. Eigen                  1        0.3531  0.3649
   ARI L. Eigen                  1        0.0191  0.0326
   NMI L. Eigen                  1        0.0711  0.0951
   RI  Louvain                            1       0.8832
   ARI Louvain                            1       0.4479
   NMI Louvain                            1       0.4679
   RI  Spinglass                                  1
   ARI Spinglass                                  1
   NMI Spinglass                                  1
   [117]Open in a new tab

   From Table [118]10, we can see that Louvain and Spinglass are more
   similar to each other as compared to all other pairs of methods except
   Combo and Spinglass. Hence, we will compare Combo and Spinglass as well
   here. Since they are more similar to each other than Louvain and
   Spinglass, they will be compared first. The first community of Combo
   (C1) and the first community of Spinglass (S1) have been compared to
   each other using KEGG pathway enrichment analysis and the results for
   top 10 pathways are presented in Table [119]11. Table [120]12 presents
   the results of comparing top 10 pathways for the first community of
   Louvain (L1) and the first community of Spinglass (S1). Organization of
   these two tables is the same as that for Tables [121]5, [122]6 and
   [123]7 in the previous subsection. The complete versions of
   Tables [124]11 and [125]12 are in the Additional file [126]1: Tables
   S15 and S16, respectively. The results of comparing all pathways for
   the communities C1 and S1 and L1 and S1 (with p-values less than 0.01)
   are illustrated in Fig. [127]1 and Fig. [128]2, respectively. The pie
   charts in Figs. [129]1 and [130]2 show the broad functional categories.
   Essentially, pathways belonging to a broad category are selected and
   the genes of these pathways combined together and the number of unique
   genes is expressed as a percentage of total unique genes in all
   pathways with p-values less than 0.01. As an example, there are three
   different pathways belong to cellular processes in C1: lysosome,
   peroxisome and phagosome. There are 61 genes enriched in lysosome, 46
   genes enriched in peroxisome and 63 genes enriched in phagosome.
   Together, they have 159 unique genes, which is about 14% of the total
   unique genes for all pathways with p-value less than 0.01. We performed
   these calculations for all six broad categories of pathways for two
   community-pairs of Additional file [131]1: Tables S15 and S16 and the
   corresponding results are shown in Figs. [132]1 and [133]2,
   respectively.

Table 11.

   Top 10 pathways for a comparison of KEGG pathway enrichment results
   between C1 with 3252 genes and S1 with 3206 genes for the Human PPI
   network (20,644 nodes and 241,008 edges). The numbers inside
   parenthesis after C1 and S1 represent the number of genes that DAVID
   could annotate, which is generally less than the number of genes in
   those communities. The first column lists the broad category of
   pathways (M: Metabolism, HD: Human Diseases, CP: Cellular Processes,
   and GIP: Genetic Information Processing), FE: Fold Enrichment

   [134]graphic file with name 12859_2019_2746_Tab11_HTML.jpg
   [135]Open in a new tab

Table 12.

   Top 10 pathways for a comparison of KEGG pathway enrichment results
   between L1 with 3585 genes and S1 with 3206 genes for the Human PPI
   network. The numbers inside parenthesis after L1 and S1 represent the
   number of genes that DAVID could annotate, which is generally less than
   the number of genes in those communities. The first column lists the
   broad category of pathways (M: Metabolism, HD: Human Diseases, GIP:
   Genetic Information Processing, and CP: Cellular Processes), FE: Fold
   Enrichment

   [136]graphic file with name 12859_2019_2746_Tab12_HTML.jpg
   [137]Open in a new tab

Fig. 1.

   [138]Fig. 1
   [139]Open in a new tab

   Pie charts for KEGG pathway enrichment results of C1 with 3252 genes
   and S1 with 3206 genes for the Human PPI network. Left chart shows the
   results for C1 and right chart shows the results for S1

Fig. 2.

   [140]Fig. 2
   [141]Open in a new tab

   Pie charts for KEGG pathway enrichment results of L1 with 3585 genes
   and S1 with 3206 genes for the Human PPI network. Left chart shows the
   results for L1 and right chart shows the results for S1

   As seen in Table [142]11 (C1 vs. S1), there is a good overlap between
   enriched genes in C1 and S1 communities for different pathways. The
   first pathway is oxidative phosphorylation with the lowest p-value.
   This pathway has 102 genes enriched in C1 and 99 genes enriched in S1.
   Of these genes, there are 98 genes common between the two communities
   which is a 96% overlap. Counting all genes for all pathways yields 778
   unique genes for C1 and 756 unique genes for S1. Of these genes, there
   are 696 genes common between the two communities which is an 89%
   overlap. The results of GO term enrichment analysis between these two
   communities are presented in Additional file [143]1: Table S17, where a
   similarity of 91% is observed.

   As seen in Fig. [144]1, communities C1 and S1 represent six different
   broad categories of functions and they are similar to each other in
   terms of the percentage of enriched genes in each category.

   Next, we will compare Louvain and Spinglass. The results of comparing
   the top 10 pathways for L1 and S1 are summarized in Table [145]12
   (Additional file [146]1: Table S16 for the full list). Figure [147]2
   shows the broad functional categories for comparing all pathways with
   p-values less than 0.01 for L1 and S1. Comparison of Figs. [148]1 and
   [149]2 reveals that L1 and S1 are less similar as compared to C1 and
   S1. However, it is appropriate to say that Combo, Louvain and Spinglass
   broadly yield similar and reasonably sized communities.

Orthology comparison of communities from Yeast and Human PPI networks using
Louvain method

   In this sub-section, we will compare communities detected by Louvain
   for the Yeast PPI network and communities detected by the same method
   for the Human PPI network. Louvain could find 9 communities with sizes
   ranging from 4 to 1538 for the Yeast PPI network (named SC1 for the
   biggest and SC9 for the smallest community) and 14 communities with
   sizes ranging from 3 to 3585 for the Human PPI network. Using biomaRt
   package of R [[150]26], we were able to find orthologous genes between
   Yeast and Human. Since the sizes of communities (the number of genes in
   the community) detected for the Human PPI network are larger than the
   size of communities detected for the Yeast PPI network, we found
   orthologous genes of the communities detected for the Human PPI network
   in Yeast (denoted HS ➔ SC) and then used DAVID to perform KEGG pathway
   enrichment for those genes. KEGG pathway enrichment results for the HS
   ➔ SC genes were compared to that for the communities of the Yeast PPI
   network. Table [151]13 shows the Jaccard index (as a percentage)
   between different pairs of communities and guided us on which community
   pairs should be compared with each other. For example, SC2 should be
   compared with HS3 ➔ SC. The results of comparing SC2 and HS3 ➔ SC are
   presented in Table [152]14. Tables for other comparisons of this
   sub-section are in the supplementary section (SC4 vs. HS1 ➔ SC in
   Additional file [153]1: Table S18 and SC5 vs. HS2 ➔ SC in Additional
   file 1: Table S19).

Table 13.

   Jaccard index (as a percentage) between the communities detected by
   Louvain for the Yeast PPI network and orthologous genes of the
   communities detected by the same method for the Human PPI network in
   Yeast. Community pairs with maximum overlap of more than 10% (e.g., SC2
   vs. HS3 ➔ SC) are indicated in bold text

   graphic file with name 12859_2019_2746_Tab13_HTML.jpg
   [154]Open in a new tab

Table 14.

   A comparison of KEGG pathway enrichment results between the second
   community detected by Louvain for the Yeast PPI network (SC2) and HS3 ➔
   SC (orthologous genes of the third community of the Human PPI network
   in Yeast). The first column lists the broad category of pathways (GIP:
   Genetic Information Processing), FE: Fold Enrichment

   [155]graphic file with name 12859_2019_2746_Tab14_HTML.jpg
   [156]Open in a new tab

   As seen in Table [157]14, the most enriched pathway is the ribosome
   pathway with 151 genes enriched in SC2 and 112 genes enriched in HS3 ➔
   SC. Of these, 104 genes are common between the two communities, which
   is a 69% overlap. Counting all genes for all pathways yields 380 unique
   genes for SC2 and 233 unique genes for HS3 ➔ SC. Of these genes, there
   are 218 genes common between the two communities, which is a 57%
   overlap. Although this similarity level is not impressive by itself, we
   did not expect much overlap between the two communities since
   Table [158]13 represents only a 21% similarity between them.

Discussion

   As mentioned in the Results section, Louvain and Spinglass are most
   similar to each other for the Yeast PPI network (Table [159]2). Louvain
   tries to maximize the modularity (Q) whereas Spinglass tries to
   minimize the Hamiltonian (H). However, it has been shown that there is
   a relation between Q and H as
   [MATH: <mi>Q</mi><mo>=</mo><mo>−</mo><mfrac><mi
   mathvariant="script">H</mi><mrow><mn>2</mn><mi>M</mi></mrow></mfrac>
   :MATH]
   (Eq. [160]16 in the Methods section). Thus, minimizing H is equivalent
   to maximizing Q. Still, since they use different algorithms to optimize
   their objective functions, the results are not exactly the same. Combo
   also tries to maximize modularity but in a different way than that in
   Louvain, thus resulting in slightly different communities as compared
   to those obtained by the Louvain method.

   Table [161]2 suggests us that Louvain and Spinglass are most similar to
   each other while Louvain and Leading Eigen are most dissimilar for the
   Yeast PPI network. Fig. [162]3 illustrates the differences (as a
   percentage, i.e., 100*(#genes different between L1 and S1 (or L1 and
   LE4))/(max(L1,S1,LE4)) for each pathway) between the number of genes
   enriched in different pathways (with more than 10 genes and p-values
   less than 0.01) for L1 and S1 (black columns), and for L1 and LE4 (grey
   columns). As seen in Fig. [163]3, there is more difference between the
   number of genes enriched in L1 and LE4 compared to the difference
   between L1 and S1. This also verifies our results of topological
   comparison between L1 and S1, and L1 and LE4 (see also Table [164]2).

Fig. 3.

   [165]Fig. 3
   [166]Open in a new tab

   Comparing number of genes enriched in different pathways for the first
   community detected by Louvain (L1), the first community detected by
   Spinglass (S1) and the fourth community detected by Leading Eigen (LE4)
   for the Yeast PPI network

   KEGG pathway enrichment results for communities detected for the Yeast
   PPI network show that almost all pathways of each community belong to
   one broad function. For example, the first community of Louvain mostly
   includes pathways related to metabolic processes, the second community
   consists of pathways related to genetic information processing. On the
   other hand, the functions/pathways represented by communities detected
   for the Human PPI network are somewhat mixed and include several broad
   biological functions. Vis-a-vis the functional similarity of the
   methods, for the Human PPI network, Combo and Spinglass are similar,
   e.g., Fig. [167]1 shows that in the first community of Combo (C1), 15%
   of the genes belong to pathways related to human diseases and the first
   community of Spinglass (S1) also has 15% of the genes related to the
   same broad category.

   In order to confirm that the size of communities detected by different
   methods are reliable, we found sub-communities of the communities
   detected by Combo, Louvain and Spinglass for the Yeast PPI network and
   analyzed the results to see if detected sub-communities of one
   community include pathways related to different biological functions or
   to the same one as the main community. As an example, most pathways of
   L1 belong to metabolic processes and the pathways for its
   sub-communities also belong to metabolic processes. Due to the result
   of this comparison and other comparisons for all communities and
   sub-communities detected by Combo, Louvain and Spinglass, we can
   conclude that the size of communities detected are reliable.

   We were also curious to know if all genes enriched in each pathway
   belong to one community or to different communities. To find this, we
   compared genes enriched in different pathways for all communities
   detected by Combo, Louvain and Spinglass for the Yeast PPI network. The
   results of this comparison are shown in Tables [168]15, [169]16 and
   [170]17. The first column lists the broad category of pathways (M:
   Metabolism, and GIP: Genetic Information Processing). The second column
   lists the different pathways enriched, columns 3 through 7 represent
   number of enriched genes for each community, column 8 is the summation
   of all enriched genes and the last column specifies total number of
   genes in each pathway in DAVID KEGG database. Some numbers in these
   tables are colored in grey, meaning their p-values are larger than the
   cut-off of 0.01. As seen in these tables, most enriched genes in each
   pathway belong to one community and the corresponding pathway is also
   significantly enriched. There are only a few exceptions such as
   metabolic pathways (which has a total of 685 genes and they are mainly
   distributed into two communities while still maintaining p-value less
   than 0.01 for the pathway (Tables [171]15-[172]17, for the Yeast PPI
   network)), biosynthesis of amino acids (Table [173]16) and ribosome
   (Table [174]15). For the other pathways, most of the enriched genes
   belong to one community and if another community has some enriched
   genes, the p-value is greater than the cutoff of 0.01 (colored grey).

Table 15.

   Number of genes enriched in each pathway for different communities
   detected by Combo for the Yeast PPI network. The first column lists the
   broad category of pathways (M: Metabolism, and GIP: Genetic Information
   Processing). Column 2 lists the different pathways enriched. Columns 3
   through 7 represent number of enriched genes for different communities.
   Column 8 lists the total of all enriched genes of all communities
   together and the last column represents the maximum number of genes in
   that pathway. For example, in DAVID database, “metabolic pathways”
   contains 685 genes and of these, 384 were found in C1 and 140 were
   found in C4

   [175]graphic file with name 12859_2019_2746_Tab15_HTML.jpg
   [176]Open in a new tab

Table 16.

   Number of genes enriched in each pathway for different communities
   detected by Louvain for the Yeast PPI network. This table is arranged
   similar to Table [177]15

   [178]graphic file with name 12859_2019_2746_Tab16_HTML.jpg
   [179]Open in a new tab

Table 17.

   Number of genes enriched in each pathway for different communities
   detected by Spinglass for the Yeast PPI network. This table is arranged
   similar to Table [180]15

   [181]graphic file with name 12859_2019_2746_Tab17_HTML.jpg
   [182]Open in a new tab

   We note that the analysis provided above does not fully address the
   issue of selecting the best method. It is likely to be subjective and
   network specific. Hence, we recommend applying a few different methods,
   such as Louvain (from the group of similar methods) and one or two
   other (dissimilar) methods, and compare and interpret the results to
   obtain a consensus best method.

Robustness of communities obtained by the Louvain method

   We also analyzed the robustness of communities obtained by Louvain
   method for small perturbations in the network. Essentially, the network
   is randomly perturbed by deleting some nodes (and edges involving them)
   and the communities are identified. This is carried out 100 times to
   assess the robustness of the communities. First, the communities of the
   original Yeast PPI network are identified using Louvain method. Then,
   the following steps are repeated 100 times:
    1. Remove 1% of nodes randomly (e.g., 65 nodes out of 6532 nodes for
       the Yeast PPI network).
    2. Find the communities of the new network using Louvain method.
    3. From the communities of the original network, delete the random
       nodes of step 1.
    4. Calculate the Jaccard index matrix between the communities obtained
       in steps 2 and 3. We considered all communities with more than 100
       nodes.
    5. Compute the maximum value for each column of the Jaccard index
       matrix.
    6. Compute the average of the resulting row-vector.

   After performing the above steps 100 times, we have the vector of
   average of column-wise-maximum (avg-max) of Jaccard index values. Then,
   we generate scatter plots of avg-max. The scatter plots of avg-max for
   the Yeast and the Human PPI networks are shown in Fig. [183]4 (left
   panel: Yeast, right panel: Human).

Fig. 4.

   [184]Fig. 4
   [185]Open in a new tab

   Scatter plots of avg-max (average of column-wise-maximum of Jaccard
   index matrix) values vs. run number. The left panel (a) shows the
   results for the Yeast PPI network and the right panel (b) shows the
   results for the Human PPI network

   The mean and standard deviation of the avg-max vector are 72.21 and
   9.02%, respectively for the Yeast PPI network, whereas, for the Human
   PPI network, they are 60.11 and 7.14%, respectively. These relatively
   large numbers for the mean suggest that the communities identified by
   Louvain method are robust to small perturbations. We repeated the
   process for Leading Eigen. The mean and standard deviation of the
   avg-max vector for the Yeast network is 53.4 and 5.63%, respectively.
   Thus, Louvain is a better method, at least for the Yeast PPI network;
   for the Human PPI network, Leading Eigen finds only two or three very
   large communities, making avg-max artificially high (98%).

Generality of the overall results

   The PPI networks that we used in our analysis were from BioGRID and
   included both physical and genetic interactions (combined network).
   Hence, we have also applied the various approaches for finding
   communities to a Yeast PPI network comprising of only physical
   interactions, which has 6298 nodes and 83,788 edges, to find out if our
   broad conclusions based on the combined network are still valid for the
   physical interaction-only network. Additional file [186]1: Table S20
   shows the modularity scores and the number of communities detected by
   different methods for the physical interaction-only network. While Q is
   smaller for the combined network as compared to that for the physical
   interaction-only network for all methods, their relative trend for the
   different methods remain almost the same. Interestingly, the Q values
   for Louvain, Combo and Spinglass are similar and among the largest.
   Thus, these three methods are one of the best methods in terms of Q
   value as well.

   We have compared the various methods using three topological metrics
   (namely Rand Index, Adjusted Rand Index, and Normalized Mutual
   Information) with respect to the physical interaction-only network. The
   results for these comparisons are given in Additional file [187]1:
   Table S21. Based on results of Additional file [188]1: Table S21,
   Combo, Louvain and Spinglass are similar to each other in terms of the
   topological metrics. We also compared the KEGG pathway enrichment
   results for the first 5 communities of Louvain and Spinglass. To find
   which communities from Louvain and Spinglass methods are similar, we
   generated the Jaccard index matrix for communities with more than 100
   nodes for both methods (Additional file [189]1: Table S22). After
   selecting pairs of communities with highest value of Jaccard index for
   each column, we used DAVID version 6.8, to perform KEGG pathway
   enrichment analysis. The results of those comparisons are presented in
   Additional file [190]1: Tables S23-S25 for the L1 vs. S1, L2 vs. S2 and
   L3 vs. S4 comparisons, respectively. These tables are arranged similar
   to Additional file [191]1: Table S4.

   Two main results from Additional file [192]1: Tables S23-S25 are: 1)
   good functional similarity between the communities from Louvain and
   Spinglass methods (e.g., an overlap of 73.06% between L1 and S1, 79.41%
   between L2 and S2 and 85.24% between L3 and S4), and 2) segregation of
   biological functions represented by the communities, e.g., communities
   L2, L3 and L4 represent mostly metabolism related pathways. L1 shows a
   mixed enrichment, akin to the mixed pathways represented by the L3 and
   L5 communities of the combined network (Additional file [193]1: Tables
   S4 and S6). Some differences in the nature of the broad results for the
   combined network vs. the physical interaction-only network are likely
   due to the fact that the physical interaction-only network is much
   sparser (just 1/3rd of the edges are retained) as compared to the
   combined network. Using Cytoscape [[194]27], we also analyzed the
   properties (Additional file [195]1: Table S26) and node-degree
   distribution for the combined network and the physical interaction-only
   network (Figs. [196]5(a) and 5(b)). Figure [197]5(c) shows a comparison
   of count of nodes for a given degree between the combined and the
   physical interaction-only networks. As can be seen from Fig. [198]5(c),
   good R^2-value suggests good agreement between the two degree
   distributions.

Fig. 5.

   Fig. 5
   [199]Open in a new tab

   a Node-degree distribution for the combined network. b Node-degree
   distribution for the physical interaction-only network. c Comparison of
   counts of nodes between the combined network and the physical
   interaction-only network

Optimization of method-specific parameters

   We wanted to find out if the Q value for the different methods could be
   improved by optimizing their parameters, if any, or if their results
   varied across different runs. Of six methods, three of them, namely
   Combo, Conclude and Spinglass used a random number generator in the
   procedure for finding communities, although they do not have any
   parameters to be optimized. We carried out 10 repeats of Combo,
   Conclude and Spinglass on the combined Yeast PPI network to assess the
   variation in Q across the runs. Additional file [200]1: Table S27 shows
   the results of 10 runs for these methods (similar to those for the
   Combo and Spinglass methods for the Human PPI network in Tables [201]8
   and [202]9, respectively). The standard deviations of Q across the 10
   runs are much larger for Conclude (5.11%, std./mean) as compared to
   those for Combo (0.05%) and Spinglass (0.36%). Thus, for a small number
   of allowed runs, the results from Combo and Spinglass are more
   reliable.

   We selected the run with the highest modularity for Conclude method and
   performed KEGG pathway enrichment analysis. Then we compared these
   enrichment results with the enrichment results for the communities from
   the Conclude method reported earlier. Although the size of communities
   are a little different, the broad categories of pathways for majority
   of the communities are still the same. For example, there is one
   community in both runs which has pathways mainly related to metabolic
   processes. Additional file [203]1: Table S28 shows the results for
   comparison between the two communities with pathways related to
   metabolic processes (CL3) in the two runs.

   Spinglass method uses simulated annealing in its procedure. Hence, we
   ran it 10 times with 10 different start and stop temperatures and
   cooling factor, but the default values in the igraph package yielded
   the best result in terms of the largest modularity. Overall, at a broad
   level, we found that in our case studies, the methods providing
   well-interpretable communities also resulted in near-optimal (largest)
   Q values. The fact that the Q value for Conclude in one of the runs is
   slightly higher (0.29) than those for Combo and Spinglass (0.265) does
   not violate this conclusion because some of the communities from
   Conclude are also well-interpretable. However, we note that the Q
   values across different runs vary substantially for Conclude as
   compared to those for Combo and Spinglass, suggesting that Combo and
   Spinglass methods are likely more robust than the Conclude method.

Conclusions

   In this paper, we tested six methods to detect communities within the
   Yeast and Human PPI networks. An in-depth comparison of communities
   detected by these different methods has led us to conclude that Louvain
   and Spinglass are most similar for the Yeast PPI network whereas Combo
   and Spinglass are most similar for the Human PPI network. In terms of
   finding communities that include core pathways based on KEGG pathway
   and GO term enrichment results, Combo, Louvain and Spinglass were able
   to find similar communities in which important biological functions and
   pathways were enriched. For the Yeast PPI network, all genes from the
   network belonging to a pathway also generally belong to only one or two
   communities. In terms of running time or computational complexity, for
   the Yeast PPI network, Louvain was the fastest method and Combo was
   faster than Spinglass. For the Human PPI network, Louvain was much
   faster than Spinglass, and Spinglass was faster than Combo. Overall,
   Combo, Louvain and Spinglass provide reasonable results for community
   detection in biological networks. Their corresponding modularity values
   were also among the highest, except that Conclude yielded slightly
   better modularity for the Yeast PPI network in some runs; variation in
   modularity for Conclude was much larger than that for Combo and
   Spinglass across multiple runs. Overall, since Combo and Spinglass use
   stochastic search in their procedure and their running time is also
   more than that for Louvain, Louvain is likely the best method to find
   reasonably sized communities for biological networks in a reasonable
   time. While we applied these methods to PPI networks, we expect the
   broad results to be applicable to other types of networks such as
   gene-coexpression networks and hybrid/integrated networks.

Methods

   The focus of this work is on undirected, unweighted and connected
   graphs defined as G(n,m) where n = {n[1], n[2], …, n[|n|]} is an
   ordered set of nodes and m a subset of n*n, is the set of edges; for
   convenience, here, n and m represent sets. In our case studies on the
   Yeast and Human PPI networks, each node represents an Entrez gene ID
   and each edge represents an interaction between two nodes. By finding
   communities, we imply the segregation of nodes into groups such that
   there is a higher density of edges within each group than between
   groups. Although there are many algorithms for detecting communities,
   we selected six algorithms that performed well in terms of finding core
   pathways in biological networks with several thousands of nodes and
   hundreds of thousands of edges within reasonable computing time (e.g.,
   one day on a 24 processor machine).

   Each of these algorithms tries to optimize an objective function which
   in most cases is modularity. However, two algorithms with the same
   objective function will generally yield slightly different communities
   using different amounts of run time if they use different optimization
   strategies. The algorithms that will be briefly described below are:
   Girvan-Newman [[204]1, [205]7], Fast Greedy [[206]11], Combo [[207]28],
   Louvain [[208]10], Conclude [[209]29], Infomap [[210]30], Leading Eigen
   [[211]8] and Spinglass [[212]9]. We used an R package named igraph
   [[213]31] to run Fast Greedy, Louvain, Leading Eigen and Spinglass.
   There is a Cytoscape plugin, named GLay, to visualize the network and
   the communities [[214]32]. The GLay plugin utilizes igraph C library
   which provides the same methods as the igraph R package [[215]31]. For
   the other two methods (Conclude and Combo), authors have made java
   [[216]33] and C++ [[217]34] codes of their algorithms available online.
   Girvan-Newman method did not provide any result for our networks in
   24 h on a 24 processor machine, so we did not consider its results in
   our comparison. Another method, called Infomap, did not find
   reasonably-sized communities (e.g., for the Yeast PPI network with 6532
   nodes, one community has 6195 nodes, and the rest have less than 10
   nodes). Due to these reasons, we did not consider Girvan-Newman and
   Infomap methods in our analysis. However, Fast Greedy and Louvain,
   which we have used in our comparison, are based on the original
   Girvan-Newman method. So, we will briefly describe Girvan-Newman
   algorithm in the next subsection.

Algorithms for community detection

Girvan-Newman

   This is a divisive algorithm for identifying communities in networks
   where edges are iteratively removed based on the value of their
   betweenness. The steps of the algorithm are as follows [[218]1,
   [219]7]:
    1. Calculate betweenness score for all edges in the network.
       Betweenness is a measure that favors edges that lies between
       communities and disfavors those that lie inside communities. While
       there are many measures to find betweenness, Girvan-Newman
       algorithm uses the fast algorithm of Newman to find betweenness
       scores [[220]35].
    2. Find the edge that has the maximum value of betweenness and remove
       it from the network.
    3. Recalculate betweenness for all remaining edges in the network and
       repeat from step 2 for the new scores until there are no edges
       remaining.

   The output of this algorithm is a dendrogram that captures the possible
   divisions of the network into communities. In order to select the
   optimal division among all possible options, Modularity (Q) is used.
   [MATH:
   <mi>Q</mi><mo>=</mo><mfrac><mn>1</mn><mrow><mn>2</mn><mi>m</mi></mrow><
   /mfrac><munder><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow>
   </munder><mfenced close="]" open="["><mrow><msub><mi>A</mi><mi
   mathvariant="italic">ij</mi></msub><mo>−</mo><mfrac><mrow><msub><mi>k</
   mi><mi>i</mi></msub><msub><mi>k</mi><mi>j</mi></msub></mrow><mrow><mn>2
   </mn><mi>m</mi></mrow></mfrac></mrow></mfenced><mi>δ</mi><mfenced
   close=")" open="("
   separators=","><msub><mi>c</mi><mi>i</mi></msub><msub><mi>c</mi><mi>j</
   mi></msub></mfenced> :MATH]
   1

   where A[ij] is the weight of the edge between node i and j (is equal to
   1 when all edges have the same weight), k[i] is the sum of the weights
   of the edges attached to node i or degree of node i, c[i] is the
   community to which node i belongs to and the δ function is defined as
   δ(u,v) = 1 if u = v and 0 otherwise. Local peaks in the modularity
   during the process of community detection indicate good divisions of
   the network into communities. While using this algorithm gives good
   results in many cases, its computational complexity is O(m^2n) or
   O(n^3) on a sparse graph, where n is the number of nodes and m is the
   number of edges. This makes it impractical for very large graphs with
   several thousands of nodes and hundreds of thousands of edges.

Fast greedy by Clauset, Newman and Moore

   This algorithm (namely, Fast Greedy) involves finding the changes in
   modularity that results from combining pair of communities, selecting
   the combination yielding largest gain in modularity, and implementing
   the combination of the corresponding pair [[221]11]. At the beginning,
   each node is considered as a community. One way of performing the
   combination process is to consider the network as a multigraph where a
   whole community is represented by a node and the elements of the
   adjacency matrix are equal to the number of edges between the
   communities. Joining two communities (namely i and j) corresponds to
   replacing the i^th and j^th rows and columns of the adjacency matrix by
   a single row and column formed by their sum, respectively; the record
   of the list of nodes in the communities formed thus far is updated. In
   the algorithm proposed by Newman [[222]13], this operation is carried
   out explicitly on the entire adjacency matrix. Calculating the change
   of modularity (ΔQ) and finding the pair i, j with the largest gain is
   time-consuming. Hence, here, instead of using the adjacency matrix and
   calculating ΔQ[ij], a matrix of ΔQ[ij] values is initialized and
   updated directly. For sparse matrices, e.g., adjacency matrices for
   large networks, this results in substantial reduction in computation.
   The following parameters have to be set initially:
   [MATH: <msub><mrow><mi>∆</mi><mi>Q</mi></mrow><mi
   mathvariant="italic">ij</mi></msub><mo>=</mo><mfenced close=""
   open="{"><mtable><mtr><mtd><maligngroup></maligngroup><mfrac
   bevelled="true"><mn>1</mn><mrow><mn>2</mn><mi>m</mi></mrow></mfrac><msp
   ace width="0.35em"></mspace><mo>−</mo><mspace
   width="0.35em"></mspace><mfrac
   bevelled="true"><mrow><msub><mi>k</mi><mi>i</mi></msub><msub><mi>k</mi>
   <mi>j</mi></msub></mrow><msup><mfenced close=")"
   open="("><mrow><mn>2</mn><mi>m</mi></mrow></mfenced><mn>2</mn></msup></
   mfrac><mspace width="0.25em"></mspace><mtext>if</mtext><mspace
   width="0.25em"></mspace><mi mathvariant="normal">i</mi><mo>,</mo><mi
   mathvariant="normal">j</mi><mspace
   width="0.25em"></mspace><mi>are</mi><mspace
   width="0.25em"></mspace><mtext>connected</mtext><mspace
   width="0.25em"></mspace></mtd></mtr><mtr><mtd><maligngroup></maligngrou
   p><mn>0</mn><mspace
   width="0.25em"></mspace><mtext>otherwise</mtext></mtd></mtr></mtable></
   mfenced> :MATH]
   2
   [MATH:
   <msub><mi>a</mi><mi>i</mi></msub><mo>=</mo><mfrac><msub><mi>k</mi><mi>i
   </mi></msub><mrow><mn>2</mn><mi>m</mi></mrow></mfrac> :MATH]
   3

   for each i. The next steps are as follows:
    1. Calculate the initial values of ΔQ[ij] and a[i] based on Eqs.
       [223]2 and [224]3, and form the max-heap, H, which contains the
       largest element of each row of the matrix ΔQ[ij] along with the
       labels i, j of the corresponding pair of communities.
    2. Select the largest ΔQ[ij] from H, combine the corresponding
       communities, update the matrix ΔQ, H and a[i] (Eq. [225]4) and
       increase Q by ΔQ[ij.]
    3. Repeat step 2 until only one community exists.

   Due to the sparsity of the original adjacency matrix A, we will be able
   to perform updates in step 2 quickly and we need to only adjust a few
   elements of ΔQ. If communities i and j are joined together, labeling
   the combined community as j, we update the j^th row and column, and
   delete the i^th row and column altogether. The update rules are as
   follows:

   If community k is connected to both i and j, then:
   [MATH: <mi>∆</mi><msubsup><mi>Q</mi><mi
   mathvariant="italic">jk</mi><mo>′</mo></msubsup><mo>=</mo><mi>∆</mi><ms
   ub><mi>Q</mi><mi
   mathvariant="italic">ik</mi></msub><mo>+</mo><mi>∆</mi><msub><mi>Q</mi>
   <mi mathvariant="italic">jk</mi></msub> :MATH]
   4a

   If k is connected to i but not to j, then:
   [MATH: <mi>∆</mi><msubsup><mi>Q</mi><mi
   mathvariant="italic">jk</mi><mo>′</mo></msubsup><mo>=</mo><mi>∆</mi><ms
   ub><mi>Q</mi><mi
   mathvariant="italic">ik</mi></msub><mo>−</mo><mn>2</mn><msub><mi>a</mi>
   <mi>j</mi></msub><msub><mi>a</mi><mi>k</mi></msub> :MATH]
   4b

   If k is connected to j but not to i, then:
   [MATH: <mi>∆</mi><msubsup><mi>Q</mi><mi
   mathvariant="italic">jk</mi><mo>′</mo></msubsup><mo>=</mo><mi>∆</mi><ms
   ub><mi>Q</mi><mi
   mathvariant="italic">jk</mi></msub><mo>−</mo><mn>2</mn><msub><mi>a</mi>
   <mi>i</mi></msub><msub><mi>a</mi><mi>k</mi></msub> :MATH]
   4c

   To update a:
   [MATH:
   <msubsup><mi>a</mi><mi>j</mi><mo>′</mo></msubsup><mo>=</mo><msub><mi>a<
   /mi><mi>j</mi></msub><mo>+</mo><msub><mi>a</mi><mrow><mi>i</mi><mo>.</m
   o></mrow></msub> :MATH]

   Fast Greedy algorithm runs in time O(m.d.logn) for a network with n
   nodes and m edges where d ~ logn is the depth of the dendrogram. For
   sparse networks (m ~ n), the running time is O(nlog^2n), which is
   essentially linear [[226]11].

Combo

   Most community detection algorithms use one of the following steps in
   the process of finding communities: they may join two communities,
   split a community into two, or move nodes between two distinct
   communities. Combo involves all three possibilities [[227]28]. After
   selecting an initial partition made of a single community, the
   following steps are iterated until there is no gain in the objective
   function which may be modularity (Eq. [228]1) or description code
   length:
    1. For each source community, the best possible repartition of every
       source node into each destination community (either an existing
       community or a new community) is calculated. It would be possible
       that the source community totally joined the destination community
       in this step.
    2. The best merger, split, or recombination is performed.

   The basis of Combo is the selection of the best repartition of nodes
   between two communities. For each pair of source and (maybe empty)
   destination communities, a shift of all the nodes using Kernighan-Lin
   algorithm [[229]36] is performed. Particularly, Combo recombines the
   two communities starting from several initial configurations including:
   (a) the original communities, (b) the case in which the whole source
   community is moved to the destination community and (c) a few
   intermediate mergers, where a random subset of the source community is
   shifted to the destination community.

   For each starting configuration, a series of Kernighan-Lin shifts
   [[230]36] is iterated until no further improvement is possible. Each
   configuration is carried out by initializing a list of available nodes
   to cover all the nodes from the original source community and then
   iterating the following steps until there are no more nodes in the
   list:
    1. Find the node i in the list which when switched to another
       community results in the largest gain in modularity.
    2. Switch i to the other community, remove it from the original list
       and save the intermediate result.

   After performing a series of Kernighan-Lin iterations for each of the
   starting configurations, the intermediate result with the best score in
   terms of objective function (modularity) is selected.

   Combo outperforms all other known algorithms when the objective
   function is modularity. However, it has limitations on the size of the
   network it could handle within a reasonable time, which is currently
   around 30,000 nodes and is not a serious limitation for most biological
   networks. When the objective function is description code length,
   Combo’s results are similar to Infomap (which will be described later
   in this section) in most cases. Since the sequence of operations
   depends on the specific network, obtaining exact evaluations of the
   computational complexity of Combo is difficult, but the upper bound is
   O(n^2logc) where n is the number of nodes and c is the number of
   communities in the network [[231]28].

Louvain

   This algorithm detects communities in large networks by maximizing
   modularity and is much faster as compared to other methods [[232]10,
   [233]37, [234]38]. The limiting factor for this method is the memory
   (RAM) requirement rather than the computation time, as is the case with
   Girvan-Newman and Spinglass algorithms. The algorithm is divided in two
   phases, which are repeated iteratively. First phase is to assign a
   different community to each node of the network. So, in the beginning,
   there are as many communities as there are nodes. Then, the gain of
   modularity (Eq. [235]5) is calculated for removing node i from its
   community and placing it in one of its neighboring communities. The
   gain of modularity in moving node i into a community C can be computed
   by:
   [MATH: <mi mathvariant="normal">Δ</mi><mi>Q</mi><mo>=</mo><mfenced
   close="]" open="["><mrow><mfrac><mrow><msub><mi
   mathvariant="normal">Σ</mi><mtext
   mathvariant="italic">in</mtext></msub><mo>+</mo><msub><mi>k</mi><mrow><
   mi>i</mi><mo>,</mo><mtext
   mathvariant="italic">in</mtext></mrow></msub></mrow><mrow><mn>2</mn><mi
   >m</mi></mrow></mfrac><mo>−</mo><msup><mfenced close=")"
   open="("><mfrac><mrow><msub><mi mathvariant="normal">Σ</mi><mi
   mathvariant="italic">tot</mi></msub><mo>+</mo><msub><mi>k</mi><mi>i</mi
   ></msub></mrow><mrow><mn>2</mn><mi>m</mi></mrow></mfrac></mfenced><mn>2
   </mn></msup></mrow></mfenced><mo>−</mo><mfenced close="]"
   open="["><mrow><mfrac><msub><mi mathvariant="normal">Σ</mi><mtext
   mathvariant="italic">in</mtext></msub><mrow><mn>2</mn><mi>m</mi></mrow>
   </mfrac><mo>−</mo><msup><mfenced close=")" open="("><mfrac><msub><mi
   mathvariant="normal">Σ</mi><mi
   mathvariant="italic">tot</mi></msub><mrow><mn>2</mn><mi>m</mi></mrow></
   mfrac></mfenced><mn>2</mn></msup><mo>−</mo><msup><mfenced close=")"
   open="("><mfrac><msub><mi>k</mi><mi>i</mi></msub><mrow><mn>2</mn><mi>m<
   /mi></mrow></mfrac></mfenced><mn>2</mn></msup></mrow></mfenced> :MATH]
   5

   where Ʃ[in] is the sum of the weights (or count for un-weighted
   networks) of the edges inside C, Ʃ[tot] is the sum of the weights of
   the edges incident to nodes in C, k[i] is the sum of the weights of the
   edges incident to nodes i (degree of i), k[i,in] is the sum of the
   weights of the edges from i to nodes in C and m is the sum of the
   weights of all the edges in the network. If the gain is positive, the
   node i is placed in the community for which the gain is maximum. This
   process is applied repeatedly for all nodes until no further
   improvement can be achieved.

   The second phase is to build a network whose nodes are now the
   communities detected during the first phase. In order to perform that,
   the weights of the edges between the new nodes are given by the sum of
   the edges between nodes in the corresponding two communities. Edges
   between nodes of the same community result in self-loops for this
   community in the new network. When this phase is completed, the first
   phase of the algorithm is reapplied to the new network. The combination
   of these two phases is referred to as a “pass”. The passes are iterated
   until a maximum of modularity is reached. This algorithm is extremely
   fast (O(nlogn)) and could be even faster by using some heuristics,
   e.g., stopping the first phase when the gain of modularity is less than
   a given threshold [[236]10].

COmplex Network CLUster DEtection (CONCLUDE)

   CONCLUDE is a fast community detection method. The algorithm takes a
   graph G, an integer κ and an integer φ as inputs. The steps are:
    1. Compute κ-path edge centralities using Edge Random Walk κ-path
       Centrality (ERW-Kpath) algorithm (described by De Meo, et al.
       [[237]29]) on nodes of G by carrying out at most φ iteration. The
       output of ERW-Kpath algorithm is an array of weights.
    2. Calculate the distance among all pairs of nodes by taking two
       inputs: graph G and the array of weights from the previous step. It
       uses the following equation (Eq. [238]6) to find pairwise
       distances:

   [MATH: <msub><mi>σ</mi><mi
   mathvariant="italic">ij</mi></msub><mo>=</mo><msqrt><mrow><mfrac><mrow>
   <munder><mo>∑</mo><mrow><mi mathvariant="italic">kϵN</mi><mfenced
   close=")" open="("><mi>i</mi></mfenced><mo>−</mo><mi
   mathvariant="italic">CN</mi><mfenced close=")" open="("
   separators=","><mi>i</mi><mi>j</mi></mfenced></mrow></munder><msup><mfe
   nced close="]" open="["><mrow><msup><mi>L</mi><mi>κ</mi></msup><mfenced
   close=")" open="("><msub><mi>e</mi><mi
   mathvariant="italic">ki</mi></msub></mfenced></mrow></mfenced><mn>2</mn
   ></msup></mrow><mrow><mo>∣</mo><mi>N</mi><mfenced close=")"
   open="("><mi>i</mi></mfenced><mo>−</mo><mi
   mathvariant="italic">CN</mi><mfenced close=")" open="("
   separators=","><mi>i</mi><mi>j</mi></mfenced><mo>∣</mo></mrow></mfrac><
   mo>+</mo><mfrac><mrow><munder><mo>∑</mo><mrow><mi
   mathvariant="italic">kϵN</mi><mfenced close=")"
   open="("><mi>j</mi></mfenced><mo>−</mo><mi
   mathvariant="italic">CN</mi><mfenced close=")" open="("
   separators=","><mi>i</mi><mi>j</mi></mfenced></mrow></munder><msup><mfe
   nced close="]" open="["><mrow><msup><mi>L</mi><mi>κ</mi></msup><mfenced
   close=")" open="("><msub><mi>e</mi><mi
   mathvariant="italic">kj</mi></msub></mfenced></mrow></mfenced><mn>2</mn
   ></msup></mrow><mrow><mo>∣</mo><mi>N</mi><mfenced close=")"
   open="("><mi>j</mi></mfenced><mo>−</mo><mi
   mathvariant="italic">CN</mi><mfenced close=")" open="("
   separators=","><mi>i</mi><mi>j</mi></mfenced><mo>∣</mo></mrow></mfrac><
   mo>+</mo><mfrac><mrow><munder><mo>∑</mo><mrow><mtext
   mathvariant="italic">kϵCN</mtext><mfenced close=")" open="("
   separators=","><mi>i</mi><mi>j</mi></mfenced></mrow></munder><msup><mfe
   nced close="]" open="["><mrow><msup><mi>L</mi><mi>κ</mi></msup><mfenced
   close=")" open="("><msub><mi>e</mi><mi
   mathvariant="italic">ki</mi></msub></mfenced><mo>−</mo><msup><mi>L</mi>
   <mi>κ</mi></msup><mfenced close=")" open="("><msub><mi>e</mi><mi
   mathvariant="italic">kj</mi></msub></mfenced></mrow></mfenced><mn>2</mn
   ></msup></mrow><mrow><mo>∣</mo><mi mathvariant="italic">CN</mi><mfenced
   close=")" open="("
   separators=","><mi>i</mi><mi>j</mi></mfenced><mo>∣</mo></mrow></mfrac><
   /mrow></msqrt> :MATH]
   6

   where L^κ is κ-path edge centrality and is calculated by:
   [MATH: <msup><mi>L</mi><mi>k</mi></msup><mfenced close=")"
   open="("><mi>e</mi></mfenced><mo>=</mo><munder><mo
   movablelimits="false">∑</mo><mi
   mathvariant="italic">sϵV</mi></munder><mo>Pr</mo><mfenced close=")"
   open="(" separators=","><mi>e</mi><mi>s</mi></mfenced> :MATH]
   7

   Pr(e,s) is the probability of selecting the edge e in a random simple
   κ-path originating from an arbitrary source node s. The symbol N(i) is
   the set of neighbors of the node i and CN(i,j) indicates the subset of
   neighbors common to i and j.

   The output of this step is a matrix Δ containing all pairs of distances
   between nodes.
     * 3.
       Finally, apply Louvain method [[239]10] on matrix Δ to find
       communities of G.

Maps of random walk (Infomap)

   The Infomap approach closely follows the Louvain approach [[240]10];
   neighboring nodes are joined together to make small communities which
   subsequently are joined into bigger communities. The difference between
   these two methods is that the objective function of Louvain is
   modularity while the objective function of Infomap is a lower bound on
   a quantity referred to as code-length (M), defined as,
   [MATH: <mi>L</mi><mfenced close=")"
   open="("><mi>M</mi></mfenced><mo>=</mo><msub><mi>q</mi><mo>↷</mo></msub
   ><mi>H</mi><mfenced close=")"
   open="("><mi>Q</mi></mfenced><mo>−</mo><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>c</mi></munderover><msub><msup><mi>p</mi><mi>i</mi></msup><mo>↻</mo
   ></msub><mi>H</mi><mfenced close=")"
   open="("><msup><mi>P</mi><mi>i</mi></msup></mfenced> :MATH]
   8

   The aim of Infomap is to minimize the lower bound, L(M). The equation
   comprises of two terms: first is the entropy of the movement of nodes
   between communities and second is the entropy of movements of nodes
   within communities. Further details about this equation can be found
   elsewhere [[241]30].

   Each node is initially assigned to its own community. Then, in random
   sequential order, each node is placed into the neighboring community
   that results in the largest decrease in L(M) (Eq. [242]8). If no move
   decreases L(M), the node will stay in its original community. This
   procedure is repeated in a new random sequential order each time until
   no move could decrease L(M). In each iteration, the nodes of the new
   network are the communities found at the last level and the process of
   joining nodes into communities is repeated on the new network until
   L(M) cannot be reduced further. The computational complexity of Infomap
   is a linear function of the number of edges, i.e., O(m) [[243]30,
   [244]39].

Leading Eigen

   Leading Eigen method [[245]8] is also based on the modularity
   maximization but here, the modularity is expressed in terms of the
   eigenvalues and eigenvectors of a matrix called the modularity matrix,
   B:
   [MATH: <msub><mi>B</mi><mi
   mathvariant="italic">ij</mi></msub><mo>=</mo><msub><mi>A</mi><mi
   mathvariant="italic">ij</mi></msub><mo>−</mo><mfrac><mrow><msub><mi>k</
   mi><mi>i</mi></msub><msub><mi>k</mi><mi>j</mi></msub></mrow><mrow><mn>2
   </mn><mi>m</mi></mrow></mfrac> :MATH]
   9

   where, A is the adjacency matrix, k[i] is the sum of the weights of the
   edges attached to node i (or degree of node i), k[j] is the degree of
   node j and m is the total number of edges. The modularity matrix is a
   characteristic property of the network and is independent of any
   division of the network into communities. The procedure of finding
   communities of a network with this method consists of finding the
   eigenvector corresponding to the most positive eigenvalue of the
   modularity matrix, and then dividing the network into two groups based
   on the sign of the elements of the eigenvector. Defining an index
   vector, s, the sign of elements are:
   [MATH: <msub><mi>s</mi><mi>i</mi></msub><mo>=</mo><mfenced close=""
   open="{"><mtable><mtr><mtd><maligngroup></maligngroup><mo>+</mo><mn>1</
   mn><mspace width="0.25em"></mspace><mtext
   mathvariant="italic">if</mtext><mspace
   width="0.25em"></mspace><msup><msub><mi>u</mi><mi>i</mi></msub><mfenced
   close=")"
   open="("><mn>1</mn></mfenced></msup><mo>≥</mo><mn>0</mn></mtd></mtr><mt
   r><mtd><maligngroup></maligngroup><mo>−</mo><mn>1</mn><mspace
   width="0.25em"></mspace><mtext mathvariant="italic">if</mtext><mspace
   width="0.25em"></mspace><msup><msub><mi>u</mi><mi>i</mi></msub><mfenced
   close=")"
   open="("><mn>1</mn></mfenced></msup><mo><</mo><mn>0</mn></mtd></mtr></m
   table></mfenced> :MATH]
   10

   where, u[i]^(1) is the i^th element of u[1] (the normalized eigenvector
   of the modularity matrix). The nodes with positive sign form one
   community and the rest of the nodes form the other community. To avoid
   dividing the network into only two communities, an n (total number of
   nodes) by c (the number of non-overlapping communities) index matrix S
   has to be defined. Each column of this matrix is an index vector of
   (0,1) elements, such that:
   [MATH: <msub><mi>S</mi><mi
   mathvariant="italic">ij</mi></msub><mo>=</mo><mfenced close=""
   open="{"><mtable><mtr><mtd><maligngroup></maligngroup><mn>1</mn><mspace
   width="0.25em"></mspace><mrow><mspace width="0.25em"></mspace><mtext
   mathvariant="italic">if vertex</mtext><mspace
   width="0.25em"></mspace><mi>i</mi><mspace
   width="0.25em"></mspace><mtext mathvariant="italic">belongs to
   community</mtext><mspace
   width="0.25em"></mspace><mi>j</mi></mrow></mtd></mtr><mtr><mtd><maligng
   roup></maligngroup><mn>0</mn><mrow><mspace
   width="0.25em"></mspace><mtext
   mathvariant="italic">otherwise</mtext><mspace
   width="0.25em"></mspace></mrow></mtd></mtr></mtable></mfenced> :MATH]
   11

   The modularity of this division of the network is then equal to:
   [MATH: <mi>Q</mi><mo>=</mo><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>,</mo><mi>j</mi><mo>=</
   mo><mn>1</mn></mrow><mi>n</mi></munderover><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>c</mi></munderover><msub><mi>B</mi><mi
   mathvariant="italic">ij</mi></msub><msub><mi>S</mi><mi
   mathvariant="italic">ik</mi></msub><msub><mi>S</mi><mi
   mathvariant="italic">jk</mi></msub><mo>=</mo><mi
   mathvariant="italic">Tr</mi><mfenced close=")"
   open="("><mrow><msup><mi>S</mi><mi>T</mi></msup><mi
   mathvariant="italic">BS</mi></mrow></mfenced> :MATH]
   12

   This form of modularity is different from other forms in a leading
   multiplicative constant 1/(2 m) but since it has no effect on the
   position of the maximum of the modularity, it has been omitted from the
   equation. Writing B=UDU^T where U = (u[1]|u[2]| …) is the matrix of
   eigenvectors of B, and D is the diagonal matrix of eigenvalues
   D[ii] = β[i], Q can be written as:
   [MATH: <mi>Q</mi><mo>=</mo><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>n</mi></munderover><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>c</mi></munderover><msub><mi>β</mi><mi>j</mi></msub><msup><mfenced
   close=")"
   open="("><mrow><msubsup><mi>u</mi><mi>j</mi><mi>T</mi></msubsup><msub><
   mi>s</mi><mi>k</mi></msub></mrow></mfenced><mn>2</mn></msup> :MATH]
   13

   The aim is still maximizing the modularity Q, but now, there is no
   constraint on the number of communities, c [[246]8].

Spinglass

   In this method, the community structure of a network is described as
   the spin configuration that minimizes the energy of the spin glass
   (Hamiltonian) with respect to the spin states (the community indices)
   [[247]9]. Similar to any other quality function for an assignment of
   nodes into communities, Hamiltonian has to follow the principle of
   grouping together the nodes that are linked (there is an edge between
   them) and keep apart the ones that are not linked. From this, four
   requirements have to be satisfied: a) rewarding internal edges between
   nodes of the same community (in the same spin state), b) penalizing
   missing edges (non-links) between nodes in the same community, c)
   penalizing existing edges between different communities and d)
   rewarding non-links between different communities. The following
   equation (Eq. [248]14) satisfies these properties:
   [MATH: <mtable columnalign="center" displaystyle="true"><mtr><mtd><mi
   mathvariant="script">H</mi><mfenced close=")" open="("><mfenced
   close="}"
   open="{"><mi>σ</mi></mfenced></mfenced><mo>=</mo><mo>−</mo><munder><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>≠</mo><mi>j</mi></mrow>
   </munder><msub><mi>a</mi><mi
   mathvariant="italic">ij</mi></msub><munder><munder
   accentunder="true"><mrow><msub><mi>A</mi><mi
   mathvariant="italic">ij</mi></msub><mi>δ</mi><mfenced close=")"
   open="("
   separators=","><msub><mi>σ</mi><mi>i</mi></msub><msub><mi>σ</mi><mi>j</
   mi></msub></mfenced></mrow><mo stretchy="true">⏟</mo></munder><mtext
   mathvariant="italic">internal
   links</mtext></munder><mo>+</mo><munder><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>≠</mo><mi>j</mi></mrow>
   </munder><msub><mi>b</mi><mi
   mathvariant="italic">ij</mi></msub><munder><munder
   accentunder="true"><mrow><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msub><mi>A</mi><mi
   mathvariant="italic">ij</mi></msub></mrow></mfenced><mi>δ</mi><mfenced
   close=")" open="("
   separators=","><msub><mi>σ</mi><mi>i</mi></msub><msub><mi>σ</mi><mi>j</
   mi></msub></mfenced></mrow><mo
   stretchy="true">⏟</mo></munder><mrow><mtext
   mathvariant="italic">internal non</mtext><mo>−</mo><mtext
   mathvariant="italic">links</mtext></mrow></munder></mtd></mtr><mtr><mtd
   ><mo>+</mo><munder><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>≠</mo><mi>j</mi></mrow>
   </munder><msub><mi>c</mi><mi
   mathvariant="italic">ij</mi></msub><munder><munder
   accentunder="true"><mrow><msub><mi>A</mi><mi
   mathvariant="italic">ij</mi></msub><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mi>δ</mi><mfenced close=")"
   open="("
   separators=","><msub><mi>σ</mi><mi>i</mi></msub><msub><mi>σ</mi><mi>j</
   mi></msub></mfenced></mrow></mfenced></mrow><mo
   stretchy="true">⏟</mo></munder><mtext mathvariant="italic">external
   links</mtext></munder><mo>−</mo><munder><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>≠</mo><mi>j</mi></mrow>
   </munder><msub><mi>d</mi><mi
   mathvariant="italic">ij</mi></msub><munder><munder
   accentunder="true"><mrow><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msub><mi>A</mi><mi
   mathvariant="italic">ij</mi></msub></mrow></mfenced><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mi>δ</mi><mfenced close=")"
   open="("
   separators=","><msub><mi>σ</mi><mi>i</mi></msub><msub><mi>σ</mi><mi>j</
   mi></msub></mfenced></mrow></mfenced></mrow><mo
   stretchy="true">⏟</mo></munder><mrow><mtext
   mathvariant="italic">external non</mtext><mo>−</mo><mtext
   mathvariant="italic">links</mtext></mrow></munder></mtd></mtr></mtable>
   :MATH]
   14

   where, σ[i] denotes the community index of node i in the graph, δ is
   the Kronecker delta function and a[ij], b[ij], c[ij], and d[ij]
   represent the weights of the individual contributions, respectively. If
   links and non-links are each weighted equally, no matter they are
   external or internal, a[ij] = c[ij] and b[ij] = d[ij], then it would be
   enough to consider the internal links and non-links. Convenient choices
   of coefficients are a[ij] = 1 – γp[ij] and b[ij] = γp[ij] where
   p[ij]denotes the probability that a link exists between node i and j,
   normalized, such that
   [MATH: <munder><mo>∑</mo><mrow><mi
   mathvariant="normal">i</mi><mo>≠</mo><mi
   mathvariant="normal">j</mi></mrow></munder><msub><mi
   mathvariant="normal">p</mi><mi>ij</mi></msub><mo>=</mo><mn>2</mn><mi
   mathvariant="normal">m</mi> :MATH]
   , where m is the total number of edges in the network. When γ = 1, it
   leads to the natural situation in which the total amount of energy that
   can possibly be contributed by links and non-links is equal, i.e.,
   [MATH: <munder><mo>∑</mo><mrow><mi
   mathvariant="normal">i</mi><mo>≠</mo><mi
   mathvariant="normal">j</mi></mrow></munder><msub><mi
   mathvariant="normal">A</mi><mi>ij</mi></msub><msub><mi
   mathvariant="normal">a</mi><mi>ij</mi></msub><mo>=</mo><munder><mo>∑</m
   o><mrow><mi mathvariant="normal">i</mi><mo>≠</mo><mi
   mathvariant="normal">j</mi></mrow></munder><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msub><mi
   mathvariant="normal">A</mi><mi>ij</mi></msub></mrow></mfenced><msub><mi
   mathvariant="normal">b</mi><mi>ij</mi></msub> :MATH]
   . As we are dealing with undirected, unweighted networks, our choice of
   weights allows us to simplify the Hamiltonian (Eq. [249]14):
   [MATH: <mi mathvariant="script">H</mi><mfenced close=")"
   open="("><mfenced close="}"
   open="{"><mi>σ</mi></mfenced></mfenced><mo>=</mo><mo>−</mo><munder><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>≠</mo><mi>j</mi></mrow>
   </munder><mfenced close=")" open="("><mrow><msub><mi>A</mi><mi
   mathvariant="italic">ij</mi></msub><mo>−</mo><mi>γ</mi><msub><mi>p</mi>
   <mi
   mathvariant="italic">ij</mi></msub></mrow></mfenced><mi>δ</mi><mfenced
   close=")" open="("
   separators=","><msub><mi>σ</mi><mi>i</mi></msub><msub><mi>σ</mi><mi>j</
   mi></msub></mfenced> :MATH]
   15

   where, p[ij], the probability, can be written as
   [MATH: <msub><mi
   mathvariant="normal">p</mi><mi>ij</mi></msub><mo>=</mo><mfrac><mrow><ms
   ub><mi mathvariant="normal">k</mi><mi
   mathvariant="normal">i</mi></msub><msub><mi
   mathvariant="normal">k</mi><mi
   mathvariant="normal">j</mi></msub></mrow><mrow><mn>2</mn><mi
   mathvariant="normal">m</mi></mrow></mfrac> :MATH]
   and k[i] and k[j] represent degree of node i and degree of node j,
   respectively. Now, minimizing H gives the number of spin states (or
   communities) in a network. The minimization is carried out by using
   simulated annealing on the entire network. This method is rather fast
   and the computational complexity is approximately O(n^3.2). However, it
   cannot be used for disconnected networks as there is no guarantee that
   nodes from disconnected parts of the network also have different spin
   states and belong to different communities.

   Substituting p[ij] as
   [MATH: <mfrac><mrow><msub><mi mathvariant="normal">k</mi><mi
   mathvariant="normal">i</mi></msub><msub><mi
   mathvariant="normal">k</mi><mi
   mathvariant="normal">j</mi></msub></mrow><mrow><mn>2</mn><mi
   mathvariant="normal">m</mi></mrow></mfrac> :MATH]
   and γ = 1 in Eq. [250]15, we have:
   [MATH: <mi mathvariant="script">H</mi><mfenced close=")"
   open="("><mfenced close="}"
   open="{"><mi>σ</mi></mfenced></mfenced><mo>=</mo><mo>−</mo><munder><mo>
   ∑</mo><mrow><mi>i</mi><mo>≠</mo><mi>j</mi></mrow></munder><mfenced
   close=")" open="("><mrow><msub><mi>A</mi><mi
   mathvariant="italic">ij</mi></msub><mo>−</mo><mfrac><mrow><msub><mi>k</
   mi><mi>i</mi></msub><msub><mi>k</mi><mi>j</mi></msub></mrow><mrow><mn>2
   </mn><mi>m</mi></mrow></mfrac></mrow></mfenced><mi>δ</mi><mfenced
   close=")" open="("
   separators=","><msub><mi>σ</mi><mi>i</mi></msub><msub><mi>σ</mi><mi>j</
   mi></msub></mfenced> :MATH]

   and comparing this equation with Eq. [251]1 (modularity) yields:
   [MATH: <mi>Q</mi><mo>=</mo><mo>−</mo><mfrac><mrow><mi
   mathvariant="script">H</mi><mfenced close=")" open="("><mfenced
   close="}"
   open="{"><mi>σ</mi></mfenced></mfenced></mrow><mrow><mn>2</mn><mi>m</mi
   ></mrow></mfrac> :MATH]
   16

   It is clear from Eq. [252]16 that minimizing Hamiltonian is equivalent
   to maximizing modularity. Thus, we expect to get same results for
   Louvain (which maximizes modularity) and Spinglass (which minimizes
   Hamiltonian) when applied to our networks.

   Table [253]18 summarizes the above methods described in this section.

Table 18.

   Summary of community detection methods
   Name of Method Equation # Computational complexity    Reference
   Girvan Newman  1          O(m^2n)                  [[254]1, [255]7]
   Fast Greedy    2, 3, 4    O(nlog^2n)               [[256]11]
   Combo          1          O(n^2logc)               [[257]28]
   Louvain        1, 5       O(nlogn)                 [[258]10]
   Conclude       6, 7       O(m)                     [[259]29]
   Infomap        8          O(m)                     [[260]30]
   Leading Eigen  9, 10, 12  O(n^2)                   [[261]8]
   Spinglass      15         O(n^3.2)                 [[262]9]
   [263]Open in a new tab

Metrics for comparison of different algorithms

   To compare different methods, we used three metrics, namely, Rand Index
   (RI), Adjusted Rand Index (ARI), and Normalized Mutual Information
   (NMI). We also used Jaccard Index for measuring similarity between
   different communities. These metrics are based on topological
   similarities of the communities identified and hence are relevant for
   biological networks; we expect that topologically similar communities
   will likely yield similar biological interpretations.

Rand index (RI)

   Rand proposes a simple measure of agreement between the results of
   (i.e., the communities identified by) two methods A and B [[264]40]. RI
   represents the fraction of node-pairs that are distributed to the
   communities obtained by the two methods in a similar manner. Let n[11]
   be the number of pairs of nodes from a network G which are both in the
   same community detected by method A and are also in the same community
   detected by method B. Let n[00] be the number of pairs of nodes from G
   which are in different communities in A and are also in different
   communities in B. n[00] and n[11] are interpreted as agreements in the
   classification of the nodes from a pair. Accordingly, two disagreement
   quantities n[01] and n[10] are also defined: n[01] (n[10]) is the
   number of pairs of nodes from G which are in the same community
   detected by method A (B) but they are in different communities detected
   by method B (A). Then, Rand Index (RI) is given by [[265]41]:
   [MATH: <mi mathvariant="italic">RI</mi><mfenced close=")" open="("
   separators=","><mi>A</mi><mi>B</mi></mfenced><mo>=</mo><mfrac><mrow><ms
   ub><mi>n</mi><mn>00</mn></msub><mo>+</mo><msub><mi>n</mi><mn>11</mn></m
   sub></mrow><mrow><msub><mi>n</mi><mn>00</mn></msub><mo>+</mo><msub><mi>
   n</mi><mn>11</mn></msub><mo>+</mo><msub><mi>n</mi><mn>01</mn></msub><mo
   >+</mo><msub><mi>n</mi><mn>10</mn></msub></mrow></mfrac> :MATH]
   17

   As seen from Eq. [266]17, RI has a probabilistic interpretation with
   respect to picking a pair of nodes at random, i.e.,
   [MATH:
   <mfrac><mrow><msub><mi>n</mi><mn>00</mn></msub><mo>+</mo><msub><mi>n</m
   i><mn>11</mn></msub></mrow><mfenced close=")"
   open="("><mtable><mtr><mtd><maligngroup></maligngroup><mi>N</mi></mtd><
   /mtr><mtr><mtd><maligngroup></maligngroup><mn>2</mn><mrow></mrow></mtd>
   </mtr></mtable></mfenced></mfrac> :MATH]
   , which is a probability of agreement (N is the total number of nodes).
   RI is not a normalized quantity, e.g., the upper bound is 1 but the
   lower bound is more than zero (network dependent). Due to this lack of
   normalization, Hubert and Arabie [[267]42] suggested an improvement to
   RI as described below.

Adjusted Rand index (ARI)

   ARI is equivalent to a normalized Rand Index. Consider a confusion
   matrix for methods A and B where rows correspond to the communities in
   A and columns correspond to the communities in B. N[ij], the (i,j)^th
   entry in this matrix, is the number of nodes in both community i of
   method A and community j in method B. Denote by N[i]. the sum of all
   columns for row i; thus N[i.] is the number of nodes in community i of
   method A. Define N[.j] to be the sum of all rows for column j, i.e.
   N[.j] is the number of nodes in community j in method B. The Adjusted
   Rand Index (ARI), is calculated from the values N[ij] of the confusion
   matrix for the two methods as follows [[268]42]:
   [MATH: <mtable
   displaystyle="true"><mtr><mtd><msub><mi>t</mi><mn>1</mn></msub><mo>=</m
   o><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><msu
   b><mi>c</mi><mi>A</mi></msub></munderover><mfenced close=")"
   open="("><mtable><mtr><mtd><maligngroup></maligngroup><msub><mi>N</mi><
   mrow><mi>i</mi><mo>.</mo></mrow></msub></mtd></mtr><mtr><mtd><maligngro
   up></maligngroup><mn>2</mn><mrow></mrow></mtd></mtr></mtable></mfenced>
   <mo>;</mo><msub><mi>t</mi><mn>2</mn></msub><mo>=</mo><munderover><mo>∑<
   /mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>c</mi><mi>B</m
   i></msub></munderover><mfenced close=")"
   open="("><mtable><mtr><mtd><maligngroup></maligngroup><msub><mi>N</mi><
   mrow><mo>.</mo><mi>j</mi></mrow></msub></mtd></mtr><mtr><mtd><maligngro
   up></maligngroup><mn>2</mn></mtd></mtr></mtable></mfenced><mo>;</mo><ms
   ub><mi>t</mi><mn>3</mn></msub><mo>=</mo><mfrac><mrow><mn>2</mn><msub><m
   i>t</mi><mn>1</mn></msub><msub><mi>t</mi><mn>2</mn></msub></mrow><mrow>
   <mi>N</mi><mfenced close=")"
   open="("><mrow><mi>N</mi><mo>−</mo><mn>1</mn></mrow></mfenced></mrow></
   mfrac></mtd></mtr><mtr><mtd><mi mathvariant="italic">ARI</mi><mfenced
   close=")" open="("
   separators=","><mi>A</mi><mi>B</mi></mfenced><mo>=</mo><mfrac><mrow><mu
   nderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><msub><mi
   >c</mi><mi>A</mi></msub></munderover><munderover><mo>∑</mo><mrow><mi>j<
   /mi><mo>=</mo><mn>1</mn></mrow><msub><mi>c</mi><mi>B</mi></msub></munde
   rover><mfenced close=")"
   open="("><mtable><mtr><mtd><maligngroup></maligngroup><msub><mi>N</mi><
   mrow><mi>i</mi><mi>j</mi></mrow></msub></mtd></mtr><mtr><mtd><maligngro
   up></maligngroup><mn>2</mn><mrow></mrow></mtd></mtr></mtable></mfenced>
   <mo>−</mo><msub><mi>t</mi><mn>3</mn></msub></mrow><mrow><mfrac><mn>1</m
   n><mn>2</mn></mfrac><mfenced close=")"
   open="("><mrow><msub><mi>t</mi><mn>1</mn></msub><mo>+</mo><msub><mi>t</
   mi><mn>2</mn></msub></mrow></mfenced><mo>−</mo><msub><mi>t</mi><mn>3</m
   n></msub></mrow></mfrac></mtd></mtr></mtable><mspace
   width="0.25em"></mspace> :MATH]
   18

   where, c[A] and c[B] are the number of communities detected by methods
   A and B, respectively.

Normalized mutual information (NMI)

   Another metric to calculate the similarity between two methods is
   Normalized Mutual Information (NMI). NMI is the normalized form of
   Mutual Information (MI). MI measures similarity between the results of
   two methods and is given by [[269]41]:
   [MATH: <mi mathvariant="italic">MI</mi><mfenced close=")" open="("
   separators=","><mi>A</mi><mi>B</mi></mfenced><mo>=</mo><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <msub><mi>c</mi><mi>A</mi></msub></munderover><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <msub><mi>c</mi><mi>B</mi></msub></munderover><mfrac><msub><mi>N</mi><m
   i mathvariant="italic">ij</mi></msub><mi>N</mi></mfrac><mo
   mathvariant="italic">log</mo><mfenced close=")"
   open="("><mfrac><mrow><msub><mi>N</mi><mi
   mathvariant="italic">ij</mi></msub><mi>N</mi></mrow><mrow><msub><mi>N</
   mi><mrow><mi>i</mi><mo>.</mo></mrow></msub><msub><mi>N</mi><mrow><mo>.<
   /mo><mi>j</mi></mrow></msub></mrow></mfrac></mfenced> :MATH]
   19

   where, A and B are the methods being compared. The terms used in this
   equation are the same as those used in the equation for ARI (Eq.
   [270]18). Then, NMI between methods A and B is calculated as:
   [MATH: <mi mathvariant="italic">NMI</mi><mfenced close=")" open="("
   separators=","><mi>A</mi><mi>B</mi></mfenced><mo>=</mo><mfrac><mrow><mo
   >−</mo><mn>2</mn><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1<
   /mn></mrow><msub><mi>c</mi><mi>A</mi></msub></munderover><munderover><m
   o>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>c</mi><mi>
   B</mi></msub></munderover><mfrac><msub><mi>N</mi><mi
   mathvariant="italic">ij</mi></msub><mi>N</mi></mfrac><mo
   mathvariant="italic">log</mo><mfenced close=")"
   open="("><mfrac><mrow><msub><mi>N</mi><mi
   mathvariant="italic">ij</mi></msub><mi>N</mi></mrow><mrow><msub><mi>N</
   mi><mrow><mi>i</mi><mo>.</mo></mrow></msub><msub><mi>N</mi><mrow><mo>.<
   /mo><mi>j</mi></mrow></msub></mrow></mfrac></mfenced></mrow><mrow><mund
   erover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>c
   </mi><mi>A</mi></msub></munderover><msub><mi>N</mi><mrow><mi>i</mi><mo>
   .</mo></mrow></msub><mo mathvariant="italic">log</mo><mfenced close=")"
   open="("><mfrac><msub><mi>N</mi><mrow><mi>i</mi><mo>.</mo></mrow></msub
   ><mi>N</mi></mfrac></mfenced><mo>+</mo><munderover><mo>∑</mo><mrow><mi>
   j</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>c</mi><mi>B</mi></msub></mun
   derover><msub><mi>N</mi><mrow><mo>.</mo><mi>j</mi></mrow></msub><mo
   mathvariant="italic">log</mo><mfenced close=")"
   open="("><mfrac><msub><mi>N</mi><mrow><mo>.</mo><mi>j</mi></mrow></msub
   ><mi>N</mi></mfrac></mfenced></mrow></mfrac> :MATH]
   20

Jaccard index

   Jaccard index is a measure of similarity for two sets of nodes, with a
   range from 0 to 1 and is defined as the size of the intersection
   (overlap) divided by the size of the union of the sets:
   [MATH: <mi>J</mi><mfenced close=")" open="("
   separators=","><mi>A</mi><mi>B</mi></mfenced><mo>=</mo><mfrac><mrow><mo
   >#</mo><mfenced close=")"
   open="("><mrow><mi>A</mi><mo>∩</mo><mi>B</mi></mrow></mfenced></mrow><m
   row><mo>#</mo><mfenced close=")"
   open="("><mrow><mi>A</mi><mo>∪</mo><mi>B</mi></mrow></mfenced></mrow></
   mfrac> :MATH]
   21

   where, the numerator is the number of common elements between the sets
   A and B and the denominator is the number of all the elements in A and
   B combined.

Overall approach for topological and functional comparison of communities
detected by different algorithms

   Applying the above methods to PPI networks yields different number of
   communities with different number of nodes. The communities detected
   are compared in two ways: topological comparison and functional
   comparison. In topological comparison, methods are compared using
   different metrics (RI, ARI and NMI). Based on the results of these
   metrics, we are able to figure out which methods are similar to each
   other and which are dissimilar. When a method is compared with itself,
   RI, ARI and NMI are 1. Larger (smaller) the value of metrics, more
   (less) similar are the two methods being compared. After finding which
   methods are similar to each other from a topological perspective,
   functional comparisons (such as KEGG pathway enrichment analysis) have
   been used to further assess the functional similarity of the
   communities identified by these methods. Figure [271]6 shows a flow
   chart of our analysis pipeline.

Fig. 6.

   Fig. 6
   [272]Open in a new tab

   Flow chart of the steps used in our analysis

Additional file

   [273]Additional file 1:^ (7.7MB, xlsx)

   Supplemental Tables S1-S31. (XLSX 7912 kb)

Acknowledgments