Abstract

   The accessibility of a huge amount of protein-protein interaction (PPI)
   data has allowed to do research on biological networks that reveal the
   structure of a protein complex, pathways and its cellular organization.
   A key demand in computational biology is to recognize the modular
   structure of such biological networks. The detection of protein
   complexes from the PPI network, is one of the most challenging and
   significant problems in the post-genomic era. In Bioinformatics, the
   frequently employed approach for clustering the networks is Markov
   Clustering (MCL). Many of the researches for protein complex detection
   were done on the static PPI network, which suffers from a few
   drawbacks. To resolve this problem, this paper proposes an approach to
   detect the dynamic protein complexes through Markov Clustering based on
   Elephant Herd Optimization Approach (DMCL-EHO). Initially, the proposed
   method divides the PPI network into a set of dynamic subnetworks under
   various time points by combining the gene expression data and secondly,
   it employs the clustering analysis on every subnetwork using the MCL
   along with Elephant Herd Optimization approach. The experimental
   analysis was employed on different PPI network datasets and the
   proposed method surpasses various existing approaches in terms of
   accuracy measures. This paper identifies the common protein complexes
   that are expressively enriched in gold-standard datasets and also the
   pathway annotations of the detected protein complexes using the KEGG
   database.

   Subject terms: Computational biology and bioinformatics, Data mining

Introduction

   The protein complexes are molecular combinations of proteins
   accumulated by multiple PPI networks, which plays a significant part in
   numerous biological processes. Several proteins are biologically
   functional only when they interact with their neighbour proteins and
   create their protein complex. It is crucial to recognize the sets of
   proteins that form complexes. Thus, numerous computational approaches
   have been developed to detect and predict protein complexes from the
   PPI networks.

   High-throughput approaches have created a huge quantity of protein
   interactions that helps to discover the protein complexes from a large
   PPI network. During the clustering process, the PPI network is
   considered as an undirected graph N[et] = (V[er], E[dg]) where V[er] is
   the set of nodes and E[dg] is the set of edges. The set of nodes
   signifies the proteins and set of edges signifies the interaction
   between proteins.

   To cluster the PPI, the network has been modelled into two types,
   static PPI network that detects the protein functional modules and the
   second is the dynamic PPI network that detects protein complexes. The
   dynamic PPI network is defined as the division of static PPI in a
   series of time-sequenced subnetworks using gene expression data. There
   exists the variance between protein functional module and protein
   complexes. The protein functional module is defined as the cluster of
   proteins which contributes to a specific cellular process and binds
   with each other at various time points, whereas protein complexes are
   defined as the cluster of proteins that interacts with each other at
   the same time point^[28]1.

   Many computational approaches of protein complex detection have been
   focused on static PPI that extract the dense region in PPI networks,
   which concentrates only on the topological structure of PPI. Some of
   the methods that use the static PPI for protein complex detection are
   MCode^[29]2, CFinder^[30]3, MCL^[31]4, COACH^[32]5, ClusterOne^[33]6,
   RNSC^[34]7, CMC^[35]8, and many more. Maulik et al., identified the
   protein complexes using non-cooperative sequential game^[36]9.

   As PPI network continuously transforms with respect to the environment,
   time and various phases of the cell cycle, the clustering analysis on
   static PPI does not emulate these dynamic attributes and it is far from
   optimal solution. Thus, in recent times, various attempts on the
   clustering process of dynamic PPI network has been initiated along with
   the gene expression data to enhance the protein complex detection.
   Also, many evolutionary approaches were employed for analysing the
   clustering process of the PPI network such as ant colony optimization
   ACC-DPC^[37]1, ACO-MCL^[38]10, cuckoo search optimization (CSO)^[39]11,
   BiCAMWI using genetic algorithm^[40]12, Soft Regularized-MCL^[41]13,
   particle swarm optimization (PSO-MCL)^[42]14 and artificial fish school
   algorithm (AFA-MCL)^[43]15. The firefly optimization was employed along
   with Markov Clustering (F-MCL) on the dynamic PPI network for
   predicting complexes. The execution time for F-MCL is higher as all the
   fireflies (proteins) in the population (network) tries to reach the
   optimal solution (cluster). There are few proteins that are not
   eligible to come under the cluster and take more iterations to reach
   the cluster, which may take a long time^[44]16.

   The above-mentioned approaches were effective, but they do not promise
   a global outcome since they suffer from the effect of unwanted clusters
   which leads to time consuming. In order to discard the drawbacks of the
   above-mentioned approaches, a novel approach was proposed to detect the
   dynamic protein complexes through Markov Clustering based on Elephant
   Herd Optimization Approach. One of the most important advantages for
   EHO is that it is the most computationally efficient and has less time
   consuming compared to F-MCL and other approaches. This is because the
   unwanted noisy data (unclustered proteins) will be removed from the
   clan separating operation of EHO approach. The remaining sections of
   this paper is ordered as follows: Section 2 discusses briefly about the
   methodology of the proposed approach. Section 3 illustrates the
   experimental results with various performance measures, Section 4
   deliberates about the implementation and discussion of the proposed
   method in detail and finally Section 5 concludes the paper and
   recommends for the future enrichments.

Methods

   For detecting the protein complexes, initially, the proposed method
   divides a static PPI network into a sequence of subnetworks below
   diverse time points by combining gene expression data to form dynamic
   model. In order to build a dynamic model, the static PPI network is
   integrated with gene expression data, which declare the level of gene
   expression, as well as protein expression. As a protein does not always
   becomes active at a cell cycle, it is assumed that a protein was active
   at the time points with its highest expression level^[45]17. The
   expression level of a protein will be increased before its expression
   and will be decreased once the protein has completed its function, and
   the time points are identified with its expression level, which are
   higher than a threshold.

   Given is a static PPI network P[P] = (P[ver], P[Edg]), where P[ver], is
   a set of proteins and P[Edg], is a set of interactions between these
   proteins. In gene expression data, there is a series of T time stamps
   coming with |P[ver]| × (T * TR) matrix M, where TR is the number of
   repetitions of the time series. Each element M(P[ver], j) of this
   matrix represents the level of gene expression.

   The three-sigma principle is employed to determine if a gene is
   expressed in a single stamp. For each gene P[ver], the gene expression
   is defined as given in the following Eqs ([46]1–[47]5)
   [MATH: <msub><mrow><mi mathvariant="normal">Ev</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub><mo
   stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub><mo
   stretchy="false">)</mo><mo>=</mo><mfrac><mrow><msubsup><mo>∑</mo><mrow>
   <mi mathvariant="normal">tr</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">TR</mi></mrow></msubsup><mi
   mathvariant="normal">M</mi><mo stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub><mo>,</mo><mi
   mathvariant="normal">i</mi><mo>+</mo><mi
   mathvariant="normal">T</mi><mo>×</mo><mo stretchy="false">(</mo><mi
   mathvariant="normal">tr</mi><mo>−</mo><mn>1</mn><mo
   stretchy="false">)</mo><mo stretchy="false">)</mo></mrow><mrow><mi
   mathvariant="normal">TR</mi></mrow></mfrac> :MATH]
   1
   [MATH: <mi mathvariant="normal">UE</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>=</mo><mfrac><mrow><msubsup><mo>∑</mo>
   <mrow><mi mathvariant="normal">i</mi><mo>=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">T</mi></msubsup><msub><mrow><mi
   mathvariant="normal">Ev</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub><mo
   stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub><mo
   stretchy="false">)</mo></mrow><mi mathvariant="normal">T</mi></mfrac>
   :MATH]
   2
   [MATH: <msup><mrow><mi
   mathvariant="normal">σ</mi></mrow><mn>2</mn></msup><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>=</mo><mfrac><mrow><msubsup><mo>∑</mo>
   <mrow><mi mathvariant="normal">i</mi><mo>=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">T</mi></msubsup><msup><mrow><mo
   stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">Ev</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub><mo
   stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub><mo
   stretchy="false">)</mo><mo>−</mo><mi mathvariant="normal">UE</mi><mo
   stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub><mo
   stretchy="false">)</mo><mo
   stretchy="false">)</mo></mrow><mn>2</mn></msup></mrow><mi
   mathvariant="normal">T</mi></mfrac> :MATH]
   3
   [MATH: <mi mathvariant="normal">Fl</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>=</mo><mfrac><mn>1</mn><mrow><mn>1</mn
   ><mo>+</mo><msup><mrow><mi
   mathvariant="normal">σ</mi></mrow><mn>2</mn></msup><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow></mfrac> :MATH]
   4
   [MATH: <mtable columnalign="left"><mtr><mtd><mi
   mathvariant="normal">AT</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>=</mo><msub><mrow><mi
   mathvariant="normal">S</mi></mrow><mrow><mn>1</mn></mrow></msub><mrow><
   mo stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>×</mo><mi
   mathvariant="normal">Fl</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>+</mo><msub><mrow><mi
   mathvariant="normal">S</mi></mrow><mrow><mn>2</mn></mrow></msub><mrow><
   mo stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>×</mo><mrow><mo
   stretchy="true">(</mo><mrow><mn>1</mn><mo>−</mo><mi
   mathvariant="normal">fl</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">)</mo></mrow></mtd></mtr><mtr><mtd><mspace
   width="1em"></mspace><mspace width="1em"></mspace><mspace
   width="1em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mo>=</mo><mspace
   width=".25em"></mspace><mi mathvariant="normal">UE</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>+</mo><mn>3</mn><mi
   mathvariant="normal">σ</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mspace width=".25em"></mspace><mrow><mo
   stretchy="true">(</mo><mrow><mn>1</mn><mo>−</mo><mi
   mathvariant="normal">fl</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">ver</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">)</mo></mrow></mtd></mtr></mtable> :MATH]
   5

   where Ev[i](P[ver]) is the mean of the expression value of gene P[ver]
   at timestamp i, UE(P[ver]) is the mean of its expression values over
   times ranging from 1 to T, σ(P[ver]) is the standard deviation of its
   expression values, Fl(P[ver]) is used to show fluctuation of the
   expression curve of gene P[ver]. Suppose that the gene expression data
   is governed by a normal distribution, then S[1](P[ver]) and
   S[2](P[ver]) are the associated mean and three-sigma value, that is
   S[1](P[ver]) = UE(P[ver]) and S[2](P[ver]) = UE(P[ver]) + 3σ(P[ver]).
   In virtue of three-sigma principle, the probability that a value
   greater than S[2](P[ver]) is not an active point is less than 0.1%.
   AT(P[ver]) is the active threshold of gene P[ver]. Consider the gene
   (P[ver]) at timestamp i. If Ev[i](P[ver]) > AT(P[ver]), then the gene
   P[ver] is expressed and the gene product exists^[48]16.

   In the clustering procedure of every subnetwork, the proposed method
   starts with constructing the initial protein clusters depending on the
   protein complexes attained at the prior time point. The initial
   clusters constructed in the first generation have three steps The
   procedure for constructing initial clusters has three steps: seed node
   selection, attachment nodes addition and finally refining^[49]1. To
   clearly demonstrate the three steps, a subnetwork of time point t with
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>p</mi></mrow><mrow><mi>t</mi>
   </mrow></msubsup> :MATH]
    = (
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>v</mi><mi>e</mi><mi>r</mi></m
   row><mrow><mi>t</mi></mrow></msubsup> :MATH]
   ,
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>E</mi><mi>d</mi><mi>g</mi></m
   row><mrow><mi>t</mi></mrow></msubsup> :MATH]
   ), where
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>v</mi><mi>e</mi><mi>r</mi></m
   row><mrow><mi>t</mi></mrow></msubsup> :MATH]
   , is a set of proteins and
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>E</mi><mi>d</mi><mi>g</mi></m
   row><mrow><mi>t</mi></mrow></msubsup> :MATH]
   , is a set of interactions between these proteins at the time t.
    1. Selecting seed nodes: This step first computes the clustering
       coefficient of every node. Then it selects the nodes whose
       clustering coefficients are greater than a given threshold λ[c] as
       seed nodes, and puts them into the set of seed nodes at the current
       time point t, denoted by S^t. The seed nodes are considered as the
       candidate clustering centers and represent different clusters of
       protein complexes. The clustering coefficient of any node i is
       defined in Eq. ([50]6):

   [MATH: <mi
   mathvariant="normal">Ψ</mi><mo>=</mo><mfrac><mrow><mn>2</mn><mo>×</mo><
   msubsup><mrow><mi>n</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>t</mi><
   /mrow></msubsup></mrow><mrow><mo
   stretchy="false">|</mo><mi>N</mi><mi>e</mi><mi>i</mi><mi>g</mi><mi>h</m
   i><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo><mo
   stretchy="false">|</mo><mo>×</mo><mo stretchy="false">(</mo><mo
   stretchy="false">|</mo><mi>N</mi><mi>e</mi><mi>i</mi><mi>g</mi><mi>h</m
   i><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo><mo
   stretchy="false">|</mo><mo>−</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></mfrac> :MATH]
       6
       where Neigh(i) = {j є
       [MATH:
       <msubsup><mrow><mi>P</mi></mrow><mrow><mi>v</mi><mi>e</mi><mi>r</mi
       ></mrow><mrow><mi>t</mi></mrow></msubsup> :MATH]
       |(i · j) є 
       [MATH:
       <msubsup><mrow><mi>P</mi></mrow><mrow><mi>E</mi><mi>d</mi><mi>g</mi
       ></mrow><mrow><mi>t</mi></mrow></msubsup> :MATH]
       } represents the neighbor nodes of node i, and |Neigh(i)| is the
       number of neighbor nodes of node i,
       [MATH:
       <msubsup><mrow><mi>n</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>t<
       /mi></mrow></msubsup> :MATH]
       is the number of links between neighbour nodes of i at the time
       point t.
    2. Attachment nodes addition: For any seed node i (i є 
       [MATH: <msup><mrow><mi>S</mi></mrow><mi>t</mi></msup> :MATH]
       ) of current time point t, if it is also the seed node of previous
       time point (t − 1), then the nodes which are in the cluster i at
       the previous time point (t − 1) and also exists in the subnetwork
       [MATH:
       <msubsup><mrow><mi>P</mi></mrow><mrow><mi>p</mi></mrow><mrow><mi>t<
       /mi></mrow></msubsup> :MATH]
       at the current time point t are put into the cluster i of current
       time point t. In this way, initial clusters are built. However,
       some clusters may be too sparse since that not all proteins of
       previous time point (t − 1) exist at the current time point t.
       Thus, a refining step is needed to be carried out on the initial
       clusters.
    3. Refining: For any initial cluster of protein complex
       [MATH:
       <msubsup><mrow><mi>c</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>t<
       /mi></mrow></msubsup> :MATH]
       at the current time point t, if its density is smaller than a given
       threshold λ[d] all the nodes in
       [MATH:
       <msubsup><mrow><mi>c</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>t<
       /mi></mrow></msubsup> :MATH]
       are sorted in a descending order according to their clustering
       coefficients, and the node with the smallest clustering coefficient
       is iteratively removed until the density of cluster
       [MATH:
       <msubsup><mrow><mi>c</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>t<
       /mi></mrow></msubsup><mi mathvariant="normal">i</mi> :MATH]
       is not smaller than the given threshold λ[d]. The density of a
       protein complex
       [MATH:
       <msubsup><mrow><mi>c</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>t<
       /mi></mrow></msubsup> :MATH]
       is computed by Eq. ([51]7):

   [MATH: <mi>d</mi><mi>e</mi><mi>n</mi><mo
   stretchy="false">(</mo><msubsup><mrow><mi>c</mi></mrow><mrow><mi>i</mi>
   </mrow><mrow><mi>t</mi></mrow></msubsup><mo
   stretchy="false">)</mo><mfrac><mrow><mn>2</mn><mo>×</mo><msub><mrow><mi
   >l</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mrow><msub><mrow><m
   i>n</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>×</mo><mo
   stretchy="false">(</mo><msub><mrow><mi>n</mi></mrow><mrow><mi>i</mi></m
   row></msub><mo>−</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></mfrac> :MATH]
   7

   where n[i] and l[i] are number of nodes and edges in cluster
   [MATH:
   <msubsup><mrow><mi>c</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>t</mi>
   </mrow></msubsup> :MATH]
   respectively^[52]1.

   Now, the clustering analysis of the remaining generations is employed
   by utilizing the Markov Clustering technique along with the EHO
   algorithm on every subnetwork. The matrix is constructed that depicts
   the probabilities of transition of a Markov Chain (random walk) based
   on the graph. The MCL procedure comprises of two activities such as
   expansion and inflation, which was applied to the matrix that was
   constructed. The construction of matrix M[at] for a graph description
   and the process of Markov clustering method is briefly
   described^[53]18.

   Let P[p] = (P[ver], P[Edg]), where P[ver], is a set of proteins and
   P[Edg], is a set of interactions between these proteins. Denote a node
   in P[ver] by p[vi] and an edge between p[vi] and p[vj] in P[Edg] by
   (p[vi], p[vj]), in which i and j are the indexes of the corresponding
   nodes^[54]16. W(p[vi], p[vj]) is the weight of edge (p[vi], p[vj]),
   which represents the confidence level of the interaction in a weighted
   PPI networks. Adj is the adjacency matrix of a weighted graph given as
   Eq. ([55]8),
   [MATH: <mi mathvariant="normal">Adj</mi><mrow><mo
   stretchy="true">(</mo><mrow><mi
   mathvariant="normal">i</mi><mo>,</mo><mi
   mathvariant="normal">j</mi></mrow><mo
   stretchy="true">)</mo></mrow><mo>=</mo><mrow><mo
   stretchy="true">{</mo><mrow><mtable columnalign="left"><mtr
   columnalign="left"><mtd columnalign="left"><mrow><mi
   mathvariant="normal">W</mi><mo stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">vi</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">vj</mi></mrow></msub><mo
   stretchy="false">)</mo></mrow></mtd><mtd columnalign="left"><mrow><mi
   mathvariant="normal">if</mi><mspace width=".25em"></mspace><mo
   stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">vi</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">vj</mi></mrow></msub><mo
   stretchy="false">)</mo><mo>∈</mo><msub><mrow><mi
   mathvariant="normal">P</mi></mrow><mrow><mi
   mathvariant="normal">Edg</mi></mrow></msub></mrow></mtd></mtr><mtr
   columnalign="left"><mtd columnalign="left"><mrow><msub><mrow><mi
   mathvariant="normal">max</mi></mrow><mrow><mi
   mathvariant="normal">x</mi><mo>≠</mo><mi
   mathvariant="normal">j</mi></mrow></msub><mspace
   width=".25em"></mspace><mi mathvariant="normal">W</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">vi</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">vj</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mi mathvariant="normal">if</mi><mspace
   width=".25em"></mspace><mo stretchy="false">(</mo><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">vi</mi></mrow></msub><mo>=</mo><msub><mrow><mi
   mathvariant="normal">p</mi></mrow><mrow><mi
   mathvariant="normal">vj</mi></mrow></msub><mo
   stretchy="false">)</mo></mrow></mtd></mtr><mtr columnalign="left"><mtd
   columnalign="left"><mn>0</mn></mtd><mtd columnalign="left"><mrow><mi
   mathvariant="normal">else</mi></mrow></mtd></mtr></mtable></mrow></mrow
   > :MATH]
   8

   A canonical flow matrix M[at] is an n × n (n = |P[ver]|) matrix that
   shows the probabilities of transition of a random walk defined on the
   graph. M[at](i, j) represents the probability of a transition from node
   p[vi] to p[vj]. The transition probability from p[vi] to p[vj] is
   referred to as the stochastic flow from p[vi] to p[vj]. All the
   elements in each column of M[at] will sum up to 1 and the matrix is
   expressed as given in Eq. ([56]9)
   [MATH: <msub><mrow><mi mathvariant="normal">M</mi></mrow><mrow><mi
   mathvariant="normal">at</mi></mrow></msub><mrow><mo
   stretchy="true">(</mo><mrow><mi
   mathvariant="normal">i</mi><mo>,</mo><mi
   mathvariant="normal">j</mi></mrow><mo
   stretchy="true">)</mo></mrow><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">Adj</mi><mrow><mo stretchy="true">(</mo><mrow><mi
   mathvariant="normal">i</mi><mo>,</mo><mi
   mathvariant="normal">j</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><msubsup><mo>∑</mo><mrow><mi
   mathvariant="normal">k</mi><mo>=</mo><mn>1</mn></mrow><mi
   mathvariant="normal">n</mi></msubsup><mi
   mathvariant="normal">Adj</mi><mrow><mo stretchy="true">(</mo><mrow><mi
   mathvariant="normal">k</mi><mo>,</mo><mi
   mathvariant="normal">j</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow></mfrac> :MATH]
   9

   The three crucial parameters of MCL are inflation constant (ic),
   balance (b) and penalty proportion (P[p]), where ic defines the size of
   each cluster, b defines the user-specific balance constant that is
   employed for penalizing higher-propensity neighbours and P[p] defines
   the penalty ratio of the protein nodes, which is also
   user-specified^[57]16. The clustering process using EHO algorithm is
   briefly explained here for clustering protein complexes. The overall
   flowchart of the proposed method is shown in Fig. [58]1.

Figure 1.

   [59]Figure 1
   [60]Open in a new tab

   Overall Flowchart of the proposed method.

Elephant herd optimization

   One of the contemporary swarm intelligence technique is the elephant
   herd optimization which was projected in 2016^[61]19. This algorithm
   was stimulated by the herding characteristics of elephants. In general,
   elephants are social mammals with the composite social group comprising
   of numerous clans under the guidance of a matriarch. A clan comprises
   of one or more female elephant with their calves. Female desires to
   live in domestic clusters while male elephants prefer to live alone and
   they will exit from the clan when they grow with each
   generation^[62]20. The characteristics of the clans signifies
   exploitation and leaving elephants signifies the exploration of the
   population.

   The characteristics of an elephant are measured using two main
   operators, namely clan updating and clan separating operators that are
   used for producing better clustering of proteins. Here, the elephant
   population is referred to as the static PPI network, each clan is
   referred to as the dynamic PPI subnetwork, and the elephants inside
   each clan is represented as proteins.

Clan updating operator

   The static PPI is initially separated into k dynamic PPI. Each dynamic
   PPI is headed by the individual protein, which represents the best
   solution of the dynamic PPI. In each generation, protein e of dynamic
   PPI cl[i] moves towards the
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>b</mi><mi>e</mi><mi>s</mi><mi>t<
   /mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></
   mrow></msub></mrow></msub> :MATH]
   which has the best fitness in dynamic PPI cl[i]. The fitness of the
   dynamic PPI is computed by employing the accuracy values of the protein
   complex. For new protein e in dynamic PPI cl[i], the position is
   updated by following Eq. ([63]10).
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>n</mi><mi>e</mi><mi>w</mi><mo>,<
   /mo><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow></msu
   b><mo>,</mo><mi>e</mi></mrow></msub><mo>=</mo><msub><mrow><mi>p</mi></m
   row><mrow><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>,</mo><mi>e</mi></mrow></msub><mo>+</mo><mi>α</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>b</m
   i><mi>e</mi><mi>s</mi><mi>t</mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</
   mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></msub><mo>−</mo><msub><
   mrow><mi>p</mi></mrow><mrow><mi>c</mi><msub><mrow><mi>l</mi></mrow><mro
   w><mi>i</mi></mrow></msub><mo>,</mo><mi>e</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>×</mo><mi>r</mi><mi>a</mi><mi>n</mi><m
   i>d</mi> :MATH]
   10

   where
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>n</mi><mi>e</mi><mi>w</mi><mo>,<
   /mo><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow></msu
   b><mo>,</mo><mi>e</mi></mrow></msub> :MATH]
   is the new position of protein e in dynamic PPI cl[i] and
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><msub><mrow><mi>l</mi></mr
   ow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mi>e</mi></mrow></msub>
   :MATH]
   denotes the position in previous generation.
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>b</mi><mi>e</mi><mi>s</mi><mi>t<
   /mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></
   mrow></msub></mrow></msub> :MATH]
   signifies dynamic PPI cl[i] which has the best fitness, α is the scale
   factor that determines the influence of best fitness and rand is the
   random variable employed to enhance the diversity of the populations
   and defined in the range (0, 1).

   The movement of a protein e for best fitness can be updated using Eq.
   ([64]11).
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>b</mi><mi>e</mi><mi>s</mi><mi>t<
   /mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>,</mo><mi>e</mi></mrow></msub><mo>=</mo><mi>β</mi><mo>×
   </mo><msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><mi>e</mi><mi>n</mi><
   mi>t</mi><mi>e</mi><mi>r</mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi>
   </mrow><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   11

   where β belongs to (0, 1) which is a scale to regulate the effect of
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><mi>e</mi><mi>n</mi><mi>t<
   /mi><mi>e</mi><mi>r</mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mro
   w><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   on
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>b</mi><mi>e</mi><mi>s</mi><mi>t<
   /mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>,</mo><mi>e</mi></mrow></msub> :MATH]
   .
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><mi>e</mi><mi>n</mi><mi>t<
   /mi><mi>e</mi><mi>r</mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mro
   w><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   is the centre of dynamic PPI cl[i] and for the di^th dimension it can
   be computed using the Eq. ([65]12).
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><mi>e</mi><mi>n</mi><mi>t<
   /mi><mi>e</mi><mi>r</mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mro
   w><mrow><mi>i</mi></mrow></msub><mo>,</mo><mi>d</mi></mrow></msub><mo>=
   </mo><mfrac><mn>1</mn><mrow><msub><mrow><mi>n</mi></mrow><mrow><mi>c</m
   i><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></m
   sub></mrow></mfrac><mo>×</mo><munderover><mo>∑</mo><mrow><mi>e</mi><mo>
   =</mo><mn>1</mn></mrow><mrow><msub><mrow><mi>n</mi></mrow><mrow><mi>c</
   mi><msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></
   msub></mrow></munderover><msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><
   msub><mrow><mi>l</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mi
   >e</mi><mo>,</mo><mi>d</mi></mrow></msub> :MATH]
   12

   where 1 ≤ di ≤ D, denotes the di^th dimension and D is its total
   dimension.
   [MATH:
   <msub><mrow><mi>n</mi></mrow><mrow><mi>c</mi><msub><mrow><mi>l</mi></mr
   ow><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   is the number of proteins in dynamic PPI cl[i],
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><msub><mrow><mi>l</mi></mr
   ow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mi>e</mi><mo>,</mo><mi>d</m
   i></mrow></msub> :MATH]
   is the di^th dimension of the protein in
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><msub><mrow><mi>l</mi></mr
   ow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mi>e</mi></mrow></msub>
   :MATH]
   . The centre of the dynamic PPI cl[i] is computed through DI
   computations using Eq. ([66]12). The pseudocode for the dynamic PPI
   updating operator is depicted in Algorithm 1.

Algorithm 1.

   Algorithm 1
   [67]Open in a new tab

   Pseudocode for Clan Updating Operator.

Clan separating operator

   To enhance the search capacity of the proposed method, the unclustered
   proteins and clusters with the lowest fitness will exit in every
   generation as given in Eq ([68]13)^[69]19.
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>w</mi><mi>o</mi><mi>r</mi><mi>s<
   /mi><mi>t</mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></msub><mo>=</mo><msub><mrow><mi>p</mi></m
   row><mrow><mi>m</mi><mi>i</mi><mi>n</mi></mrow></msub><mo>+</mo><mrow><
   mo
   stretchy="true">(</mo><mrow><msub><mrow><mi>p</mi></mrow><mrow><mi>m</m
   i><mi>a</mi><mi>x</mi></mrow></msub><mo>−</mo><msub><mrow><mi>p</mi></m
   row><mrow><mi>m</mi><mi>i</mi><mi>n</mi></mrow></msub><mo>+</mo><mn>1</
   mn></mrow><mo
   stretchy="true">)</mo></mrow><mo>×</mo><mi>r</mi><mi>a</mi><mi>n</mi><m
   i>d</mi> :MATH]
   13

   where p[max] and p[min] are the upper and lower bound of the single
   protein.
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>w</mi><mi>o</mi><mi>r</mi><mi>s<
   /mi><mi>t</mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></msub> :MATH]
   is the protein or complex with the lowest fitness. The rand is the
   random variable that has stochastic and uniform distribution in the
   range (0, 1). The pseudocode for the clan separating operator is given
   in Algorithm 2.

Algorithm 2.

   Algorithm 2
   [70]Open in a new tab

   Pseudocode for Clan Separating Operator.

   Depending on the clan updating and separating operator, the module of
   the proposed algorithm is framed as given in Algorithm 3.

Algorithm 3.

   [71]Algorithm 3
   [72]Open in a new tab

   Pseudocode for the Proposed Method.

   The relationship between the DMCL-EHO and the protein complex is given
   in the Table [73]1.

Table 1.

   The association between the components of DMCL-EHO and the protein
   complex
          DMCL-EHO                        Protein Complex
   Elephant                The temporary proteins in dynamic subnetwork.
   Population              Static PPI Network
   Clan                    Dynamic PPI Network
   Fitness of Clan         Clustering result of the proposed method
   Fittest Clan            Best result of the proposed method
   Position of an elephant Value of Parameters
   [74]Open in a new tab

Experimental Results

Datasets

   In this experiment, the datasets which consists of interactions for
   both Saccharomyces cerevisiae and Homo Sapiens are DIP^[75]21,
   BioGRID^[76]22 and STRING^[77]23. The benchmark PPI datasets employed
   only for Saccharomyces cerevisiae are Gavin2 and Gavin6^[78]24,
   Krogan-core and Krogan-extended^[79]25, Collins^[80]26, and
   WI-PHI^[81]27. The Gavin + Krogan dataset was generated by merging
   Gavin and Krogan Core datasets. The PPI datasets employed only for Homo
   Sapiens are HPRD^[82]28, HPID^[83]29 and PIPs^[84]30. Table [85]2 shows
   the list of datasets used in this experiment.

Table 2.

   List of datasets and gold standard benchmark databases.
   S. No Saccharomyces cerevisiae Homo Sapiens Saccharomyces cerevisiae &
   Homo Sapiens
   Dataset No of Proteins No of Interactions Dataset No of Proteins No of
   Interactions Dataset ORGANISM No of Proteins No of Interactions
   1 Gavin2 1430 6531 HPRD 10080 39209 DIP Yeast 5221 24918
   Human 5048 9141
   2 Gavin6 1855 7669 HPID 27049 16390 BioGRID Yeast 7161 53791
   Human 23373 365293
   3 Krogan-Core 2708 7123 PIPs 32179 14979 STRING Yeast 6691 184596
   Human 19566 1258291
   4 Krogan-Extended 3581 14076 — — — — — — —
   5 Collins 1622 9074 — — — — — — —
   6 Gavin + Krogan 2964 13507 — — — — — — —
   7 WI-PHI 5955 50000 — — — — — — —
   — — — — — — —
   Gold Standard Databases
   S. No Standard Database Number of Proteins Number of Interactions
   Number of Complexes
   1 CYC2008 1627 408 408
   2 MIPS 1189 11119 203
   3 SGD 1279 19854 323
   4 PCDq 9268 32198 1264
   [86]Open in a new tab

   The gene expression data used in this study for Saccharomyces
   cerevisiae ([87]GSE3431)^[88]31 and Homo Sapiens ([89]GSE3933)^[90]32
   are taken from the GEO database.

   The predicted complexes are compared to gold standard benchmark
   databases such as CYC2008^[91]33, MIPS^[92]34, SGD^[93]35 for
   Saccharomyces cerevisiae organism and PCDq^[94]36 benchmark dataset for
   Homo sapiens organism. The percentage of overlapping interactions among
   the datasets in Gavin2 is 32%, Gavin6 is 53%, Krogan-core is 46%,
   Collins is 56%, HPRD is 23%, PIPs is 57%, DIP is 2%, BioGRID is 55% and
   STRING is 47%^[95]37,[96]38.

Performance measures

   To evaluate and compare the clustering results of predicted protein
   complexes, the generated complexes were compared and matched with the
   gold standard benchmark protein complexes. Assume P[r](V[Pr],E[Pr]) and
   B[e](V[Be],E[Be]) be the set of vertices (proteins) and edges
   (interaction) of a predicted protein complex and benchmark protein
   complexes.

Complex similarity score (CSS)

   CSS is defined as the closeness of two protein complexes namely
   predicted (P[r]) and benchmark (B[e]) protein complexes and they are
   computed based on Eq. [97]14.
   [MATH: <mi>C</mi><mi>S</mi><mi>S</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi>P</mi></mrow><mrow><mi>r</m
   i></mrow></msub><mo>,</mo><msub><mrow><mi>B</mi></mrow><mrow><mi>e</mi>
   </mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mo>=</mo><mfrac><mrow><mo
   stretchy="false">|</mo><msub><mrow><mi>V</mi></mrow><mrow><mi>P</mi><mi
   >r</mi></mrow></msub><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><msub><mrow><mi>V</mi></mrow><mrow><mi>B</mi><mi
   >e</mi></mrow></msub><msup><mrow><mo
   stretchy="false">|</mo></mrow><mn>2</mn></msup></mrow><mrow><mrow><mo
   stretchy="true">|</mo><mrow><msub><mrow><mi>V</mi></mrow><mrow><mi>P</m
   i><mi>r</mi></mrow></msub></mrow><mo
   stretchy="true">|</mo></mrow><mo>∗</mo><mo
   stretchy="false">|</mo><msub><mrow><mi>V</mi></mrow><mrow><mi>B</mi><mi
   >e</mi></mrow></msub><mo stretchy="false">|</mo></mrow></mfrac> :MATH]
   14

   where V[pr] and V[Be] denotes the set of proteins in predicted and
   benchmark protein complexes. If CSS(P[r], B[e]) is equal to 0, it
   denotes that the predicted and benchmark protein complexes do not have
   any common protein complexes. On the contradictory, if CSS(P[r], B[e])
   is equal to 1, then the predicted complex P[r](V[Pr], E[Pr]) has the
   same equal nodes as the benchmark complex B[e](V[Be], E[Be]). Here, if
   CSS(P[r], B[e]) > 0.2, it is considered as the predicted and benchmark
   protein complexes are identical^[98]39.

   Now, to assess the performance of predicted protein clusters, four
   commonly employed measures are utilized such as Precision, Recall,
   F-Measure, Coverage Ratio and Accuracy.

Precision

   Precision is defined as the accuracy of predicted protein complexes
   that are identical to the benchmark protein complexes. If the precision
   value is high, it indicates that the predicted complexes are likely to
   be true positive. The precision of the protein complexes is computed
   based on Eq. ([99]15).
   [MATH:
   <mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>i</mi><mi>s</mi><mi>i</mi><
   mi>o</mi><mi>n</mi><mo>=</mo><mfrac><mrow><msub><mrow><mi>N</mi></mrow>
   <mrow><mi>P</mi><mi>c</mi></mrow></msub></mrow><mrow><mo
   stretchy="false">|</mo><mi>P</mi><mi>r</mi><mi>e</mi><mi>d</mi><mi>i</m
   i><mi>c</mi><mi>t</mi><mi>e</mi><msub><mrow><mi>d</mi></mrow><mrow><mi>
   s</mi><mi>e</mi><mi>t</mi></mrow></msub><mo
   stretchy="false">|</mo></mrow></mfrac> :MATH]
   15

Recall

   Recall is defined as the accuracy of benchmark protein complexes that
   are identical to the predicted complexes. If the recall value is high,
   it indicates that the predicted complex has a good number of coverage
   of the proteins in the gold standard complexes. The recall of the
   protein complexes is computed based on Eq. ([100]16).
   [MATH:
   <mi>R</mi><mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi><mo>=</mo><
   mfrac><mrow><msub><mrow><mi>N</mi></mrow><mrow><mi>B</mi><mi>c</mi></mr
   ow></msub></mrow><mrow><mo
   stretchy="false">|</mo><mi>K</mi><mi>n</mi><mi>o</mi><mi>w</mi><msub><m
   row><mi>n</mi></mrow><mrow><mi>s</mi><mi>e</mi><mi>t</mi></mrow></msub>
   <mo stretchy="false">|</mo></mrow></mfrac> :MATH]
   16

   where N[Pc] is denoted as the number of predicted complexes which match
   at least one recognized benchmark complex, N[Bc] is denoted as the
   number of recognised benchmark complexes which match at least one
   predicted complex, Predicted[set] is denoted as the set of complexes
   predicted by the proposed approach and Known[set] is denoted as the set
   of recognised gold standard benchmark protein complexes.

Coverage ratio (CR)

   CR is defined as the fraction of proteins in benchmark complex V[Be]
   found in predicted complex V[pr] and they are computed based on Eq.
   ([101]17).
   [MATH:
   <mi>C</mi><mi>R</mi><mo>=</mo><mfrac><mrow><munder><mo>∑</mo><mi>i</mi>
   </munder><mrow><mi mathvariant="italic">max</mi><mspace
   width=".25em"></mspace><msub><mrow><mi>T</mi></mrow><mrow><mi>i</mi><mo
   >,</mo><mi>j</mi></mrow></msub></mrow></mrow><mrow><msub><mo>∑</mo><mi>
   i</mi></msub><mo
   stretchy="false">|</mo><msub><mrow><mi>V</mi></mrow><mrow><mi>B</mi><mi
   >e</mi></mrow></msub><mo stretchy="false">|</mo></mrow></mfrac> :MATH]
   17

   where V[Be] is denoted as the set of proteins in benchmark protein
   complexes. T[i, j] is denoted as the common number of proteins between
   V[pr] and V[Be].

F-Measure

   F-Measure is defined as the harmonic mean, i.e., a rational mixture of
   both precision and recall and it is computed based on Eq. ([102]18).
   [MATH:
   <mi>F</mi><mo>−</mo><mi>M</mi><mi>e</mi><mi>a</mi><mi>s</mi><mi>u</mi><
   mi>r</mi><mi>e</mi><mo>=</mo><mfrac><mrow><mn>2</mn><mspace
   width=".25em"></mspace><mrow><mo
   stretchy="true">(</mo><mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi
   >i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo>∗</mo><mi>R</mi><mi>
   e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><mo
   stretchy="true">(</mo><mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi
   >i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo>+</mo><mi>R</mi><mi>
   e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow><mo
   stretchy="true">)</mo></mrow></mfrac> :MATH]
   18

Accuracy

   Accuracy is defined as the geometrical mean i.e the trade-off between
   precision and recall and it is computed based on Eq. ([103]19).
   [MATH:
   <mi>A</mi><mi>c</mi><mi>c</mi><mi>u</mi><mi>r</mi><mi>a</mi><mi>c</mi><
   mi>y</mi><mo>=</mo><msqrt><mrow><mi>P</mi><mi>r</mi><mi>e</mi><mi>c</mi
   ><mi>i</mi><mi>s</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo>∗</mo><mi>R</mi>
   <mi>e</mi><mi>c</mi><mi>a</mi><mi>l</mi><mi>l</mi></mrow></msqrt>
   :MATH]
   19

Number of Clusters

   The number of clusters is defined as the total quantity of clusters
   formed from the PPI network after the clustering process has been
   completed.

   The performance measures such as coverage ratio, the number of
   clusters, precision, recall, f-measure and accuracy of the proposed
   method for Saccharomyces cerevisiae are compared with various datasets
   and existing algorithms against CYC2008 benchmark database and the
   graphical representation of the comparison is depicted in
   Figs [104]2–[105]7. Also, the performance measures such as coverage
   ratio, the number of clusters, precision, recall, f-measure and
   accuracy of the proposed method for Homo sapiens are compared with
   various datasets and existing algorithms against PCDq benchmark
   database and the graphical representation of the comparison is depicted
   in Figs [106]8 and [107]9. The comparison of performance measures for
   the proposed method with various datasets and existing algorithms
   against the MIPS and SGD benchmark database for Saccharomyces
   cerevisiae is given in supplementary material.

Figure 5.

   [108]Figure 5
   [109]Open in a new tab

   Comparison of Recall with various Datasets and Algorithms against
   CYC2008 Benchmark Dataset.

Figure 2.

   [110]Figure 2
   [111]Open in a new tab

   Comparison of Number of Clusters with various Datasets and Algorithms
   against CYC2008 Benchmark Dataset,

Figure 7.

   [112]Figure 7
   [113]Open in a new tab

   Comparison of Accuracy with various Datasets and Algorithms against
   CYC2008 Benchmark Dataset.

Figure 8.

   Figure 8
   [114]Open in a new tab

   Comparison of Number of Clusters and Coverage Ratio with HPRD Dataset
   and Algorithms against PCDq Benchmark Dataset.

Figure 9.

   [115]Figure 9
   [116]Open in a new tab

   Comparison of Precision, Recall, F-Measure and Accuracy with HPRD
   Dataset and Algorithms against PCDq Benchmark Dataset.

   From Figs [117]2 and [118]8, it is inferred that the number of clusters
   in the proposed method is less when compared to FOCA, AFA-MCL and
   ACO-MCL as they try to get solution from all the proteins in the
   network. These methods will not discard the undesirable proteins which
   may result in false positives. But in the proposed method, the clusters
   which has less than three proteins are discarded. Hence the precision,
   recall, F-Measure and accuracy are high for the proposed method.

   From Figs [119]3 and [120]8, it is observed that the proposed method
   has more coverage ratio than the existing methods since it employs the
   iterated clustering approach. This enhances the coverage of proteins in
   the network as the proteins in the benchmark complexes are highly found
   in the predicted complexes. From Figs [121]4–[122]7 and [123]9 it is
   observed that the precision, recall, F-Measure and accuracy shows
   fluctuations for PSO-MCL, ACO-MCL, AFA-MCL, F-MCl, FOCA and EHO-MCL.
   The mean of these measures for all the datasets shows that the proposed
   method performs better than the existing methods because it has
   employed the dynamic PPI along with EHO.

Figure 3.

   [124]Figure 3
   [125]Open in a new tab

   Comparison of Coverage Ratio with various Datasets and Algorithms
   against CYC2008 Benchmark Dataset.

Figure 4.

   [126]Figure 4
   [127]Open in a new tab

   Comparison of Precision with various Datasets and Algorithms against
   CYC2008 Benchmark Dataset.

Implementation and Discussion

   The computational issue of attaining a solution with a high accuracy
   solution for protein complex detection from dynamic PPI is still a
   challenging task. In this paper, the elephant herd optimization
   algorithm along with Markov clustering technique is combined to solve
   the protein complex detection problem. The proposed method provides an
   enhancement of the results compared to all the other popular existing
   methods. This work was executed on 2.00 GHz Intel i3 with 8GB of memory
   running on Windows 10.

   The number of clusters is small in an average when compared to other
   existing methods, due to the deletion of proteins without interactions.
   Here, the minimum number of proteins inside a cluster should be three
   or more and that are considered as a protein complex. The protein
   cluster with less than three proteins are removed. The proposed method
   was evaluated based on the removal of noise, insertion and deletion of
   random protein interactions, large PPI network, namely WI-PHI, various
   parameter analysis, statistical significance and finally with
   biological significance.

Evaluation by noise removal

   The PPI networks are obtained from high-throughput experiments, the
   large coverage of the PPI network comprises of noise in the format of
   false positive interactions and redundant data. The main challenge of
   clustering these PPI networks is present in the PPI networks itself. In
   this method, after the clustering process is accomplished, the proteins
   that do not present in any of the clusters is also considered as a
   noise. These solitary proteins that do not interact with any other
   proteins will not provide any valuable information. The minimum number
   of proteins inside the cluster is set to be three in this work. Thus,
   the isolated proteins and clusters with below three proteins are
   considered as a noise and they are removed by the clan separating
   operator by the elephant herd optimization method. Many evolutionary
   approaches are inheriting the undesirable proteins from one generation
   to another which may lead to loss of accuracy, but EHO approach will
   discard the undesirable proteins from the population in the clan
   separating operator that leads to the optimal solution. The comparison
   of EHO with other existing methods is depicted in Figs [128]2–[129]9.

Evaluation by adding and removing random protein interactions

   The testing of the proposed method is accomplished by inserting and
   deleting the random interactions of the PPI network to evaluate its
   performance. The noise can also be any missing information (false
   negatives) or added noise (false positives) in the PPI network. The DIP
   dataset is used for evaluation of adding and removing random
   interactions. The missing information of PPI network is processed by
   removing the proportion of edges randomly (0%, 20%, 40%, 60%, 80%) and
   the false positive information of PPI network is processed by adding
   the proportion of edges randomly (0%, 20%, 40%, 60%, 80%, 100%). The
   performance of the proposed method by adding and removing the random
   interactions are depicted in Figs [130]10 and [131]11.

Figure 10.

   Figure 10
   [132]Open in a new tab

   Comparison of Accuracy with Random Deletion of Protein Interactions on
   DIP Dataset against CYC2008 benchmark database.

Figure 11.

   Figure 11
   [133]Open in a new tab

   Comparison of Accuracy with Random Insertion of Protein Interactions on
   DIP Dataset against CYC2008 benchmark database.

   From the Figs [134]10 and [135]11, it is observed that even though the
   random insertions and deletion of the protein interactions are employed
   on the dataset, the proposed method performs better than other existing
   approaches.

Evaluation by large PPI network WI-PHI dataset

   In addition to analyse the performance of the proposed method on the
   large PPI dataset, WI-PHI^[136]27 dataset of Saccharomyces cerevisiae
   was employed which comprises of 5955 proteins and 50,000 protein
   interactions. The proposed method and also the existing methods were
   executed on this large dataset and compared the predicted clusters with
   the various gold standard benchmark databases. The comparison of the
   existing and proposed method on WI-PHI dataset is depicted in
   Figs [137]2–[138]7.

Evaluation by parameter analysis

   Generally, every metaheuristic approach is based on certain stochastic
   dissemination. Hence, diverse runs will produce various diverse
   results. This work implements 500 independent runs in order to score
   optimal solution. In general, 20 numbers of clans were employed as per
   literature. The execution process will be terminated, if the best
   result generated in each iteration remains interchangeable for 100
   successive iterations or the maximum number of generations is attained.
   The assignment of parameter values was adjusted based on the
   experimental results. It was identified that the parameters of the
   proposed method that has values of α = 0.5 and β = 0.1 produced better
   solution among different values and hence were allocated. It was
   observed that the optimal solution was identified after 315^th
   generation. For all the performance measures, there were fluctuations
   during the first 10 runs of the experiment and in the future runs
   reliability was observed. Figures [139]2–[140]9 shows the average
   outcome of performance measures for the above parameter values of the
   proposed method. Table [141]3 shows the various parameter values for
   the proposed approach and the other existing approaches of protein
   complex detection.

Table 3.

   Various Parameter Values of proposed and existing methods for protein
   complex detection.
   Parameters Variable MCL SR-MCL CSO PSO-MCL ACO-MCL AFA-MCL F-MCL FOCA
   EHO-MCL
   Inflation constant ic 2 2 automatic automatic automatic automatic
   automatic automatic Automatic
   Lowest ic Lic 1 1 1 1 1 1 1
   Highest ic Hic 6 6 6 6 6 6 6
   Balance B 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
   Penalty proportion P[p] 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25 1.25
   Number of population K 10 10 10 10 10 10 10 10
   Maximum geneartion mGen 5 5 5 5 5 5 5 5
   Cognitive and Social acceleration coefficient C1 and c2 2
   Maximum velocity Max[V] 0.5
   evaporation rate р 0.1
   Heuristic information H 1.2
   pseudo random proportion selection rule q0 0.9
   Visual range Vis 0.9
   Step length S 0.05 0.05 0.05
   Light absorption coefficient λ 1.0
   Maximum attractiveness Ma 1.0
   Scale regulates
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><msub><mrow><mi>l</mi></mr
   ow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mi>e</mi></mrow></msub>
   :MATH]
   α 0.5
   Scale regulates
   [MATH:
   <msub><mrow><mi>p</mi></mrow><mrow><mi>c</mi><mi>e</mi><mi>n</mi><mi>t<
   /mi><mi>e</mi><mi>r</mi><mo>,</mo><mi>c</mi><msub><mrow><mi>l</mi></mro
   w><mrow><mi>i</mi></mrow></msub></mrow></msub> :MATH]
   β 0.1
   Number of clans allclan 20
   [142]Open in a new tab

Evaluation by statistical significance

   The proposed method was also assessed by utilizing non-parametric test
   such as, Wilcoxon Matched-Pair Signed-Rank Test among each pair of
   approaches that produces the statistical consequence. The discrepancy
   between the F-Measure and Accuracy for every entry in Figs [143]6 and
   [144]7 was tested based on the confidence level of 1% (p-value < 0.01).
   The p-value less than 0.01 are assumed as highly significant and the
   values greater than 0.01 are assumed as insignificant values. The
   scores of F-Measure and Accuracy is alone considered as they are
   computed based on precision and recall. The Statistical Significance of
   the proposed and existing approaches based on F-Measure and Accuracy is
   depicted in Table [145]4. The scores of upper right positions of the
   table are attained from F-Measure of proposed and various existing
   algorithms based on DIP dataset against CYC2008 benchmark database. The
   scores of lower left positions of the table are attained from Accuracy
   of proposed and various existing algorithms based on DIP dataset
   against CYC2008 benchmark database. From Table [146]4, it is shown that
   the proposed method is statistically significant in nature compared to
   all the existing methods.

Figure 6.

   [147]Figure 6
   [148]Open in a new tab

   Comparison of F-Measure with various Datasets and Algorithms against
   CYC2008 Benchmark Dataset.

Table 4.

   Statistical Significance of proposed and existing approaches based on
   F-Measure and Accuracy.
   MCODE MCL COACH ClusterONE RNSC Maulik U et al.^[149]37 CFinder CMC CSO
   SR-MCL PSO-MCL ACO-MCL AFA-MCL F-MCL FOCA EHO-MCL
   MCODE 0 0.005 0.241 0.028 0.015 0.508 0.012 0.017 0.019 0.026 0.018
   0.006 0.006 0.006 0.005 0.004
   MCL 0.005 0 0.047 0.022 0.059 0.035 0.333 0.878 0.52 0.008 0.007 0.008
   0.006 0.006 0.006 0.005
   COACH 0.241 0.037 0 0.646 0.022 0.508 0.013 0.093 0.008 0.009 0.009
   0.005 0.006 0.007 0.005 0.005
   ClusterONE 0.174 0.089 0.521 0 0.089 0.017 0.015 0.059 0.025 0.012
   0.008 0.008 0.007 0.008 0.005 0.005
   RNSC 0.05 0.074 0.022 0.022 0 0.025 0.022 0.015 0.016 0.028 0.008 0.007
   0.007 0.009 0.005 0.005
   Maulik U et al.^[150]37 0.333 0.065 0.103 0.035 0.005 0 0.035 0.027
   0.025 0.015 0.005 0.006 0.006 0.006 0.005 0.003
   CFinder 0.024 0.138 0.013 0.029 0.028 0.035 0 0.285 0.025 0.015 0.005
   0.005 0.005 0.008 0.005 0.002
   CMC 0.035 0.093 0.017 0.022 0.203 0.03 0.059 0 0.015 0.025 0.015 0.005
   0.005 0.009 0.005 0.001
   CSO 0.038 0.045 0.015 0.019 0.169 0.027 0.025 0.017 0 0.028 0.012 0.004
   0.005 0.006 0.005 0.003
   SR-MCL 0.027 0.035 0.035 0.025 0.017 0.021 0.035 0.015 0.013 0 0.022
   0.005 0.004 0.005 0.005 0.004
   PSO-MCL 0.023 0.028 0.018 0.015 0.005 0.019 0.012 0.011 0.009 0.018 0
   0.013 0.004 0.004 0.004 0.004
   ACO-MCL 0.015 0.019 0.013 0.01 0.005 0.015 0.011 0.005 0.009 0.008
   0.013 0 0.016 0.003 0.005 0.003
   AFA-MCL 0.012 0.002 0.01 0.009 0.005 0.01 0.005 0.009 0.005 0.009 0.008
   0.139 0 0.017 0.009 0.002
   F-MCL 0.009 0.005 0.009 0.005 0.005 0.005 0.005 0.007 0.006 0.004 0.005
   0.005 0.007 0 0.003 0.003
   FOCA 0.004 0.007 0.006 0.002 0.005 0.004 0.005 0.005 0.005 0.005 0.004
   0.005 0.005 0.005 0 0.002
   EHO-MCL 0.005 0.004 0.004 0.003 0.002 0.005 0.002 0.008 0.002 0.003
   0.003 0.002 0.004 0.005 0.001 0
   [151]Open in a new tab

Evaluation by biological significance

   Many of the existing methods solve the protein complex detection
   problem based on the topological similarity. But to obtain some useful
   biological information, the computational methods should be
   biologically significant in nature. This proposed method is evaluated
   in the biological significance test. The gold standard benchmark
   databases are manually annotated based on the information from
   biologically experimental analysis. Thus, the detected protein
   complexes obtained from the proposed method is compared and matched
   with the benchmark databases. Few benchmark databases such as CYC2008,
   MIPS, SGD databases for Saccharomyces cerevisiae and the PCDq database
   for Homo sapiens are employed for assessing the proposed method.
   Table [152]5 displays the common protein complexes between the CYC2008
   benchmark database and the proposed method for DIP and Krogan-extended.
   Also, the common biological process, molecular function and the
   cellular component of the obtained protein complexes are displayed.
   Correspondingly, the common pathway annotations of the predicted
   protein complexes are obtained from the KEGG database are displayed.

Table 5.

   Top 5 Common Protein Complexes, Gene Ontology Functions and KEGG
   Pathways of the Predicted Complexes of proposed method.
   S.no Complex name Real Complexes Correctly Predicted Complexes Wrong
   Complexes BP MF CC Pathways
   Krogan-extended
   1 Paf1p complex YOL145C, YLR418C, YBR279W, YML069W, YGL207W, YGL244W,
   YOL145C, YLR418C, YBR279W, YGL244W, YEL037C

   Positive regulation of transcription elongation from RNA polymerase I
   promoter (GO:2001209)

   P-Value: 1.4E-9

   Enrichment Score: 5.3E-8

   RNA polymerase II C-terminal domain phosphoserine binding (GO:1990269)

   P-Value: 3.5E-9

   Enrichment Score: 6.6E-8

   Cdc73/Paf1 complex(GO:0016593)

   P-Value: 2.9E-8

   Enrichment Score: 2.8E-7
   NIL
   2 Condensin complex YFR031C, YLR086W, YDR325W, YBL097W, YNL088W,
   YFR031C, YLR086W, YDR325W, YBL097W, YNL088W, NIL

   tRNA gene clustering (GO:0070058)

   P-Value: 1.3E-3

   Enrichment Score: 1.7E-2

   Chromatin binding (GO:0003682)

   P-Value: 2.1E-2

   Enrichment Score: 1.4E-1

   Condensed nuclear chromosome(GO:0000794)

   P-Value: 2.2E-2

   Enrichment Score: 4.4E-2

   Cell- Cycle Yeast

   P-Value: 5.6E-2

   Enrichment Score: 5.6E-2
   3 RNA polymerase II mediator complexX YHR058C, YDR308C, YHR041C,
   YNR010W, YOL135C, YBR253W, YOR174W, YMR112C, YPR168W YHR058C, YDR308C,
   YHR041C, YBR253W, YOR174W, YMR112C, YPR168W YKL081W, YCR033W

   Positive regulation of transcription from RNA polymerase II promoter
   (GO:0045944)

   P-Value: 3.4E-7

   Enrichment Score: 1.5E-6

   transcription factor activity, RNA polymerase II transcription factor
   binding (GO:0001076)

   P-Value: 5.3E-12

   Enrichment Score: 6.3E-11

   Mediator complex (GO:0016592)

   P-Value: 4.1E-15

   Enrichment Score: 6.2E-15
   NIL
   4 RNA polymerase I subunit YNL248C, YJR063W, YJL148W, YOR340C, YPR010C,
   YPR187W, YBR154C, YOR224C, YNL113W YNL248C, YJR063W, YJL148W, YOR340C,
   YPR010C, YPR187W, YBR154C, YNL113W YIL7095W

   Ribosome biogenesis (GO:0042254)

   P-Value: 7.1E-10

   Enrichment Score: 2.8E1

   DNA-directed RNA polymerase activity (GO:0003899)

   P-Value: 2.3E-14

   Enrichment Score: 3.6E2

   DNA-directed RNA polymerase I complex (GO:0005736)

   P-Value: 5.3E-17

   Enrichment Score: 1.5E2

   RNA polymerase,

   P-Value: 4.0E-12

   Enrichment Score: 6.2E1

   Pyrimidine metabolism,

   P-Value: 8.8E-10

   Enrichment Score: 2.7E1

   Purine metabolism,

   P-Value: 6.2E-9

   Enrichment Score: 2.0E1

   Metabolic pathways

   P-Value: 9.1E-4

   Enrichment Score: 2.7E0
   5 Small Subunit (SSU) processome complexes YLR409C, YLR222C, YJL069C,
   YDR398W, YGR128C, YJL109C, YDR324C, YDR449C, YDL148C, YLR129W YLR409C,
   YLR222C, YJL069C, YDR398W, YGR128C, YJL109C, YDR324C, YDR449C, YDL148C,
   YLR129W NIL

   Ribosomal small subunit biogenesis (GO:0042274)

   P-Value: 3.7E-2

   Enrichment Score: 5.2E-2

   snoRNA binding(GO:0030515)

   P-Value: 1.3E-9

   Enrichment Score: 3.8E-9

   Small-subunit processome (GO:0032040)

   P-Value: 2.8E-15

   Enrichment Score: 4.0E-14

   Ribosome biogenesis in eukaryotes

   P-Value: 3.9E-9

   Enrichment Score: 3.9E-9
   DIP
   1 NOT core complex YDL165W, YCR093W, YAL021C, YIL038C, YGR134W,
   YNL288W, YDR252W, YER068W, YNR052C, YPR072W YDL165W, YCR093W, YAL021C,
   YIL038C, YDR252W, YER068W, YNR052C, YPR072W YMR149WYJR035W, YJR112W

   Nuclear-transcribed mRNA catabolic process, deadenylation-dependent
   decay

   (GO:0000288)

   P-Value: 6.0E-14

   Enrichment Score: 1.6E-12
   NIL

   CCR4-NOT core complex (GO:0030015)

   P-Value: 2.6E-21

   Enrichment Score: 1.8E-20

   RNA degradation

   P-Value: 3.7E-10

   Enrichment Score: 3.7E-10
   2 Mitochondrial F1F0 ATP synthase YLR295C, YDL004W, YBL099W, YBR039W,
   YJR121W, YPL078C, YKL016C, YEL027W, YEL051W, YDL185W, YLR447C YLR295C,
   YDL004W, YBL099W, YBR039W, YJR121W, YPL078C, YKL016C, YEL027W, YEL051W,
   YNL189W, YER031C, YGL181W

   ATP hydrolysis coupled proton transport (GO:0015991)

   P-Value: 2.2E-4

   Enrichment Score: 6.8E-4

   Proton-transporting ATPase activity, rotational mechanism (GO:0046961)

   P-Value: 8.0E-14

   Enrichment Score: 3.6E-*13

   Mitochondrial proton-transporting ATP synthase complex(GO:0005753)

   P-Value:

   4.1E-3

   Enrichment Score: 9.1E-3

   Oxidative phosphorylation,

   P-Value: 9.4E-13

   Enrichment Score: 2.8E-12

   Metabolic pathways

   P-Value: 8.8E-5

   Enrichment Score: 1.3E-4
   3 Putative ferric reductase YBR207W, YLR214W, YER145C, YLR047C,
   YOL152W, YKL220C, YFL041W, YMR319C, YLL051C, YBR207W, YLR214W, YER145C,
   YLR047C, YLL051C, YOR227W, YKL196C,

   Iron ion homeostasis (GO:0055072)

   P-Value: 6.0E-10

   Enrichment Score: 9.6E-9

   Ferroxidase activity (GO:0004322)

   P-Value: 1.8E-5

   Enrichment Score: 1.3E-4

   Plasma membrane (GO:0005886)

   P-Value: 2.1E-3

   Enrichment Score: 2.1E-2
   NIL
   4 Component of spindle pole body YKL042W, YDR356W, YPL124W, YMR117C,
   YAL047C, YHR172W, YNL126W, YML124C, YLR212C, YNL188W YDR356W, YPL124W,
   YMR117C, YAL047C, YHR172W, YNL126W, YML124C, YLR212C, YNL188W YML048W

   Microtubule nucleation(GO:0007020)

   P-Value: 1.3E-14

   Enrichment Score: 3.2E-13

   Structural constituent of cytoskeleton(GO:0005200)

   P-Value: 9.5E-16

   Enrichment Score: 8.0E-15

   Microtubule organizing center part(GO:0044450)

   P-Value: 2.7E-3

   Enrichment Score: 1.1E-2
   NIL
   5 PRoteinase yscE YKL206C, YER012W, YJL001W, YFR050C, YMR314W, YOL038W,
   YBL041W, YML092C, YGR135W, YOR362C, YER094C YKL206C, YER012W, YJL001W,
   YFR050C, YOL038W, YBL041W, YML092C, YGR135W, YOR362C, YER094C YOL061W,
   YIL006W

   Proteasomal ubiquitin-independent protein catabolic process(GO:0010499)

   P-Value: 1.6E-18

   Enrichment Score: 1.1E-17

   Threonine-type endopeptidase activity (GO:0004298)

   P-Value: 2.6E-19

   Enrichment Score: 1.6E-18

   Proteasome storage granule (GO:0034515)

   P-Value: 1.1E-16

   Enrichment Score: 4.4E-16

   Proteasome

   P-Value: 1.4E-13

   Enrichment Score: 1.4E-13
   [153]Open in a new tab

   The predicted complex gene ontology and KEGG pathway enrichment
   analysis were predicted by using the DAVID gene function classification
   online tool. The overall predicted complex enrichment score and the
   respective gene ontology elements and KEGG pathway enrichment scores
   are displayed in Table [154]5. The pictorial representation of the
   common RNA Polymerase KEGG Pathway of the predicted protein Complex on
   Krogan-extended dataset and common Oxidative Phosphorylation KEGG
   Pathway of the predicted protein complex on DIP dataset is exhibited in
   the supplementary information. The RNA polymerase is essential for
   nucleolar assembly and for high polymerase loading rate. Oxidative
   phosphorylation is the metabolic pathway in which cells use enzymes to
   oxidize nutrients, thereby releasing energy which is used to produce
   adenosine triphosphate (ATP)^[155]40–[156]42. The pictorial
   representation of the Top 5 common protein complexes, gene ontology
   functions and KEGG pathways of the predicted complexes of proposed
   method is given as Venn diagram in Figs [157]12 and [158]13.

Figure 12.

   [159]Figure 12
   [160]Open in a new tab

   Top 5 common protein complexes, gene ontology functions and KEGG
   pathways of the predicted complexes of proposed method on
   Krogan-Extended Dataset.

Figure 13.

   [161]Figure 13
   [162]Open in a new tab

   Top 5 common protein complexes, gene ontology functions and KEGG
   pathways of the predicted complexes of proposed method on DIP Dataset.

Execution time

   Besides the accuracy, the time required to detect the dynamic protein
   complexes is also an important factor. Processing the various benchmark
   datasets with various numbers of proteins and different interactions
   requires more time complexity due to stochastic optimization methods.
   Subsequently, not all methods were available under the same platform,
   the execution of many of the approaches were done on virtual machines,
   which prohibited us from accomplishing an exact comparison of their
   relative execution times. Thus, here the average execution time of
   SR-MCL, ACO-MCL, PSO-MCL, AFA-MCL, F-MCL, FOCA AND EHO-MCL is displayed
   in Fig. [163]14.

Figure 14.

   Figure 14
   [164]Open in a new tab

   Comparison of Average Execution Time of the proposed algorithm with the
   existing algorithms.

   From Fig. [165]14, it is observed that in this research, the proposed
   algorithm has less execution time when compared to other algorithms,
   due to the clan separating operator of EHO approach. It is inferred
   that the proposed EHO-MCL is efficient for detecting dynamic protein
   complexes. In future, the EHO-MCL can be further optimized in multicore
   CPU.

Conclusion

   The Protein Complex detection is an exposed problem for scientists. The
   solution for the complex problem should be recurrently improved as they
   are important in the analysis of the biological process. The volume of
   PPI networks has also been increased due to high-throughput
   experiments, the lack of accurate computational model for protein
   complex detection exists. Many of the existing researches were employed
   on the static PPI data that do not provide accurate biological results.
   Thus, in this proposed method initially, the static PPI data is
   converted into dynamic PPI data by integrating the gene expression
   data. Later, every dynamic subnetwork was clustered based on the
   popular clustering technique MCL along with the elephant herd
   optimization method for exploring and exploiting the better solution.
   The proposed method was employed on various 11 widespread datasets and
   the predicted complexes were compared with 4 different benchmark
   databases. Also, the proposed method was evaluated based on noise
   removal, insertion and deletion of random protein interactions, using
   the large PPI dataset, various parameter analyses, statistical
   significance and biological significance. On every evaluation phase,
   the proposed method was outperforming all other existing approaches and
   identified the common protein complexes, Gene Ontology functions and
   KEGG pathways of predicted protein complexes. As a future work,
   additional information on the unknown protein complexes predicted by
   the proposed method is to be addressed with the help of biological
   experts. The proposed method can also be applied and analyzed on
   weighted PPI networks. Also, various other diseased databases can be
   used to experiment.

Supplementary information

   [166]Supplementary Information^ (857KB, docx)

Acknowledgements