Abstract

Background

   Protein function prediction is to assign biological or biochemical
   functions to proteins, and it is a challenging computational problem
   characterized by several factors: (1) the number of function labels
   (annotations) is large; (2) a protein may be associated with multiple
   labels; (3) the function labels are structured in a hierarchy; and (4)
   the labels are incomplete. Current predictive models often assume that
   the labels of the labeled proteins are complete, i.e. no label is
   missing. But in real scenarios, we may be aware of only some
   hierarchical labels of a protein, and we may not know whether
   additional ones are actually present. The scenario of incomplete
   hierarchical labels, a challenging and practical problem, is seldom
   studied in protein function prediction.

Results

   In this paper, we propose an algorithm to Predict protein functions
   using Incomplete hierarchical LabeLs (PILL in short). PILL takes into
   account the hierarchical and the flat taxonomy similarity between
   function labels, and defines a Combined Similarity (ComSim) to measure
   the correlation between labels. PILL estimates the missing labels for a
   protein based on ComSim and the known labels of the protein, and uses a
   regularization to exploit the interactions between proteins for
   function prediction. PILL is shown to outperform other related
   techniques in replenishing the missing labels and in predicting the
   functions of completely unlabeled proteins on publicly available PPI
   datasets annotated with MIPS Functional Catalogue and Gene Ontology
   labels.

Conclusion

   The empirical study shows that it is important to consider the
   incomplete annotation for protein function prediction. The proposed
   method (PILL) can serve as a valuable tool for protein function
   prediction using incomplete labels. The Matlab code of PILL is
   available upon request.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12859-014-0430-y)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Function prediction, Incomplete hierarchical labels, Combined
   similarity, Gene ontology

Background

   The increasing amount of proteomic data produced using high-throughput
   technology makes it crucial but challenging to develop computational
   models that can identify hypothetical functions of proteins. Such
   techniques have the potential to drive the biological validation and
   discovery of novel functions of proteins, and to save on the
   experimental cost. At the same time, functional annotations of proteins
   have been incorporated into several bioinformatics tools (e.g., Panther
   [[29]1], IntPath [[30]2], and InterProScan [[31]3]) to investigate the
   semantic similarity between proteins, proteins functional interactions,
   pathway enrichment analysis, functional enrichment analysis, and
   phylogenetic tree [[32]4,[33]5].

   Protein function prediction is a challenging computational problem,
   characterized by several intrinsic hardships: the number of function
   labels is rather large, each protein can have several labels, and the
   labels are structured in a hierarchy and are unbalanced. Furthermore,
   function labels associated to proteins are uncertain and incomplete.
   Various computational models have been proposed to address one or more
   of these issues [[34]3,[35]6-[36]10]. Some models use cost-sensitive
   learning and hierarchical classification [[37]8,[38]11], others apply
   multi-label learning [[39]12,[40]13], classifier ensemble
   [[41]8,[42]12] and multiple networks (kernel) integration [[43]14] to
   use the complimentary information spread across different heterogeneous
   data sources. More recent approaches incorporate evolutionary knowledge
   [[44]15], pathways [[45]1,[46]2,[47]16], domains [[48]17], or negative
   examples selection [[49]7,[50]18]. For a complete review on protein
   function prediction, see [[51]6,[52]10,[53]19]. Radivojac et al.
   [[54]9,[55]19] organized the large scale community-based critical
   assessment of protein function annotation, and suggested that there is
   significant room for improving protein function prediction.

   Protein function prediction can be viewed as a multi-label learning
   problem [[56]7,[57]10,[58]12,[59]20,[60]21]. Recently, multi-label
   learning approaches that use the correlation (or similarity) between
   function labels have been introduced. Pandey et al. [[61]22]
   incorporated label correlation using Lin’s similarity [[62]23] into the
   k-nearest neighborhood (LkNN) classifier; the authors observed that
   utilizing the correlation between function labels can boost the
   prediction accuracy. Zhang and Dai [[63]24] investigated the usefulness
   of functional interrelationships based on Jaccard coefficients for
   protein function prediction. Wang et al. [[64]25] introduced a
   function-function correlated multi-label learning method to infer
   protein functions. Yu et al. [[65]12] studied a directed bi-relational
   graph (composed by protein nodes and function label nodes) to utilize
   the correlation between function labels for protein function
   prediction. Chi and Hou [[66]26] assumed the label sets of two proteins
   can influence their similarity and introduced a Cosine Iterative
   Algorithm (CIA). In each iteration of CIA, the function predicted with
   highest confidence is appended to the label set of a protein. Next, the
   pairwise similarity between training proteins and testing proteins is
   updated based on the extended function sets. CIA considers the updated
   pairwise similarity, the function correlation based on cosine
   similarity, and the PPI network topology to predict functions in
   consecutive iterations.

   Most of these multi-label learning algorithms focus on exploiting label
   correlations to boost prediction accuracy, under the assumption that
   the labels of labeled proteins used for training are complete, i.e. no
   label is missing. Due to various reasons (e.g., evolving Gene Ontology
   terms, or limitations of experimental methods), in practice we may be
   aware of some functions only, while additional functions (unknown to
   us) may also be associated with the protein. In other words, proteins
   are partially labeled. Learning from partially and multi-label
   instances (or proteins) can be formulated as a multi-label and
   weak-label learning problem [[67]27-[68]29].

   Several multi-label and weak-label learning algorithms have been
   introduced in the past years. Sun et al. [[69]27] studied a multi-label
   and weak-label learning method called WELL. WELL assumes there is a
   margin between instances of different classes and any given label has a
   small number of member instances. To make use of the label correlation
   among multi-label instances, this approach assumes that there is a
   group of low rank based similarities, and the similarity between
   instances of different labels can be approximated based on these
   similarities. However, WELL relies on quadratic programming to compute
   the low rank based similarities and to make the final predictions.
   Therefore, it’s computationally expensive and can hardly make
   predictions for samples with a large number of labels. Bucak et al.
   [[70]30] proposed a weak-label learning approach called MLR-GL. MLR-GL
   optimizes a convex objective function that includes a ranking loss and
   a group Lasso loss. MLR-GL aims at labeling instances with no labels by
   using partially labeled instances. Yang et al. [[71]28] introduced a
   multi-instance and multi-label weak-label learning algorithm. Yu et al.
   [[72]29] proposed an approach called ProWL to predict protein functions
   using partially labeled proteins. ProWL exploits the label correlation
   and available labels of a protein to estimate the likelihood of a
   missing function for the protein. ProWL integrates these estimations
   with a smoothness loss function to replenish the missing function
   labels and to predict functions for proteins with no labels. Yu et al.
   [[73]31] assumed a function label depends on the feature information of
   proteins and introduced an algorithm called ProDM. ProDM maximizes this
   dependency to replenish the missing function labels and to predict
   functions for unlabeled proteins.

   However, these weak-label learning techniques only use the flat
   relationships among function labels, and do not explicitly take into
   account the hierarchical relationship among labels. It is widely
   recognized that the MIPS Functional Catalogue (FunCat) [[74]32]
   organizes the function labels in a tree structure and the Gene Ontology
   (GO) [[75]33] organizes the function terms (or labels) in a directed
   acyclic graph. It is reported that exploiting the hierarchical
   relationship among function labels can boost the accuracy of protein
   function prediction [[76]7,[77]8,[78]11,[79]22]. For example,
   Barutcuoglu et al. [[80]11] suggested that organizing the prediction
   produced by the binary classifier for each individual function label in
   a Bayes network can improve the accuracy of gene function prediction.
   Tao et al. [[81]34] utilized an information theory based metric to
   measure the interrelationships between function labels and to determine
   whether a certain function label belongs to a protein or not. However,
   this method cannot predict functions for unlabeled proteins, since it
   only employs the known annotations of a protein to infer its other
   potential annotations. Jiang et al. [[82]35] combined the relational
   PPI network and the label hierarchical structure to predict consistent
   functions by setting the descendants of a function label as negative
   whenever this label is set to negative. Pandey et al. [[83]22] used
   Lin’s similarity to capture the relationship among hierarchically
   organized labels. Schietgat et al. [[84]36] integrated hierarchical
   multi-label decision trees for protein function prediction. Valentini
   [[85]7] post-processed the prediction made by a binary classifier for
   each label according to the true path rule in the GO and the FunCat
   hierarchies, and proposed a method called TPR. Cesa-Bianch et al.
   [[86]8] integrated cost-sensitive learning and data fusion with TPR to
   further boost the accuracy of protein function prediction. Valentini
   [[87]10] advocated in his recent survey that it is paramount to exploit
   the hierarchical relationship among function labels for protein
   function prediction.

   According to the True Path Rule [[88]7] in GO and FunCat: (i) if a
   protein is labeled with a function, then this protein should be labeled
   with the ancestor functions (if any) of this function; (ii) if a
   protein cannot be labeled with a function, then this protein should not
   be labeled with the descendant functions (if any) of this function. In
   [[89]29,[90]31], the incomplete annotation problem was simulated by
   randomly masking function labels in a flat style, ignoring the
   hierarchical relationship between labels. In the simulation, if a
   function label of a protein is missing, this protein may still be
   labeled with the descendant functions of this function. And in fact,
   the missing function can be directly inferred from its descendant
   function labels.

   In this paper, we studied the incomplete label problem in a
   hierarchical manner, as opposed to a flat style. We propose an approach
   called PILL to predict protein functions using partially labeled
   proteins with hierarchical labels. PILL integrates the hierarchical and
   flat relationships between function labels to estimate the likelihoods
   of missing labels, and the interaction between proteins to replenish
   the missing annotations and to predict the functions of unlabeled
   proteins. Particularly, PILL simulates the incomplete hierarchical
   labels by randomly masking the leaf function labels of a protein, which
   is closer to the real situation than the simulation in the previous
   study [[91]29,[92]31]. We conducted experiments on three publicly
   available PPI datasets, in which each dataset was annotated with FunCat
   labels and GO labels. The experimental results showed that PILL
   outperforms other related algorithms on replenishing the missing labels
   of partially labeled proteins and on predicting functions for
   completely unlabeled proteins.

The incomplete hierarchical label problem

   Figure [93]1 illustrates an example of an incomplete hierarchical label
   problem for proteins annotated with FunCat labels. A corresponding
   example for the GO labels is given in Figure S1 of the Additional file
   [94]1. In Figure [95]1, p1 and p2 are partially labeled (missing labels
   are described by a question mark ?), and p3 is completely unlabeled.
   Note, other FunCat labels (i.e., ‘12.03’ and ‘03’) are not really
   missing for these proteins, and thus not shown in the figure; these
   function labels will also be viewed as candidate ‘missing’ labels. The
   missing labels are leaf function labels. If a non-leaf function label
   of a protein is missing, we can directly append this function label to
   this protein from its descendant function labels. Each hierarchy of
   non-leaf and leaf function labels is defined with respect to a single
   protein. For example, ‘12.04’ is a leaf function label for p2, but it
   is a non-leaf function label for p1, since p1 is labeled with a
   descendant label (‘12.04.02’) of ‘12.04’. Our task is to replenish the
   missing labels of p1 and p2, and to predict functions for p3. To this
   end, we define three kinds of relationships between function labels:
   (i) parent-child (e.g., ‘01.03’ is a child function label of ‘01’);
   (ii) grandparent-grandson (e.g., ‘01.03.02’ is a grandson label of
   ‘01’); and (iii) uncle-nephew (e.g., if we consider ‘01’ as a sibling
   of ‘12’, although these two labels do not have an explicit common
   parent label, ‘12’ is an uncle label of ‘01.03’). These relationships
   will be further discussed in the next Section.

Figure 1.

   Figure 1
   [96]Open in a new tab

   Illustration of incomplete hierarchical labels for proteins annotated
   with MIPS FunCat labels. A rectangle represents a protein (pi,
   i∈{1,2,3}); an ellipse denotes a function label, and a undirected line
   between rectangles captures a protein-protein interaction (the more
   reliable the interaction is, the thicker the line is). All the
   functional labels (including the missing function labels denoted by
   color ellipses with question marks ‘?’) in the ellipses should be
   associated with the proteins, but only the functional labels in the
   white ellipses are known. For better visualization, other functional
   labels (i.e., ‘12.03’ and ‘03’, which are not ground-truth labels for
   these proteins), are not plotted in the Figure.

Methods

Function correlation definition

   A protein often has multiple functions, which are organized as a tree
   hierarchy (FunCat) or as a directed acyclic graph (GO). Some pioneers
   [[97]7,[98]10,[99]11,[100]22] have demonstrated that exploiting the
   hierarchical relationship among function labels can boost the
   performance of protein function prediction. Pandey et al. [[101]22]
   used the Lin’s similarity [[102]23] to take advantage of the
   hierarchical relationship between function labels. Lin’s similarity
   measures the similarity of two function labels in terms of their
   proximity in the hierarchical ontology, as well as their content. It is
   defined as follows:
   [MATH: <mtext
   mathvariant="italic">LinSim</mtext><mo>(</mo><mi>s</mi><mo>,</mo><mi>t<
   /mi><mo>)</mo><mo>=</mo><mfrac><mrow><mn>2</mn><mo>×</mo><mo>log</mo><m
   sub><mrow><mi>p</mi></mrow><mrow><mtext
   mathvariant="italic">ca</mtext></mrow></msub><mo>(</mo><mi>s</mi><mo>,<
   /mo><mi>t</mi><mo>)</mo></mrow><mrow><mo>log</mo><mi>p</mi><mo>(</mo><m
   i>s</mi><mo>)</mo><mo>+</mo><mo>log</mo><mi>p</mi><mo>(</mo><mi>t</mi><
   mo>)</mo></mrow></mfrac><mo>,</mo> :MATH]
   (1)

   and
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mtext
   mathvariant="italic">ca</mtext></mrow></msub><mo>(</mo><mi>s</mi><mo>,<
   /mo><mi>t</mi><mo>)</mo><mo>=</mo><munder><mrow><mo>min</mo></mrow><mro
   w><mi>k</mi><mo>∈</mo><mtext
   mathvariant="italic">ca</mtext><mo>(</mo><mi>s</mi><mo>,</mo><mi>t</mi>
   <mo>)</mo></mrow></munder><mi>p</mi><mo>(</mo><mi>k</mi><mo>)</mo>
   :MATH]
   (2)

   s and t are two function labels, p(s) denotes the probability for a
   protein to be labeled with s. p(s) can be estimated from the available
   number of member proteins of s for an organism. c a(s,t) is the set of
   common ancestors of s and t, and p [ca](s,t) denotes the probability of
   the most specific function label in the hierarchy that subsumes both s
   and t. Intuitively, Eq. ([103]1) measures the semantic similarity of s
   and t in terms of the content of their minimum subsumer node in the
   hierarchy. Clearly, p [ca](s,t)=1 if s=t, and p [ca](s,t)=0 when their
   minimum subsumer is the root node of the ontology, or the function
   label corresponding to the minimum subsumer node is associated with all
   the proteins of an organism. L i n S i m(s,t) can also be viewed as a
   correlation measure between s and t. According to this definition, L i
   n S i m(s,t) is large if s and t often co-annotate the same proteins,
   and their most specific ancestor label is close to s and t but far away
   from the root node. On the other hand, if the most specific ancestor of
   s and t is (close to) the root node, but s and t are far away from the
   root node in the hierarchy, L i n S i m(s,t) will be small.

   However, if s is an ancestor of t, taking s as the common ancestor of t
   is preferable to any other common ancestor label, since s is more
   specific than any other label in the common ancestor label set, and s
   also subsumes both s and t. The more specific the function, the fewer
   member proteins this function has, and the smaller the probability is
   for a protein to be labeled with this function. Therefore, we
   substitute p [ca](s,t) with p [sa](s,t), which is defined as follows:
   [MATH: <msub><mrow><mi>p</mi></mrow><mrow><mtext
   mathvariant="italic">sa</mtext></mrow></msub><mo>(</mo><mi>s</mi><mo>,<
   /mo><mi>t</mi><mo>)</mo><mo>=</mo><munder><mrow><mo>min</mo></mrow><mro
   w><mi>k</mi><mo>∈</mo><mtext
   mathvariant="italic">sa</mtext><mo>(</mo><mi>s</mi><mo>,</mo><mi>t</mi>
   <mo>)</mo></mrow></munder><mi>p</mi><mo>(</mo><mi>k</mi><mo>)</mo>
   :MATH]
   (3)

   s a(s,t) represents the set of shared ancestors of s and t, which
   includes s if t is a descendant label of s, or t if t is an ancestor
   label of s. Thus, c a(s,t)⊆s a(s,t). We extend Lin’s similarity to a
   similarity named H S i m(s,t) by substituting p [ca](s,t) in Eq.
   ([104]1) with p [sa](s,t). If s is an ancestor label of t, H S i m(s,t)
   is no smaller than L i n S i m(s,t), since s is more specific than any
   function label in c a(s,t) (or p(s)≤p [ca](s,t)). When s and t are
   siblings (or cousins), H S i m(s,t) and L i n S i m(s,t) are the same.

   s a(s,t) often includes more specific functions (i.e., the parent
   function label of t) than c a(s,t), since c a(s,t)⊆s a(s,t). If t is
   missing for a protein, but the ancestor function labels (including
   parent function label s) of t are associated with this protein, it is
   easy to see that the missing label estimation from the parent function
   is more reliable than that from other ancestor functions (i.e.
   grandparent functions). This property of function label hierarchies
   motivates us to estimate the missing labels using HSim instead of
   LinSim. The statistics computed in the next Section supports our
   rationale.

   Nevertheless, when s and t have no shared ancestor (e.g., the function
   label in the first level of MIPS FunCat does not have an ancestor
   label), p [sa](s,t)=0; when the most specific shared function label is
   associated with almost all the proteins (e.g., the function label
   corresponds to the root node of the GO biological process sub-ontology
   hierarchy), p [sa](s,t)≈1 and H S i m(s,t)≈0. But H S i m(s,t)≈0 does
   not mean that s and t have no correlation. For example, there are 272
   proteins in S. Cerevisiae labeled with ‘40’ (CELL FATE), 448 proteins
   labeled with ‘43’ (CELL TYPE DIFFERENTIATION), and 170 proteins labeled
   with both ‘40’ and ‘43’. If a protein is labeled with ‘40’ and it is
   unknown whether this protein has ‘43’, we have 170/272=62.5% confidence
   that this protein is also labeled with ‘43’. However, neither HSim nor
   LinSim can provide this confidence. The reason is that ‘40’ and ‘43’ do
   not have any shared ancestor label, and both of them only consider the
   hierarchical relationship between function labels. In fact, it is
   observed that flat label relationships are also beneficial for protein
   function prediction [[105]24,[106]25,[107]29]. To overcome this
   limitation of H S i m(s,t), we introduce a C o m S i m(s,t) to describe
   the correlation between function labels:
   [MATH: <mtext
   mathvariant="italic">ComSim</mtext><mo>(</mo><mi>s</mi><mo>,</mo><mi>t<
   /mi><mo>)</mo><mo>=</mo><mfenced close="" open="{"
   separators=""><mrow><mtable><mtr><mtd><mtext
   mathvariant="italic">HSim</mtext><mo>(</mo><mi>s</mi><mo>,</mo><mi>t</m
   i><mo>)</mo><mo>,</mo><mspace
   width="1em"></mspace><mtext>if</mtext><mspace
   width="1em"></mspace><msub><mrow><mi>p</mi></mrow><mrow><mtext
   mathvariant="italic">sa</mtext></mrow></msub><mo>(</mo><mi>s</mi><mo>,<
   /mo><mi>t</mi><mo>)</mo><mo>∈</mo><mo>(</mo><mn>0</mn><mo>,</mo><mn>1</
   mn><mo>)</mo></mtd></mtr><mtr><mtd><mtext
   mathvariant="italic">JcdSim</mtext><mo>(</mo><mi>s</mi><mo>,</mo><mi>t<
   /mi><mo>)</mo><mo>,</mo><mspace
   width="1em"></mspace><mtext>otherwise</mtext></mtd></mtr><mtr></mtr></m
   table></mrow></mfenced> :MATH]
   (4)

   where JcdSim is the similarity based on the Jaccard coefficient J c d S
   i m(s,t)=|N(s)∩N(t)|/|N(s)∪N(t)|. N(·) denotes the set of proteins
   labeled with the corresponding function label and |N(·)| is the
   cardinality of the set. From the definition, if s and t do not have
   shared ancestor function labels, C o m S i m(s,t) is large when they
   often co-associated with the same set of proteins; C o m S i m(s,t) is
   small when they seldom co-associated with the same proteins. When s, t
   and the most specific shared ancestors of these two function labels are
   always associated with the same proteins, C o m S i m(s,t)=1. In this
   case, J c d S i m(s,t) is also set to 1. As such, ComSim captures both
   the hierarchical and the flat relationships between functions.

Statistics of hierarchical function label relationships

   From the true path rule of function label hierarchies, it’s easy to
   observe that:
     * p(s|p a r(s))≥p(s|g p a r(s))
     * p(s|g p a r(s))≥p(s|u n c l e(s))

   where p a r(s) denotes the parent function label of s, g p a r(s) is
   the grandparent function label of s, and u n c l e(s) is the uncle
   (parent’s sibling) function label of s. p(s|p a r(s)) is the
   conditional probability that a protein is labeled with s given that
   it’s already labeled with p a r(s). These equations hold since if a
   protein is labeled with s, then this protein is also labeled with the
   ancestor functions of s (including p a r(s) and g p a r(s)), and if a
   protein is labeled with u n c l e(s), this protein is also labeled with
   g p a r(s). In contrast, if a protein is labeled with p a r(s) (or g p
   a r(s)), it is uncertain whether this protein is labeled with s (or u n
   c l e(s)).

   Based on these rules, we investigate the parent-child relationship by
   counting the cases in which a protein is labeled with both a function
   label in p a r(s) and with s. Similarly, we investigate the
   grandparent-grandson (or uncle-nephew) relationship by computing the
   cases in which a protein is labeled with both a label in g p a r(s) (or
   u n c l e(s)) and with s. The distributions of these three statistics
   for proteins in S. Cerevisiae (labeled with FunCat labels) are shown in
   the first three boxplots in Figure [108]2. In addition, we report p(s|p
   a r(s))−p(s|g p a r(s)) in the fourth boxplot in Figure [109]2. We also
   provide the distribution of all pairs of function correlations based on
   the proposed ComSim, Lin’s similarity, Cosine similarity, and Jaccard
   coefficients on the same protein data in Figure [110]2. The
   corresponding distributions obtained on the S. Cerevisiae proteins
   labeled with GO labels are given in Figure S2 of the Additional file
   [111]1. For a fair comparison, all the zero elements in these
   likelihoods and similarities are removed, since some pairwise function
   labels do not have the hierarchical (i.e., parent-child) relationships,
   or are not associated with the same proteins.

Figure 2.

   Figure 2
   [112]Open in a new tab

   Label relationship statistics and four label similarities on proteins
   of S. Cerevisiae annotated with FunCat labels. In the figure, each
   boxplot describes the distribution of a likelihood (or similarity), the
   central line is the median, the edges of the box are the 25% and 75%
   percentiles, the whiskers extend to the most extreme datapoints that
   are not considered as outliers, and the outliers are plotted
   individually as ‘+’.

   The boxplots of Figure [113]2 support the relationships p(s|p a
   r(s))≥p(s|g p a r(s)) and p(s|g p a r(s))≥p(s|u n c l e(s)). If s is
   missing for a protein, and the protein is labeled with labels in p a
   r(s), g p a r(s) and u n c l e(s), the estimated likelihood of the
   missing label s from p a r(s) is more reliable than that from g p a
   r(s) and u n c l e(s). The explanation is straightforward: the more
   specific the function label is, the fewer member proteins the label
   has. In other words, if the function label in p a r(s) is associated
   with a protein, we can ensure that the function label in g p a r(s) is
   also associated with the same protein, but not vice versa. Similarly,
   given that u n c l e(s) is the sibling of p a r(s) and the two share
   the same parent, if a protein is annotated with u n c l e(s), this
   protein is also annotated with g p a r(s). Similar results are obtained
   for the S. Cerevisiae proteins annotated with GO labels (see Figure S2
   of the Additional file [114]1).

   In Figure [115]2, p(s|p a r(s)) is more evenly distributed than p(s|g p
   a r(s)) and p(s|u n c l e(s)), and it has fewer outliers than the
   latter two. We can also observe that the distributions of the function
   correlations defined by LinSim and ComSim are closer to p(s|p a r(s))
   than the correlations defined by the Cosine similarity and the Jaccard
   coefficient, and the label correlations based on LinSim and ComSim are
   more evenly distributed than the correlations based on Cosine and
   Jaccard similarity, since the former two have fewer outliers than the
   latter two. ComSim considers both the hierarchical (measured by HSim)
   and flat (measured by JcdSim) relationships among labels, and its
   margin between 25% and 75% percentiles is wider than that of LinSim. In
   addition, the overlap between ComSim and p(s|p a r(s)) is larger than
   that between LinSim and p(s|p a r(s)). In fact, we also studied the
   Gaussian function (
   [MATH: <mtext mathvariant="italic">exp</mtext><mfenced close=")"
   open="("
   separators=""><mrow><mo>−</mo><mfrac><mrow><msup><mrow><mo>(</mo><mi>x<
   /mi><mo>−</mo><mi>μ</mi><mo>)</mo></mrow><mrow><mn>2</mn></mrow></msup>
   </mrow><mrow><msup><mrow><mi>σ</mi></mrow><mrow><mn>2</mn></mrow></msup
   ></mrow></mfrac></mrow></mfenced> :MATH]
   , where μ and σ are the mean and standard deviation of x, x corresponds
   to a kind of likelihood or similarity) distribution of these
   likelihoods and similarities, and also observed that ComSim overlaps
   more with p(s|p a r(s)) than with other similarity metrics (not
   reported). Since ComSim will be used to estimate the likelihoods of
   missing labels, these differences indicate that ComSim can estimate the
   missing labels more accurately than the other three techniques. The
   advantage of ComSim will also be verified in our experiments.

Objective function

   Given n proteins, let K be the number of distinct functions across all
   proteins. Let Y=[y [1],y [2],…,y [n]] be the original label set, with y
   [ik]=1 if protein i has the k-th function, and y [ik]=0 if it’s unknown
   whether this protein has the k-th function or not. We assume the first
   l≤n proteins are partially labeled and the remaining n−l proteins are
   completely unlabeled. We set the normalized function correlation matrix
   as
   [MATH:
   <msub><mrow><mi>C</mi></mrow><mrow><mi>m</mi></mrow></msub><mo>(</mo><m
   i>s</mi><mo>,</mo><mi>t</mi><mo>)</mo><mo>=</mo><mfrac><mrow><mtext
   mathvariant="italic">ComSim</mtext><mo>(</mo><mi>s</mi><mo>,</mo><mi>t<
   /mi><mo>)</mo></mrow><mrow><munderover><mrow><mo>∑</mo></mrow><mrow><mi
   >t</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>K</mi></mrow></munderover><
   mtext
   mathvariant="italic">ComSim</mtext><mo>(</mo><mi>s</mi><mo>,</mo><mi>t<
   /mi><mo>)</mo></mrow></mfrac> :MATH]
   .

   Based on the definition of C [m], we can estimate the likelihood of a
   missing function label on the i-th (i≤l) partially labeled protein as
   follows:
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mo>~</mo></mover></mrow><mrow><mt
   ext mathvariant="italic">ik</mtext></mrow></msub><mo>=</mo><mfenced
   close="" open="{"
   separators=""><mrow><mtable><mtr><mtd><msubsup><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>T</mi>
   </mrow></msubsup><msub><mrow><mi>C</mi></mrow><mrow><mi>m</mi></mrow></
   msub><mo>(</mo><mo>·</mo><mo>,</mo><mi>k</mi><mo>)</mo><mo>,</mo><mtext
   >if</mtext><mspace
   width="1em"></mspace><msub><mrow><mi>y</mi></mrow><mrow><mtext
   mathvariant="italic">ik</mtext></mrow></msub><mo>=</mo><mn>0</mn></mtd>
   </mtr><mtr><mtd><mn>1</mn><mo>,</mo><mspace
   width="1em"></mspace><mspace
   width="1em"></mspace><mtext>otherwise</mtext></mtd></mtr><mtr></mtr></m
   table></mrow></mfenced> :MATH]
   (5)

   If y [ik]=0 and the k-th function label has a large correlation with
   the already known functions of protein i, then it is likely that the
   k-th function is missing for this protein,
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mo>~</mo></mover></mrow><mrow><mt
   ext mathvariant="italic">ik</mtext></mrow></msub> :MATH]
   is assigned to a large value.
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>~</mo></mover></mrow><mrow><mi>i</m
   i></mrow></msub> :MATH]
   is the label vector for the confirmed labels (the corresponding entries
   are set to 1) together with y [i] and C [m] estimated likelihoods of
   the missing labels (for entries corresponding to y [ik]=0) on the i-th
   protein.

   Based on
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>~</mo></mover></mrow><mrow><mi>i</m
   i></mrow></msub> :MATH]
   , we can define the empirical loss function on l partially labeled
   proteins as follows:
   [MATH:
   <mtable><mtr><mtd><msub><mrow><mi>Ψ</mi></mrow><mrow><mn>1</mn></mrow><
   /msub><mo>(</mo><mi>f</mi><mo>)</mo></mtd><mtd><mo>=</mo><munder><mrow>
   <mo>min</mo></mrow><mrow><mi>f</mi></mrow></munder><munderover><mrow><m
   o>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>l</
   mi></mrow></munderover><msubsup><mrow><mfenced close="∥" open="∥"
   separators=""><mrow><msub><mrow><mi
   mathvariant="bold">f</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>−</mo
   ><msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>~</mo></mover></mrow><mrow><mi>i</m
   i></mrow></msub></mrow></mfenced></mrow><mrow><mn>2</mn></mrow><mrow><m
   n>2</mn></mrow></msubsup></mtd></mtr><mtr><mtd><mo>=</mo><munder><mrow>
   <mo>min</mo></mrow><mrow><mi>F</mi></mrow></munder><msubsup><mrow><mfen
   ced close="∥" open="∥"
   separators=""><mrow><msup><mrow><mo>(</mo><mi>F</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mo>~</mo></mover><mo>)</mo></mrow
   ><mrow><mi>T</mi></mrow></msup><mi>U</mi><mo>(</mo><mi>F</mi><mo>−</mo>
   <mover
   accent="true"><mrow><mi>Y</mi></mrow><mo>~</mo></mover><mo>)</mo></mrow
   ></mfenced></mrow><mrow><mn>2</mn></mrow><mrow><mn>2</mn></mrow></msubs
   up></mtd></mtr></mtable> :MATH]
   (6)

   where
   [MATH: <msub><mrow><mi
   mathvariant="bold">f</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>∈</mo
   ><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>K</mi></mrow></msup>
   :MATH]
   is the to be predicted probability likelihood on the i-th protein, F=[f
   [i],f [2],…,f [n]] is the predictions on n proteins,
   [MATH: <mover
   accent="true"><mrow><mi>Y</mi></mrow><mo>~</mo></mover><mo>=</mo><mfenc
   ed close="]" open="[" separators=""><mrow><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>~</mo></mover></mrow><mrow><mn>1</m
   n></mrow></msub><mo>,</mo><msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>~</mo></mover></mrow><mrow><mn>2</m
   n></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">y</mi></mrow><mo>~</mo></mover></mrow><mrow><mi>n</m
   i></mrow></msub></mrow></mfenced> :MATH]
   is the likelihood matrix for confirmed labels along with the estimated
   missing labels on n proteins, U is an n×n diagonal matrix with U [ii]=1
   if i≤l, and U [ii]=0 otherwise.

   Proteins with similar amino acid sequences are likely to share the same
   functions. Schwikowski et al. [[116]37] observed that two interacting
   proteins are more likely to share the same functions than two proteins
   with no interaction with each other. This observation is recognized as
   the ‘guilt by association’ rule. Inspired by the work [[117]38] that
   states that the labels of an unlabeled instance can be linearly
   inferred from the labels of its neighbors, we introduce a smoothness
   term to utilize the interactions (or similarity) between proteins as:
   [MATH: <mtable class="eqnarray" columnalign="left center
   right"><mtr><mtd
   class="eqnarray-1"><msub><mrow><mi>Ψ</mi></mrow><mrow><mn>2</mn></mrow>
   </msub><mo>(</mo><mi>f</mi><mo>)</mo></mtd><mtd
   class="eqnarray-2"><mo>=</mo></mtd><mtd
   class="eqnarray-3"><munder><mrow><mo>min</mo></mrow><mrow><mi>f</mi></m
   row></munder><munderover><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</
   mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><msubsup><mrow>
   <mfenced close="∥" open="∥" separators=""><mrow><msub><mrow><mi
   mathvariant="bold">f</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>−</mo
   ><munder><mrow><mo>∑</mo></mrow><mrow><msub><mrow><mi>p</mi></mrow><mro
   w><mi>j</mi></mrow></msub><mo>∈</mo><mi
   mathvariant="script">N</mi><mo>(</mo><msub><mrow><mi>p</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><mo>)</mo></mrow></munder><msub><mrow><mi>W</m
   i></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub><msub><mrow><mi
   mathvariant="bold">f</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow></
   mfenced></mrow><mrow><mn>2</mn></mrow><mrow><mn>2</mn></mrow></msubsup>
   </mtd><mtd><mtext></mtext></mtd></mtr><mtr><mtd
   class="eqnarray-1"></mtd><mtd class="eqnarray-2"><mo>=</mo></mtd><mtd
   class="eqnarray-3"><munder><mrow><mo>min</mo></mrow><mrow><mi>F</mi></m
   row></munder><msubsup><mrow><mfenced close="∥" open="∥"
   separators=""><mrow><msup><mrow><mi>F</mi></mrow><mrow><mi>T</mi></mrow
   ></msup><msup><mrow><mo>(</mo><mi>I</mi><mo>−</mo><mi>W</mi><mo>)</mo><
   /mrow><mrow><mi>T</mi></mrow></msup><mo>(</mo><mi>I</mi><mo>−</mo><mi>W
   </mi><mo>)</mo><mi>F</mi></mrow></mfenced></mrow><mrow><mn>2</mn></mrow
   ><mrow><mn>2</mn></mrow></msubsup></mtd><mtd><mtext></mtext></mtd></mtr
   ><mtr><mtd class="eqnarray-1"></mtd><mtd class="eqnarray-2"></mtd><mtd
   class="eqnarray-3"><mi>s.t.</mi><munderover><mrow><mo>∑</mo></mrow><mro
   w><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mundero
   ver><msub><mrow><mi>W</mi></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub><mo>=</mo><mn>1</mn></mtd>
   <mtd><mtext></mtext><mtext></mtext></mtd></mtr></mtable> :MATH]
   (7)

   where
   [MATH: <mi
   mathvariant="script">N</mi><mo>(</mo><msub><mrow><mi>p</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><mo>)</mo> :MATH]
   is the set of proteins interacting with p [i], W [ij] is the weight of
   the interaction (similarity) between proteins i and j, and I is an n×n
   identity matrix. Our motivation to minimize Eq. ([118]7) is three-fold:
   (i) if two proteins i and j are quite similar to one another (or W [ij]
   is large), then the margin between f [i] and f [j] should be small,
   otherwise there is a big loss; (ii) if protein i has missing labels and
   its interacting partners do have those labels, then we can leverage
   this information to assist the replenishing process of the missing
   labels for protein i; (iii) if protein i is completely unlabeled, its
   labels can be predicted using the labels of its partners. Alternative
   ways (i.e., based on functional connectivity or homology between
   proteins) to transfer labels among proteins have been suggested in the
   literature (see [[119]5,[120]39-[121]41]). These methods can also be
   adapted to replace Eq. ([122]7). Since our work focuses on how to
   replenish the missing labels and how to predict protein functions using
   incomplete hierarchical labels, how to more efficiently utilize the
   guilt-by-association rule and how to reduce noise in PPI networks to
   boost the accuracy (i.e., by enhancing the functional content
   [[123]42], or by incorporating additional data sources
   [[124]5,[125]15,[126]16]), is out of scope.

   Based on Eq. ([127]6) and Eq. ([128]7), the objective function to be
   minimized by the PILL algorithm is:
   [MATH:
   <mtable><mtr><mtd><mi>Ψ</mi><mo>(</mo><mi>F</mi><mo>)</mo><mo>=</mo></m
   td><mtd><mspace width="0.3em"></mspace><mtext
   mathvariant="italic">tr</mtext><mfenced close=")" open="("
   separators=""><mrow><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><mi>F</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mo>~</mo></mover></mrow></mfenced
   ></mrow><mrow><mi>T</mi></mrow></msup><mi>U</mi><mfenced close=")"
   open="(" separators=""><mrow><mi>F</mi><mo>−</mo><mover
   accent="true"><mrow><mi>Y</mi></mrow><mo>~</mo></mover></mrow></mfenced
   ></mrow></mfenced></mtd></mtr><mtr><mtd><mo>+</mo><mspace
   width="2.77626pt"></mspace><mi>λtr</mi><mfenced close=")" open="("
   separators=""><mrow><msup><mrow><mi>F</mi></mrow><mrow><mi>T</mi></mrow
   ></msup><msup><mrow><mo>(</mo><mi>I</mi><mo>−</mo><mi>W</mi><mo>)</mo><
   /mrow><mrow><mi>T</mi></mrow></msup><mo>(</mo><mi>I</mi><mo>−</mo><mi>W
   </mi><mo>)</mo><mi>F</mi></mrow></mfenced></mtd></mtr></mtable> :MATH]
   (8)

   where λ>0 is a scaler parameter that balances the importance of the
   empirical loss and the smoothness loss.

Results and discussion

Datasets and experimental setup

   We report the results on three PPI networks, namely CollingsPPI,
   KroganPPI, and ScPPI. We annotated proteins in these networks according
   to MIPS FunCat [[129]32] and Gene Ontology [[130]33] (Biological
   Process non-IEA terms) respectively. The statistic of these
   preprocessed datasets is listed in Table [131]1. The CollingsPPI
   dataset, for example, has 1620 proteins labeled with 168 distinct GO
   labels and 176 FunCat labels; these proteins in total are labeled with
   22,023 GO labels and 13,320 FunCat labels, and on average each protein
   has about 13.59 GO labels and 8.22 FunCat labels. More details on these
   datasets and experimental setup are provided in the Additional file
   [132]1. The label vector of proteins implicitly encodes the
   hierarchical relationship among labels. For example, suppose the entry
   index corresponding to ‘01.01’ in label vector
   [MATH: <msub><mrow><mi
   mathvariant="bold">y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>∈</mo
   ><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>K</mi></mrow></msup>
   :MATH]
   is t, and the entry index corresponding to ‘01’ (the ancestor function
   label of ‘01.01’) is s, if y [it]=1, then y [is]=1.

Table 1.

   Dataset statistics
   Dataset # Proteins # FunCat labels # GO labels Avg ± Std(FunCat) Avg ±
   Std(GO)
   CollinsPPI 1620 176 (13320) 168 (22023) 8.22 ±5.60 13.59 ±8.28
   KroganPPI 2670 228 (20384) 241 (32639) 7.63 ±5.81 12.22 ±8.83
   ScPPI 5700 305 (36909) 372 (61048) 6.48 ±5.71 10.71 ±8.83
   [133]Open in a new tab

   ‘ #Proteins’ represents the number of proteins in a dataset, ‘ #FunCat
   Labels’ describes the number of distinct FunCat labels of these
   proteins and the number in the bracket represents the total number of
   FunCat labels on all these proteins, ‘ #GO Labels’ represents the
   number of distinct GO labels of these proteins and the number in the
   bracket represents the total number of GO labels on all these proteins,
   ‘Avg ±Std(FunCat)’ represents the average number of FunCat labels for a
   protein in a dataset and the standard deviation, ‘Avg ±Std(GO)’
   represents the average number of GO labels for a protein in a dataset
   and the standard deviation.

   There are no off-the-shelf proteomic datasets that can be directly used
   to test the performance of the solution of the incomplete labels
   problem, although this problem is practical and common in real world
   scenarios. To address this issue, we assume the labels of the currently
   labeled proteins are complete and randomly mask some of the ground
   truth leaf functions of a protein; these masked functions are
   considered as missing for this protein.

   For representation, we use m as the number of missing functions of a
   protein. For example, if a protein has 10 functional labels, m=3 means
   that three functional labels are masked for this protein. If a protein
   does not have more than m labels, we do not mask all the available
   labels and ensure it has one function label. A small number of proteins
   in these networks doesn’t have any label; we keep these proteins in the
   network to retain the network’s structure, but do not test on them. We
   introduce N [m] to represent how many labels are missing for all the
   proteins for a given setting of m.

Comparing methods and evaluation metrics

   We compare PILL against ProDM [[134]14], ProWL [[135]29], LkNN
   [[136]22], TPR [[137]7], MLR-GL [[138]30], CIA [[139]26], and Naive
   [[140]9]. ProDM and ProWL are designed to replenish the missing labels
   and to predict protein functions using partially labeled proteins;
   neither explicitly considers the hierarchical relationship among
   function labels. LkNN utilizes LinSim in Eq. ([141]1) to predict the
   functions of unlabeled proteins. TPR uses the true path rule (or
   hierarchical relationship) in label hierarchies to refine the
   predictions of binary classifiers trained for each label. We use the
   weighted version, TPRw, for the experiments. MLR-GL uses partially
   labeled instances in the training set to predict the labels of
   unlabeled instances. CIA is an iterative algorithm that uses function
   correlations based on Cosine similarity to infer protein functions.
   Naive, which ranks functional labels based on their frequencies, is a
   baseline approach in the community-based critical assessment of protein
   function annotation [[142]9]. It is reported that very few methods
   performed above the Naive method. Therefore, we take the Naive method
   as a comparing method for reference. More details about the
   implementations and parameter settings of these methods are reported in
   the Additional file [143]1.

   The performance of protein function prediction can be evaluated
   according to different criteria, and the choice of evaluation metrics
   differentially affects different prediction algorithms
   [[144]9,[145]29]. For a fair and comprehensive comparison, we used five
   representative metrics, namely MacroF1, MicroF1, AvgROC, RankingLoss
   and Fmax. These evaluation metrics are extensively applied to evaluate
   the performance of multi-label learning algorithms and protein function
   prediction [[146]9,[147]21,[148]29]. The formal definition of these
   metrics is provided in the Additional file [149]1. To keep consistency
   across all evaluation metrics, we use 1-RankLoss instead of
   RankingLoss. Thus, the higher the value, the better the performance is
   for all the used metrics. These metrics evaluate the performance of
   function prediction in different aspects, and thus it is difficult for
   an algorithm to outperform another technique on all the metrics.

Replenishing missing function labels

   In this section, we conduct experiments to study the performance of
   PILL on replenishing missing annotations of n hierarchically and
   partially labeled proteins. In the experiments, we consider all the
   proteins in the dataset as training and testing data. The experimental
   results with m=1,3,5 on CollingsPPI with respect to the FunCat labels
   are reported in Table [150]2 (the best and comparable results are in
   bold font, with statistical significance examined by a pairwise t-test
   at 95% significance level). Other results on CollingsPPI, KroganPPI and
   ScPPI are reported in Tables S1-5 of the Additional file [151]1. For
   each setting of m, the experiments are repeated 20 times. In each run,
   the masked labels of a protein are randomly chosen from the leaf
   function labels of the same protein, and these masked labels are
   considered as missing for testing. If s is a non-leaf function label of
   a protein, whenever its descendant function labels are all missing (or
   masked), s turns to be a leaf function label and can be masked for this
   protein.

Table 2.

   Results of replenishing missing labels on CollinsPPI wrt
   Metric m(N [m] ) PILL ProDM ProWL LkNN TPRw Naive
   MicroF1 1(1526) 93.91 ±0.11 83.30 ±0.30 90.31 ±0.08 44.07 ±0.14 50.00
   ±0.12 29.00 ±0.01
   3(4330) 81.70 ±0.29 72.16 ±0.77 78.38 ±0.23 41.61 ±0.16 43.60 ±0.18
   29.77 ±0.13
   5(6580) 70.53 ±0.31 60.10 ±1.01 66.61 ±0.16 37.54 ±0.21 36.79 ±0.13
   30.09 ±0.06
   MacroF1 1(1526) 89.29 ±0.25 69.53 ±0.40 85.75 ±0.34 34.23 ±0.21 43.33
   ±0.15 4.70 ±0.01
   3(4330) 70.19 ±0.63 60.78 ±1.73 69.03 ±0.46 29.23 ±0.48 35.45 ±0.39
   5.06 ±0.04
   5(6580) 55.32 ±0.94 45.37 ±1.90 52.95 ±0.59 24.12 ±0.62 27.06 ±0.75
   5.13 ±0.05
   AvgROC 1(1526) 99.47 ±0.01 97.44 ±0.06 98.27 ±0.09 66.14 ±0.05 69.67
   ±0.18 49.44 ±0.00
   3(4330) 97.77 ±0.16 93.86 ±0.44 93.35 ±0.18 64.86 ±0.11 64.93 ±0.21
   49.44 ±0.00
   5(6580) 94.64 ±0.33 87.03 ±1.04 86.24 ±0.49 63.25 ±0.34 60.41 ±0.36
   49.44 ±0.00
   1-RankLoss 1(1526) 99.43 ±0.03 96.80 ±0.04 98.55 ±0.05 69.38 ±0.04
   55.75 ±0.14 79.33 ±0.04
   3(4330) 97.58 ±0.11 92.15 ±0.26 94.62 ±0.17 66.09 ±0.27 46.90 ±0.47
   76.72 ±0.22
   5(6580) 94.55 ±0.27 86.63 ±0.67 89.30 ±0.25 59.65 ±0.65 36.88 ±0.41
   74.52 ±0.41
   Fmax 1(1526) 90.88 ±0.07 76.28 ±0.31 80.49 ±0.34 42.74 ±0.09 58.43
   ±0.36 28.32 ±0.00
   3(4330) 76.82 ±0.10 67.39 ±0.84 66.14 ±0.29 42.16 ±0.25 51.12 ±0.50
   27.93 ±0.01
   5(6580) 66.11 ±0.50 56.26 ±4.01 57.76 ±0.52 40.39 ±0.37 44.01 ±0.53
   27.04 ±0.00
   [152]Open in a new tab

   FunCat labels. m is the number of missing labels for a protein and N
   [m]in the bracket is the total number of missing labels for all the
   proteins. The numbers in boldface denote the best performance.

   From the results reported in these Tables, we can observe that PILL
   outperforms other competitive methods across all the evaluation metrics
   in most cases. In summary, out of 90 configurations (3 datasets × 2
   kinds of labels × 5 evaluation metrics × 3 settings of m), PILL
   outperforms ProDM 85.56% of the cases, outperforms ProWL 91.11% of the
   cases, ties with them 4.44% and 4.44% of the cases, and loses to them
   in 5.56% and 4.44% of the cases, respectively. PILL outperforms LkNN,
   TPRw and Naive in all configurations. Taking MacroF1 on CollingsPPI
   annotated with FunCat labels, for example, PILL on average is 23.30%
   better than ProDM, 4.27% better than ProWL, 147.33% better than LkNN,
   and 106.61% better than TPRw. These results corroborate the
   effectiveness of PILL on replenishing the missing labels.

   PILL largely outperforms ProDM and ProWL, even if the latter two also
   leverage correlation between function labels and the interaction
   between proteins. The reason is that ProDM and ProWL use the Cosine
   based similarity to define the correlation between function labels, and
   they do not explicitly make use of the hierarchical relationship among
   labels. Since the label vector implicitly encodes the hierarchical
   relationship of labels to some extent, ProDM and ProWL can achieve a
   result comparable (or a slightly better) to PILL in few cases.

   LkNN and TPRw explicitly utilize the hierarchical relationship among
   labels, but they are not able to compete with PILL, ProDM and ProWL.
   The cause is two fold: (i) LkNN and TPRw assume that the labels of the
   labeled proteins are complete, and they use partially labeled proteins
   to predict missing labels without estimating the missing labels in
   advance; (ii) they do not utilize the flat relationships among function
   labels. Naive ranks the functional labels according to their frequency
   and sets the frequency as the predicted probability for the labels.
   Since the missing labels are ‘leaf’ functional labels, and their
   frequencies are smaller than the ‘non-leaf’ functional labels, Naive
   achieves the lowest AvgROC and MacroF1 scores, a medium 1-RankLoss
   score, and almost the lowest Fmax and MicroF1 scores among the
   comparing methods. Naive performs better than some methods in few
   cases, but it is outperformed by PILL by a large margin across all the
   evaluation metrics. These results show that PILL can exploit the
   hierarchical and flat relationships among labels to boost the
   performance of protein function prediction.

   Real Life Examples: Another experiment is performed to study the
   ability of PILL on providing hypothetical missing labels. In
   particular, we use the GO terms associations (download date:
   2014-02-01) of S. Cerevisiae to annotate the proteins in ScPPI (here we
   do not apply the filter process to remove the too specific and too
   general labels as in the previous experiments, and these 5700 proteins
   were annotated with 2381 distinct biological process labels). We use
   PILL to replenish the missing labels of these proteins. There are 117
   proteins in ScPPI having new labels in the updated GO terms annotations
   [[153]33](download date: 2014-06-01), and there are 451 newly appended
   labels for these proteins. We choose the top 50 function labels (from
   2381 distinct labels) as the hypothetical labels for each of these
   proteins. We observe PILL can correctly replenish 30.38%(137/451)
   missing labels, and if we append the ancestor labels of these 137
   labels to these 117 proteins, 40.80%(184/451) labels are correctly
   replenished. These proteins are provided in Additional file [154]2, and
   some examples are reported in Table [155]3. In the table, the original
   (date before 2014-02-01) GO labels and the replenished ones of a
   protein, the support reference’s PMID, the GO term annotation (or
   protein label association) added date and the GO term annotation
   evidence code are all listed. Evidence code indicates the type of
   evidence that supports the go term annotation, ‘IMP’ is Inferred from
   Mutant Phenotype, ‘IDA’ is Inferred from Direct Assay, ‘IGI’ is
   Inferred from Genetic Interaction, and ‘ISS’ is Inferred from Sequence
   or Structural Similarity. These real life examples demonstrate PILL can
   confidently provide hypothetical missing labels from a large number of
   candidate labels.

Table 3.

   Examples of replenished labels for proteins by PILL and their support
   references