Abstract
Background
Protein function prediction is to assign biological or biochemical
functions to proteins, and it is a challenging computational problem
characterized by several factors: (1) the number of function labels
(annotations) is large; (2) a protein may be associated with multiple
labels; (3) the function labels are structured in a hierarchy; and (4)
the labels are incomplete. Current predictive models often assume that
the labels of the labeled proteins are complete, i.e. no label is
missing. But in real scenarios, we may be aware of only some
hierarchical labels of a protein, and we may not know whether
additional ones are actually present. The scenario of incomplete
hierarchical labels, a challenging and practical problem, is seldom
studied in protein function prediction.
Results
In this paper, we propose an algorithm to Predict protein functions
using Incomplete hierarchical LabeLs (PILL in short). PILL takes into
account the hierarchical and the flat taxonomy similarity between
function labels, and defines a Combined Similarity (ComSim) to measure
the correlation between labels. PILL estimates the missing labels for a
protein based on ComSim and the known labels of the protein, and uses a
regularization to exploit the interactions between proteins for
function prediction. PILL is shown to outperform other related
techniques in replenishing the missing labels and in predicting the
functions of completely unlabeled proteins on publicly available PPI
datasets annotated with MIPS Functional Catalogue and Gene Ontology
labels.
Conclusion
The empirical study shows that it is important to consider the
incomplete annotation for protein function prediction. The proposed
method (PILL) can serve as a valuable tool for protein function
prediction using incomplete labels. The Matlab code of PILL is
available upon request.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0430-y)
contains supplementary material, which is available to authorized
users.
Keywords: Function prediction, Incomplete hierarchical labels, Combined
similarity, Gene ontology
Background
The increasing amount of proteomic data produced using high-throughput
technology makes it crucial but challenging to develop computational
models that can identify hypothetical functions of proteins. Such
techniques have the potential to drive the biological validation and
discovery of novel functions of proteins, and to save on the
experimental cost. At the same time, functional annotations of proteins
have been incorporated into several bioinformatics tools (e.g., Panther
[[29]1], IntPath [[30]2], and InterProScan [[31]3]) to investigate the
semantic similarity between proteins, proteins functional interactions,
pathway enrichment analysis, functional enrichment analysis, and
phylogenetic tree [[32]4,[33]5].
Protein function prediction is a challenging computational problem,
characterized by several intrinsic hardships: the number of function
labels is rather large, each protein can have several labels, and the
labels are structured in a hierarchy and are unbalanced. Furthermore,
function labels associated to proteins are uncertain and incomplete.
Various computational models have been proposed to address one or more
of these issues [[34]3,[35]6-[36]10]. Some models use cost-sensitive
learning and hierarchical classification [[37]8,[38]11], others apply
multi-label learning [[39]12,[40]13], classifier ensemble
[[41]8,[42]12] and multiple networks (kernel) integration [[43]14] to
use the complimentary information spread across different heterogeneous
data sources. More recent approaches incorporate evolutionary knowledge
[[44]15], pathways [[45]1,[46]2,[47]16], domains [[48]17], or negative
examples selection [[49]7,[50]18]. For a complete review on protein
function prediction, see [[51]6,[52]10,[53]19]. Radivojac et al.
[[54]9,[55]19] organized the large scale community-based critical
assessment of protein function annotation, and suggested that there is
significant room for improving protein function prediction.
Protein function prediction can be viewed as a multi-label learning
problem [[56]7,[57]10,[58]12,[59]20,[60]21]. Recently, multi-label
learning approaches that use the correlation (or similarity) between
function labels have been introduced. Pandey et al. [[61]22]
incorporated label correlation using Lin’s similarity [[62]23] into the
k-nearest neighborhood (LkNN) classifier; the authors observed that
utilizing the correlation between function labels can boost the
prediction accuracy. Zhang and Dai [[63]24] investigated the usefulness
of functional interrelationships based on Jaccard coefficients for
protein function prediction. Wang et al. [[64]25] introduced a
function-function correlated multi-label learning method to infer
protein functions. Yu et al. [[65]12] studied a directed bi-relational
graph (composed by protein nodes and function label nodes) to utilize
the correlation between function labels for protein function
prediction. Chi and Hou [[66]26] assumed the label sets of two proteins
can influence their similarity and introduced a Cosine Iterative
Algorithm (CIA). In each iteration of CIA, the function predicted with
highest confidence is appended to the label set of a protein. Next, the
pairwise similarity between training proteins and testing proteins is
updated based on the extended function sets. CIA considers the updated
pairwise similarity, the function correlation based on cosine
similarity, and the PPI network topology to predict functions in
consecutive iterations.
Most of these multi-label learning algorithms focus on exploiting label
correlations to boost prediction accuracy, under the assumption that
the labels of labeled proteins used for training are complete, i.e. no
label is missing. Due to various reasons (e.g., evolving Gene Ontology
terms, or limitations of experimental methods), in practice we may be
aware of some functions only, while additional functions (unknown to
us) may also be associated with the protein. In other words, proteins
are partially labeled. Learning from partially and multi-label
instances (or proteins) can be formulated as a multi-label and
weak-label learning problem [[67]27-[68]29].
Several multi-label and weak-label learning algorithms have been
introduced in the past years. Sun et al. [[69]27] studied a multi-label
and weak-label learning method called WELL. WELL assumes there is a
margin between instances of different classes and any given label has a
small number of member instances. To make use of the label correlation
among multi-label instances, this approach assumes that there is a
group of low rank based similarities, and the similarity between
instances of different labels can be approximated based on these
similarities. However, WELL relies on quadratic programming to compute
the low rank based similarities and to make the final predictions.
Therefore, it’s computationally expensive and can hardly make
predictions for samples with a large number of labels. Bucak et al.
[[70]30] proposed a weak-label learning approach called MLR-GL. MLR-GL
optimizes a convex objective function that includes a ranking loss and
a group Lasso loss. MLR-GL aims at labeling instances with no labels by
using partially labeled instances. Yang et al. [[71]28] introduced a
multi-instance and multi-label weak-label learning algorithm. Yu et al.
[[72]29] proposed an approach called ProWL to predict protein functions
using partially labeled proteins. ProWL exploits the label correlation
and available labels of a protein to estimate the likelihood of a
missing function for the protein. ProWL integrates these estimations
with a smoothness loss function to replenish the missing function
labels and to predict functions for proteins with no labels. Yu et al.
[[73]31] assumed a function label depends on the feature information of
proteins and introduced an algorithm called ProDM. ProDM maximizes this
dependency to replenish the missing function labels and to predict
functions for unlabeled proteins.
However, these weak-label learning techniques only use the flat
relationships among function labels, and do not explicitly take into
account the hierarchical relationship among labels. It is widely
recognized that the MIPS Functional Catalogue (FunCat) [[74]32]
organizes the function labels in a tree structure and the Gene Ontology
(GO) [[75]33] organizes the function terms (or labels) in a directed
acyclic graph. It is reported that exploiting the hierarchical
relationship among function labels can boost the accuracy of protein
function prediction [[76]7,[77]8,[78]11,[79]22]. For example,
Barutcuoglu et al. [[80]11] suggested that organizing the prediction
produced by the binary classifier for each individual function label in
a Bayes network can improve the accuracy of gene function prediction.
Tao et al. [[81]34] utilized an information theory based metric to
measure the interrelationships between function labels and to determine
whether a certain function label belongs to a protein or not. However,
this method cannot predict functions for unlabeled proteins, since it
only employs the known annotations of a protein to infer its other
potential annotations. Jiang et al. [[82]35] combined the relational
PPI network and the label hierarchical structure to predict consistent
functions by setting the descendants of a function label as negative
whenever this label is set to negative. Pandey et al. [[83]22] used
Lin’s similarity to capture the relationship among hierarchically
organized labels. Schietgat et al. [[84]36] integrated hierarchical
multi-label decision trees for protein function prediction. Valentini
[[85]7] post-processed the prediction made by a binary classifier for
each label according to the true path rule in the GO and the FunCat
hierarchies, and proposed a method called TPR. Cesa-Bianch et al.
[[86]8] integrated cost-sensitive learning and data fusion with TPR to
further boost the accuracy of protein function prediction. Valentini
[[87]10] advocated in his recent survey that it is paramount to exploit
the hierarchical relationship among function labels for protein
function prediction.
According to the True Path Rule [[88]7] in GO and FunCat: (i) if a
protein is labeled with a function, then this protein should be labeled
with the ancestor functions (if any) of this function; (ii) if a
protein cannot be labeled with a function, then this protein should not
be labeled with the descendant functions (if any) of this function. In
[[89]29,[90]31], the incomplete annotation problem was simulated by
randomly masking function labels in a flat style, ignoring the
hierarchical relationship between labels. In the simulation, if a
function label of a protein is missing, this protein may still be
labeled with the descendant functions of this function. And in fact,
the missing function can be directly inferred from its descendant
function labels.
In this paper, we studied the incomplete label problem in a
hierarchical manner, as opposed to a flat style. We propose an approach
called PILL to predict protein functions using partially labeled
proteins with hierarchical labels. PILL integrates the hierarchical and
flat relationships between function labels to estimate the likelihoods
of missing labels, and the interaction between proteins to replenish
the missing annotations and to predict the functions of unlabeled
proteins. Particularly, PILL simulates the incomplete hierarchical
labels by randomly masking the leaf function labels of a protein, which
is closer to the real situation than the simulation in the previous
study [[91]29,[92]31]. We conducted experiments on three publicly
available PPI datasets, in which each dataset was annotated with FunCat
labels and GO labels. The experimental results showed that PILL
outperforms other related algorithms on replenishing the missing labels
of partially labeled proteins and on predicting functions for
completely unlabeled proteins.
The incomplete hierarchical label problem
Figure [93]1 illustrates an example of an incomplete hierarchical label
problem for proteins annotated with FunCat labels. A corresponding
example for the GO labels is given in Figure S1 of the Additional file
[94]1. In Figure [95]1, p1 and p2 are partially labeled (missing labels
are described by a question mark ?), and p3 is completely unlabeled.
Note, other FunCat labels (i.e., ‘12.03’ and ‘03’) are not really
missing for these proteins, and thus not shown in the figure; these
function labels will also be viewed as candidate ‘missing’ labels. The
missing labels are leaf function labels. If a non-leaf function label
of a protein is missing, we can directly append this function label to
this protein from its descendant function labels. Each hierarchy of
non-leaf and leaf function labels is defined with respect to a single
protein. For example, ‘12.04’ is a leaf function label for p2, but it
is a non-leaf function label for p1, since p1 is labeled with a
descendant label (‘12.04.02’) of ‘12.04’. Our task is to replenish the
missing labels of p1 and p2, and to predict functions for p3. To this
end, we define three kinds of relationships between function labels:
(i) parent-child (e.g., ‘01.03’ is a child function label of ‘01’);
(ii) grandparent-grandson (e.g., ‘01.03.02’ is a grandson label of
‘01’); and (iii) uncle-nephew (e.g., if we consider ‘01’ as a sibling
of ‘12’, although these two labels do not have an explicit common
parent label, ‘12’ is an uncle label of ‘01.03’). These relationships
will be further discussed in the next Section.
Figure 1.
Figure 1
[96]Open in a new tab
Illustration of incomplete hierarchical labels for proteins annotated
with MIPS FunCat labels. A rectangle represents a protein (pi,
i∈{1,2,3}); an ellipse denotes a function label, and a undirected line
between rectangles captures a protein-protein interaction (the more
reliable the interaction is, the thicker the line is). All the
functional labels (including the missing function labels denoted by
color ellipses with question marks ‘?’) in the ellipses should be
associated with the proteins, but only the functional labels in the
white ellipses are known. For better visualization, other functional
labels (i.e., ‘12.03’ and ‘03’, which are not ground-truth labels for
these proteins), are not plotted in the Figure.
Methods
Function correlation definition
A protein often has multiple functions, which are organized as a tree
hierarchy (FunCat) or as a directed acyclic graph (GO). Some pioneers
[[97]7,[98]10,[99]11,[100]22] have demonstrated that exploiting the
hierarchical relationship among function labels can boost the
performance of protein function prediction. Pandey et al. [[101]22]
used the Lin’s similarity [[102]23] to take advantage of the
hierarchical relationship between function labels. Lin’s similarity
measures the similarity of two function labels in terms of their
proximity in the hierarchical ontology, as well as their content. It is
defined as follows:
[MATH: LinSim(s,t<
/mi>)=2×logpca(s,<
/mo>t)logp(s)+logp(t<
mo>), :MATH]
(1)
and
[MATH: pca(s,<
/mo>t)=mink∈ca(s,t
)p(k)
:MATH]
(2)
s and t are two function labels, p(s) denotes the probability for a
protein to be labeled with s. p(s) can be estimated from the available
number of member proteins of s for an organism. c a(s,t) is the set of
common ancestors of s and t, and p [ca](s,t) denotes the probability of
the most specific function label in the hierarchy that subsumes both s
and t. Intuitively, Eq. ([103]1) measures the semantic similarity of s
and t in terms of the content of their minimum subsumer node in the
hierarchy. Clearly, p [ca](s,t)=1 if s=t, and p [ca](s,t)=0 when their
minimum subsumer is the root node of the ontology, or the function
label corresponding to the minimum subsumer node is associated with all
the proteins of an organism. L i n S i m(s,t) can also be viewed as a
correlation measure between s and t. According to this definition, L i
n S i m(s,t) is large if s and t often co-annotate the same proteins,
and their most specific ancestor label is close to s and t but far away
from the root node. On the other hand, if the most specific ancestor of
s and t is (close to) the root node, but s and t are far away from the
root node in the hierarchy, L i n S i m(s,t) will be small.
However, if s is an ancestor of t, taking s as the common ancestor of t
is preferable to any other common ancestor label, since s is more
specific than any other label in the common ancestor label set, and s
also subsumes both s and t. The more specific the function, the fewer
member proteins this function has, and the smaller the probability is
for a protein to be labeled with this function. Therefore, we
substitute p [ca](s,t) with p [sa](s,t), which is defined as follows:
[MATH: psa(s,<
/mo>t)=mink∈sa(s,t
)p(k)
:MATH]
(3)
s a(s,t) represents the set of shared ancestors of s and t, which
includes s if t is a descendant label of s, or t if t is an ancestor
label of s. Thus, c a(s,t)⊆s a(s,t). We extend Lin’s similarity to a
similarity named H S i m(s,t) by substituting p [ca](s,t) in Eq.
([104]1) with p [sa](s,t). If s is an ancestor label of t, H S i m(s,t)
is no smaller than L i n S i m(s,t), since s is more specific than any
function label in c a(s,t) (or p(s)≤p [ca](s,t)). When s and t are
siblings (or cousins), H S i m(s,t) and L i n S i m(s,t) are the same.
s a(s,t) often includes more specific functions (i.e., the parent
function label of t) than c a(s,t), since c a(s,t)⊆s a(s,t). If t is
missing for a protein, but the ancestor function labels (including
parent function label s) of t are associated with this protein, it is
easy to see that the missing label estimation from the parent function
is more reliable than that from other ancestor functions (i.e.
grandparent functions). This property of function label hierarchies
motivates us to estimate the missing labels using HSim instead of
LinSim. The statistics computed in the next Section supports our
rationale.
Nevertheless, when s and t have no shared ancestor (e.g., the function
label in the first level of MIPS FunCat does not have an ancestor
label), p [sa](s,t)=0; when the most specific shared function label is
associated with almost all the proteins (e.g., the function label
corresponds to the root node of the GO biological process sub-ontology
hierarchy), p [sa](s,t)≈1 and H S i m(s,t)≈0. But H S i m(s,t)≈0 does
not mean that s and t have no correlation. For example, there are 272
proteins in S. Cerevisiae labeled with ‘40’ (CELL FATE), 448 proteins
labeled with ‘43’ (CELL TYPE DIFFERENTIATION), and 170 proteins labeled
with both ‘40’ and ‘43’. If a protein is labeled with ‘40’ and it is
unknown whether this protein has ‘43’, we have 170/272=62.5% confidence
that this protein is also labeled with ‘43’. However, neither HSim nor
LinSim can provide this confidence. The reason is that ‘40’ and ‘43’ do
not have any shared ancestor label, and both of them only consider the
hierarchical relationship between function labels. In fact, it is
observed that flat label relationships are also beneficial for protein
function prediction [[105]24,[106]25,[107]29]. To overcome this
limitation of H S i m(s,t), we introduce a C o m S i m(s,t) to describe
the correlation between function labels:
[MATH: ComSim(s,t<
/mi>)=HSim(s,t),ifpsa(s,<
/mo>t)∈(0,1
mn>)JcdSim(s,t<
/mi>),otherwise :MATH]
(4)
where JcdSim is the similarity based on the Jaccard coefficient J c d S
i m(s,t)=|N(s)∩N(t)|/|N(s)∪N(t)|. N(·) denotes the set of proteins
labeled with the corresponding function label and |N(·)| is the
cardinality of the set. From the definition, if s and t do not have
shared ancestor function labels, C o m S i m(s,t) is large when they
often co-associated with the same set of proteins; C o m S i m(s,t) is
small when they seldom co-associated with the same proteins. When s, t
and the most specific shared ancestors of these two function labels are
always associated with the same proteins, C o m S i m(s,t)=1. In this
case, J c d S i m(s,t) is also set to 1. As such, ComSim captures both
the hierarchical and the flat relationships between functions.
Statistics of hierarchical function label relationships
From the true path rule of function label hierarchies, it’s easy to
observe that:
* p(s|p a r(s))≥p(s|g p a r(s))
* p(s|g p a r(s))≥p(s|u n c l e(s))
where p a r(s) denotes the parent function label of s, g p a r(s) is
the grandparent function label of s, and u n c l e(s) is the uncle
(parent’s sibling) function label of s. p(s|p a r(s)) is the
conditional probability that a protein is labeled with s given that
it’s already labeled with p a r(s). These equations hold since if a
protein is labeled with s, then this protein is also labeled with the
ancestor functions of s (including p a r(s) and g p a r(s)), and if a
protein is labeled with u n c l e(s), this protein is also labeled with
g p a r(s). In contrast, if a protein is labeled with p a r(s) (or g p
a r(s)), it is uncertain whether this protein is labeled with s (or u n
c l e(s)).
Based on these rules, we investigate the parent-child relationship by
counting the cases in which a protein is labeled with both a function
label in p a r(s) and with s. Similarly, we investigate the
grandparent-grandson (or uncle-nephew) relationship by computing the
cases in which a protein is labeled with both a label in g p a r(s) (or
u n c l e(s)) and with s. The distributions of these three statistics
for proteins in S. Cerevisiae (labeled with FunCat labels) are shown in
the first three boxplots in Figure [108]2. In addition, we report p(s|p
a r(s))−p(s|g p a r(s)) in the fourth boxplot in Figure [109]2. We also
provide the distribution of all pairs of function correlations based on
the proposed ComSim, Lin’s similarity, Cosine similarity, and Jaccard
coefficients on the same protein data in Figure [110]2. The
corresponding distributions obtained on the S. Cerevisiae proteins
labeled with GO labels are given in Figure S2 of the Additional file
[111]1. For a fair comparison, all the zero elements in these
likelihoods and similarities are removed, since some pairwise function
labels do not have the hierarchical (i.e., parent-child) relationships,
or are not associated with the same proteins.
Figure 2.
Figure 2
[112]Open in a new tab
Label relationship statistics and four label similarities on proteins
of S. Cerevisiae annotated with FunCat labels. In the figure, each
boxplot describes the distribution of a likelihood (or similarity), the
central line is the median, the edges of the box are the 25% and 75%
percentiles, the whiskers extend to the most extreme datapoints that
are not considered as outliers, and the outliers are plotted
individually as ‘+’.
The boxplots of Figure [113]2 support the relationships p(s|p a
r(s))≥p(s|g p a r(s)) and p(s|g p a r(s))≥p(s|u n c l e(s)). If s is
missing for a protein, and the protein is labeled with labels in p a
r(s), g p a r(s) and u n c l e(s), the estimated likelihood of the
missing label s from p a r(s) is more reliable than that from g p a
r(s) and u n c l e(s). The explanation is straightforward: the more
specific the function label is, the fewer member proteins the label
has. In other words, if the function label in p a r(s) is associated
with a protein, we can ensure that the function label in g p a r(s) is
also associated with the same protein, but not vice versa. Similarly,
given that u n c l e(s) is the sibling of p a r(s) and the two share
the same parent, if a protein is annotated with u n c l e(s), this
protein is also annotated with g p a r(s). Similar results are obtained
for the S. Cerevisiae proteins annotated with GO labels (see Figure S2
of the Additional file [114]1).
In Figure [115]2, p(s|p a r(s)) is more evenly distributed than p(s|g p
a r(s)) and p(s|u n c l e(s)), and it has fewer outliers than the
latter two. We can also observe that the distributions of the function
correlations defined by LinSim and ComSim are closer to p(s|p a r(s))
than the correlations defined by the Cosine similarity and the Jaccard
coefficient, and the label correlations based on LinSim and ComSim are
more evenly distributed than the correlations based on Cosine and
Jaccard similarity, since the former two have fewer outliers than the
latter two. ComSim considers both the hierarchical (measured by HSim)
and flat (measured by JcdSim) relationships among labels, and its
margin between 25% and 75% percentiles is wider than that of LinSim. In
addition, the overlap between ComSim and p(s|p a r(s)) is larger than
that between LinSim and p(s|p a r(s)). In fact, we also studied the
Gaussian function (
[MATH: exp−(x<
/mi>−μ)2
σ2 :MATH]
, where μ and σ are the mean and standard deviation of x, x corresponds
to a kind of likelihood or similarity) distribution of these
likelihoods and similarities, and also observed that ComSim overlaps
more with p(s|p a r(s)) than with other similarity metrics (not
reported). Since ComSim will be used to estimate the likelihoods of
missing labels, these differences indicate that ComSim can estimate the
missing labels more accurately than the other three techniques. The
advantage of ComSim will also be verified in our experiments.
Objective function
Given n proteins, let K be the number of distinct functions across all
proteins. Let Y=[y [1],y [2],…,y [n]] be the original label set, with y
[ik]=1 if protein i has the k-th function, and y [ik]=0 if it’s unknown
whether this protein has the k-th function or not. We assume the first
l≤n proteins are partially labeled and the remaining n−l proteins are
completely unlabeled. We set the normalized function correlation matrix
as
[MATH:
Cm(s,t)=ComSim(s,t<
/mi>)∑t=1K<
mtext
mathvariant="italic">ComSim(s,t<
/mi>) :MATH]
.
Based on the definition of C [m], we can estimate the likelihood of a
missing function label on the i-th (i≤l) partially labeled protein as
follows:
[MATH: y~ik=yiT
Cm
msub>(·,k),ifyik=0
1,otherwise :MATH]
(5)
If y [ik]=0 and the k-th function label has a large correlation with
the already known functions of protein i, then it is likely that the
k-th function is missing for this protein,
[MATH: y~ik :MATH]
is assigned to a large value.
[MATH: y~i :MATH]
is the label vector for the confirmed labels (the corresponding entries
are set to 1) together with y [i] and C [m] estimated likelihoods of
the missing labels (for entries corresponding to y [ik]=0) on the i-th
protein.
Based on
[MATH: y~i :MATH]
, we can define the empirical loss function on l partially labeled
proteins as follows:
[MATH:
Ψ1<
/msub>(f)=
minf∑i=1l
mi>fi−y~i22=
minF(F−Y~)TU(F−
Y~)22 :MATH]
(6)
where
[MATH: fi∈RK
:MATH]
is the to be predicted probability likelihood on the i-th protein, F=[f
[i],f [2],…,f [n]] is the predictions on n proteins,
[MATH: Y~=y~1,y~2,…,y~n :MATH]
is the likelihood matrix for confirmed labels along with the estimated
missing labels on n proteins, U is an n×n diagonal matrix with U [ii]=1
if i≤l, and U [ii]=0 otherwise.
Proteins with similar amino acid sequences are likely to share the same
functions. Schwikowski et al. [[116]37] observed that two interacting
proteins are more likely to share the same functions than two proteins
with no interaction with each other. This observation is recognized as
the ‘guilt by association’ rule. Inspired by the work [[117]38] that
states that the labels of an unlabeled instance can be linearly
inferred from the labels of its neighbors, we introduce a smoothness
term to utilize the interactions (or similarity) between proteins as:
[MATH: Ψ2
(f)=minf∑i=
mo>1n
fi−∑pj∈N(pi)Wijfj
mfenced>22
=minFFT(I−W)<
/mrow>T(I−W
)F22s.t.∑j=1nWij=1
:MATH]
(7)
where
[MATH: N(pi) :MATH]
is the set of proteins interacting with p [i], W [ij] is the weight of
the interaction (similarity) between proteins i and j, and I is an n×n
identity matrix. Our motivation to minimize Eq. ([118]7) is three-fold:
(i) if two proteins i and j are quite similar to one another (or W [ij]
is large), then the margin between f [i] and f [j] should be small,
otherwise there is a big loss; (ii) if protein i has missing labels and
its interacting partners do have those labels, then we can leverage
this information to assist the replenishing process of the missing
labels for protein i; (iii) if protein i is completely unlabeled, its
labels can be predicted using the labels of its partners. Alternative
ways (i.e., based on functional connectivity or homology between
proteins) to transfer labels among proteins have been suggested in the
literature (see [[119]5,[120]39-[121]41]). These methods can also be
adapted to replace Eq. ([122]7). Since our work focuses on how to
replenish the missing labels and how to predict protein functions using
incomplete hierarchical labels, how to more efficiently utilize the
guilt-by-association rule and how to reduce noise in PPI networks to
boost the accuracy (i.e., by enhancing the functional content
[[123]42], or by incorporating additional data sources
[[124]5,[125]15,[126]16]), is out of scope.
Based on Eq. ([127]6) and Eq. ([128]7), the objective function to be
minimized by the PILL algorithm is:
[MATH:
Ψ(F)=trF−Y~TUF−Y~+λtrFT(I−W)<
/mrow>T(I−W
)F :MATH]
(8)
where λ>0 is a scaler parameter that balances the importance of the
empirical loss and the smoothness loss.
Results and discussion
Datasets and experimental setup
We report the results on three PPI networks, namely CollingsPPI,
KroganPPI, and ScPPI. We annotated proteins in these networks according
to MIPS FunCat [[129]32] and Gene Ontology [[130]33] (Biological
Process non-IEA terms) respectively. The statistic of these
preprocessed datasets is listed in Table [131]1. The CollingsPPI
dataset, for example, has 1620 proteins labeled with 168 distinct GO
labels and 176 FunCat labels; these proteins in total are labeled with
22,023 GO labels and 13,320 FunCat labels, and on average each protein
has about 13.59 GO labels and 8.22 FunCat labels. More details on these
datasets and experimental setup are provided in the Additional file
[132]1. The label vector of proteins implicitly encodes the
hierarchical relationship among labels. For example, suppose the entry
index corresponding to ‘01.01’ in label vector
[MATH: yi∈RK
:MATH]
is t, and the entry index corresponding to ‘01’ (the ancestor function
label of ‘01.01’) is s, if y [it]=1, then y [is]=1.
Table 1.
Dataset statistics
Dataset # Proteins # FunCat labels # GO labels Avg ± Std(FunCat) Avg ±
Std(GO)
CollinsPPI 1620 176 (13320) 168 (22023) 8.22 ±5.60 13.59 ±8.28
KroganPPI 2670 228 (20384) 241 (32639) 7.63 ±5.81 12.22 ±8.83
ScPPI 5700 305 (36909) 372 (61048) 6.48 ±5.71 10.71 ±8.83
[133]Open in a new tab
‘ #Proteins’ represents the number of proteins in a dataset, ‘ #FunCat
Labels’ describes the number of distinct FunCat labels of these
proteins and the number in the bracket represents the total number of
FunCat labels on all these proteins, ‘ #GO Labels’ represents the
number of distinct GO labels of these proteins and the number in the
bracket represents the total number of GO labels on all these proteins,
‘Avg ±Std(FunCat)’ represents the average number of FunCat labels for a
protein in a dataset and the standard deviation, ‘Avg ±Std(GO)’
represents the average number of GO labels for a protein in a dataset
and the standard deviation.
There are no off-the-shelf proteomic datasets that can be directly used
to test the performance of the solution of the incomplete labels
problem, although this problem is practical and common in real world
scenarios. To address this issue, we assume the labels of the currently
labeled proteins are complete and randomly mask some of the ground
truth leaf functions of a protein; these masked functions are
considered as missing for this protein.
For representation, we use m as the number of missing functions of a
protein. For example, if a protein has 10 functional labels, m=3 means
that three functional labels are masked for this protein. If a protein
does not have more than m labels, we do not mask all the available
labels and ensure it has one function label. A small number of proteins
in these networks doesn’t have any label; we keep these proteins in the
network to retain the network’s structure, but do not test on them. We
introduce N [m] to represent how many labels are missing for all the
proteins for a given setting of m.
Comparing methods and evaluation metrics
We compare PILL against ProDM [[134]14], ProWL [[135]29], LkNN
[[136]22], TPR [[137]7], MLR-GL [[138]30], CIA [[139]26], and Naive
[[140]9]. ProDM and ProWL are designed to replenish the missing labels
and to predict protein functions using partially labeled proteins;
neither explicitly considers the hierarchical relationship among
function labels. LkNN utilizes LinSim in Eq. ([141]1) to predict the
functions of unlabeled proteins. TPR uses the true path rule (or
hierarchical relationship) in label hierarchies to refine the
predictions of binary classifiers trained for each label. We use the
weighted version, TPRw, for the experiments. MLR-GL uses partially
labeled instances in the training set to predict the labels of
unlabeled instances. CIA is an iterative algorithm that uses function
correlations based on Cosine similarity to infer protein functions.
Naive, which ranks functional labels based on their frequencies, is a
baseline approach in the community-based critical assessment of protein
function annotation [[142]9]. It is reported that very few methods
performed above the Naive method. Therefore, we take the Naive method
as a comparing method for reference. More details about the
implementations and parameter settings of these methods are reported in
the Additional file [143]1.
The performance of protein function prediction can be evaluated
according to different criteria, and the choice of evaluation metrics
differentially affects different prediction algorithms
[[144]9,[145]29]. For a fair and comprehensive comparison, we used five
representative metrics, namely MacroF1, MicroF1, AvgROC, RankingLoss
and Fmax. These evaluation metrics are extensively applied to evaluate
the performance of multi-label learning algorithms and protein function
prediction [[146]9,[147]21,[148]29]. The formal definition of these
metrics is provided in the Additional file [149]1. To keep consistency
across all evaluation metrics, we use 1-RankLoss instead of
RankingLoss. Thus, the higher the value, the better the performance is
for all the used metrics. These metrics evaluate the performance of
function prediction in different aspects, and thus it is difficult for
an algorithm to outperform another technique on all the metrics.
Replenishing missing function labels
In this section, we conduct experiments to study the performance of
PILL on replenishing missing annotations of n hierarchically and
partially labeled proteins. In the experiments, we consider all the
proteins in the dataset as training and testing data. The experimental
results with m=1,3,5 on CollingsPPI with respect to the FunCat labels
are reported in Table [150]2 (the best and comparable results are in
bold font, with statistical significance examined by a pairwise t-test
at 95% significance level). Other results on CollingsPPI, KroganPPI and
ScPPI are reported in Tables S1-5 of the Additional file [151]1. For
each setting of m, the experiments are repeated 20 times. In each run,
the masked labels of a protein are randomly chosen from the leaf
function labels of the same protein, and these masked labels are
considered as missing for testing. If s is a non-leaf function label of
a protein, whenever its descendant function labels are all missing (or
masked), s turns to be a leaf function label and can be masked for this
protein.
Table 2.
Results of replenishing missing labels on CollinsPPI wrt
Metric m(N [m] ) PILL ProDM ProWL LkNN TPRw Naive
MicroF1 1(1526) 93.91 ±0.11 83.30 ±0.30 90.31 ±0.08 44.07 ±0.14 50.00
±0.12 29.00 ±0.01
3(4330) 81.70 ±0.29 72.16 ±0.77 78.38 ±0.23 41.61 ±0.16 43.60 ±0.18
29.77 ±0.13
5(6580) 70.53 ±0.31 60.10 ±1.01 66.61 ±0.16 37.54 ±0.21 36.79 ±0.13
30.09 ±0.06
MacroF1 1(1526) 89.29 ±0.25 69.53 ±0.40 85.75 ±0.34 34.23 ±0.21 43.33
±0.15 4.70 ±0.01
3(4330) 70.19 ±0.63 60.78 ±1.73 69.03 ±0.46 29.23 ±0.48 35.45 ±0.39
5.06 ±0.04
5(6580) 55.32 ±0.94 45.37 ±1.90 52.95 ±0.59 24.12 ±0.62 27.06 ±0.75
5.13 ±0.05
AvgROC 1(1526) 99.47 ±0.01 97.44 ±0.06 98.27 ±0.09 66.14 ±0.05 69.67
±0.18 49.44 ±0.00
3(4330) 97.77 ±0.16 93.86 ±0.44 93.35 ±0.18 64.86 ±0.11 64.93 ±0.21
49.44 ±0.00
5(6580) 94.64 ±0.33 87.03 ±1.04 86.24 ±0.49 63.25 ±0.34 60.41 ±0.36
49.44 ±0.00
1-RankLoss 1(1526) 99.43 ±0.03 96.80 ±0.04 98.55 ±0.05 69.38 ±0.04
55.75 ±0.14 79.33 ±0.04
3(4330) 97.58 ±0.11 92.15 ±0.26 94.62 ±0.17 66.09 ±0.27 46.90 ±0.47
76.72 ±0.22
5(6580) 94.55 ±0.27 86.63 ±0.67 89.30 ±0.25 59.65 ±0.65 36.88 ±0.41
74.52 ±0.41
Fmax 1(1526) 90.88 ±0.07 76.28 ±0.31 80.49 ±0.34 42.74 ±0.09 58.43
±0.36 28.32 ±0.00
3(4330) 76.82 ±0.10 67.39 ±0.84 66.14 ±0.29 42.16 ±0.25 51.12 ±0.50
27.93 ±0.01
5(6580) 66.11 ±0.50 56.26 ±4.01 57.76 ±0.52 40.39 ±0.37 44.01 ±0.53
27.04 ±0.00
[152]Open in a new tab
FunCat labels. m is the number of missing labels for a protein and N
[m]in the bracket is the total number of missing labels for all the
proteins. The numbers in boldface denote the best performance.
From the results reported in these Tables, we can observe that PILL
outperforms other competitive methods across all the evaluation metrics
in most cases. In summary, out of 90 configurations (3 datasets × 2
kinds of labels × 5 evaluation metrics × 3 settings of m), PILL
outperforms ProDM 85.56% of the cases, outperforms ProWL 91.11% of the
cases, ties with them 4.44% and 4.44% of the cases, and loses to them
in 5.56% and 4.44% of the cases, respectively. PILL outperforms LkNN,
TPRw and Naive in all configurations. Taking MacroF1 on CollingsPPI
annotated with FunCat labels, for example, PILL on average is 23.30%
better than ProDM, 4.27% better than ProWL, 147.33% better than LkNN,
and 106.61% better than TPRw. These results corroborate the
effectiveness of PILL on replenishing the missing labels.
PILL largely outperforms ProDM and ProWL, even if the latter two also
leverage correlation between function labels and the interaction
between proteins. The reason is that ProDM and ProWL use the Cosine
based similarity to define the correlation between function labels, and
they do not explicitly make use of the hierarchical relationship among
labels. Since the label vector implicitly encodes the hierarchical
relationship of labels to some extent, ProDM and ProWL can achieve a
result comparable (or a slightly better) to PILL in few cases.
LkNN and TPRw explicitly utilize the hierarchical relationship among
labels, but they are not able to compete with PILL, ProDM and ProWL.
The cause is two fold: (i) LkNN and TPRw assume that the labels of the
labeled proteins are complete, and they use partially labeled proteins
to predict missing labels without estimating the missing labels in
advance; (ii) they do not utilize the flat relationships among function
labels. Naive ranks the functional labels according to their frequency
and sets the frequency as the predicted probability for the labels.
Since the missing labels are ‘leaf’ functional labels, and their
frequencies are smaller than the ‘non-leaf’ functional labels, Naive
achieves the lowest AvgROC and MacroF1 scores, a medium 1-RankLoss
score, and almost the lowest Fmax and MicroF1 scores among the
comparing methods. Naive performs better than some methods in few
cases, but it is outperformed by PILL by a large margin across all the
evaluation metrics. These results show that PILL can exploit the
hierarchical and flat relationships among labels to boost the
performance of protein function prediction.
Real Life Examples: Another experiment is performed to study the
ability of PILL on providing hypothetical missing labels. In
particular, we use the GO terms associations (download date:
2014-02-01) of S. Cerevisiae to annotate the proteins in ScPPI (here we
do not apply the filter process to remove the too specific and too
general labels as in the previous experiments, and these 5700 proteins
were annotated with 2381 distinct biological process labels). We use
PILL to replenish the missing labels of these proteins. There are 117
proteins in ScPPI having new labels in the updated GO terms annotations
[[153]33](download date: 2014-06-01), and there are 451 newly appended
labels for these proteins. We choose the top 50 function labels (from
2381 distinct labels) as the hypothetical labels for each of these
proteins. We observe PILL can correctly replenish 30.38%(137/451)
missing labels, and if we append the ancestor labels of these 137
labels to these 117 proteins, 40.80%(184/451) labels are correctly
replenished. These proteins are provided in Additional file [154]2, and
some examples are reported in Table [155]3. In the table, the original
(date before 2014-02-01) GO labels and the replenished ones of a
protein, the support reference’s PMID, the GO term annotation (or
protein label association) added date and the GO term annotation
evidence code are all listed. Evidence code indicates the type of
evidence that supports the go term annotation, ‘IMP’ is Inferred from
Mutant Phenotype, ‘IDA’ is Inferred from Direct Assay, ‘IGI’ is
Inferred from Genetic Interaction, and ‘ISS’ is Inferred from Sequence
or Structural Similarity. These real life examples demonstrate PILL can
confidently provide hypothetical missing labels from a large number of
candidate labels.
Table 3.
Examples of replenished labels for proteins by PILL and their support
references