Abstract
Phosphorylation is catalyzed by protein kinases and is irreplaceable in
regulating biological processes. Identification of phosphorylation
sites with their corresponding kinases contributes to the understanding
of molecular mechanisms. Mass spectrometry analysis of
phosphor-proteomes generates a large number of phosphorylated sites.
However, experimental methods are costly and time-consuming, and most
phosphorylation sites determined by experimental methods lack kinase
information. Therefore, computational methods are urgently needed to
address the kinase identification problem. To this end, we propose a
new kernel-based machine learning method called Supervised Laplacian
Regularized Least Squares (SLapRLS), which adopts a new method to
construct kernels based on the similarity matrix and minimizes both
structure risk and overall inconsistency between labels and
similarities. The results predicted using both Phospho.ELM and an
additional independent test dataset indicate that SLapRLS can more
effectively identify kinases compared to other existing algorithms.
Introduction
Protein phosphorylation is one of the most pervasive posttranslational
modifications and plays an important role in regulating nearly all
types of cellular processes in organisms, including signal
transduction, translation and transcription
[[29]1,[30]2,[31]3,[32]4,[33]5]. Phosphorylation is catalyzed by
protein kinases [[34]6], which regulate most cellular processes. More
than one-third of proteins can be phosphorylated, and half of the
protein kinases have intimate relationships with cancer and diseases
[[35]7]. Each protein kinase specifically catalyzes a certain subset of
substrates, and deficiencies in protein kinases often cause diseases
and cancers [[36]6]. In this regard, identifying potential
phosphorylation sites and their corresponding protein kinases is
beneficial for elucidating molecular mechanisms.
Conventional experimental methods such as high-throughput biological
technique mass spectrometry [[37]8] were developed to identify
phosphorylation sites. Although these experimental methods provide the
foundation for understanding the mechanisms underlying phosphorylation,
they are often costly and time-consuming. Additionally, although mass
spectrometry methods can generate a large number of phosphorylated
sites, most of these sites lack kinase information and the kinase that
catalyzes the site is unknown. For example, Phospho.ELM, which is a
verified phosphorylation site database, contains 37,145 phosphorylation
sites, but only a small number of them (3,599 items) have corresponding
kinase information. Due to the limitation of experimental methods,
computational methods are required to identify protein kinases for
specific phosphorylation sites based on data verified by experimental
methods.
Currently, many computational methods have been developed for protein
phosphorylation prediction. The first computational method proposed by
Blom [[38]9] was based on an artificial neural network algorithm using
peptide sequences. Since then, a large number of methods have been
developed, such as PPSP [[39]10] and Musite [[40]11]. PPSP adopts the
Bayesian decision theory (BDT), which is based on an assumption that
all flanking residues are independent of each other, to construct a
classifier. Musite calculates the distance between two peptide
sequences using the distance calculator Blosum62, which is a matrix
reflecting the relationship between amino acids, and then constructs
the classifier with support vector machine (SVM). Due to the increasing
demand for kinase identification, a few machine learning-based methods
have been developed in recent years. Among them, NetworKIN [[41]12]
uses consensus sequence motifs and a probabilistic protein association
network. IGPS [[42]13] is based on peptide sequence similarity and uses
protein-protein interaction (PPI) information to control the false
positive rate.
Despite the success achieved by these computational approaches, most of
them neglect the geometry of the probability distribution [[43]14],
thereby hampering the improvement of prediction accuracy. For example,
SVM only focuses on structural risk minimization and the quadrature
encoder and thus ignores the intrinsic relationship between different
amino acids. Additionally, the distance of two peptide sequences
defined in Musite may fail to fulfill the triangle inequality [[44]11].
To solve these problems, Belkin et al. [[45]14] proposed a framework
exploiting the geometry of the probability distribution; the test
results showed that the proposed framework efficiently addressed the
classification problems.
In this study, we propose a kernel-based supervised learning algorithm
called Supervised Laplacian Regularized Least Squares (SLapRLS), which
incorporates a new kernel construction method and brings together the
spectral graph theory, regularization and the geometry of the
probability distribution for kinase identification. In SLapRLS,
reasonable translations are performed on a similarity matrix to force
it to act as a kernel matrix [[46]15]. Additionally, we introduce the
overall inconsistencies between sample similarities and labels for each
class [[47]16] and minimize both the inconsistency and the structure
risk. To compare the proposed algorithm with existing algorithms, we
perform a 10-fold cross-validation using data retrieved from
Phospho.ELM and compare SLapRLS with three classical algorithms: SVM,
BDT and the k-nearest neighbor (KNN). To confirm the effectiveness and
superiority of SLapRLS, an additional independent test dataset is used
to compare SLapRLS and two other kinase identification tools: iGPS and
NetworKIN. The results show that SLapRLS is more effective than the
competitive algorithms and that the kernel matrix construction method
is useful for the identification of kinases corresponding to known
phosphorylation sites.
Materials and Methods
Data description
In this work, we extracted 37,145 experimentally verified
phosphorylation sites from humans, including 3,599 sites with
corresponding kinase information, from the most recent version of
Phospho.ELM [[48]17]. Among the sites with kinase information, 2,398
unique phosphorylation sites with kinase information in 934 proteins
are obtained after removing the duplicated data. To overcome the
over-estimation aroused by homology bias and redundancy, we cluster the
protein sequences using Blastclust with a threshold of 70%; only one
representation of each cluster is reserved [[49]18]. As a result, 2,289
sites in 889 proteins are employed for the analysis. There are 1,823
serine (S)/ threonine (T) phosphorylation sites and 446 tyrosine (Y)
phosphorylation sites. For each kinase, the corresponding
phosphorylation sites are treated as positive data, whereas sites
phosphorylated by other kinases are treated as negative data. Several
kinases that contain too few known phosphorylated substrates are
excluded to achieve reliable results. Finally, 23 types of kinases are
obtained for investigation after removing the kinases that contain less
than 20 positive items.
Because iGPS and NetworKIN use data retrieved from the Phospho.ELM
database for model training, the test dataset in this study at least
partially includes the training dataset of these two methods. This
factor would inevitably result in the overestimation of the prediction
performance for iGPS and NetworKIN. To obtain a fair comparison result,
an independent dataset is adopted in this work [[50]19]. Similarly,
protein kinases in the independent test dataset that contain less than
20 items are also excluded to ensure the reliability of the results.
Finally, we select 6 kinases in the independent dataset: PKC alpha,
Erk2, Erk1, P38a, SRC and SYK.
Algorithm
In this work, we propose that SLapRLS brings together spectral graph
theory, regularization and the geometry of the probability distribution
based on the regularized least squares (RLS) theory [[51]14]. Similar
to SVM, RLS is engaged in minimizing the structure risk [[52]20].
SLapRLS is proposed based on the manifold assumption that similar
samples tend to have similar results, and thus samples with the same
label are predicted to have similar results. Therefore, the overall
inconsistency between labels and pairwise similarities in the same
class should be minimized. SLapRLS aims to minimize both structure risk
and the overall inconsistency between labels and pairwise similarities.
Feature description
In this work, we take full advantage of sequence information in
modeling. A 15 amino acid local sequence is used to represent a
candidate phosphorylation site that has 7 amino acids upstream and
downstream of the phosphorylation site (S, T or Y). Thus, a
phosphorylation site can be denoted as s = (s(1), s(2),…, s(8),…,
s(15)), where s(i) represents the amino acid at the i [th] position and
s(8) is the phosphorylation site.
Structure risk minimization and RLS
Structure risk minimization aims to minimize VC confidence and the
summation of the empirical risk on each subset [[53]21]. The square of
the difference between the true label and the predicted result is often
used as the loss function when calculating the empirical risk [[54]22].
The optimization problem of RLS is shown as:
[MATH:
min1lΣi =<
/mo> 1l(yi−f(xi)
mo>)2+γA||f||
K2 :MATH]
(1)
where y [i] and f(x [i]) represent the true label and the predicted
result of the i [th] sample, respectively.
Inconsistency between labels and pairwise similarities
A good predictor should predict similar data with similar results, and
thus the overall inconsistency between labels and pairwise similarities
should be minimized [[55]16]. The inconsistency contains two parts: the
first is the inconsistency in the positive dataset and the second is
the inconsistency in the negative dataset. The overall inconsistency is
minimized and shown as:
[MATH:
min1p2Σi,
j = 1p(f
(xi)−f(xj))2W
ij+1<
mtext>n2Σi,
j = p+1
p+n
(f(xi)<
mo>−f(xj))
mrow>2Wij :MATH]
(2)
where p and n represent the number of positive data and negative
results, respectively, and W [ij] is the similarity between samples x
[i] and x [j].
Supervised LapRLS
Supervised LapRLS is based on the principle that both inconsistency
between labels and pairwise similarities and structural risk should be
minimized. The optimization problem aims to solve Eqs ([56]1) and
([57]2), which can be represented as [58]Eq (3). This is a multiple
objective optimization problem, and thus a weight parameter γ [I] is
introduced to weight the two objects [[59]23].
[MATH:
min1lΣi
= 1j(yi−<
/mo>f(xi
mrow>))2<
mo>+γA∥f
∥K2+γI
(1p2Σi,j<
/mi> = 1p(f(xi)<
mo>−f(xj))2Wij+<
mfrac>1n2<
/mrow>Σi,j<
/mi> = p+1
p+n
(f(xi)−f(xj))2Wij)<
/mo> :MATH]
(3)
Data imbalance is a common problem in bioinformatics, in which negative
data often have larger numbers than positive data. However, few methods
have been proposed to address this problem. In this paper, we assign
different penalty coefficients to different samples [[60]24].
Therefore, SLapRLS aims to solve the optimization problem as:
[MATH:
min1lΣi
= 1lci(
yi−f(xi))2+γA∥f∥K2+γI
(1P2Σi,j<
/mi> = 1p(f(xi)<
mo>−f(xj))2Wij+<
mfrac>1N2<
/mrow>Σi,j<
/mi> = p+1
p+n
(f(x<
mi>i)−f(
xj))2W
ij)<
/mrow> :MATH]
(4)
where parameter c = (c [1], c [2,]… c [i],…,c [l]) is the penalty
coefficient and c [i] represents the misclassification cost of the i
[th] data x [i]. The misclassification cost contains two parts: the
cost of misclassifying the positive samples as negative and the cost of
misclassifying the negative samples as positive. Assuming that the
number of class A is larger than class B, the model tends to classify
the test data as class A. If the penalty of each class is equivalent,
more samples in class B may be predicted as the wrong class. To solve
the problem of data imbalance, the class with a smaller number is
assigned a large penalty, while the class with a larger number is
assigned a small penalty. The penalty coefficients for positive and
negative data are set to n/(p + n) and p/(p + n) according to the
numbers of positive and negative results. The two tuning parameters γ
[A] and γ [I] in [61]Eq (4) were selected from grid research in the
range of [10^−5, 10^5] via ten-fold cross-validation [[62]25] [[63]26];
the values of the selected γ [A] and γ [I] for each kinase are listed
in [64]S1 Table. Belkin et al. proved that optimization problems that
share similar object functions to [65]Eq (2) were all convex
optimization problems [[66]14] and thus shared the same form as the
solutions shown in (5). By using K to represent the Kernel function, we
can calculate f*(x) as follows:
[MATH:
f*(x)<
/mo> = Σi =<
/mo> 1lαiK(x
i,x) :MATH]
(5)
By solving the convex optimization problem shown in [67]Eq (4), we can
achieve a as follows:
[MATH: α =
(K+γAlI+γIl(1p<
/mi>2C−1LPKP,P+1n2C−1LNKN,N)<
/mrow>)−1<
mi mathvariant="bold-italic">Y :MATH]
(6)
Here, K is the kernel matrix with a size of (p + n)×(p + n) and can be
denoted as
[MATH: [KP,P<
msub>KP,N<
mtd>KN,P<
msub>KN,N] :MATH]
, K [A,B] is a kernel matrix between two datasets A and B, and P and N
represent the positive dataset and negative dataset, respectively. L
[P]is the graph Laplacian of all positive data, which is given by
[MATH:
LP = DP−12<
/mfrac>(DP−WP
,P)<
mtext>DP−12 :MATH]
(7)
where the diagonal matrix D [P] is given by:
[MATH:
DPi,
mo>i = Σx
j∈PWi,j,xi∈P :MATH]
(8)
Similarly, L [N] is the graph Laplacian of all negative data, which is
given by
[MATH: LN = DN−
mo>12(DN−WN,N)<
msub>DN−
mo>12 :MATH]
(9)
where the diagonal matrix D [N] is given by:
[MATH:
DNi,
mo>i = Σx
j∈NWi,j,xi∈N :MATH]
(10)
In (6), C is a diagonal matrix given by C = diag(c [1],c [2]…c [l]), Y
= (y [1],y [2]…y [l]), W [P,P] is the similarity matrix between data in
the positive dataset and W [N,N] is the similarity matrix between
samples in the negative datasets.
Similarity among samples
Because SLapRLS is based on sample similarities, the method used to
calculate the similarity can have a large impact. Blosum62 is a matrix
that reflects the relationship among amino acids and has been proven to
be efficient for calculating pairwise similarity [[68]11]. Here, we
assume Blosum62 as matrix B and use a and b to represent two amino
acids. Then, the similarity W [i, j] between two samples s [i] and s
[j]can be calculated as follows:
[MATH:
Wi, j
mtext> = Σt = 1wsim(si(t),sj(t))
:MATH]
(11)
where w is the window size of a local peptide sequence and is set to 15
in this study. s [i](t) represents the amino acid located in the t [th]
position of s [i]. Because the similarity between samples should be
non-negative, we normalize B using:
[MATH:
sim(a,b) = B(a
,b)−min(B)max
(B)−min
(B)
:MATH]
(12)
[MATH:
sim(a, b)>0, :MATH]
Because sim(s [i, sj]) is non-negative, it is easy to come to the
conclusion that W [i, j] is also non-negative.
Kernel matrix construction
Kernel-based algorithms embed the dataset into a Hilbert space, and the
kernel matrix completely reflects the relative positions of the samples
in the embedding space. Several mathematically defined kernel functions
exist (i.e., Gaussian kernel and spline kernel); these functions have
been widely utilized in many research fields. However, these
mathematically defined kernels often require few parameters and cannot
effectively reflect the relationship between objects in a certain
field. For example, in the field of kinase identification the Gaussian
kernel needs the pairwise distance. A common way to calculate the
distance is to encode the peptide sequence using quadrature encoding;
then, the distance is calculated based on the Euclidean distance, which
assumes that each amino acid is independent of the others. However,
close relationships exist between amino acids, and thus calculating the
similarity using the Gaussian kernel may miss important information
from the substrate sequence [[69]11]. Because a kernel function can
reflect pairwise similarity, a more reliable way to calculate the
similarity is to use expert knowledge and other information rather than
the kernel function [[70]27]. A kernel matrix should be symmetric and
positive definite [[71]28]. In this regard, we can perform translation
on the similarity matrix to make it fulfill these two properties
(symmetric and positive definite). The similarity matrix calculated
with [72]Eq (11) is symmetric, and thus we only need to add a small
multiple to the diagonal elements of the similarity matrix to force it
to be positive definite; then, the translated similarity matrix can be
treated as the kernel matrix [[73]15]. The summary of SLapRLS is shown
in [74]Fig 1, and the procedure of this work is shown in [75]Fig 2.
Fig 1. A summary of SLapRLS.
[76]Fig 1
[77]Open in a new tab
Fig 2. Procedure of this work.
[78]Fig 2
[79]Open in a new tab
Firstly, label dataset are derived from Phospho.ELM, and it is split
into train dataset and test dataset. Secondly, the model is developed
using train dataset and its similarity matrix with SLapRLS, with which
the predicted result of test dataset is achieved. Additionally, an
independent test dataset is used. The model that predicts the
independent dataset is developed with all the label dataset.
Performance evaluation
To evaluate the performance of the classifiers, we calculate the
specificity (Sp), sensitivity (Sn), accuracy (Acc), precision (Pre) and
Matthews correlation coefficient (Mcc). Sp and Sn represent the ratio
of correctly predicted negative and positive sites, Acc indicates the
percentage of truly predicted sites, and Pre indicates the ratio of
true positive sites over predicted positive sites. Mcc reflects the
balance quality between the true and predicted classes and illustrates
the correlation between the true and predicted class. The definitions
of Sn, Sp, Acc, Pre and Mcc are shown in Eqs ([80]13), ([81]14),
([82]15), ([83]16) and ([84]17), respectively. The receiver operating
characteristic (ROC) curve is widely used to evaluate the performance
of a classifier in machine learning and plots (1-Sp, Sn) using each
predicted value as the threshold. The corresponding area under the ROC
curve (AUC) represents the overall accuracy of a classifier.
[MATH:
Sn = T
PTP+FN :MATH]
(13)
[MATH:
Sp = T
PTN+FP :MATH]
(14)
[MATH:
Acc =
TP+TNTP
+TN+FP+FN
mtext> :MATH]
(15)
[MATH:
Pre =
TPTP+FP :MATH]
(16)
[MATH:
Mcc =
TP×FP−TN×
mo>FN(TP+FN)×<
/mo>(TP+FP<
/mrow>)×(TN+FN)×(TN+FP)
mo> :MATH]
(17)
where TP, FP, TN and FN represent the number of true positives, false
positives, true negatives and false negatives, respectively.
AUC is used as an overall performance measurement for comparison with
other algorithms, and Acc, Pre and Mcc are utilized to evaluate the
performance when the Sp is extremely high. Notably, we cannot use AUC
as the evaluation for comparison with existing tools because the
prediction scores are not available for these tools. In this situation,
we make the comparison using the corresponding Sn value with a
comparable Sp value by selecting a suitable threshold.
Results and Discussion
Comparison with other algorithms
We first compare our method with three existing algorithms (SVM, BDT
and KNN) with a 10-fold cross validation using local peptide sequence
information. When using SVM, peptide sequences are coded into numeric
features using a quadrature encoder. In this study, we adopt the LIBSVM
with an RBF kernel function [[85]29], and parameters C andγin SVM are
chosen by cross-validation. For BDT, the method introduced in PPSP
[[86]10] is adopted in this work. In KNN, the parameter K is set to 11
and the Blosum 62 matrix is employed to calculate the distance d among
samples. Assuming S [i+] is the total similarity between sample s [i]
and the top K nearest positive samples and S [i-] is total similarity
between sample s [i] and the top K nearest negative samples, the final
predicted score of sample s [i] is denoted as the ratio of S [i+] and S
[i-] [[87]11].
The ROC curve is utilized to compare these four algorithms. The ROC
curves for the Erk2, Erk1, CDC2 and PKC alpha kinases are shown in
[88]Fig 3, with the red line, blue line, yellow line and cyan line
representing SLapRLS, SVM, BDT and KNN, respectively. As shown in
[89]Fig 3, the red line outperformed the other three lines, indicating
that SLapRLS achieved better overall performance than the other
algorithms.
Fig 3. ROC curves of different algorithms.
[90]Fig 3
[91]Open in a new tab
ROC curves of kinase Erk2, Erk1, CDC2 and PKC alpha achieved by four
different algorithms are plotted. The red line, blue line, yellow line
and cyan line represent SLapRLS, SVM, BDT and KNN, respectively.
To illustrate the robustness of our proposed method, we repeat the
10-fold cross validation five times and then compare the AUCs. Detailed
results are listed in [92]Table 1. As expected, SLapRLS achieves better
performance than the other three algorithms on S/T/Y substrate kinases.
For example, the average AUCs achieved by SLapRLS on kinase PKC alpha
are 7.7%, 5.7% and 14.2% higher than SVM, BDT and KNN, respectively.
Table 1. Compared AUC values of the four algorithms: SLapRLS, SVM, BDT and
KNN.
Methods SLapRLS SVM BDT KNN
S/T
PKC alpha 0.833LSBDT 0.756LSBDT 0.776LSBDT 0.691LSBDT
ATM 0.964LSBDT 0.964LSBDT 0.905LSBDT 0.876L0.006
CDC2 0.918L0.006 0.714L0.006 0.894L0.006 0.880L0.006
Erk2 0.882L0.006 0.733L0.006 0.825L0.006 0.816L0.006
Erk1 0.888L0.006 0.760L0.006 0.831L0.006 0.815L0.006
AurA 0.782L0.006 0.652L0.006 0.724L0.006 0.718L0.006
BARK1 0.747L0.006 0.659L0.006 0.566±0.023 0.644±0.023
CaMK2a 0.862a0.023 0.682a0.023 0.688a0.023 0.757a0.023
CDK2 0.887a0.023 0.783a0.023 0.780a0.023 0.833a0.023
GSK3B 0.806a0.023 0.694a0.023 0.758a0.023 0.704a0.023
Ck2 a1ora2 0.95420.023 0.93220.023 0.93820.023 0.93020.023
MAPKAPK2 0.651PK2023 0.502PK2023 0.500PK2023 0.716PK2023
PDK1 0.854PK2023 0.787PK2023 0.811PK2023 0.842PK2023
P38a 0.838PK2023 0.708PK2023 0.781PK2023 0.732PK2023
PLK1 0.734PK2023 0.655PK2023 0.606PK2023 0.594PK2023
Y
ABL1 0.560PK2023 0.609PK2023 0.467PK2023 0.519±0.018
EGFR 0.559±0.018 0.460±0.018 0.594±0.018 0.530±0.018
FYN 0.730±0.018 0.629±0.018 0.650±0.018 0.683±0.018
INSR 0.547±0.018 0.537±0.018 0.427±0.018 0.506±0.018
LCK 0.638±0.018 0.573±0.018 0.589±0.018 0.634±0.018
LYN 0.619±0.027 0.641 0.046 0.560±0.022 0.672±0.022
SRC 0.564±0.022 0.570±0.022 0.509±0.022 0.548±0.022
SYK 0.718±0.022 0.726±0.022 0.677±0.022 0.658±0.022
[93]Open in a new tab
Additionally, Sn, Acc, Pre and Mcc are utilized to evaluate the
performance of the four algorithms at a high stringency level (Sp =
0.99). The phosphorylated S/T and Y site kinases are divided into two
groups (i.e., S/T substrate kinases and Y substrate kinases), and the
performance of the two kinase groups are plotted in [94]Fig 4. The
results show that SLapRLS achieves higher Sn, Pre, Acc and Mcc and a
slightly higher Sp. For example, the average Sn and Pre achieved by
SLapRLS on Y substrate kinases are more than 2% and 9% higher than the
other algorithms. [95]Table 1 and [96]Fig 4 show that SLapRLS also
achieves better performance in S/T substrate kinases than Y substrate
kinases.
Fig 4. Comparison of the four algorithms at high stringency level (sp =
0.99).
[97]Fig 4
[98]Open in a new tab
The Sn, Sp, Acc and Mcc values at high stringency level (sp = 0.99) of
four algorithms on the S/T and Y kinases.
Because SLapRLS relies on the similarity between samples, a good
similarity calculator is essential to achieve a satisfying performance.
In this work, the similarity between samples is represented by peptide
similarity, which is calculated using sequence information. The
conservation is strong in the S/T substrate kinases but weak in the Y
substrate kinases [[99]30]. [100]Fig 5 shows that the amino acid
distributions of two S/T substrate kinases (ATM and ck2_alora2) have
stronger conservation than two Y substrate kinases (INSR and EGFR). As
shown in [101]Fig 5, substrates of kinases with good performances tend
to exhibit strong conservation, whereas the conservation of kinases
with bad performances is weak. Therefore, local peptide sequence
similarity may not effectively reflect the similarity between samples
for kinases that have weak sequence conservation, and thus all four
algorithms tend to achieve better performance for the S/T substrate
kinases than the Y substrate kinases. It should also be noted that
although only sequence information is used as an input feature, SLapRLS
could actually address any type of data that can be used to calculate
similarities between samples.
Fig 5. Weblogos of S/T substrate kinases and Y substrate kinases.
Fig 5
[102]Open in a new tab
A and C are the Weblogos of kinase ATM and ck2 alora2 and B and D are
the Weblogos of kinase EGFR and INSR.
Comparison with existing tools
To evaluate the performance of SLapRLS, we compare it with two existing
kinase identification tools: iGPS and NetworKIN. Because cross
validation is not available for iGPS and NetworKIN, we adopt an
independent test dataset. We also compare Sn at the same high
stringency Sp level using different algorithms. Comparison results are
shown in [103]Table 2. Although SLapRLS only uses sequence information
as a feature while both iGPS and NetworKIN use protein-protein
interaction information to filter the results, SLapRLS still achieves a
satisfying performance (5 out of 6 kinases have a better performance
compared with iGPS and NetworKIN). For instance, iGPS achieves an Sp of
0.780 and Sn of 0.525 for PKC alpha. To make a reasonable comparison
with iGPS, we set the threshold to ensure that SLapRLS has a comparable
Sp (0.782) and then calculate the corresponding Sn (0.983). The results
show that the Sn of SLapRLS is 46% higher than iGPS. Thus, SLapRLS can
effectively identify the corresponding kinase of the new site.
Table 2. Comparison among SLapRLS, iGPS and NetworKIN on independent test
data.
Methods iGPS SLapRLS NetworKIN SLapRLS
Sp Sn Sp Sn Sp Sn Sp Sn
PKC alpha 0.780 0.525 0.782 0.983 0.997 0.475 0.997 0.559
Erk2 0.466 0.709 0.471 0.993 0.865 0.278 0.870 0.329
Erk1 0.508 0.709 0.510 0.974 0.939 0.222 0.942 0.247
P38a 0.367 0.703 0.369 0.865 1.000 0.027 0.933 0.054
SRC 0.300 0.875 0.310 0.867 0.300 0.100 0.310 0.867
SYK 0.283 0.850 0.301 0.850 0.830 0.300 0.849 0.400
[104]Open in a new tab
A case study
To illustrate the usefulness of SLapRLS and to better elucidate the
biological mechanism underlying phosphorylation, we perform an
enrichment analysis on phosphorylation sites and their corresponding
kinases. We adopt the kinase PKC alpha to illustrate the capability of
SLapRLS to discover new phosphorylation sites. A total of 110 sites on
66 proteins phosphorylated by PKC alpha are extracted from the
Phospho.ELM database. Additionally, we perform cross-validation on PKC
alpha with SLapRLS and predict 56 candidate sites with the top 100
predicted scores. Enrichment analysis of the combined known and
predicted proteins is performed using DAVID [[105]31] to identify
enriched pathways. As shown in [106]Table 3, 6 KEGG pathways are highly
enriched, with the most significant pathway related to the regulation
of phosphorylation, and the corresponding Benjamin P-value is 9.2E-5.
Additionally, 6 proteins in this pathway are predicted as substrates of
PKC alpha by SLapRLS but are not included in Phospho.ELM, indicating
that SLapRLS is able to identify potential corresponding kinases for
known phosphorylation sites.
Table 3. Pathway enrichment analysis of PKC alpha.
Terms Count P-value Benjamini P-values
Regulation of phosphorylation 14(6) 1.4e-7 9.2e-5
Cell migration 10(3) 4.2e-6 7.7e-4
Learning or memory 5(2) 1.7e-3 1.8e-2
Regulation of heart contraction 3(2) 2.4e-3 3.6e-2
[107]Open in a new tab
Conclusion
Phosphorylation plays an important role in multiple biological
processes, and protein kinases have a tight relationship with many
kinds of diseases. Thus, identifying the corresponding kinases for
known phosphorylation sites is important. To overcome shortcomings such
as costly and time-consuming experimental methods, the development of
computational methods for kinase identification is urgently needed. At
present, existing phosphorylation prediction-related computational
methods neglect the geometry of the data distribution, and most
kernel-based methods are based on distance. These distance-based
methods often assume that the amino acids are independent when using
local protein sequence information to calculate distances, while the
connections between amino acids, which are also very important in
expressing the relationships among samples, are neglected.
In this work, we propose the kernel-based algorithm SLapRLS that relies
on similarity rather than distance and introduce the inconsistency
between label and pairwise similarity to reflect the geometric
distribution of the data. Instead of optimizing a single objective
function, SLapRLS optimizes two functions: minimizing structure risk
and the overall inconsistency between label and pairwise similarity.
Because the kernel function reflects the closeness of two samples, we
translate the kernel matrix from the similarity matrix instead of any
famous kernel functions. The results show that SLapRLS outperforms
three other algorithms (SVM, BDT and KNN) and two existing kinase
identification tools (iGPS and NetworKIN).
It should to be noted that SLapRLS is based on a similarity matrix.
Although only local sequence information is used as a feature in this
work, SLapRLS is able to address any type of data as a feature,
including characteristic types and numerical types. Although SLapRLS
can solve kinase identification problems efficiently, there is also
room for further improvement. For example, we only focus on local
sequence information and disregard all other biological information.
However, it has been proven that protein function and structural
information is also useful for phosphorylation predictions [[108]30,
[109]32]. Therefore, such information could be utilized for kinase
identification in a future work. To this end, a combination regulation
should be introduced to incorporate different types of features.
Supporting Information
S1 Table. The selected parameters γ [A] and γ [I] for each kinase.
(XLSX)
[110]Click here for additional data file.^ (42.4KB, xlsx)
Acknowledgments