Abstract

Background

   S-sulphenylation is a ubiquitous protein post-translational
   modification (PTM) where an S-hydroxyl (−SOH) bond is formed via the
   reversible oxidation on the Sulfhydryl group of cysteine (C). Recent
   experimental studies have revealed that S-sulphenylation plays critical
   roles in many biological functions, such as protein regulation and cell
   signaling. State-of-the-art bioinformatic advances have facilitated
   high-throughput in silico screening of protein S-sulphenylation sites,
   thereby significantly reducing the time and labour costs traditionally
   required for the experimental investigation of S-sulphenylation.

Results

   In this study, we have proposed a novel hybrid computational framework,
   termed SIMLIN, for accurate prediction of protein S-sulphenylation
   sites using a multi-stage neural-network based ensemble-learning model
   integrating both protein sequence derived and protein structural
   features. Benchmarking experiments against the current state-of-the-art
   predictors for S-sulphenylation demonstrated that SIMLIN delivered
   competitive prediction performance. The empirical studies on the
   independent testing dataset demonstrated that SIMLIN achieved 88.0%
   prediction accuracy and an AUC score of 0.82, which outperforms
   currently existing methods.

Conclusions

   In summary, SIMLIN predicts human S-sulphenylation sites with high
   accuracy thereby facilitating biological hypothesis generation and
   experimental validation. The web server, datasets, and online
   instructions are freely available at [35]http://simlin.erc.monash.edu/
   for academic purposes.

   Keywords: Protein post-translational modification, S-sulphenylation,
   Bioinformatics software, Machine learning, Ensemble learning

Background

   Post-translational modifications (PTMs) of the cellular proteome
   provide a dynamic regulatory landscape that include both rapid
   reversible modifications and long-lasting irreversible modifications to
   cellular perturbations [[36]1]. In particular, reactive oxygen species
   (ROS), which are highly reactive and toxic molecules generated during
   mitochondrial metabolism, have been shown to play important signalling
   roles in the presence of oxidative stress and cellular pathophysiology
   in various complex diseases when their levels are altered in periods of
   cellular stress [[37]2–[38]5]. In the redox environment,
   S-sulphenylation (i.e. S-sulfenylation), a type of PTM that occurs at
   cysteine residues, is a fleeting and reversible covalent oxidation of
   cysteinyl thiols (Cys-SH) towards supheric acids (Cys-SOH) in the
   presence of hydrogen peroxide, which thereby acts as a rapid sensor of
   oxidative stress [[39]6–[40]12]. Thus far, a number of experiments have
   validated that S-sulphenylation plays important roles in regulating
   protein functions under both physiologic and oxidatively stressed
   conditions [[41]7, [42]9–[43]11, [44]13–[45]19]. Despite the lack of
   knowledge regarding the specific functionality of this redox
   modification in human cell systems, it has been reported that
   S-sulphenylation is involved in many signal transduction processes,
   such as the deubiquitinase activity in ovarian tumors and growth factor
   stimulation [[46]11, [47]17, [48]20]. Furthermore, including
   S-sulphenylation, more than 200 sulfenic modifications that have been
   identified in various situations, such as transcription factors,
   signaling proteins, metabolic enzymes, proteostasis regulators, and
   cytoskeletal components [[49]17]. Although only approximately 2% of
   proteins in the human, mouse, and rat proteomes contain cysteine
   residues [[50]21], it is essential to understand the underlying
   mechanisms that contribute to the residues’ critical roles in various
   biological processes, such as S-sulphenylation, regulation of oxidative
   PTMs, and the quantification of sulfenic modification processes [[51]6,
   [52]7, [53]9, [54]10, [55]14–[56]16].

   Despite the significant progress in selective labelling methods for
   S-sulphenylation using β-dicarbonyl compounds dimedone and analogues,
   it remains challenging to accurately characterize protein
   S-sulphenylation sites experimentally, due to their intrinsic
   instability and low abundance of cysteine residues [[57]6–[58]8,
   [59]11, [60]17, [61]20, [62]22]. Moreover, experimental identification
   of S-sulphenylation is labour-intensive and particularly difficult due
   to its intrinsically unstable nature and the diversity of the redox
   reaction [[63]7, [64]8, [65]11]. Therefore, in order to assist
   biologists with characterization of S-sulphenylation sites and
   S-sulphenylated sequences, it is imperative to construct a
   generalizable computational tool for highly accurate prediction of
   protein S-sulphenylation sites.

   To date, several algorithms for S-sulphenylation prediction have been
   published, including MDD-SOH, SOHSite [[66]6, [67]7], SOHPRED [[68]23],
   Press [[69]24], iSulf-Cys [[70]25], SulCysSite [[71]26], PredSCO
   [[72]27], the predictor by Lei et al [[73]28], and SVM-SulfoSite
   [[74]29]. Among these computational tools, to the best of our
   knowledge, the most representative algorithm for S-sulphenylation
   prediction is MDD-SOH, along which the training dataset in this study
   was assembled. MDD-SOH is a two-stage ensemble learning model based
   only on SVM classifiers built upon the previous “SOHSite” project
   [[75]6, [76]7]. Despite the progress of computational methods for
   S-sulphenylation prediction, the prediction performance needs to be
   further improved, due to the low abundance of cysteine residues and the
   insufficient number of experimentally verified S-sulphenylation sites.

   In this study, we propose a novel bioinformatics tool for improved
   prediction of protein S-sulphenylation sites, named SIMLIN, integrating
   a number of protein sequence-derived and protein structural features
   based on the sequence motifs previously identified in [[77]6, [78]7].
   SIMLIN is a two-layer framework consisting of Support Vector Machine
   (SVM) and Random Forests (RF) in the first layer and neural network
   models in the second layer. To further improve the prediction accuracy
   of SIMLIN, an incremental feature selection method was employed, based
   on by the mRMR approach implemented in the R package “mRMRe” [[79]30].
   The constructed SVM and RF models, trained on different feature
   clusters plus the selected feature set, were used as the input for the
   neural network in the second layer. Empirical assessment on the
   independent testing dataset demonstrated that SIMLIN achieved a
   prediction accuracy of 88% and an AUC score of 0.82, outperforming the
   existing methods for S-sulphenylation site prediction.

Implementation

   Figure [80]1 provides an overview of the framework of SIMLIN, which
   consists of four major steps: (i) data collection, (ii) feature
   calculation and selection, (iii) model training, and (iv) performance
   evaluation. During the data collection process, we collected
   experimentally verified S-sulphenylation sites from the study of Bui et
   al. [[81]7]. The negative dataset (defined as proteins without
   experimentally validated S-sulphenylation sites) was extracted from the
   UniProt database [[82]31]. Refer to the section 2.1 for more details
   regarding data collection and pre-processing. For feature extraction, a
   variety of protein sequence and structural features were extracted and
   selected using the MDL (minimum descriptive length) technique [[83]32]
   and mRMR (minimum-redundancy maximum-relevancy) algorithm [[84]30,
   [85]33]. A detailed description and statistical summary of the
   calculated features are provided in the Section 2.2. To construct
   accurate predictive models, at the ‘Model Construction’ step, a
   generalized ensemble framework of SIMLIN was developed by integrating
   various machine-learning algorithms including Artificial Neural
   Networks (ANNs) [[86]34, [87]35], SVMs with various kernel functions
   [[88]36, [89]37], and RFs [[90]38]. To evaluate and compare the
   prediction performance of SIMLIN with the existing methods, at the last
   step, we assessed the prediction performance of different algorithms on
   both 10-fold stratified cross-validation sets and independent datasets
   assembled in the previous study of Bui et al [[91]7].

Fig. 1.

   [92]Fig. 1
   [93]Open in a new tab

   The overall framework illustrating the model construction and
   performance evaluation for SIMLIN. a The four major steps for
   constructing SIMILIN include data collection, feature engineering,
   model construction, and performance evaluation, (b) A detailed
   breakdown of the construction of the two-stage hybrid SIMLIN model

Data collection and pre-processing

   Both benchmark and independent test datasets in this study were
   extracted from the ‘SOHSite’ web server, constructed by Bui et al.
   [[94]6, [95]7]. Sequence redundancy of the dataset was removed in this
   study (using 30% as the sequence identity threshold), which was
   reported to be the most complete dataset for S-sulphenylation to date
   through the integration of experimentally validated S-sulphenylation
   sites from four different resources: (i) the human S-sulphenylation
   dataset assembled using a chemoproteomic workflow involving the
   S-sulfenyl-mediated redox regulation [[96]11], by which the
   S-sulphenylation cysteines were identified; (ii) the RedoxDB database
   [[97]39], which curates the protein oxidative modifications including
   S-sulphenylation sites; (iii) the UniProt database [[98]31], and (iv)
   related literature. Considering the frequent updates of UniProt, based
   on the gene names provided in the datasets, we further mapped these
   proteins to the UniProt database (downloaded November 2016). The
   canonical protein sequences harboring experimentally verified
   S-sulphenylation sites were retrieved and downloaded from the UniProt
   database. Motifs of 21 amino acids with the S-sulphenylation site in
   the center and flanked by 10 amino acids each side were then extracted
   from the protein sequences. The highly homologous motifs have been
   further removed to maximize the sequence diversity according to [[99]7,
   [100]13]. The resulting dataset contains a total of 1235 positive
   samples (i.e. with S-sulphenylation sites) and 9349 negative samples
   (i.e. without S-sulphenylation sites). Table [101]1 provides a
   statistical summary of the benchmark and independent test datasets,
   respectively.

Table 1.

   The statistics of datasets employed in this study
   Number of positive motifs Number of negative motifs Total
   Training dataset 1019 7937 8956
   Independent test dataset 216 1412 1628
   Total 1235 9349 10,584
   [102]Open in a new tab

Feature extraction and calculation

   To numerically represent the sequence motifs in the datasets, we
   calculated and extracted both sequence-based and structural features
   [[103]40]. In total nine types of sequence-derived and structural
   features were extracted and used, including the composition of k-spaced
   amino acid pairs (CKSAAP) [[104]41], motif binary representations
   [[105]42], amino acid substitution matrix (BLOSUM62) [[106]43], protein
   specific scoring matrix (PSSM) by PSI-BLAST [[107]44], amino acid index
   (AAindex) [[108]45], amino acid composition (AAC), surface
   accessibility (ACC) based on protein secondary structure prediction,
   protein predicted disordered region, and protein predicted secondary
   structure. The detailed information about each type of features and its
   feature dimensionality is shown in Table [109]2.

Table 2.

   The sequence and structural features extracted and the feature
   dimensionalities
   Feature type            Feature Cluster            Dimension
   Sequence     AAC                                   20
                CKSAAP                                2400
                BLOSUM62                              441
                PSSM                                  400
                AAindex                               1344
                Binary                                441
   Structural   Predicted protein disordered region   20
                Predicted protein secondary structure 84
                Predicted surface accessibility       147
   Total                                              5297
   [110]Open in a new tab

Composition of k-spaced amino acid pairs (CKSAAP)

   The CKSAAP encoding theme has been widely applied [[111]46–[112]49],
   which represents a protein sequence using the compositions of amino
   acid pairs spaced by the k residues [[113]41, [114]50, [115]51]. The
   composition of each possible k-spaced amino acid pair i can be
   therefore calculated based on the following formula:
   [MATH: <mtext mathvariant="italic">CKSAAP</mtext><mfenced close="]"
   open="["
   separators=",,,"><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mn>2</mn><
   mn>3</mn><mrow><mo>…</mo><mfenced close=")"
   open="("><mrow><msub><mi>k</mi><mi
   mathvariant="italic">max</mi></msub><mo>+</mo><mn>1</mn></mrow></mfence
   d><mo>×</mo><mn>400</mn></mrow></mfenced><mo>=</mo><msub><mi>N</mi><mi>
   i</mi></msub><mo>/</mo><mfenced close=")"
   open="("><mrow><mi>W</mi><mo>−</mo><mi>k</mi><mo>−</mo><mn>1</mn></mrow
   ></mfenced><mo>, </mo> :MATH]
   1

   where N[i] is the number of the k-spaced amino acid pair i, W denotes
   the window size, and k[max] represents the maximum space considered —
   which has been optimized as k[max] = 5 in this study [[116]42]. In
   total, the CKSAAP scheme generated a feature vector of 2400 dimensions
   for each motif.

Motif one-hot encoding (binary)

   Each motif was also presented using a binary encoding scheme [[117]42],
   where each amino acid in the motif was denoted using a 21-dimensional
   vector organized via the alphabetic order of 20 natural amino acids and
   a gap-filling residue “X”. The value 1 was used to denote that the
   amino acid was in fact in the motif and was placed in its corresponding
   position in the vector, while other positions in the vector were filled
   with “0”. For instance, the residue C (cysteine) is denoted as
   {0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}. Therefore, for a motif
   with 21 amino acids, a total of 441 (21 × 21) features were generated
   using the motif binary representation scheme.

Amino acid substitution matrix (BLOSUM62)

   The BLOSUM62 is a widely used amino acid substitution matrix based on
   sequence alignment [[118]43, [119]52] and has been employed in a
   variety of bioinformatic studies [[120]6, [121]22, [122]53–[123]55].
   For each amino acid, a 21-dimensional vector consisting of substitution
   scores of all 20 amino acids and an additional terminal signal
   constitute the matrix. For each motif, a 21 × 21 matrix was used and a
   total number of 441 features were added.

Position-specific scoring matrix (PSSM)

   Using the UniRef90 dataset from the UniProt database, we performed
   PSI-BLAST (version 2.2.26) search to generate the PSSM for each motif
   in our dataset to represent the sequence conservation and similarity
   scores. PSSM has been widely applied in a variety of bioinformatics
   studies as a crucial sequence feature type. Similar to the feature
   representation of BLOSUM62, 441 features were finally generated for
   each motif.

Amino acid index (AAindex)

   AAindex is a collective database that provides a variety of physical
   and chemical properties of amino acids [[124]45]. A number of
   bioinformatics studies have benefited from use of these amino acid
   properties provided in the AAindex database [[125]46, [126]48,
   [127]56]. Due to the high diversity of the properties offered in the
   AAindex database, Saha et al. [[128]57] further categorized these
   indices into eight clusters, which were used for the AAindex feature
   set for each motif in our study. Therefore, we utilized a selected set
   of AAindex (i.e., a vector of 1344 dimensions (21 × 8 × 8) [[129]52]
   attributes to represent each motif.

Amino acid composition (AAC)

   For the ACC encoding, each motif is represented as a 20-dimensional
   vector, where each dimension denotes the number of occurrence of each
   amino acid within the given motif and is further normalized (i.e.
   divided by the length of the motif [[130]22]).

Predicted protein disordered region

   Given the strong relationships between protein disordered regions and
   PTMs [[131]58–[132]63], we also integrated the predicted disordered
   region of a protein as a feature set. To do so, we conducted protein
   disordered region prediction using DISOPRED (Version 3.1) [[133]64]
   based on protein sequence. Each amino acid is given a predictive score
   by DISOPRED, which indicates the likelihood of being located in the
   protein’s disordered region. For a sequence motif of 21 residues, a
   20-dimensional vector of predicted scores (i.e. 10 scores for the
   upstream and 10 scores for the downstream amino acids, respectively)
   was constructed.

Predicted protein secondary structure

   PSIPRED (Version 3.5) [[134]65, [135]66] was employed to predict
   protein secondary structure based on the protein’s amino acid sequence.
   The predictive outputs of PSIPRED contain four scores for each residue
   including the predicted structural class (i.e. C, coil; E, beta strand;
   and H, alpha helix) and the probabilities of each structural class. As
   a result, for a motif with 21 amino acids, an 84-dimensional (including
   three probabilities and the recommendation for each residue) vector was
   generated for the predicted protein secondary structure feature.

Predicted surface accessibility (ACC)

   The surface accessibility feature was calculated using the NetSurfP-1.1
   algorithm [[136]67] based on the protein sequences. Each residue in the
   protein is represented using seven predictive scores, indicating the
   accessibility (i.e. if this residue is buried), relative surface
   accessibility, absolute surface accessibility, Z-fit score, probability
   of this residue being in alpha-helices, beta-strands, and coils. Note
   that the predictive scores of each category generated by NetSurfP range
   widely. Therefore, we employed the Min-Max method to normalize the
   prediction scores of each type [[137]35]. The formula we used for the
   data normalization was as follows:
   [MATH: <msub><mi mathvariant="normal">V</mi><mi
   mathvariant="italic">ij</mi></msub><mo>=</mo><mfrac><mrow><msub><mi
   mathvariant="normal">V</mi><mi
   mathvariant="italic">ij</mi></msub><mo>−</mo><msub><mo
   mathvariant="italic">min</mo><mrow><mi>j</mi><mo>∈</mo><mfenced
   close="}"
   open="{"><mrow><mn>1</mn><mo>…</mo><mi>m</mi></mrow></mfenced></mrow></
   msub><mfenced close="}" open="{"><msub><mi
   mathvariant="normal">V</mi><mi
   mathvariant="italic">ij</mi></msub></mfenced></mrow><mrow><msub><mo
   mathvariant="italic">max</mo><mrow><mi>j</mi><mo>∈</mo><mfenced
   close="}"
   open="{"><mrow><mn>1</mn><mo>…</mo><mi>m</mi></mrow></mfenced></mrow></
   msub><mfenced close="}" open="{"><msub><mi
   mathvariant="normal">V</mi><mi
   mathvariant="italic">ij</mi></msub></mfenced><mo>−</mo><msub><mo
   mathvariant="italic">min</mo><mrow><mi>j</mi><mo>∈</mo><mfenced
   close="}"
   open="{"><mrow><mn>1</mn><mo>…</mo><mi>m</mi></mrow></mfenced></mrow></
   msub><mfenced close="}" open="{"><msub><mi
   mathvariant="normal">V</mi><mi
   mathvariant="italic">ij</mi></msub></mfenced></mrow></mfrac><mo>, </mo>
   :MATH]
   2

   where V[ij] represents the value i of the feature category vector j,
   and m denotes the number of observations represented in the vector j.
   As a result, all values were rescaled to the range between 0 and 1.

Feature selection

   As shown in Table [138]2, a total of 5297 sequence and structural
   features were calculated and extracted. Such high-dimensional feature
   vectors might contain misleading and noisy information, which would
   lead to biased model training. Furthermore, it would require
   considerable time and effort to build computational models based on
   such high-dimensional feature set. Therefore, we employed the mRMR
   (minimum Redundancy Maximum Relevance) [[139]30, [140]33] package and
   forward incremental feature selection to eliminate noisy and less
   informative features from the original feature vector. To perform
   feature selection, we first applied mRMR to calculate and rank the
   importance score of each feature. Then, based on the feature importance
   ranking provided by mRMR, we initiated an empty set and added one
   feature from the original feature set at a time. The AUC values based
   on the current feature set were evaluated for both RF and SVM
   independently, and the resulting feature subset was formed using the
   features that resulted in higher AUC values for both SVM and RF models.
   Each feature was incrementally added into the optimized feature set
   based on the scores of feature importance provided by the mRMR until
   the curve of AUC values achieved its peak. As described, by applying
   this forward stepwise sequential variable elimination, the feature with
   the highest importance was selected. According to the RF algorithm, the
   global permuted importance is based on the out-of-bag sample B of the
   tree t in the forest F for each feature X[j] and is defined as follows
   [[141]22, [142]35, [143]38]:
   [MATH: <msub><mi>f</mi><mi mathvariant="italic">imp</mi></msub><mfenced
   close=")"
   open="("><msub><mi>X</mi><mi>j</mi></msub></mfenced><mo>=</mo><mfrac><m
   row><msub><mo>∑</mo><mrow><mi>i</mi><mo>∈</mo><mi>B</mi></mrow></msub><
   mi>I</mi><mfenced close=")"
   open="("><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>=</mo><msubsup><mi>
   y</mi><mi>i</mi><mo>′</mo></msubsup></mrow></mfenced><mo>−</mo><mi>I</m
   i><mfenced close=")"
   open="("><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>=</mo><msubsup><mi>
   y</mi><mi
   mathvariant="italic">ij</mi><mo>′</mo></msubsup></mrow></mfenced></mrow
   ><mrow><mo>∣</mo><mi>B</mi><mo>∣</mo></mrow></mfrac><mo>. </mo> :MATH]
   3

Model construction

   As shown in Fig. [144]1, the development of SIMLIN consists of two
   major stages after feature selection: (i) employing SVM and RF models
   based on different feature types (Table [145]2) to generate the input
   for the neural network models, and (ii) training of the neural network
   model based on the optimized RF and SVM models to deliver the final
   predictive outputs. During the first stage, ten RF and SVM models were
   constructed based on the nine types of features and the selected
   feature set. 10-fold stratified cross-validation was performed on the
   training dataset to select the best model (i.e. with highest AUC
   values) for each feature type. During the second stage, we built a
   neural network model which consists of three layers including an input
   layer, a hidden layer, and an output layer. The first layer harbours 20
   nodes to take the output of the best RF and SVM models as the input
   based on the 10-fold stratified cross-validation performed during the
   first stage, while the hidden and output layers only have one node
   (denoted as H[1] and O[1], respectively). Furthermore, in the hidden
   layer, in addition to H[1], two extra nodes, B[1] and B[2], were
   auto-generated nodes by the neural network algorithm for the purpose of
   model balancing. Lastly, the O[1] node in the output layer represents
   the prediction outcome from the entire algorithm.

   We applied a number of software packages to implement SIMLIN in our
   study, including the Python-based machine learning package
   “scikit-learn” [[146]68], and various R packages of SVM (combining
   “kernelab” and "e1071") and neural network model (“nnet”) [[147]35,
   [148]69]. The feature selection techniques employed in our study,
   including mRMR and MDL, were implemented based on the R packages
   “mRMRe” and “discretization” [[149]70–[150]72], respectively.
   Additionally, R packages “caret” [[151]73] and “fscaret” [[152]74] have
   been used in combination for the control of overall workflow for model
   training and parameter optimization.

Prediction performance evaluation

   We applied widely used measures to evaluate and compare the prediction
   performance of SIMLIN, including the Area Under the Curve (AUC),
   Accuracy, Sensitivity, Specificity and Matthew’s Correlation
   Coefficient (MCC) [[153]75–[154]77]. During the model training process,
   AUC was used as the main measure for parameter optimization. The
   performance measures used are defined as follows:
   [MATH: <mtext
   mathvariant="italic">Accuracy</mtext><mo>=</mo><mfrac><mrow><mi
   mathvariant="italic">TP</mi><mo>+</mo><mi
   mathvariant="italic">TN</mi></mrow><mrow><mi
   mathvariant="italic">TP</mi><mo>+</mo><mi
   mathvariant="italic">FP</mi><mo>+</mo><mi
   mathvariant="italic">TN</mi><mo>+</mo><mi
   mathvariant="italic">FN</mi></mrow></mfrac><mo>,</mo> :MATH]
   [MATH: <mtext
   mathvariant="italic">Sensitivity</mtext><mo>=</mo><mfrac><mi
   mathvariant="italic">TP</mi><mrow><mi
   mathvariant="italic">TP</mi><mo>+</mo><mi
   mathvariant="italic">FN</mi></mrow></mfrac><mo>,</mo> :MATH]
   [MATH: <mtext
   mathvariant="italic">Specificity</mtext><mo>=</mo><mfrac><mi
   mathvariant="italic">TN</mi><mrow><mi
   mathvariant="italic">TN</mi><mo>+</mo><mi
   mathvariant="italic">FP</mi></mrow></mfrac><mo>,</mo> :MATH]
   [MATH: <mi mathvariant="italic">MCC</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="italic">TP</mi><mo>×</mo><mi
   mathvariant="italic">TN</mi><mo>−</mo><mi
   mathvariant="italic">FP</mi><mo>×</mo><mi
   mathvariant="italic">FN</mi></mrow><msqrt><mrow><mfenced close=")"
   open="("><mrow><mi mathvariant="italic">TP</mi><mo>+</mo><mi
   mathvariant="italic">FN</mi></mrow></mfenced><mo>×</mo><mfenced
   close=")" open="("><mrow><mi mathvariant="italic">TN</mi><mo>+</mo><mi
   mathvariant="italic">FP</mi></mrow></mfenced><mo>×</mo><mfenced
   close=")" open="("><mrow><mi mathvariant="italic">TP</mi><mo>+</mo><mi
   mathvariant="italic">FP</mi></mrow></mfenced><mo>×</mo><mfenced
   close=")" open="("><mrow><mi mathvariant="italic">TN</mi><mo>+</mo><mi
   mathvariant="italic">FN</mi></mrow></mfenced></mrow></msqrt></mfrac><mo
   >,</mo> :MATH]

   where TP, TN, FP, and FN denote the numbers of true positives, true
   negatives, false positives and false negatives, respectively. In this
   study, the S-sulphenylation sites were regarded as the positives, while
   the non-S-sulphenylation sites were considered as the negatives for the
   statistics of AUC, specificity and sensitivity.

Results and discussion

Motif conservation analysis and feature selection

   We first performed the motif conservation analysis using both
   benchmarking and independent test datasets. Two sequence logos with the
   human proteome as the background set generated by pLogo are shown in
   Fig. [155]2. In general, the over- and under-represented amino acids
   surrounding the central cysteine are similar across the benchmarking
   and independent test datasets. In accordance with the conclusion by Biu
   et al., amino acids such as leucine (L), lysine (K), glutamate (E), and
   aspartate (D) are over-represented, while cysteine (C), serine (S), and
   phenylalanine (F) are under-represented.

Fig. 2.

   [156]Fig. 2
   [157]Open in a new tab

   Motif conservation analysis of S-sulphenylation using the human
   proteome as the background set for (a) benchmarking and (b) independent
   datasets

   Prior to the construction of SIMLIN, based on the calculated and
   extracted features (Table [158]2), we generated another feature set
   which contains selected features from the original combined features
   (i.e. AAC, CKSAAP, BLOSUM62, PSSM, AAindex, ACC, Protein predicted
   disordered region, Protein secondary structure prediction, and Binary)
   using stepwise forward sequential variable elimination. As a result,
   the AUC achieved its highest value of 0.72 (sensitivity: 0.95;
   specificity: 0.19; accuracy: 86.6%; MCC: 0.182) when 166 features were
   selected. Among the selected 166 features, 110 (66.3%) and 56 (33.7%)
   were sequence and structural features, respectively. A detailed
   breakdown list of these features in terms of feature types and names is
   available in supplementary material (Additional file [159]1: Table S1).

Model constructions in the two stages of SIMILN

   At the first stage of SIMILN construction, we built nine SVM and RF
   models based on the nine clusters of calculated features (Table
   [160]2), respectively. Additionally one SVM and RF models were also
   constructed using the set of selected features (Additional file [161]1:
   Table S1). The RF and SVM models were constructed and assessed via
   10-fold stratified cross-validation and the average AUC values are
   shown in Table [162]3. For the RF models, to reach the optimal
   performance, the number of trees was set to the nearest integer of the
   subspace dimensionality of the classification task, which is the square
   root of the predictors’ number. For the SVM models, different kernels
   were used including the polynomial, radial sigma, and linear kernels
   for each feature set. The AUC-based performance optimization and kernel
   selection was performed automatically by the R packages “caret” and
   “kernelab”. The best-performing kernels and their corresponding AUC
   values were listed in Table [163]3. It can be seen from Table 3 that
   SVM and RF models provided competitive performance when using different
   types of features; however, the RF model outperformed the SVM model on
   the selected feature set. As shown in Fig. [164]3, the outputs of the
   20 constructed models (i.e. ten RF and ten SVM models; the first layer)
   were used as inputs for the second layer, i.e. the neural network
   model, where the nodes, from I[1] to I[20] took the output of the 20
   models based on the outputs of RF and SVM models.

Table 3.

   The AUC values of RF and SVM models constructed using different feature
   sets at the first stage
   Feature sets AUC
   RF
   (class weight balanced) SVM
   (kernel function)
   AAC 0.68 0.63 (Polynomial kernel)
   AAindex 0.69 0.69 (Radial basis function kernel with grid search
   hyperparameter tuning)
   ACC 0.71 0.64 (Radial basis function kernel)
   BINARY 0.59 0.71 (Polynomial kernel)
   BLOSUM62 0.68 0.74 (Radial basis function kernel)
   CKSAAP 0.66 0.63 (Polynomial kernel)
   DISOPRED 0.54 0.55 (Linear kernel)
   PSIPRED 0.62 0.60 (Polynomial kernel)
   PSSM 0.73 0.71 (Polynomial kernel)

   Selected features

   (mRMR+forward consequential elimination)
   0.75 0.72 (Linear kernel)
   [165]Open in a new tab

   The bold font shows the highest performance of each feature among the
   RF and SVM

Fig. 3.

   [166]Fig. 3
   [167]Open in a new tab

   Prediction performance of SIMLIN on the independent test dataset in
   terms of (a) ROC and (b) MCC

   At the second stage a Feed-Forward Neural Network with three layers -
   including an input layer (20 nodes), a hidden layer (3 nodes) and an
   output layer (1 node) — was constructed using the R package ‘nnet’ and
   subsequently evaluated. Similar to the RF and SVM construction, 10-fold
   stratified cross-validation was employed using the training dataset for
   building the neutral network model. During the training process, two
   parameters (i.e. the number of units in the hidden layer and the weight
   decay for optimising the performance and minimizing the overfitting)
   were automatically adjusted and evaluated by the network model. The
   values of the two parameters were adjusted automatically and the
   resulting performance including AUC, sensitivity, and specificity are
   given in Table [168]4. Generally, the performance achieved using
   different numbers of units in the hidden layer and weight decay values
   was satisfactory. Based on the performance, the number of units and the
   weight decay were set to 1 and 0.1 in the final neural network model,
   respectively (Additional file [169]1: Table S2). This was for the
   purpose of minimizing the number of nodes in the hidden layer while
   maximining the AUC value and convergence rate.

Table 4.

   Prediction performance of the neural network model with different units
   in the hidden layer via 10-fold stratified cross-validation test
   #Units in the hidden layer Decay AUC Sensitivity Specificity
   1 0 0.999842 ± 3.15E-4 0.999685 ± 6.30E-4 1
   0.0004 0.999994 ± 6.30E-5 0.999887 ± 3.62E-4 1
   0.1 1 0.999874 ± 3.68E-4 1
   3 0 0.999874 ± 3.35E-4 0.999723 ± 6.84E-4 1
   0.0004 0.999987 ± 8.85E-5 0.999937 ± 2.76E-4 1
   0.1 1 0.999874 ± 3.80E-4 1
   5 0 0.999793 ± 5.90E-4 0.999685 ± 7.02E-4 0.999902 ± 9.80E-4
   0.0004 0.999869 ± 7.28E-4 0.999912 ± 4.48E-4 0.999704 ± 2.20E-3
   0.1 1 0.999899 ± 3.44E-4 1
   [170]Open in a new tab

Independent test and performance comparison with existing methods

   We assessed and compared the prediction performance of SIMLIN with
   state-of-the-art methods for S-sulphenylation prediction on the
   independent test dataset. The compared approaches included MDD-SOH,
   SOHSite [[171]6, [172]7], SOHPRED, PRESS, iSulf-Cys, SulCysSite. We
   also noticed that several new computational frameworks have been
   published recently, including PredSCO [[173]27], the predictor by Lei
   et al [[174]28], and SVM-SulfoSite [[175]29]. However, due to the
   inaccessibility of source codes or implemented webservers, we were not
   able to compare their prediction results on our independent test
   dataset with the performance of SIMLIN. From Table [176]5 and Fig.
   [177]3, it is clear that generally SIMLIN outperformed the compared
   approaches. Compared to MDD-SOH, an important advantage of SIMLIN is
   that it does not require any pre-classified motifs. iSulf-Cys is
   another computational framework that employs a similar approach to
   create a unified predictive model, but it only used SVM models with
   three major encoding features (AAindex, binary and PSAAP) for model
   construction. The overall performance of iSulf-Cys is lower than
   SIMLIN. On the 95% CI the accuracy of iSulf-Cys is 0.7155 ± 0.0085;
   while SIMLIN achieved a prediction accuracy of 0.88 (0.857–0.892) on
   the 95% CI. The MCC value of SIMLIN was also higher than iSulf-Cys
   (0.39 vs. 0.3122). The SulCysSite model is mainly developed based on
   the multistage RFs with four major features (AAindex, binary amino acid
   codes, PSSM, and compositions of profile-based amino acids). Although
   SulCysSite achieved an AUC of 0.819, it used a biased approach whose
   final decision was dependent on a complex series of rules, each of
   which can only cover a small subset. In general, SIMLIN outperformed
   all the compared methods in terms of sensitivity, MCC, and AUC,
   demonstrating its ability to accurately predict human S- sulphenylation
   sites.

Table 5.

   Performance comparison with existing approaches for S-sulphenylation
   prediction on the independent test
     Method   Sensitivity Specificity MCC  Accuracy AUC
   SOHPRED    0.73        0.74        0.34 N.A.^b   0.80
   PRESS      0.68        0.69        0.27 73.8%    N.A.
   iSulf-Cys  0.73        0.64        0.31 66.8%    0.72
   SulCysSite 0.77        0.71        N.A. 72.0%    0.76
   SIMLIN     0.88        0.56        0.39 88.0%    0.82
   MDD–SOH^a  0.85        0.87        0.58 87.0%    N.A.
   [178]Open in a new tab

   ^aThe performance values of MDD-SOH were extracted from the study of
   Bui et al [[179]6]

   ^bN.A.: not available

   The bold font shows the highest performance of each feature among the
   RF and SVM

Proteome-wide prediction and functional enrichment analysis

   In order to more effectively portray the distribution of predicted
   S-sulphenylation sites and their potential molecular functions, we
   performed human proteome-wide S-sulphenylation site prediction using
   the protein sequences collected from the UniProt database (Version Sep
   2017) and our proposed SIMLIN framework. We first conducted statistical
   analysis on the distribution of predicted S-sulphenylation sites in
   proteins followed by a Gene Ontology (GO) enrichment analysis to reveal
   the potential cellular localization, biological function, and
   signalling/metabolic pathways involved in the predicted
   S-sulphenylation sites using the DAVID biological functional annotation
   tool (Version 6.8) [[180]78, [181]79].

   Figure [182]4a-d display the top ten enriched candidates of our gene
   ontology and pathway enrichment analysis, in terms of molecular
   function, biological process and cellular component. Figure [183]4e
   shows the distribution of numbers of predicted S-sulphenylation sites
   in the human proteome. In terms of molecular function, the ATPase
   related activities (i.e., ATPase activity, coupled to movement of
   substances with a significant p-value of 8.5 × 10^− 21; ATPase
   activity, coupled to transmembrane movement of substances -
   8.5 × 10^− 21; ATPase activity - 3.42 × 10^− 14) have been found to be
   significantly enriched in proteins with predicted S-sulphenylation
   sites (Fig. [184]4a). An example of such relationship has been
   demonstrated in the study by Wojdyla et al. [[185]80] where
   Acetaminophen (APAP) treatment has been shown to influence the ATP
   production, and the APAP-induced S-sulphenylation may act as one
   contributing fact to such effect. All enriched biological processes
   shown in Fig. [186]4b are metabolic processes, which indicate the
   important roles of S-sulphenylation in metabolism [[187]11]. For
   instance, one S-sulphenylation occurring at C212 of a fatty acid
   synthase (FASN) protein may play a role in blocking an active site
   (C161), which is responsible for fatty acid synthase (Fig. [188]3B;
   fatty acid metabolic process - 5.82 × 10^− 17) [[189]11, [190]81].
   While for cellular component category (Fig. [191]4c), the top three
   localisations are organelle (5.30 × 10^− 08), intracellular organelle
   (5.30 × 10^− 08) and membrane-enclosed lumens (5.30 × 10^− 08), which
   is consistent with the analysis of Bui et al [[192]6, [193]7] RNA
   transport is an important process associated with protein synthesis,
   which consists of 14 proteins enriched in S-sulphenylation and
   S-nitrosylation sites [[194]80], highlighting the necessity of protein
   S-sulphenylation sites in RNA transport (Fig. [195]4d; 1.50 × 10^− 05).
   Figure [196]3e shows the distribution of the numbers of predicted
   S-sulphenylation site contained in each protein. Expectedly, most of
   the proteins (72.3%) only contain one predicted site; while only 1.5%
   of the human proteome harbour five or more predicted sites. A full list
   of the predicted S-sulphenylation sites on human proteome is freely
   available on the SIMLIN webserver.

Fig. 4.

   [197]Fig. 4
   [198]Open in a new tab

   Gene ontology enrichment analysis of the predicted protein
   S-sulphenylation sites in the human proteome using SIMLIN: top 10
   significant (a) molecular function terms (GO_MF), (b) biological
   process terms (GO_BP), (c) cellular component terms (GO_CC), (d)
   pathways; and (e) distribution of the numbers of predicted
   S-sulphenylation sites

Case study of predicted S-sulphenylation using SIMLIN

   As aforementioned, compared with the dataset used for training SIMLIN,
   three more S-sulphenylation sites have been recently identified and
   added to the UniProt database, including BRF2_HUMAN (position 361 of
   [199]Q9HAW0) [[200]82], PTN7_HUMAN (position 361 of [201]P35236; by
   similarity according to UniProt) and UCP1_HUMAN (position 254 of
   [202]P25874; by similarity according to UniProt). SIMLIN precisely
   predicted all of these three S-sulphenylation sites, with the
   possibility scores of 0.997, 0.999 and 0.998, respectively,
   illustrating the predictive power and capacity of SIMLIN for predicting
   human S-sulphenylation sites.

Implementation and usage of the SIMLIN webserver

   The open-access web application for SIMLIN was implemented using the
   Shiny framework (Version 1.3.0.403) in R language combining with
   Node.js (Version 0.10.21) and is freely available for academic use at
   [203]http://simlin.erc.monash.edu/. The SIMLIN server resides on a
   Linux server, equipped with dual AMD Opteron CPUs, 8 GB memory, and
   10 GB disk space. SIMLIN accepts both individual protein and a sequence
   file with the size limit of 1 MB as the input in FASTA format. An
   ‘Example’ link has been provided to demonstrate the predictive
   functionality of the service and guide users to conveniently use it. As
   the training dataset of SIMLIN was collected from the human proteome,
   the prediction results delivered by SIMLIN should be interpreted at the
   users’ discretion if the input protein is from other species rather
   than Homo sapiens. A graphical illustration of the SIMLIN webserver in
   terms of input and output is provided in Fig. [204]5.

Fig. 5.

   [205]Fig. 5
   [206]Open in a new tab

   Screenshots of SIMLIN server (a) home page, (b) submission page, and
   (c) full list of the predicted S-sulphenylation sites of human proteome
   on the SIMLIN webserver

Conclusion

   In light of the biological importance of S-sulphenylation, it is
   imperative to develop easy-to-use computational approaches for the
   accurate identification of S-sulphenylation sites. In this article, we
   present SIMLIN, a hybrid computation al framework integrating RF, SVM,
   and neural network models and sequence and structural features of
   S-sulphenylated motifs and proteins. Performance assessment on both
   cross-validation and independent test sets demonstrated that SIMLIN
   achieved outstanding prediction performance compared to
   state-of-the-art computational approaches (MDD-SOH, SOHSite, SOHPRED,
   PRESS, iSulf-Cys, and SulCysSite) for S-sulphenylation prediction. A
   user-friendly webserver has also been implemented to provide
   high-quality predictions of human S-sulphenylation sites using the
   optimised hybrid SIMLIN framework. Proteome-wide prediction of
   S-sulphenylation sites for the entire human proteome extracted from the
   UniProt database, has been made available at the SIMLIN webserver,
   aiming to provide highly accurate S-sulphenylation sites and facilitate
   biologists’ efforts for experimental validation, hypothesis generation,
   and data analysis. We anticipate that SIMLIN will be explored as a
   useful tool for human S-sulphenylation prediction. This effective
   framework can also be generally applied to address the prediction
   problem of other protein PTMs.

Supplementary information

   [207]12859_2019_3178_MOESM1_ESM.docx^ (13.9KB, docx)

   Additional file 1: Table S1. A detailed summary of the selected
   sequence and structural features using the MDL and mRMR feature
   selection methods. Table S2. The assigned weights of each node in the
   final neural network model.

Acknowledgments