Abstract The use of peptide drugs to treat cancer is gaining popularity because of their efficacy, fewer side effects, and several advantages over other properties. Identifying the peptides that interact with cancer proteins is crucial in drug discovery. Several approaches related to predicting peptide-protein interactions have been conducted. However, problems arise due to the high costs of resources and time and the smaller number of studies. This study predicts peptide-protein interactions using Random Forest, XGBoost, and SAE-DNN. Feature extraction is also performed on proteins and peptides using intrinsic disorder, amino acid sequences, physicochemical properties, position-specific assessment matrices, amino acid composition, and dipeptide composition. Results show that all algorithms perform equally well in predicting interactions between peptides derived from venoms and target proteins associated with cancer. However, XGBoost produces the best results with accuracy, precision, and area under the receiver operating characteristic curve of 0.859, 0.663, and 0.697, respectively. The enrichment analysis revealed that peptides from the Calloselasma rhodostoma venom targeted several proteins (ESR1, GOPC, and BRD4) related to cancer. Keywords: Bioinformatics, Biomedical, Cancer, Venom, Deep learning, Peptide 1. Introduction Patients with cancer are treated with three options, including surgery, radiotherapy, or pharmacotherapy, following their stage and tumor [[41]1]. Recently, cancer treatment using targeted therapeutics in the form of peptide drugs is becoming an emerging trend that provides improved efficacy and reduced side effects [[42]2]. Peptides have a crucial role in humans by interacting with various proteins and programming many cellular processes, such as cell death, and regulating gene expression [[43]3]. Furthermore, peptides have many advantages over small molecules and biologics, including simpler design, their ability to interact with unknown or unexplored targets, enhanced tissue penetration, and cheaper synthesis [[44]1]. Research in peptides has become a good start for designing novel therapeutics due to their safety and tolerability in human bodies, and identifying peptide-protein interaction is crucial for this task [[45]4]. However, finding new peptide-protein interactions often requires high costs and is very time-consuming [[46]3]. Several methods have been developed to speed up peptide drug discovery. Identifying peptide-protein interaction has two general approaches: sequence-based and structure-based. Sequence-based approaches use information in peptide and protein sequences to predict the interaction. For example, CGKronRLS [[47]5] and NRLMF [[48]6] calculate the similarity between sequences, and then use machine learning models to predict peptide-protein interaction. This approach generally needs known interaction between peptide and ligand and its similarity score as the input feature, which then created large-scale data and costs high computational power due to the enormous computational complexity of calculating similarity and the use of large-dimensional peptide and protein features [[49]4]. Structure-based approaches, such as molecular docking, solved the problem by modeling the structure of peptides and proteins at the atom level and predicting their affinities. Several docking strategies to determine peptide-protein interaction can be divided into local and global docking methods. Most of the docking approaches need three-dimensional (3D) structure information from the peptide and protein to calculate the binding energies. The usage of 3D structures consumes large computational power, resources, and time [[50]3]. The number of research and work regarding the identification of peptide-protein interaction using in silico approaches, such as machine learning or deep learning algorithm, remains relatively low although peptide drugs are becoming an emerging trend and the number of approved therapeutics using peptides has been increasing over the last decades. While the utilization of machine learning and deep learning techniques for peptide and protein interaction prediction is limited, these approaches have found significant application in the field of compound-protein interactions, specifically drug-target interactions (DTI). Several studies have successfully employed ensemble methods or deep learning methods to predict DTI. For instance, Ramadhanti et al. [[51]7] utilized Random Forest to predict the compounds that interact with proteins in the SARS-COV-2 virus, focusing on COVID-19-related DTI. Sajadi et al. [[52]8] introduced AutoDTI++, a deep unsupervised learning method that addresses the sparsity challenge in the interaction matrix and enables accurate prediction of drug-target interactions. Sulistiawan et al. [[53]9] employed a stacked autoencoder (SAE) for pretraining the weights in a deep neural network (DNN), aiming to achieve higher prediction accuracy in DTI, particularly those related to COVID-19. Furthermore, Fadli et al. [[54]10] conducted multilabel DTI prediction using the stack autoencoder-deep neural network (SAE-DNN) algorithm to address the limitations of binary classification models, which often overlook possible correlations between labels that could provide valuable information for enhancing the accuracy of DTI predictions. Thus, our study aimed to utilize ensemble method and deep learning techniques to predict peptide-protein interactions using peptides derived from Calloselasma rhodostoma venom, with a particular focus on cancer-associated proteins. The methodologies employed included ensemble methods such as Random Forest (RF) and XGBoost, as well as a deep learning approach utilizing a deep neural network (DNN) that was enhanced with stacked autoencoders for pre-training (SAE-DNN). The goal was to leverage these techniques to accurately predict interactions between the venom-derived peptides and the target proteins associated with cancer. RF and XGBoost are bagging and boosting tree-based algorithms known for their good classification performance in bioinformatics and can handle high-dataset issues with the feature selection process even with a higher number of variables [[55]11]. DNN is known for its ability to learn data representation and has achieved high performance in pattern recognition, computer vision, and bioinformatics [[56]12,[57]13]. We constructed two versions of the datasets: version 1 with a peptide sequence length of ≤50 and protein sequences length of ≤1500 and version 2 with a peptide sequence length of ≤19 (the average length of all peptide sequences) and protein sequences length of ≤1500. The feature profiles for peptides and proteins are constructed using intrinsic disorder, amino acid sequences, physicochemical properties, position-specific scoring matrix, amino acid composition, and dipeptide composition. We tested the peptide-protein interaction prediction model using the Calloselasma rhodostoma venom-derived peptides. 2. Materials and methods 2.1. Dataset Three datasets were collected: the cancer protein dataset, the peptide-protein interaction dataset, and the peptide extracted from the Calloselasma rhodostoma venom, which will be used as test data to obtain new peptide-protein interaction. The cancer protein dataset is taken from The Cancer Genome Atlas website ([58]https://www.cancer.gov/tcga accessed on October 3, 2022), resulting in a total of 571 proteins associated with cancer. The peptide data is collected from Protein Data Bank (PDB) [[59]14] and DrugBank [[60]15]. The peptide data in PDB is taken by first getting all the peptide and protein FASTA available, then using a protein-ligand interaction profiler (PLIP) to determine the interactions between peptide and protein. PLIP determines the interaction of peptide-protein by looking at seven types of interaction: hydrogen bonds, hydrophobic contact, pi-stacking, pi-caption interactions, salt bridges, water bridges, and halogen bonds [[61]16]. Peptide-protein pairs are considered to have interaction if at least one of the interaction types is detected. Then, the generated interactions were filtered by selecting the cancer-associated proteins. We screened the drug-target interactions with peptide-type drugs and cancer proteins in DrugBank. A total of 452 positive interactions were obtained, including 404 from PDB and 48 from DrugBank. Then, we generated the negative interactions to generate full datasets by randomly shuffling the peptides and proteins with unknown interactions. Five negative interactions were generated for each peptide by randomly sampling the non-interacting proteins. Overall, 2712 peptide-protein pairs were obtained in the training dataset. We combined every unique venom peptide and cancer protein for the test data, which resulted in 23,925 peptide-protein pairs. The test data consisted of 145 venom peptides. Venom was directly milked from Calloselasma rhodostoma snake. Freeze-dried venom was dissolved in buffer and digested using Trypsin Gold. The hydrolysate was measured for absorbance with a UV–Vis spectrophotometer at a wavelength of 585 nm. Then, trypsin-digested venom was subjected to high-resolution mass spectrometry. Hydrolysate was injected into the Acclaim® PepMap RSLC column and eluted using mobile phase A (water and 0.05 % formic acid) and B (water:acetonitrile at 20:80 and 0.1 % TFA). The MS and MS/MS were conducted using the NSI ionization method. Chromatograms of peptide fractions were obtained and analyzed using Xcalibur software. The dataset can be seen in [62]Supplementary File 1. 2.2. Feature extraction All peptide and protein sequences used have different lengths. We set the sequence to a certain length to make the same input feature size for all the peptides and proteins. We used two versions of data for the peptide data: peptide with a length of sequences of ≤50 [[63]4] and peptide with a length of sequences of ≤19 (the average length of all peptide sequences). We set the length of protein sequences at≥50 and ≤ 1500 for the protein data. Peptides with sequence lengths of <50 and 19 and proteins with <1500 amino acids were zero-padded to give the same input feature size. We then constructed the feature for both peptides and proteins using six feature profiles: amino acid sequences, physicochemical properties, intrinsic disorder, position-specific scoring matrix (PSSM), amino acid composition (AAC), and dipeptide composition (DPC). We encoded each residue using an integer between 1 and 21 (the number of amino acid types) for the amino acid sequences (AAS) feature. We encoded residue in sequence using an integer between 1 and 7, which is based on the combination of polarity and the hydropathy index, for physicochemical property representation. This index measures the energy of the transfer of an amino acid side chain from a hydrophobic solvent to water [[64]17]. [65]Table 1 shows the encoded rules. Table 1. Encoded physicochemical feature. Polarity Positive/negative of hydropathy index Amino acid Encoded integer Non-polar Positive Ala, Phe, Ile, Met, Leu, Pro, Val 1 Non-polar Negative Gly, Trp 2 Polar-uncharged Positive Cys 3 Polar-uncharged Negative Asn, Gln, Ser, Thr, Tyr 4 Negatively-charged Negative Asp, Glu 5 Positively-charged Negative Lys, His, Arg 6 Unknown Unknown Otherwise 7 [66]Open in a new tab Intrinsic disorder (ID) features represent the tendencies of disordered amino acid pairs to form contacts with other objects such as protein or RNA. Its value ranges from 0 (complete order) to 1 (complete disorder). Three forms of intrinsic disorder scores are used: long disorder score (considering the long disorder regions), short disorder score (considering the short stretches of disorder), and ANCHOR score (the probability of specific amino acids to be a part of the disorder region) [[67]18]. The use of intrinsic disorder for protein and peptide features can help increase the accuracy of peptide-protein interaction [[68]4]. This feature is generated using IUpred2 [[69]18]. PSSM feature is one the representation of protein sequences that can detect the homology between other protein sequences. This feature consists of an array S of size N × 20, with N as the length of the sequence and every position in Si, j as the probability of residue j in position i. PSSM profile is generated using psi-blast [[70]19] with the number of iteration of 3 and e-value of 0.001 which have been proved to work well for generating effective feature profiles from the protein sequences [[71]20]. The profile is then transformed using auto cross-covariance (ACC) after getting PSSM the profile of each peptide and protein. ACC can measure the correlation between any two properties and transform the PSSM profile into the same matrix. ACC is the combination of auto covariance between the same residue and cross-covariance (CC) between two different residues [[72]21]. ACC transformation in PSSM profiles resulted in 400 features of peptide and protein. AAC and dipeptide composition represent the percentage of single amino acids and dipeptides (2 combinations of amino acids, such as AA, AR, and AN) in the sequences [[73]22]. AAC converts the sequences into 20 features, while DPC converts them into 400 features. 2.3. Prediction model The prediction model used three algorithms: deep neural network (DNN) with pre-training using stacked autoencoder (SAE), which will be referred to as SAE-DNN, RF, and XGBoost. SAE is used as pre-training to get the initial weight for DNN, which can help produce an optimal model compared to a model using random initial weights [[74]23]. SAE training resulted in the weight and bias which will be used in the DNN training process. The activation functions in the output layer of DNN uses sigmoid because the data class is binary, while the other parameter will be tuned to find its optimal value. RF is popular machine learning based on multiple decision trees (ensemble). RF achieves better prediction performance through the usage of bagging, randomly subsetting the variables, and a majority voting system [[75]24]. RF can handle the high-dimensional dataset in bioinformatics by performing feature selection processes while building the prediction rules [[76]25]. This study will tune several parameters of RF, namely: n_estimators (the number of trees built), max_samples (the number of variable samples used in the training process), and max_depth (the depth of individual trees). XGBoost is a tree-based algorithm which is an optimization model that uses both the linear model and boosting tree model. XGBoost can improve the efficiency of the optimal results and achieve global optimal faster through the usage of the first derivative and second derivative of the loss function for second-order derivation [[77]26]. XGBoost also performs a feature selection process while being trained through the use of a gain score in each split tree and has already achieved good performance in the bioinformatics area [[78]21]. This study attempted several scenarios, with scenarios 1–7 searching an optimal dataset for the model while scenarios 8–13 compared SAE-DNN, RF, and XGBoost performance. [79]Table 2 shows the details of scenarios, with the only DrugBank data used from models 6–13. We used the peptide length of ≤50 and then compared the best model with the performance of the peptide length of ≤19 datasets to compare the performance of each scenario. Table 2. Scenarios details. Scenario Prediction model Feature used Model 1[80]^a SAE-DNN ID, PSSM Model 2[81]^a SAE-DNN ID, PSSM, and Amino Acid Composition (AAC) Model 3[82]^a SAE-DNN ID, PSSM, and Dipeptide Composition (DPC) Model 4[83]^a SAE-DNN ID, PSSM, AAS, and PP Model 5[84]^a SAE-DNN ID, PSSM, AAS, PP, and AAC Model 6 SAE-DNN ID, PSSM, AAS, and PP Model 7 SAE-DNN ID, PSSM, AAS, PP, and AAC Model 8 SAE-DNN with RF feature selection ID, PSSM, AAS, PP, and AAC Model 9 SAE-DNN with XGBoost feature selection ID, PSSM, AAS, PP, and AAC Model 10 RF prediction model ID, PSSM, AAS, PP, and AAC Model 11 XGBoost prediction model ID, PSSM, AAS, PP, and AAC Model 12[85]^b RF prediction model with class_weight settings ID, PSSM, AAS, PP, and AAC Model 13[86]^b XGBoost prediction model with class_weight settings ID, PSSM, AAS, PP, and AAC [87]Open in a new tab ^a DrugBank dataset excluded. ^b The weights are calculated using the inverse proportion of class frequencies. Hyperparameter tuning was also conducted using Bayesian Optimization to produce the optimal model. [88]Table 3 shows the hyperparameter search space. Table 3. Hyperparameter search space. Hyperparameter Values HL0 Node 100–2000 HLi Node 0.5 × (HL i-1)–0.75 × (HLi-1) Hidden layer 1–6 Learning rate/eta[89]^b 0.01–0.1 Dropout rate 0.2–0.7 n_estimators[90]^a[91]^b 100–500 colsample_bytree[92]^b/max_samples[93]^a 0.1–1 max_depth[94]^a 3–6 [95]Open in a new tab ^a for the RF model. ^b for XGBoost model. 2.4. Model evaluation The model is trained using StratifiedKFold Cross Validation to balance the number of positive and negative interactions, with the number of folds, K = 5. The evaluation metrics include accuracy, recall, precision, Fscore, receiver operating characteristic curve (ROC), and precision-recall curve (PRC). The accuracy value measures how well the test data predicts. Precision measures the percentage of positive predictions against a positive class. Recall measures the accuracy of the positive prediction of the model. Fscore measures the performance of the minority class. The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 - FPR) [[96]9]. The PRC curve shows the comparison between precision and recall which aimed to measure how well the model predicted the class in imbalanced settings [[97]27]. 2.5. Enrichment analysis Enrichr ([98]https://maayanlab.cloud/Enrichr/) [[99]28] is a database used to search for protein pathway enrichment. Pathway enrichment analysis consists of identifying the KEGG Pathway and Gene Ontology (Biological Processes, Molecular Functions, and Cellular Components). The online tool SRPlot ([100]http://www.bioinformatics.com.cn/). Cytoscape v3.9.1 ([101]https://cytoscape.org/) [[102]29] was used to build protein-enrichment and protein-peptide-enrichment network topologies to visualize predicted results of pathway enrichment. 3. Results and discussion 3.1. Prediction results of dataset version 1 [103]Supplementary Table 4 shows all the prediction models trained using the optimal hyperparameters. [104]Table 4 shows all scenario comparison results. Table 4. All scenario comparison results with peptide length of ≤50 dataset. Metrics Accuracy Recall Precision ROC-AUC Fscore Model 1 0.825 ± 0.005 0.136 ± 0.046 0.429 ± 0.040 0.647 ± 0.030 0.202 ± 0.053 2 0.831 ± 0.024 0.341 ± 0.062 0.509 ± 0.094 0.723 ± 0.036 0.402 ± 0.058 3 0.832 ± 0.016 0.428 ± 0.047 0.502 ± 0.052 0.733 ± 0.031 0.460 ± 0.042 4 0.825 ± 0.020 0.349 ± 0.062 0.494 ± 0.091 0.726 ± 0.013 0.397 ± 0.035 5 0.822 ± 0.016 0.495 ± 0.014 0.485 ± 0.089 0.761 ± 0.024 0.486 ± 0.047 6 0.804 ± 0.003 0.433 ± 0.079 0.428 ± 0.011 0.654 ± 0.023 0.428 ± 0.043 7 0.818 ± 0.020 0.358 ± 0.118 0.480 ± 0.094 0.736 ± 0.025 0.393 ± 0.084 8 0.828 ± 0.012 0.427 ± 0.072 0.499 ± 0.044 0.743 ± 0.013 0.457 ± 0.051 9 0.830 ± 0.017 0.413 ± 0.100 0.518 ± 0.055 0.753 ± 0.017 0.448 ± 0.050 10 0.842 ± 0.007 0.148 ± 0.012 0.709 ± 0.133 0.688 ± 0.039 0.243 ± 0.021 11 0.859 ± 0.006 0.369 ± 0.040 0.663 ± 0.055 0.697 ± 0.020 0.466 ± 0.025 12 0.800 ± 0.014 0.433 ± 0.067 0.419 ± 0.037 0.685 ± 0.042 0.455 ± 0.032 13 0.843 ± 0.009 0.400 ± 0.023 0.560 ± 0.038 0.690 ± 0.029 0.466 ± 0.023 [105]Open in a new tab All models generally produced fairly good average accuracy with a score of ≥80 %. All models have a decent result in an average ROC-AUC of >0.6. However, all models produced a bad recall and Fscore, with a score of under 0.4 in both metrics. A low recall score indicated a high number of false negatives in the models which is a common case in imbalanced class problems. The ROC curve is used for another measurement to evaluate the model. The ROC curve of each model can be seen in [106]Fig. 1a and b. The ROC curve closer to the top-left corner indicates the following. Higher AUC means that the model can distinguish classes between positive and negative. Hence, model 5 has the best AUC in one of its folds of 0.8, while the other models produced almost similar scores of AUC with only model 1 scoring below 0.7. This means that all scenarios, except model 1, have at least a 73 % chance to differentiate the class between positive and negative. Fig. 1a. [107]Fig. 1a [108]Open in a new tab ROC curve for model 1–5. [109]Fig. 1b. ROC curve for model 6-13. An imbalanced setting is PRC is another metric used to measure model performance, which can be seen in [110]Fig. 2a and b. PRC shows the trade-off between precision and recall, and a model is considered to perform better when moving closer to point (1,1) in the graph. Hence, no models tend to be close to the ideal point with some of the models having a high precision and low recall. A model with high precision but low recall means returning very few positive classes, but most of its predictions were correct. Fig. 2a. [111]Fig. 2a [112]Open in a new tab Precision-Recall (PR) curve for model 1–5. [113]Fig. 2b. Precision-Recall (PR) curve for model 6-13. The scenarios were further compared by measuring the confusion matrix on the last fold of the CV. [114]Table 5 shows a comparison of the results. Table 5. All scenarios’ confusion matrix comparison results. Metrics Model __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ __________________________________________________________________ 1 2 3 4 5 6 7 8 9 10 11 12 13 True Positive (TP) 11 32 29 31 40 37 17 41 31 13 32 34 33 True Negative (TN) 392 387 374 374 360 390 428 408 418 437 427 390 413 False Negative (FN) 69 48 51 49 40 53 73 49 59 77 58 56 57 False Positive (FP) 12 17 30 30 44 48 10 30 20 1 11 48 25 [115]Open in a new tab Models 5 and 8 had the highest number of TP (40 and 41 consecutively), but they also produced many FP which is something that should be avoided in peptide-protein interactions. Model 10 produced the lowest number of positive predictions with only 14. Hence, we considered models 5, 10, and 11 to be the better model among others with model 11 being the best due to its high number of TP (32) and low FP (11) as well as its performance in all evaluation metrics in [116]Table 4. 3.2. Prediction comparison of dataset version 1 and version 2 using the three best model Then, we compared the performance of models 5, 10, and 11 with the data version 2 which used the peptide sequence length of ≤19 (the average length of all peptide sequences). The comparison results can be seen in [117]Fig. 3. Fig. 3. [118]Fig. 3 [119]Open in a new tab Comparison results for dataset versions 1 and 2 using models 5, 10, and 11. Overall, the performance of dataset versions 1 and 2 revealed no significant differences. Dataset version 2 has a slightly better performance in model 10 while dataset version 1 had a better performance in models 5 and 11. [120]Fig. 4 shows the comparison for the ROC curve, while [121]Fig. 5 shows the PR curve comparison of each dataset. Fig. 4. [122]Fig. 4 [123]Open in a new tab ROC curve comparison for dataset versions 1 and 2 using models 5, 10, and 11. Fig. 5. [124]Fig. 5 [125]Open in a new tab Precision-Recall curve comparison for dataset versions 1 and 2 using models 5, 10, and 11. 3.3. Venom peptide prediction results in dataset version 1 The venom peptide prediction is performed using models 5, 10, and 11. Prediction using model 5 resulted in 23,532 pairs of peptide-protein with all peptides predicted to have interaction with cancer proteins. Only protein AFDN had no interaction with venom peptide. Prediction using model 10 produced 169 pairs of peptide-protein with 113 unique peptides predicted to have an interaction with only four proteins (Estrogen Receptor 1 [ESR1], Golgi-associated PDZ and coiled-coil motif-containing protein [GOPC], Bromodomain-containing protein 4 [BRD4], and androgen receptor [AR]), with protein ESR1 having the most peptide interaction of 112 peptides, followed by GOPC (48), BRD4 (8), and AR (1). Prediction using model 11 produced 765 peptide-protein pairs with 140 distinct peptides and 66 proteins. The top 5 protein with the most interaction with a peptide is ESR1 with 90 peptides, followed by EGFR (76), GOPC (69), BCL6 (47), and YWHAE/14-3-3 protein epsilon (42). Venom peptide prediction with these three models also yielded the same 121 pairs of peptide-protein that consisted of 86 unique peptides and 3 proteins (ESR1, GOPC, and BRD4). [126]Supplementary File 2 shows all of these overlap prediction results. 3.4. Venom peptide prediction results in dataset version 2 Venom peptide prediction using dataset version 2 produced a slight difference result. Model 5 prediction resulted in 145 pairs of peptide-protein with all peptides predicted to have interaction with only 1 protein (CREBPP). Model 10 prediction yielded 121 pairs of peptide-protein with 108 unique peptides predicted to have an interaction with only three proteins (ESR1, GOPC, and BRD4), with protein ESR1 having the most peptide interaction of 107 peptides, followed by GOPC (7) and BRD4 (7). Prediction using model 11 produced 1056 peptide-protein pairs with 140 distinct peptides and 71 proteins. The top 5 protein with the most interaction with a peptide is ESR1 with 102 peptides, followed by AR (86), BRD4 (83), GOPC (73), and BCL6 (58). Peptide-protein pairs have no overlap among these three prediction model results. The overlap pairs were found in models 10 and 11 prediction results, with a total of 95 pairs that consisted of 86 unique peptides and 3 proteins (ESR1, GOPC, and BRD4). [127]Supplementary File 3 shows all of these overlap prediction results. A similarity was found in pairs of peptide-protein between the prediction of dataset versions 1 and 2, with 67 pairs which consisted of 64 unique peptides and 3 proteins (ESR1, GOPC, and BRD4). [128]Supplementary File 4 shows these overlap pairs. 3.5. Enrichment analysis results The identification results of protein overlap interactions with peptides included 83 peptides for ESR1, 6 for GOPC, and 6 for BRD4. Gene ontology and KEGG pathways are identified from ESR1, GOPC, and BRD4 proteins using the Enrichr database, as seen in [129]Fig. 6, [130]Fig. 7, [131]Fig. 8, [132]Fig. 9 and [133]Supplementary File 5. [134]Fig. 10 shows the topological network of protein-peptide-enrichment. Fig. 6. [135]Fig. 6 [136]Open in a new tab GO Biological Processes: a. Top 10 terms with the lowest p-value; b. network of protein-GO Biological Processes, darker colors indicate the terms with a p-value of <0.01(enrichment terms explanation can be seen in [137]Supplementary File 5). (For interpretation of the references to