Abstract

   The use of peptide drugs to treat cancer is gaining popularity because
   of their efficacy, fewer side effects, and several advantages over
   other properties. Identifying the peptides that interact with cancer
   proteins is crucial in drug discovery. Several approaches related to
   predicting peptide-protein interactions have been conducted. However,
   problems arise due to the high costs of resources and time and the
   smaller number of studies. This study predicts peptide-protein
   interactions using Random Forest, XGBoost, and SAE-DNN. Feature
   extraction is also performed on proteins and peptides using intrinsic
   disorder, amino acid sequences, physicochemical properties,
   position-specific assessment matrices, amino acid composition, and
   dipeptide composition. Results show that all algorithms perform equally
   well in predicting interactions between peptides derived from venoms
   and target proteins associated with cancer. However, XGBoost produces
   the best results with accuracy, precision, and area under the receiver
   operating characteristic curve of 0.859, 0.663, and 0.697,
   respectively. The enrichment analysis revealed that peptides from the
   Calloselasma rhodostoma venom targeted several proteins (ESR1, GOPC,
   and BRD4) related to cancer.

   Keywords: Bioinformatics, Biomedical, Cancer, Venom, Deep learning,
   Peptide

1. Introduction

   Patients with cancer are treated with three options, including surgery,
   radiotherapy, or pharmacotherapy, following their stage and tumor
   [[41]1]. Recently, cancer treatment using targeted therapeutics in the
   form of peptide drugs is becoming an emerging trend that provides
   improved efficacy and reduced side effects [[42]2]. Peptides have a
   crucial role in humans by interacting with various proteins and
   programming many cellular processes, such as cell death, and regulating
   gene expression [[43]3].

   Furthermore, peptides have many advantages over small molecules and
   biologics, including simpler design, their ability to interact with
   unknown or unexplored targets, enhanced tissue penetration, and cheaper
   synthesis [[44]1]. Research in peptides has become a good start for
   designing novel therapeutics due to their safety and tolerability in
   human bodies, and identifying peptide-protein interaction is crucial
   for this task [[45]4]. However, finding new peptide-protein
   interactions often requires high costs and is very time-consuming
   [[46]3]. Several methods have been developed to speed up peptide drug
   discovery.

   Identifying peptide-protein interaction has two general approaches:
   sequence-based and structure-based. Sequence-based approaches use
   information in peptide and protein sequences to predict the
   interaction. For example, CGKronRLS [[47]5] and NRLMF [[48]6] calculate
   the similarity between sequences, and then use machine learning models
   to predict peptide-protein interaction. This approach generally needs
   known interaction between peptide and ligand and its similarity score
   as the input feature, which then created large-scale data and costs
   high computational power due to the enormous computational complexity
   of calculating similarity and the use of large-dimensional peptide and
   protein features [[49]4]. Structure-based approaches, such as molecular
   docking, solved the problem by modeling the structure of peptides and
   proteins at the atom level and predicting their affinities. Several
   docking strategies to determine peptide-protein interaction can be
   divided into local and global docking methods. Most of the docking
   approaches need three-dimensional (3D) structure information from the
   peptide and protein to calculate the binding energies. The usage of 3D
   structures consumes large computational power, resources, and time
   [[50]3].

   The number of research and work regarding the identification of
   peptide-protein interaction using in silico approaches, such as machine
   learning or deep learning algorithm, remains relatively low although
   peptide drugs are becoming an emerging trend and the number of approved
   therapeutics using peptides has been increasing over the last decades.
   While the utilization of machine learning and deep learning techniques
   for peptide and protein interaction prediction is limited, these
   approaches have found significant application in the field of
   compound-protein interactions, specifically drug-target interactions
   (DTI). Several studies have successfully employed ensemble methods or
   deep learning methods to predict DTI. For instance, Ramadhanti et al.
   [[51]7] utilized Random Forest to predict the compounds that interact
   with proteins in the SARS-COV-2 virus, focusing on COVID-19-related
   DTI. Sajadi et al. [[52]8] introduced AutoDTI++, a deep unsupervised
   learning method that addresses the sparsity challenge in the
   interaction matrix and enables accurate prediction of drug-target
   interactions. Sulistiawan et al. [[53]9] employed a stacked autoencoder
   (SAE) for pretraining the weights in a deep neural network (DNN),
   aiming to achieve higher prediction accuracy in DTI, particularly those
   related to COVID-19. Furthermore, Fadli et al. [[54]10] conducted
   multilabel DTI prediction using the stack autoencoder-deep neural
   network (SAE-DNN) algorithm to address the limitations of binary
   classification models, which often overlook possible correlations
   between labels that could provide valuable information for enhancing
   the accuracy of DTI predictions.

   Thus, our study aimed to utilize ensemble method and deep learning
   techniques to predict peptide-protein interactions using peptides
   derived from Calloselasma rhodostoma venom, with a particular focus on
   cancer-associated proteins. The methodologies employed included
   ensemble methods such as Random Forest (RF) and XGBoost, as well as a
   deep learning approach utilizing a deep neural network (DNN) that was
   enhanced with stacked autoencoders for pre-training (SAE-DNN). The goal
   was to leverage these techniques to accurately predict interactions
   between the venom-derived peptides and the target proteins associated
   with cancer. RF and XGBoost are bagging and boosting tree-based
   algorithms known for their good classification performance in
   bioinformatics and can handle high-dataset issues with the feature
   selection process even with a higher number of variables [[55]11]. DNN
   is known for its ability to learn data representation and has achieved
   high performance in pattern recognition, computer vision, and
   bioinformatics [[56]12,[57]13]. We constructed two versions of the
   datasets: version 1 with a peptide sequence length of ≤50 and protein
   sequences length of ≤1500 and version 2 with a peptide sequence length
   of ≤19 (the average length of all peptide sequences) and protein
   sequences length of ≤1500. The feature profiles for peptides and
   proteins are constructed using intrinsic disorder, amino acid
   sequences, physicochemical properties, position-specific scoring
   matrix, amino acid composition, and dipeptide composition. We tested
   the peptide-protein interaction prediction model using the Calloselasma
   rhodostoma venom-derived peptides.

2. Materials and methods

2.1. Dataset

   Three datasets were collected: the cancer protein dataset, the
   peptide-protein interaction dataset, and the peptide extracted from the
   Calloselasma rhodostoma venom, which will be used as test data to
   obtain new peptide-protein interaction. The cancer protein dataset is
   taken from The Cancer Genome Atlas website
   ([58]https://www.cancer.gov/tcga accessed on October 3, 2022),
   resulting in a total of 571 proteins associated with cancer. The
   peptide data is collected from Protein Data Bank (PDB) [[59]14] and
   DrugBank [[60]15]. The peptide data in PDB is taken by first getting
   all the peptide and protein FASTA available, then using a
   protein-ligand interaction profiler (PLIP) to determine the
   interactions between peptide and protein. PLIP determines the
   interaction of peptide-protein by looking at seven types of
   interaction: hydrogen bonds, hydrophobic contact, pi-stacking,
   pi-caption interactions, salt bridges, water bridges, and halogen bonds
   [[61]16]. Peptide-protein pairs are considered to have interaction if
   at least one of the interaction types is detected. Then, the generated
   interactions were filtered by selecting the cancer-associated proteins.
   We screened the drug-target interactions with peptide-type drugs and
   cancer proteins in DrugBank. A total of 452 positive interactions were
   obtained, including 404 from PDB and 48 from DrugBank.

   Then, we generated the negative interactions to generate full datasets
   by randomly shuffling the peptides and proteins with unknown
   interactions. Five negative interactions were generated for each
   peptide by randomly sampling the non-interacting proteins. Overall,
   2712 peptide-protein pairs were obtained in the training dataset.

   We combined every unique venom peptide and cancer protein for the test
   data, which resulted in 23,925 peptide-protein pairs. The test data
   consisted of 145 venom peptides. Venom was directly milked from
   Calloselasma rhodostoma snake. Freeze-dried venom was dissolved in
   buffer and digested using Trypsin Gold. The hydrolysate was measured
   for absorbance with a UV–Vis spectrophotometer at a wavelength of
   585 nm. Then, trypsin-digested venom was subjected to high-resolution
   mass spectrometry. Hydrolysate was injected into the Acclaim® PepMap
   RSLC column and eluted using mobile phase A (water and 0.05 % formic
   acid) and B (water:acetonitrile at 20:80 and 0.1 % TFA). The MS and
   MS/MS were conducted using the NSI ionization method. Chromatograms of
   peptide fractions were obtained and analyzed using Xcalibur software.
   The dataset can be seen in [62]Supplementary File 1.

2.2. Feature extraction

   All peptide and protein sequences used have different lengths. We set
   the sequence to a certain length to make the same input feature size
   for all the peptides and proteins. We used two versions of data for the
   peptide data: peptide with a length of sequences of ≤50 [[63]4] and
   peptide with a length of sequences of ≤19 (the average length of all
   peptide sequences). We set the length of protein sequences at≥50
   and ≤ 1500 for the protein data. Peptides with sequence lengths of <50
   and 19 and proteins with <1500 amino acids were zero-padded to give the
   same input feature size. We then constructed the feature for both
   peptides and proteins using six feature profiles: amino acid sequences,
   physicochemical properties, intrinsic disorder, position-specific
   scoring matrix (PSSM), amino acid composition (AAC), and dipeptide
   composition (DPC).

   We encoded each residue using an integer between 1 and 21 (the number
   of amino acid types) for the amino acid sequences (AAS) feature. We
   encoded residue in sequence using an integer between 1 and 7, which is
   based on the combination of polarity and the hydropathy index, for
   physicochemical property representation. This index measures the energy
   of the transfer of an amino acid side chain from a hydrophobic solvent
   to water [[64]17]. [65]Table 1 shows the encoded rules.

Table 1.

   Encoded physicochemical feature.
   Polarity Positive/negative of hydropathy index Amino acid Encoded
   integer
   Non-polar Positive Ala, Phe, Ile, Met, Leu, Pro, Val 1
   Non-polar Negative Gly, Trp 2
   Polar-uncharged Positive Cys 3
   Polar-uncharged Negative Asn, Gln, Ser, Thr, Tyr 4
   Negatively-charged Negative Asp, Glu 5
   Positively-charged Negative Lys, His, Arg 6
   Unknown Unknown Otherwise 7
   [66]Open in a new tab

   Intrinsic disorder (ID) features represent the tendencies of disordered
   amino acid pairs to form contacts with other objects such as protein or
   RNA. Its value ranges from 0 (complete order) to 1 (complete disorder).
   Three forms of intrinsic disorder scores are used: long disorder score
   (considering the long disorder regions), short disorder score
   (considering the short stretches of disorder), and ANCHOR score (the
   probability of specific amino acids to be a part of the disorder
   region) [[67]18]. The use of intrinsic disorder for protein and peptide
   features can help increase the accuracy of peptide-protein interaction
   [[68]4]. This feature is generated using IUpred2 [[69]18].

   PSSM feature is one the representation of protein sequences that can
   detect the homology between other protein sequences. This feature
   consists of an array S of size N × 20, with N as the length of the
   sequence and every position in Si, j as the probability of residue j in
   position i. PSSM profile is generated using psi-blast [[70]19] with the
   number of iteration of 3 and e-value of 0.001 which have been proved to
   work well for generating effective feature profiles from the protein
   sequences [[71]20]. The profile is then transformed using auto
   cross-covariance (ACC) after getting PSSM the profile of each peptide
   and protein. ACC can measure the correlation between any two properties
   and transform the PSSM profile into the same matrix. ACC is the
   combination of auto covariance between the same residue and
   cross-covariance (CC) between two different residues [[72]21]. ACC
   transformation in PSSM profiles resulted in 400 features of peptide and
   protein. AAC and dipeptide composition represent the percentage of
   single amino acids and dipeptides (2 combinations of amino acids, such
   as AA, AR, and AN) in the sequences [[73]22]. AAC converts the
   sequences into 20 features, while DPC converts them into 400 features.

2.3. Prediction model

   The prediction model used three algorithms: deep neural network (DNN)
   with pre-training using stacked autoencoder (SAE), which will be
   referred to as SAE-DNN, RF, and XGBoost. SAE is used as pre-training to
   get the initial weight for DNN, which can help produce an optimal model
   compared to a model using random initial weights [[74]23]. SAE training
   resulted in the weight and bias which will be used in the DNN training
   process. The activation functions in the output layer of DNN uses
   sigmoid because the data class is binary, while the other parameter
   will be tuned to find its optimal value.

   RF is popular machine learning based on multiple decision trees
   (ensemble). RF achieves better prediction performance through the usage
   of bagging, randomly subsetting the variables, and a majority voting
   system [[75]24]. RF can handle the high-dimensional dataset in
   bioinformatics by performing feature selection processes while building
   the prediction rules [[76]25]. This study will tune several parameters
   of RF, namely: n_estimators (the number of trees built), max_samples
   (the number of variable samples used in the training process), and
   max_depth (the depth of individual trees).

   XGBoost is a tree-based algorithm which is an optimization model that
   uses both the linear model and boosting tree model. XGBoost can improve
   the efficiency of the optimal results and achieve global optimal faster
   through the usage of the first derivative and second derivative of the
   loss function for second-order derivation [[77]26]. XGBoost also
   performs a feature selection process while being trained through the
   use of a gain score in each split tree and has already achieved good
   performance in the bioinformatics area [[78]21].

   This study attempted several scenarios, with scenarios 1–7 searching an
   optimal dataset for the model while scenarios 8–13 compared SAE-DNN,
   RF, and XGBoost performance. [79]Table 2 shows the details of
   scenarios, with the only DrugBank data used from models 6–13. We used
   the peptide length of ≤50 and then compared the best model with the
   performance of the peptide length of ≤19 datasets to compare the
   performance of each scenario.

Table 2.

   Scenarios details.
   Scenario Prediction model Feature used
   Model 1[80]^a SAE-DNN ID, PSSM
   Model 2[81]^a SAE-DNN ID, PSSM, and Amino Acid Composition (AAC)
   Model 3[82]^a SAE-DNN ID, PSSM, and Dipeptide Composition (DPC)
   Model 4[83]^a SAE-DNN ID, PSSM, AAS, and PP
   Model 5[84]^a SAE-DNN ID, PSSM, AAS, PP, and AAC
   Model 6 SAE-DNN ID, PSSM, AAS, and PP
   Model 7 SAE-DNN ID, PSSM, AAS, PP, and AAC
   Model 8 SAE-DNN with RF feature selection ID, PSSM, AAS, PP, and AAC
   Model 9 SAE-DNN with XGBoost feature selection ID, PSSM, AAS, PP, and
   AAC
   Model 10 RF prediction model ID, PSSM, AAS, PP, and AAC
   Model 11 XGBoost prediction model ID, PSSM, AAS, PP, and AAC
   Model 12[85]^b RF prediction model with class_weight settings ID, PSSM,
   AAS, PP, and AAC
   Model 13[86]^b XGBoost prediction model with class_weight settings ID,
   PSSM, AAS, PP, and AAC
   [87]Open in a new tab
   ^a

   DrugBank dataset excluded.
   ^b

   The weights are calculated using the inverse proportion of class
   frequencies.

   Hyperparameter tuning was also conducted using Bayesian Optimization to
   produce the optimal model. [88]Table 3 shows the hyperparameter search
   space.

Table 3.

   Hyperparameter search space.
                Hyperparameter                         Values
   HL0 Node                                 100–2000
   HLi Node                                 0.5 × (HL i-1)–0.75 × (HLi-1)
   Hidden layer                             1–6
   Learning rate/eta[89]^b                  0.01–0.1
   Dropout rate                             0.2–0.7
   n_estimators[90]^a[91]^b                 100–500
   colsample_bytree[92]^b/max_samples[93]^a 0.1–1
   max_depth[94]^a                          3–6
   [95]Open in a new tab
   ^a

   for the RF model.
   ^b

   for XGBoost model.

2.4. Model evaluation

   The model is trained using StratifiedKFold Cross Validation to balance
   the number of positive and negative interactions, with the number of
   folds, K = 5. The evaluation metrics include accuracy, recall,
   precision, Fscore, receiver operating characteristic curve (ROC), and
   precision-recall curve (PRC). The accuracy value measures how well the
   test data predicts. Precision measures the percentage of positive
   predictions against a positive class. Recall measures the accuracy of
   the positive prediction of the model. Fscore measures the performance
   of the minority class. The ROC curve shows the trade-off between
   sensitivity (or TPR) and specificity (1 - FPR) [[96]9]. The PRC curve
   shows the comparison between precision and recall which aimed to
   measure how well the model predicted the class in imbalanced settings
   [[97]27].

2.5. Enrichment analysis

   Enrichr ([98]https://maayanlab.cloud/Enrichr/) [[99]28] is a database
   used to search for protein pathway enrichment. Pathway enrichment
   analysis consists of identifying the KEGG Pathway and Gene Ontology
   (Biological Processes, Molecular Functions, and Cellular Components).
   The online tool SRPlot ([100]http://www.bioinformatics.com.cn/).
   Cytoscape v3.9.1 ([101]https://cytoscape.org/) [[102]29] was used to
   build protein-enrichment and protein-peptide-enrichment network
   topologies to visualize predicted results of pathway enrichment.

3. Results and discussion

3.1. Prediction results of dataset version 1

   [103]Supplementary Table 4 shows all the prediction models trained
   using the optimal hyperparameters. [104]Table 4 shows all scenario
   comparison results.

Table 4.

   All scenario comparison results with peptide length of ≤50 dataset.
   Metrics Accuracy Recall Precision ROC-AUC Fscore
   Model 1 0.825 ± 0.005 0.136 ± 0.046 0.429 ± 0.040 0.647 ± 0.030 0.202 ±
   0.053
   2 0.831 ± 0.024 0.341 ± 0.062 0.509 ± 0.094 0.723 ± 0.036 0.402 ± 0.058
   3 0.832 ± 0.016 0.428 ± 0.047 0.502 ± 0.052 0.733 ± 0.031 0.460 ± 0.042
   4 0.825 ± 0.020 0.349 ± 0.062 0.494 ± 0.091 0.726 ± 0.013 0.397 ± 0.035
   5 0.822 ± 0.016 0.495 ± 0.014 0.485 ± 0.089 0.761 ± 0.024 0.486 ± 0.047
   6 0.804 ± 0.003 0.433 ± 0.079 0.428 ± 0.011 0.654 ± 0.023 0.428 ± 0.043
   7 0.818 ± 0.020 0.358 ± 0.118 0.480 ± 0.094 0.736 ± 0.025 0.393 ± 0.084
   8 0.828 ± 0.012 0.427 ± 0.072 0.499 ± 0.044 0.743 ± 0.013 0.457 ± 0.051
   9 0.830 ± 0.017 0.413 ± 0.100 0.518 ± 0.055 0.753 ± 0.017 0.448 ± 0.050
   10 0.842 ± 0.007 0.148 ± 0.012 0.709 ± 0.133 0.688 ± 0.039 0.243 ±
   0.021
   11 0.859 ± 0.006 0.369 ± 0.040 0.663 ± 0.055 0.697 ± 0.020 0.466 ±
   0.025
   12 0.800 ± 0.014 0.433 ± 0.067 0.419 ± 0.037 0.685 ± 0.042 0.455 ±
   0.032
   13 0.843 ± 0.009 0.400 ± 0.023 0.560 ± 0.038 0.690 ± 0.029 0.466 ±
   0.023
   [105]Open in a new tab

   All models generally produced fairly good average accuracy with a score
   of ≥80 %. All models have a decent result in an average ROC-AUC of
   >0.6. However, all models produced a bad recall and Fscore, with a
   score of under 0.4 in both metrics. A low recall score indicated a high
   number of false negatives in the models which is a common case in
   imbalanced class problems.

   The ROC curve is used for another measurement to evaluate the model.
   The ROC curve of each model can be seen in [106]Fig. 1a and b. The ROC
   curve closer to the top-left corner indicates the following. Higher AUC
   means that the model can distinguish classes between positive and
   negative. Hence, model 5 has the best AUC in one of its folds of 0.8,
   while the other models produced almost similar scores of AUC with only
   model 1 scoring below 0.7. This means that all scenarios, except model
   1, have at least a 73 % chance to differentiate the class between
   positive and negative.

Fig. 1a.

   [107]Fig. 1a
   [108]Open in a new tab

   ROC curve for model 1–5.

   [109]Fig. 1b. ROC curve for model 6-13.

   An imbalanced setting is PRC is another metric used to measure model
   performance, which can be seen in [110]Fig. 2a and b. PRC shows the
   trade-off between precision and recall, and a model is considered to
   perform better when moving closer to point (1,1) in the graph. Hence,
   no models tend to be close to the ideal point with some of the models
   having a high precision and low recall. A model with high precision but
   low recall means returning very few positive classes, but most of its
   predictions were correct.

Fig. 2a.

   [111]Fig. 2a
   [112]Open in a new tab

   Precision-Recall (PR) curve for model 1–5.

   [113]Fig. 2b. Precision-Recall (PR) curve for model 6-13.

   The scenarios were further compared by measuring the confusion matrix
   on the last fold of the CV. [114]Table 5 shows a comparison of the
   results.

Table 5.

   All scenarios’ confusion matrix comparison results.
   Metrics Model
     __________________________________________________________________
     __________________________________________________________________
     __________________________________________________________________
     __________________________________________________________________
     __________________________________________________________________
     __________________________________________________________________
     __________________________________________________________________

   1 2 3 4 5 6 7 8 9 10 11 12 13
   True Positive (TP) 11 32 29 31 40 37 17 41 31 13 32 34 33
   True Negative (TN) 392 387 374 374 360 390 428 408 418 437 427 390 413
   False Negative (FN) 69 48 51 49 40 53 73 49 59 77 58 56 57
   False Positive (FP) 12 17 30 30 44 48 10 30 20 1 11 48 25
   [115]Open in a new tab

   Models 5 and 8 had the highest number of TP (40 and 41 consecutively),
   but they also produced many FP which is something that should be
   avoided in peptide-protein interactions. Model 10 produced the lowest
   number of positive predictions with only 14. Hence, we considered
   models 5, 10, and 11 to be the better model among others with model 11
   being the best due to its high number of TP (32) and low FP (11) as
   well as its performance in all evaluation metrics in [116]Table 4.

3.2. Prediction comparison of dataset version 1 and version 2 using the three
best model

   Then, we compared the performance of models 5, 10, and 11 with the data
   version 2 which used the peptide sequence length of ≤19 (the average
   length of all peptide sequences). The comparison results can be seen in
   [117]Fig. 3.

Fig. 3.

   [118]Fig. 3
   [119]Open in a new tab

   Comparison results for dataset versions 1 and 2 using models 5, 10, and
   11.

   Overall, the performance of dataset versions 1 and 2 revealed no
   significant differences. Dataset version 2 has a slightly better
   performance in model 10 while dataset version 1 had a better
   performance in models 5 and 11. [120]Fig. 4 shows the comparison for
   the ROC curve, while [121]Fig. 5 shows the PR curve comparison of each
   dataset.

Fig. 4.

   [122]Fig. 4
   [123]Open in a new tab

   ROC curve comparison for dataset versions 1 and 2 using models 5, 10,
   and 11.

Fig. 5.

   [124]Fig. 5
   [125]Open in a new tab

   Precision-Recall curve comparison for dataset versions 1 and 2 using
   models 5, 10, and 11.

3.3. Venom peptide prediction results in dataset version 1

   The venom peptide prediction is performed using models 5, 10, and 11.
   Prediction using model 5 resulted in 23,532 pairs of peptide-protein
   with all peptides predicted to have interaction with cancer proteins.
   Only protein AFDN had no interaction with venom peptide.

   Prediction using model 10 produced 169 pairs of peptide-protein with
   113 unique peptides predicted to have an interaction with only four
   proteins (Estrogen Receptor 1 [ESR1], Golgi-associated PDZ and
   coiled-coil motif-containing protein [GOPC], Bromodomain-containing
   protein 4 [BRD4], and androgen receptor [AR]), with protein ESR1 having
   the most peptide interaction of 112 peptides, followed by GOPC (48),
   BRD4 (8), and AR (1).

   Prediction using model 11 produced 765 peptide-protein pairs with 140
   distinct peptides and 66 proteins. The top 5 protein with the most
   interaction with a peptide is ESR1 with 90 peptides, followed by EGFR
   (76), GOPC (69), BCL6 (47), and YWHAE/14-3-3 protein epsilon (42).

   Venom peptide prediction with these three models also yielded the same
   121 pairs of peptide-protein that consisted of 86 unique peptides and 3
   proteins (ESR1, GOPC, and BRD4). [126]Supplementary File 2 shows all of
   these overlap prediction results.

3.4. Venom peptide prediction results in dataset version 2

   Venom peptide prediction using dataset version 2 produced a slight
   difference result. Model 5 prediction resulted in 145 pairs of
   peptide-protein with all peptides predicted to have interaction with
   only 1 protein (CREBPP).

   Model 10 prediction yielded 121 pairs of peptide-protein with 108
   unique peptides predicted to have an interaction with only three
   proteins (ESR1, GOPC, and BRD4), with protein ESR1 having the most
   peptide interaction of 107 peptides, followed by GOPC (7) and BRD4 (7).

   Prediction using model 11 produced 1056 peptide-protein pairs with 140
   distinct peptides and 71 proteins. The top 5 protein with the most
   interaction with a peptide is ESR1 with 102 peptides, followed by AR
   (86), BRD4 (83), GOPC (73), and BCL6 (58). Peptide-protein pairs have
   no overlap among these three prediction model results. The overlap
   pairs were found in models 10 and 11 prediction results, with a total
   of 95 pairs that consisted of 86 unique peptides and 3 proteins (ESR1,
   GOPC, and BRD4). [127]Supplementary File 3 shows all of these overlap
   prediction results.

   A similarity was found in pairs of peptide-protein between the
   prediction of dataset versions 1 and 2, with 67 pairs which consisted
   of 64 unique peptides and 3 proteins (ESR1, GOPC, and BRD4).
   [128]Supplementary File 4 shows these overlap pairs.

3.5. Enrichment analysis results

   The identification results of protein overlap interactions with
   peptides included 83 peptides for ESR1, 6 for GOPC, and 6 for BRD4.
   Gene ontology and KEGG pathways are identified from ESR1, GOPC, and
   BRD4 proteins using the Enrichr database, as seen in [129]Fig. 6,
   [130]Fig. 7, [131]Fig. 8, [132]Fig. 9 and [133]Supplementary File 5.
   [134]Fig. 10 shows the topological network of
   protein-peptide-enrichment.

Fig. 6.

   [135]Fig. 6
   [136]Open in a new tab

   GO Biological Processes: a. Top 10 terms with the lowest p-value; b.
   network of protein-GO Biological Processes, darker colors indicate the
   terms with a p-value of <0.01(enrichment terms explanation can be seen
   in [137]Supplementary File 5). (For interpretation of the references to