Abstract
Influenza A virus (IAV) has the characteristics of high infectivity and
high pathogenicity, which makes IAV infection a serious public health
threat. Identifying protein-protein interactions (PPIs) between IAV and
human proteins is beneficial for understanding the mechanism of viral
infection and designing antiviral drugs. In this article, we developed
a sequence-based machine learning method for predicting PPI. First, we
applied a new negative sample construction method to establish a
high-quality IAV-human PPI dataset. Then we used conjoint triad (CT)
and Moran autocorrelation (Moran) to encode biologically relevant
features. The joint consideration utilizing the complementary
information between contiguous and discontinuous amino acids provides a
more comprehensive description of PPI information. After comparing
different machine learning models, the eXtreme Gradient Boosting
(XGBoost) model was determined as the final model for the prediction.
The model achieved an accuracy of 96.89%, precision of 98.79%, recall
of 94.85%, F1-score of 96.78%. Finally, we successfully identified
3,269 potential target proteins. Gene ontology (GO) and pathway
analysis showed that these genes were highly associated with IAV
infection. The analysis of the PPI network further revealed that the
predicted proteins were classified as core proteins within the human
protein interaction network. This study may encourage the
identification of potential targets for the discovery of more effective
anti-influenza drugs. The source codes and datasets are available at
[34]https://github.com/HVPPIlab/IVA-Human-PPI/.
Keywords: Pathogen-host interaction (PHI), Protein-protein interaction
(PPI), Influenza A virus, XGBoost, Machine learning, GO and KEGG
Introduction
The influenza virus belongs to the genus Influenza within the family
Orthomyxoviridae and is a type of pathogen that causes influenza
([35]Ahmad et al., 2015). Due to its susceptibility to genetic
recombination leading to antigenic variation, it has caused several
pandemics and posed a serious threat to human public health ([36]Shaw,
2011). In addition to pandemic influenza outbreaks, seasonal influenza
epidemics occur annually in all regions of the world, resulting in
approximately three to five million severe cases and 250,000 to 500,000
deaths annually ([37]Huang et al., 2011). The influenza viruses are
classified into four types: A, B, C and D. Among them, type A (IAV) has
the greatest impact and the widest range of hosts ([38]Lee, Wang &
Park, 2018), being able to infect a wide range of mammals, including
humans. Meanwhile, IAV is responsible for seasonal flu and all previous
flu pandemics ([39]Nogales et al., 2021). Since a virus depends heavily
on the host cellular machinery for its replication, intermolecular
interactions between the virus and host must be established for their
propagation ([40]Ain et al., 2020). By gaining insight into virus-host
protein-protein interactions, scientists can identify important host
factors enriched in the infection pathway that may be used as potential
drug targets and develop antiviral drugs against these targets
([41]Marques et al., 2019). Hence, the investigation of the interaction
between proteins of the influenza A virus and human proteins holds
significant importance in the diagnosis of influenza infections and the
containment of influenza virus pandemics.
There are many traditional experimental methods for predicting PPIs,
with yeast two-hybrid (Y2H) and tandem affinity purification-mass
spectrometry (TAP-MS) being the two most prominent experimental
approaches ([42]Farooq et al., 2021). Although these traditional
experimental methods have helped people to detect many unknown protein
interactions, biological experiments can be affected by a number of
factors that can lead to false positives in the experimental results
([43]Deng et al., 2020), and they have the problems of high cost and
large consumption of manpower and material resources, so the available
and reliable virus-human PPI data are still insufficient ([44]Yang et
al., 2021). Therefore, validating traditional experimental methods
through computational approaches and supplementing PPI data with
prediction methods is imperative. The commonly employed computational
methods are protein structure-motif interactions-based methods (DMI)
([45]Evans et al., 2009), protein structure similarity-based methods
([46]Tiwari et al., 2021), and machine learning (ML)-based methods
([47]Zhang et al., 2017). Despite the gradual improvement of biological
databases, the coverage of structural information for proteins is still
incomplete. This makes the approach of using machine learning to study
PPIs only based on protein sequence information stand out, as protein
sequences are easily accessible. The human-virus PPI prediction task
can be treated as a binary classification problem on the positive and
negative samples. However, since there are no experimentally
established non-interacting protein pairs (negative samples), careful
selection of negative samples becomes crucial ([48]Murakami &
Mizuguchi, 2022). Another crucial challenge of ML-based methods is how
to encode variable-length protein sequences into fixed-length numerical
feature vectors that can be inputted into models to predict ([49]You,
Chan & Hu, 2015). Furthermore, the extraction of appropriate features
from protein sequences is of paramount importance for accurately
predicting PPIs ([50]You et al., 2013). An effective feature
representation should capture essential biochemical, structural, and
sequence-level information from protein sequences to ensure that the
model learns key interaction patterns. The more comprehensively the
representation characterizes protein interactions, the more insights
the model can extract, ultimately improving prediction accuracy.
Many researchers have conducted the construction of sequence-based
predictive PPI models. [51]Yang et al. (2020) applied doc2vec on the
dataset of known PPIs between all viruses and human to represent
protein sequences as low-dimensional but information-rich feature
vectors, which allowed it to capture more contextual information from
protein sequences, and obtained excellent prediction accuracy using
Random Forest (RF) as classifiers. [52]Liu et al. (2021) developed a
prediction model for PPIs between hepatitis C virus (HCV) and humans by
integrating features generated from the pseudo-amino acid composition.
[53]Dey & Mukhopadhyay (2019) used amino acid composition and conjoint
triad to extract features from protein sequences and used support
vector machine (SVM) model to predict PPIs between dengue and its human
hosts.
The proposal of these methods proves the feasibility of using a
sequence-based ML model for PPI prediction and shows the promise of
providing a practical reference for further in-depth research on
viruses in the future. However, current methods still have
deficiencies. First, few studies have explored how to construct
high-quality PPI datasets for training models, as there is no gold
standard for positive and negative samples. Positive samples are often
derived from high-throughput data, while negative samples can be
challenging. An excessive difference between positive and negative
samples can lead to model bias, whereas a lack of distinction between
them may degrade model performance. And there are almost no directly
accessible datasets of IAV-human PPIs, which hinders the further study
of influenza A through computational methods. Secondly, effective
feature representation to characterize protein sequence information is
another important issue in current methods. Many recent studies have
demonstrated that a single feature is insufficient to fully
characterize sequence information. Therefore, in their study, they
utilized a combination of multiple features ([54]Chen et al., 2020;
[55]Yu et al., 2020; [56]Gao et al., 2022). However, most of these
studies have not detailed how the features were selected or the
biological significance of their combination. Finally, the
determination of classification algorithms and hyper-parameterization
also has a great impact on the overall method.
In this article, firstly, we proposed a new method of negative sample
construction to construct a high-quality IAV-human PPI dataset for
training prediction model. To better characterize PPI information,
protein sequence information was then extracted by integrating the
conjoint triad and Moran autocorrelation features. Furthermore, a
five-fold cross-validation was employed to compare various machine
learning models, the eXtreme Gradient Boosting (XGBoost) model with the
best performance was determined to predict the interaction between IAV
and human proteins. Finally, based on this XGBoost model, a large
number of unknown influenza A virus-human protein interactions were
predicted. The GO and pathway analysis showed that these genes were
highly associated with influenza A virus infection process, indicating
the effectiveness of our method. We hope these predicted IAV-human PPIs
and the human target proteins will shed some light in the fields of the
research on infection mechanism and anti-virus drug development against
IAV. Portions of this text were previously published as part of a
preprint
([57]https://www.authorea.com/doi/full/10.22541/au.172115164.41006449/v
1).
Materials and Methods
Here, we introduce a computational pipeline ([58]Fig. 1) that is based
on a protein sequence-based ML method, allowing us to predict IAV-human
PPIs. Firstly, in the data preparation phase, we collected IAV-human
PPIs from HPIDB and constructed a high-quality positive dataset through
a series of processing, and then applied our proposed degree and
dissimilarity-based negative sampling to construct a more realistic and
reliable negative dataset. As a result, a high-quality IAV-human PPI
dataset was generated, which will be used for validation and training.
We also prepared an independent dataset for predicting more IAV-human
PPIs. Subsequently, two sequence-based features: conjoint triads and
Moran autocorrelation, were used to extract PPI information from
protein sequences. Following that, the powerful XGBoost model was used
to predict IAV-human PPIs. Finally, a series of systems biology
analyses were carried out on the predicted results, which proved that
the human proteins in the predicted interactions were involved in the
process of influenza A virus infection, indicating the reliability of
our method. The overall workflow is shown in [59]Fig. 1.
Figure 1. Schematic framework of our research.
[60]Figure 1
[61]Open in a new tab
The specific steps are described as follows: Step 1: Data preparation.
A training dataset was gathered for model construction and training. An
Independent dataset was constructed to predict potential PPI. Step 2:
Feature extraction and feature fusion. Using conjoint triad and Moran
autocorrelation descriptors to convert the protein sequences into
feature vectors and extract PPI information from the sequences. Step 3:
Model construction and training. Five-fold cross-validation was used to
train the XGBoost model with optimal parameters. Step 4: Prediction and
results analysis. The trained model was utilized to predict potential
PPIs and the systems biology analysis was performed on the predicted
results.
Dataset
Positive dataset
In this study, a total of 10,485 pairs of IAV and human PPIs were
acquired by searching the HPIDB3.0 (Host-Pathogen Interaction Database)
([62]Ammari et al., 2016). After applying several filtering criteria,
including removing duplicate values, limiting the host to humans,
restricting the pathogen to the IAV, setting a minimum protein sequence
length of 50 amino acids, and removing non-standard amino acids, a
total of 7,011 pairs of IAV and human PPIs were obtained as positive
samples for this study. The process is shown in [63]Fig. 2.
Figure 2. Flow chart of positive sample construction.
[64]Figure 2
[65]Open in a new tab
Negative dataset
Due to the difficulty in obtaining empirical evidence of
non-interaction between two proteins, there is no gold standard for
constructing negative samples so far. However, the quality of negative
samples affects the accuracy of PPI prediction. In response to this
problem, many scholars have proposed methods for constructing negative
samples. The mainstream methods for constructing negative samples are
random sampling, subcellular localization based sampling, and
dissimilarity-based sampling. Random sampling is the earliest
mainstream method ([66]Zhang et al., 2018). In this method, human
proteins and viral proteins from the real PPI are randomly paired
together, the original positive combinations are then removed from
these pairs, leaving just the remaining pairs as negative samples. This
method may take a significant number of positive samples as negative
samples. The approach of constructing negative samples based on
subcellular localization has been acknowledged and adopted by numerous
researchers since its introduction ([67]Jansen & Gerstein, 2004).
Subcellular localization data provides insight into the functional
position and biological role of proteins within a cell, making it
biologically significant for the creation of negative samples.
Nevertheless, this approach has its limitations. [68]Ben-Hur & Noble
(2006) experimentally demonstrated that although choosing negative
examples as pairs of proteins that are localized to different cellular
compartments creates high-quality negative examples, it also makes them
easier to distinguish from interacting proteins. Dissimilarity-based
sampling is proposed by [69]Eid, ElHefnawi & Heath (2015). They
hypothesized that viral proteins with high sequence similarity could
theoretically interact with a large number of similar host proteins.
That is, if there exist human proteins H1, H2 that interact with viral
proteins V1, V2 respectively, and if the sequence similarity between V1
and V2 is greater than 80%, then it can be assumed that H1 interacts
with V2, and H2 interacts with V2. Conversely, if the sequence
similarity between V1 and V2 is less than 20%, then H1-V2, H2-V1 can be
considered as non-interacting and can be used as negative samples. This
method greatly reduces false negatives and is more biologically
meaningful. This method is currently the most recognized method for
constructing negative samples and is still used by many researchers
today. In addition, [70]Dey, Chakraborty & Mukhopadhyay (2020) proposed
a degree-based negative sampling method, which is based on the
principle that human proteins with higher degrees are more likely to be
attacked by viral proteins. Here, “degree” refers to the number of
connections a protein has with other proteins in the human
protein-protein interaction network. The degrees of all human proteins
are calculated and sorted in ascending order, and proteins with low
degrees are selected as negative samples. In [71]Dey, Chakraborty &
Mukhopadhyay (2020) the dataset is single human proteins, and this
degree-based method of constructing negative samples has not been
applied to protein pairs. The prevalent methods for generating negative
PPI datasets continue to be subcellular localization-based sampling and
similarity-based sampling.
Inspired by these previous works, here we propose a new method, namely,
degree and dissimilarity-based negative sampling. The negative dataset
obtained by the dissimilarity-based sampling is then further selected
according to the degree, and the human proteins involved in it with
lower degrees are selected as the final negative samples, thereby
significantly enhancing the stringency of the selection process. This
approach enhances the overall rigor of the methodology.
The following is how the degree and dissimilarity-based negative
sampling is carried out. First, global alignments of pathogen proteins
were carried out using the Needleman-Wunsch algorithm on an
all-versus-all basis. These alignments employed a linear gap penalty of
10 and utilized the BLOSUM62 scoring matrix ([72]Peris & Marzal, 2011).
The obtained comparison scores were then normalized between 0 and 1.
The sequence dissimilarity distance between viral proteins V1 and V2 is
obtained by subtracting the normalized scores from 1 and the
dissimilarity threshold was set at 0.8. For example, for a certain
viral protein, we found all other viral proteins that meet the
dissimilarity threshold, obtaining all viral proteins with less than
20% similarity to that viral protein. Then the human proteins that
originally interact with these viral proteins in the positive sample
were selected and randomly paired with the obtained set of viral
proteins. After that, we removed the protein pairs that already exist
in the positive sample. The remaining pairs were considered as
candidate negative samples. Once all candidate negative samples were
acquired, the degrees of the human proteins within those samples were
calculated. Finally, the protein pairs were sorted in ascending order
according to the degree of human proteins involved. An equal number of
protein pairs as in the positive samples were selected to determine the
negative sample. As a result, a total of 7,011 protein pairs were
designated as negative samples. [73]Figure 3 shows the workflow of
degree and dissimilarity-based negative sampling.
Figure 3. The workflow of degree and dissimilarity-based negative sampling.
[74]Figure 3
[75]Open in a new tab
Independent dataset
An independent dataset was prepared to predict potential interactions
between human proteins and influenza A virus proteins to get potential
human target protein. In this study, we downloaded the full set of IAV
and human protein sequences from the UniProt/Swiss-Prot ([76]Boutet et
al., 2007) database. After excluding non-standard amino acid sequences
and setting a minimum protein sequence length of 50 amino acids, which
aligns with the positive sample filtering criteria, we eventually
obtained a total of 1,384 IAV proteins and 20,304 human proteins. The
IAV and human proteins were randomly combined, as V (1,384) × H
(20,304), generating a total of 28,100,736 pairs as full-set samples.
Feature representation
Conjoint triad
Protein-protein interactions are mainly driven by hydrophobic effects,
with hydrogen bonding and electrostatic interactions playing a crucial
role ([77]Pontremoli et al., 2018). Electrostatic and hydrophobic
interactions are affected by the dipole and volume of the amino acid
side chain. The conjoint triad is a method proposed by [78]Shen et al.
(2007) to classify amino acids based on dipole and side chain volume in
order to extract useful information from protein sequences and convert
the sequences into feature vectors. It takes into account the
properties of an amino acid and its neighbors and regards any three
consecutive amino acids as a single unit. Therefore, it is possible to
distinguish triads based on the class of amino acids, and by
calculating the frequency of each triad type, the PPI information of
the protein sequence can be converted into a numerical vector. The
process of generating the descriptor vector is described as follows:
the 20 amino acids are categorized into seven categories depending on
dipole and side chain volumes. The seven categories are {AGV}, {DE},
{FILP}, {KR}, {MSTY}, {HNQW}, and {C}. Considering three consecutive
amino acids as one unit, we can get 7 * 7 * 7 = 343 triplet types, then
calculate the frequency of each triplet occurrence in the given protein
sequence, and finally obtain the 343-dimensional feature vector
according to the following formula.
[MATH: di=fi−min{f
1,f2
,…,f343}
max{f
1,f2
,…,f343},i=1,2,…,343 :MATH]
Moran autocorrelation
Protein-protein interactions can be categorized into four distinct
modes of interaction: electrostatic interactions, hydrophobic
interactions, spatial interactions, and hydrogen bond interactions
([79]Arenas et al., 2015). The conjoint triad approach considers the
local environment of residues, but it only considers the properties of
an amino acid and its two neighboring amino acids. However, information
between discontinuous amino acids is equally vital, as these discrete
amino acids may be spatially close for protein folding ([80]You et al.,
2014; [81]Wang et al., 2017). By considering this information, a more
comprehensive understanding of PPIs can be achieved ([82]Guo et al.,
2008), so we also use Moran autocorrelation to represent protein
sequences. The autocorrelation-based descriptors capture the spatial
distribution of local and global properties in proteins by calculating
correlations between the physicochemical properties of amino acids at
different locations in the protein ([83]Hosseinzadeh et al., 2012). The
autocorrelation descriptors also provide access to the physicochemical
information contained in the protein sequence, and here eight
physicochemical properties of amino acids ([84]Xiao et al., 2015) have
been chosen to reflect these interaction patterns as much as possible:
normalized average hydrophobicity scales (AccNo. CIDH920105; [85]Cid et
al., 1992), average flexibility indices (AccNo. BHAR88010;
[86]Bhaskaran & Ponnuswamy, 1988), polarizability parameter (AccNo.
CHAM820101; [87]Charton & Charton, 1982), free energy of solution in
water (AccNo. CHAM820102), residue accessible surface area in
tripeptide (AccNo. CHOC760101; [88]Chothia, 1976), residue volume
(AccNo. BIGC670101; [89]Bigelow, 1967), steric parameter (AccNo.
CHAM810101; [90]Charton, 1981), and relative mutability (AccNo.
DAYM780201; [91]Dayhoff, Schwartz & Orcutt, 1978). The Moran
autocorrelation descriptor is defined as follows:
[MATH:
I(d)=1L−d<
/mi>∑i=1
L−d<
mrow>(Pi
−P¯′)(Pi+
d−P¯′)1L∑i=1
L(
Pi−P¯′)2d=1,2,…,30 :MATH]
where
[MATH: Pi :MATH]
and
[MATH:
Pi+d
:MATH]
denote the physicochemical properties of the
[MATH: i−th :MATH]
and
[MATH: i+d−th
:MATH]
amino acids, and d denotes the interval between two amino acids.
Feature selection
Due to the complexity of protein structures and functions, feature
extraction for proteins is more intricate than for DNA and RNA
sequences ([92]Tsubaki, Tomii & Sese, 2018). Protein sequence-based
features were categorized into five groups: amino acid composition
descriptors, such as amino acid composition (AAC) and conjoint triad
(CT), Autocorrelation descriptors, such as Geary autocorrelation
(Geary) and Moran autocorrelation (Moran); pseudo amino acid
composition descriptors, such as pseudo amino acid composition (PseAAC)
and amphiphilic pseudo amino acid composition (APseAAC);
quasi-sequence-order descriptors, such as quasi-sequence-order (QSO)
and sequence-order-coupling number (SOCN); and CTD descriptors,
including composition (CTDC), transition (CTDT), and distribution
(CTDD) ([93]Liu, 2017). We evaluated features from these categories on
the constructed dataset using a wrapper-based feature selection method,
identifying the best-performing features within each group.
Subsequently, feature combination and ablation experiments were
conducted to determine the optimal feature set.
Optimization and training of the XGBoost model
XGBoost ([94]Chen & Guestrin, 2016) is an integrated learning algorithm
based on gradient boosting trees, whose core idea is to build a
powerful prediction model by combining multiple decision trees. The
prediction accuracy of the model is improved by iteratively training
multiple decision trees. In each iteration, XGBoost calculates the
gradient and second-order derivative of the loss function of the
current model, and then constructs a new decision tree to fit the
current negative gradient residuals. The step-by-step approach employed
by XGBoost enables it to effectively adapt to complex binary
classification problems, making it highly suitable for such scenarios
([95]Wu, Li & Ma, 2021). The objective function of XGBoost can be
expressed as:
[MATH:
Obj(t
mi>)=∑i=1
nl(yi,y^i(<
mi>t))
+∑i=1
tΩ(f
i) :MATH]
where
[MATH: yi :MATH]
is the true value,
[MATH: y^i :MATH]
is the predicted value,
[MATH: ∑i=1
nl(yi,y^i(t))
:MATH]
is the loss function and
[MATH: Ω(f
i) :MATH]
is the canonical term.
To optimize the model’s performance and enhance its generalization
ability, an iterative process was employed to test different parameter
combinations. Specifically, the learning_rate parameter was tested
within the range of 0.01 to 0.2 in steps of 0.01, the max_depth
parameter within the range of 3 to 10 in steps of 1, and the
n_estimators parameter within the range of 50 to 500 in steps of 10.
For the optimization, 80% of the dataset obtained from the previous
steps was used for training while the remaining 20% was for model
validation. By employing a five-fold cross-validation approach with
grid search, the final optimal parameters determined were:
learning_rate = 0.1, max_depth = 7, and n_estimators = 310. These
parameter values have been identified as the most suitable for
achieving optimal performance in the classification task.
To evaluate the performance of XGBoost, we compared it with other
machine learning models. Some widely-used PPI classifiers such as
Random Forest (RF) ([96]Chen & Liu, 2005; [97]Wang et al., 2018),
support vector machine (SVM) ([98]Chatterjee et al., 2011; [99]Cui,
Fang & Han, 2012; [100]You et al., 2015), logistic regression (LR)
([101]Qi, Bar-Joseph & Klein-Seetharaman, 2006), K nearest neighbors
(KNN) ([102]Guarracino & Nebbia, 2010) and other tree models such as
extremely randomized trees (ET) ([103]Yu et al., 2021; [104]Peng et
al., 2018) and adaptive boosting (Adaboost) ([105]Mei & Zhu, 2014) were
compared. These model parameters and optimization are detailed in the
[106]Supplemental File.
Specifically, the training dataset in our study maintained a balanced
ratio of positive and negative samples at 1:1. We extracted protein
sequence information using two features, namely CT+Moran. The XGBoost
model was trained using a five-fold cross-validation technique, with
optimal parameters configured for the task. Subsequently, the final
trained model was saved.
Prediction of IAV-human protein-protein interactions
In this study, we not only evaluated the performance of our proposed
method through various tests but also applied it to prediction in
practice. We prepared an independent dataset to predict more PPIs. The
constructed independent dataset encompasses all IAV proteins and human
proteins. It consists of a total of 28,100,736 protein pairs obtained
by randomly pairing IAV proteins and human proteins together.
Similarly, these 28,100,736 protein pairs were characterized using
CT+Moran features. The model trained in the previous section was used
to make predictions on this independent dataset. The final predicted
results of the model were obtained in the form of the corresponding
protein pair ID along with its probability of being a positive PPI. Due
to the vast amount of data in the prediction, it is necessary to
establish probability thresholds to determine potential PPIs based on
the overall results, ensuring rigorous analysis. For instance, in the
study conducted by [107]Dey, Chakraborty & Mukhopadhyay (2020), the
threshold setting was carefully considered while predicting potential
human protein targets of SARS-CoV-2 and finally set the threshold to
0.7 in order to get a reasonable number of predictions. In order to
obtain potential human protein targets of the IAV with high confidence,
we further screened the human proteins involved in the predicted PPIs
and selected human proteins with a degree greater than 0.5*virus number
(human proteins targeted by more than half of the IAV), the
corresponding high-confidence target genes were obtained, and
subsequently verified using gene ontology and KEGG pathway enrichment
analysis, as well as network topology analysis.
Performance measures
In order to construct more effective models with more significant
features, a five-fold cross-validation approach was utilized for
training. Five-fold cross-validation involves randomly dividing the
samples into five subsets. Four of these subsets are used as the
training set, while the remaining subset serves as the test set. This
process is repeated five times, with each time employing a different
subset as the test set. Then the fully trained model is validated on an
independent dataset to obtain more robust results. Various
measurements, including accuracy, precision, specificity, F1-score,
recall, and Matthew’s correlation coefficient (MCC), were calculated
using balanced datasets with equal numbers of positive and negative
samples. These measures are formulated as follows:
[MATH:
Accurac<
mi>y=TP+T<
/mi>NTP+T
N+FP+FN :MATH]
[MATH:
Precisi<
mi>on=TPTP+FP
mstyle> :MATH]
[MATH:
Specifi<
mi>ty=TNFP+TN
mstyle> :MATH]
[MATH: F1_scor
e=2∗(
Recall∗
Precisi<
/mi>on)
(Recall+Precis
ion)
mrow> :MATH]
[MATH:
Recall=<
mstyle displaystyle="true"
scriptlevel="0">TPTP+FN
mstyle> :MATH]
[MATH: MCC=(T
P∗TN−FP∗
FN)(TP+FP<
/mrow>)∗(TP<
/mi>∗FN)∗(TN+FP
mrow>)∗(TN
mi>+FN) :MATH]
where TP, FP, TN, and FN are true positive, false positive, true
negative, and false negative, respectively.
Systems biology analysis
Gene ontology and pathway enrichment analysis
The functional enrichment analysis of genes by gene ontology (GO) and
pathway analysis were performed on target genes. The GO and KEGG (Kyoto
Encyclopedia of Genes and Genomes) pathway enrichment analysis was
performed via the DAVID (Database for Annotation, Visualization and
Integrated Discovery) online website ([108]Sherman et al., 2022), and
the top 20 items were selected for visualization via the ggplot2
package in R ([109]Ito & Murphy, 2013). Significant GO and KEGG
pathways were selected using a statistical threshold criterion of
adjusted P-value < 0.05.
Identification of hub genes and PPI network analysis
The target genes obtained through our model prediction were mapped on
the STRING database ([110]Szklarczyk et al., 2022), and the PPI was
constructed by selecting a threshold of confidence >0.90. Network
analysis was performed using Cytoscape ([111]Smoot et al., 2010) to
filter out the genes with degree ≥8, which were identified as potential
hub genes ([112]Tian et al., 2019). To better identify hub genes, the
top 20 hub genes were screened from this network based on connectivity
by CytoHubba plugin for Cytoscape. These high confidence hub genes are
considered as high confidence biomarkers of the human pathway of
Influenza A virus infection.
Results and discussion
Performance of different negative sampling methods
In this study, we proposed a novel and rigorous negative sampling
method that produces a more reliable negative interaction dataset
compared to previous approaches. To validate the effectiveness of our
method, we compared the datasets generated by our proposed method with
those obtained from alternative negative sampling techniques using
various models. The alternative methods considered include random
sampling, dissimilarity-based sampling, degree-based sampling, and
subcellular localization-based sampling. To prevent statistical
discrepancies, we employed a balanced dataset with a 1:1 ratio of
positive to negative samples. All datasets were divided into two parts,
with 80% being extensively trained using five-fold cross-validation.
After training, validation was conducted on the remaining 20%
independent dataset, and the performance on this independent dataset
was compared. The comparison results are presented based on accuracy
rates, and the corresponding outcomes are displayed in [113]Table 1. It
demonstrates that our negative sampling approach not only improves
prediction performance but also exhibits biological relevance.
Specifically, this method is based on two key principles: firstly,
viral proteins with high sequence similarity are less likely to
interact with the same host proteins, and secondly, host proteins with
lower degrees of connectivity have a lower probability of interacting
with viral proteins. Compared to other individual methods, our approach
avoids the bias commonly observed with subcellular localization
methods, while also outperforming several widely used approaches in
terms of performance.
Table 1. Comparison of accuracy (%) of different machine learning algorithms
on 1:1 positive: negative training dataset considering random sampling,
dissimilarity-based sampling, degree distribution and degree and
dissimilarity-based of preparing negative.
Random sampling Dissimilarity-based sampling Degree-based sampling
Degree and dissimilarity-based
XGBoost 80.80% 75.83% 95.72% 96.84%
Random forest 81.21% 76.11% 92.58% 94.63%
LightGBM 81.45% 76.18% 93.66% 95.29%
ExtraTrees 81.50% 76.29% 91.50% 93.74%
AdaBoost 81.09% 75.46% 89.89% 92.11%
SVM 81.11% 75.45% 89.83% 91.92%
LR 80.98% 75.02% 89.21% 91.15%
[114]Open in a new tab
Performance of feature representation
We compared the features from five different categories on our
constructed dataset. The results are illustrated in the [115]Fig. 4.
Among all the categories, it is evident that amino acid composition and
autocorrelation provide the most accurate description of our dataset.
The conjoint triad feature within the amino acid composition category
demonstrates superior performance, while Moran autocorrelation stands
out as the optimal feature within the autocorrelation category.
Previous studies have consistently shown that the fusion of multiple
features offers a more comprehensive understanding of the information
associated with protein interactions ([116]Chen et al., 2020; [117]Yu
et al., 2020; [118]Chen et al., 2019; [119]Gao et al., 2022). The best
features in each category were compared with combined CT+Moran features
and the results are shown in the [120]Fig. 5. It can be seen that
CT+Moran outperforms the best single feature in all categories.
Figure 4. ACC, MCC, and F1-score of different features from different
categories on independent dataset.
[121]Figure 4
[122]Open in a new tab
Figure 5. The comparison of accuracy, recall, F1-score and MCC of
best-performing features in each category with CT+Moran.
[123]Figure 5
[124]Open in a new tab
In this study we used two types of features to convert protein
sequences into numerical vectors, the conjoint triad and Moran
autocorrelation, respectively. By integrating these two features, we
obtained a more comprehensive representation of protein interaction
information, as demonstrated in previous studies ([125]Chen et al.,
2020; [126]Yu et al., 2020; [127]Chen et al., 2019; [128]Gao et al.,
2022). To further validate the efficacy of combining these features, we
conducted an ablation experiment specifically targeting this feature
fusion. The results of this experiment are presented in the
accompanying [129]Table 2. The results show that the accuracy when
fusing the two features is 91.37%, with a loss of 2.75% accuracy when
CT is removed and 1.07% accuracy when Moran is removed. In terms of
precision, removing the CT feature results in a reduction of 0.78%,
while excluding the Moran feature leads to a decrease of 0.3%. For
recall, the removal of the CT feature causes a reduction of 4.49%,
whereas excluding the Moran feature results in a decrease of 1.4%.
Regarding the F1-score, removing the CT feature leads to a reduction of
2.72%, and excluding the Moran feature results in a decrease of 0.87%.
For MCC, the removal of the CT feature causes a reduction of 0.0533,
while excluding the Moran feature results in a decrease of 0.021. This
suggests that the feature combination we selected effectively
represents both the continuous and discontinuous information of amino
acids, and demonstrates potential in the prediction task of influenza
virus and human host protein interactions among all the features we
tested.
Table 2. Results of ablation study of CT+Moran fusion feature.
CT+Moran CT Moran
Accuracy (%) 91.37% 90.30% 88.62%
Precision (%) 92.15% 91.85% 91.37%
Recall (%) 90.15% 88.75% 85.66%
F1_score (%) 91.14% 90.27% 88.42%
MCC 0.8275 0.8065 0.7742
[130]Open in a new tab
Determination of the ML model
To build an effective PPI prediction model, the determination of the
classifier is crucial. In our study, we concatenated the protein pair
vectors encoded by CT and Moran, resulting in a 1,166-dimensional
representation of PPIs, and eXtreme Gradient Boosting (XGBoost) was
determined as the final classifier. To evaluate the performance of
XGBoost, we compared it with other machine learning models on the
1,166-dimensional dataset. Some widely-used PPI classifiers such as RF
([131]Chen & Liu, 2005; [132]Wang et al., 2018), SVM ([133]Chatterjee
et al., 2011; [134]Cui, Fang & Han, 2012; [135]You et al., 2015), LR
([136]Qi, Bar-Joseph & Klein-Seetharaman, 2006), KNN ([137]Guarracino &
Nebbia, 2010) and other tree models such as ET ([138]Yu et al., 2021;
[139]Peng et al., 2018) and Adaboost ([140]Mei & Zhu, 2014) were
compared. The radial basis kernel function was used in SVM. L2
regularization with a regularization factor of 6 was used in LR. The
number of neighbors in k-NN was configured as 2. The n_estimators of
AdaBoost, RF and XGBoost were all configured as 310. [141]Figure 6
shows the accuracy of different models for each fold in the five-fold
cross-validation. It is evident in [142]Fig. 6 that both SVM and
XGBoost exhibit the superior performance. Meanwhile, XGBoost slightly
outperforming SVM. To further highlight the superior performance of
XGBoost, we compared the average accuracy, precision, recall, F1-score,
and MCC across different models in the five-fold cross-validation. The
results are presented in [143]Fig. 7, clearly demonstrating that
XGBoost outperforms SVM in all aspects, making it the top-performing
model among all the evaluated models.
Figure 6. Accuracy of each fold in the five-fold cross-validation of
different models.
[144]Figure 6
[145]Open in a new tab
Figure 7. Accuracy, Precision, Recall, F1-score and MCC for different models
on independent dataset.
[146]Figure 7
[147]Open in a new tab
Comparison with other methods
Applying our method to other datasets
To provide a more objective evaluation of the predictive performance of
our constructed model, we conducted a series of comparative
experiments. Firstly, we obtained the datasets used by existing PPI
prediction methods proposed in the literature. Subsequently, we applied
our method to these datasets, utilizing the CT+Moran features for
protein sequence extraction and employing the XGBoost model for
prediction. The results obtained using our method were then compared
with those reported in the original article.
The two datasets published by [148]Zhou et al. (2018) have been widely
adopted as benchmarks for evaluating the performance of
state-of-the-art models in viral-human PPI prediction tasks. We refer
to these datasets as Zhou’s H1N1 and Zhou’s Ebola, with each dataset
named after the virus represented in the respective test set.
In Zhou’s H1N1 dataset, the training set contains 10,955 true PPIs
between humans and any virus other than H1N1, along with an equal
number (10,955) of negative interaction samples. The test set consists
of 381 real PPIs between humans and the H1N1 virus and 381 negative
interactions. It can be seen in [149]Table 3 that the sensitivity,
specificity, accuracy, and MCC of our method on Zhou’s H1N1 are 91.60%,
70.07%, 80.83%, and 0.631. Their original method’s performance on
sensitivity, specificity, accuracy, and MCC are 66.39%, 65.98%, 66.19%,
and 0.324. The results presented in [150]Table 3 demonstrate the
superior performance of our method, particularly in terms of
sensitivity and MCC, when compared to the original research approach.
Table 3. Comparison of our method and Zhou’s method on Zhou’s H1N1 dataset.
Sensitivity (%) Specificity (%) Accuracy (%) F1_score (%) MCC
Zhou’s 66.39 65.98 66.19 – 0.324
Our 91.60 70.07 80.83 82.70 0.631
[151]Open in a new tab
In Zhou’s Ebola dataset, the training set contains 11,341 true PPIs
between humans and any virus other than Ebola and an equal number
(11,341) of negative interaction samples. The test set contains 150
true PPIs between humans and Ebola viruses and 150 negative
interactions. [152]Table 4 displays the performance metrics of our
method on Zhou’s Ebola dataset, including sensitivity, specificity,
accuracy, F1-score, and MCC. Our method achieves a sensitivity of
96.66%, specificity of 66.00%, accuracy of 81.33%, F1-score of 83.81%,
and MCC of 0.658. In comparison, the original method outlined by
[153]Zhou et al. (2018) reports a sensitivity of 90.67%, specificity of
65.33%, accuracy of 78.00%, and MCC of 0.579. The results indicate that
our method is better than the original method in all aspects.
Table 4. Comparison of our method and Zhou’s method on Zhou’s Ebola dataset.
Sensitivity (%) Specificity (%) Accuracy (%) F1_score (%) MCC
Zhou’s 90.67 65.33 78.00 – 0.579
Our 96.66 66.00 81.33 83.81 0.658
[154]Open in a new tab
We also assessed the performance of our method on the dataset developed
by [155]Prasasty et al. (2021), referred to as Prasasty’s bacteria.
Prasasty’ bacteria contain three human pathogens, Bacillus anthracis,
Yersinia pestis, and Francisella tularensis. They utilized two of the
bacteria as the training set, while the remaining one was used as the
test set for their analysis. In our evaluation, we adopted Bacillus
anthracis and Yersinia pestis as training sets, Francisella tularensis
as the test set. Therefore, the training set contains 6,354 positive
interactions between humans and bacteria, and the corresponding
negative interactions. The test set contains 1,187 positive
interactions between humans and Francisella tularensis, as well as
1,187 negative interactions. The outcomes are shown in [156]Table 5.
The method utilized in the original article by [157]Prasasty et al.
(2021) achieved the following performance on this dataset: sensitivity
of 74.56%, specificity of 97.83%, accuracy of 95.84%, and precision of
76.36%. However, when applied to our dataset, our method achieves a
sensitivity of 93.11%, specificity of 95.28%, accuracy of 93.25%, and
precision of 95.07%. Although our model exhibits slightly lower
specificity and accuracy, it displays higher sensitivity and precision.
As a result, our method is superior overall.
Table 5. Comparison of our method and Prasasty’s method on Prasasty’s
bacteria dataset.
Sensitivity (%) Specificity (%) Accuracy (%) F1_score (%) MCC Precision
(%)
Prasasty’s 74.56 97.83 95.84 – – 76.36
Our 93.11 95.28 93.25 93.38 0.870 95.07
[158]Open in a new tab
Additionally, it is worth mentioning that the three datasets we
evaluated were not based on a single virus-human PPI dataset. In these
datasets, the training and test sets involve different viruses, leading
to low sequence similarity between their proteins. In contrast, our
dataset consists entirely of influenza A virus-human PPI data, which
might suggest potential dataset bias. However, our method demonstrated
strong generalization ability, consistently outperforming others across
diverse datasets. This robustness highlights its potential as a
foundational model for PPI prediction across viruses.
Comparison with other methods on our dataset
In order to evaluate our method more comprehensively, we also applied
the proposed approach by others to our dataset and conducted a
comparative analysis with our own method’s results. Specifically, we
employed the method presented in Denovo ([159]Eid, ElHefnawi & Heath,
2015), which utilized CT for protein feature extraction and SVM for
prediction. The comparative outcomes are summarized in [160]Table 6.
Our original results exhibit superior performance compared to Denovo.
In particular, our results demonstrate an accuracy of 96.89%, precision
of 98.79%, recall of 94.85%, F1-score of 96.78%, and MCC of 0.9386.
These values are 6.41%, 7.29%, 6.39%, 6.83%, and 0.1291 higher than
Denovo in terms of accuracy, precision, recall, F1-score, and MCC,
respectively.
Table 6. Comparison of our method and Dovono on our training dataset.
Accuracy (%) Precision (%) Recall (%) F1_score (%) MCC
Denovo’s 90.48 91.50 88.46 89.95 0.8095
Our 96.89 98.79 94.85 96.78 0.9386
[161]Open in a new tab
Gene ontology and pathway enrichment analysis
In this study, we established a threshold value of 0.95 to determine
the predicted results of the model. To better assess the reliability of
the predicted influenza virus-specific human protein targets, we also
stipulated that human proteins must be targeted by over half of the
total number of viruses. This threshold of 0.5 is commonly used in
statistical analysis. As a result, we identified a total of 32,855
interactions, from which we identified 3,269 potential human target
proteins, and yielded 2,995 target genes with high confidence. The
predicted result is shown in the [162]Supplemental Files. To further
understand and validate the high-confidence target genes obtained from
the model, GO and KEGG pathway analysis was performed in this study.
The corresponding results can be observed in [163]Fig. 8. The
biological process is significantly enriched mainly in replication,
transcription and translation (A). These biological processes are
highly associated with viral processes. Viral processes are a number of
biological processes in which viruses are involved, including infection
of host cells, replication of the viral genome and assembly of daughter
viral particles ([164]Huang et al., 2019). IAV enter host cells mainly
through transmembrane transport, such as cytokinesis and vesicular
transport. The influenza virus genome undergoes transcription and
replication in the host cell nucleus, followed by exit and transit, and
final assembly and release ([165]Nuwarda, Alharbi & Kayser, 2021). The
cellular component is mainly enriched in the nucleus, cytoplasm and
extracellular vesicles (B), which are highly similar to those involved
in the life cycle of the influenza A infected host. The molecular
function is mainly enriched in binding-related functions, especially
protein-related binding (C), which further validates the plausibility
of these high confidence target genes.
Figure 8. (A–D) Results of GO and KEGG pathway analysis.
[166]Figure 8
[167]Open in a new tab
The KEGG pathway is significantly enriched in infection-related
pathways (viral carcinogenesis, prion diseases, coronavirus
disease-COVID-19), neurological disease-related pathways (amyotrophic
lateral sclerosis, Alzheimer disease, Huntington disease and Parkinson
disease), immune-related pathways (systemic lupus erythematosus,
neutrophil extracellular trap formation and necroptosis), alcoholism
and signaling-related pathways (D). In terms of viral infection-related
pathways, IAV and SARS-CoV-2 share similar infection pathways and
associated pathogenesis ([168]Ilyicheva & Gureyev, 2021). Meanwhile,
recent studies have suggested a possible association between IAV and a
variety of neurological disorders ([169]Levine et al., 2023),
particularly between influenza and Parkinson’s disease ([170]Hoffman &
Vilensky, 2017). In addition to this, IAV infection is often
accompanied by an immune and inflammatory response ([171]Wan et al.,
2022), in which IAV can cause apoptosis of epithelial cells in the
upper respiratory tract, triggering immunity in the respiratory mucosa
and causing an inflammatory response in vivo ([172]Zhu et al., 2022).
The above GO and pathway analyses suggest that these genes are highly
associated with IAV infection.
PPI network analysis with identified hub genes
By uploading the obtained target genes to the STRING database, 427
potential hub genes (PPIhub) were finally obtained based on degree ≥8.
To better identify the hub genes obtained from the model, a PPI network
of the top 20 highly connected genes was constructed based on the size
of the degree using Cytoscape’s CytoHubba plugin. The importance
ranking of these 20 pivotal genes is given based on the network
interactions between these genes, based on degree ([173]Table 7). All
genes except UBA52 appeared in the positive sample dataset. However,
previous studies have shown that knockdown of UBA52 in chicken cells
resulted in reduced viral titers in the offspring, confirming the
important function of UBA52 in H5N1 influenza A virus infection
([174]Ghobadi et al., 2019). The other 19 genes were all genes encoding
ribosomal proteins. RPS11 and RPS8 have been shown to affect influenza
A infection ([175]Murray, Sheng & Rubin, 2014) and RPS3 is thought to
be a gene involved in viral transcription and translation ([176]Cui et
al., 2020). Most of these proteins are thought to be ribosomal proteins
involved in influenza virus RNA transcription and viral mRNA
translation, and also have high connectivity in the human protein
interactions network ([177]Hegde et al., 2012). This suggests that our
model may be biased to focus on learning sequence features of human
host proteins that affect transcription and translation of the A virus,
a point that may be related to the dataset we constructed, particularly
the original positive sample dataset. Furthermore, it is also possible
that the degree of filtering we performed during the negative sample
construction resulted in the majority of our positive predictions also
being biased towards the core proteins of the human protein
interactions network.
Table 7. Ranking of the importance of the 20 pivotal genes.
Rank Gene symbol Uniprot Degree
1 RPS27A [178]P62979 140
2 RPS11 [179]P62280 134
3 RPS5 [180]P46782 128
4 UBA52 [181]P62987 126
5 RPS18 [182]P62269 124
6 RPS6 [183]P62753 122
7 RPS8 [184]P62241 121
8 RPS23 [185]P62266 120
8 RPS9 [186]P46781 120
8 RPS7 [187]P62081 120
8 RPS16 [188]P62249 120
8 RPS3A [189]P61247 120
13 RPS14 [190]P62263 119
13 RPS15A [191]P62244 119
15 RPS13 [192]P62277 118
15 RPS24 [193]P62847 118
17 RPS28 [194]P62857 117
18 RPS2 [195]P15880 115
18 RPS27 [196]P42677 115
18 RPS3 [197]P23396 115
[198]Open in a new tab
Conclusion
In our study, we constructed a high-quality IAV-human PPI dataset and
employed XGBoost to predict interactions between influenza A virus
proteins and human proteins. By leveraging CT and Moran autocorrelation
features, our method effectively captured PPI information from protein
sequences. Extensive experimental comparisons demonstrated that our
approach outperforms other methods for this task, and its strong
performance across diverse datasets further underscores its potential
as a foundational tool for predicting PPIs involving various viruses.
In a five-fold cross-validation of our benchmark dataset, the model
achieved an accuracy of 96.89%, precision of 98.79%, recall of 94.85%,
F1-score of 96.78%, and MCC of 0.9386.Our model ultimately predicts
32,855 PPIs, involving 3,269 potential target proteins corresponding to
2,995 target genes. The GO and pathway analysis showed that these genes
were highly associated with influenza A virus infection. In the network
topology analysis, the predicted proteins exhibited high connectivity
within the human protein interactions network. This finding further
reinforces the credibility and reliability of our prediction results.
While this study achieved promising prediction results, we acknowledge
that the current approach primarily relies on sequence-based feature
representations, which are inherently limited in capturing the full
complexity of protein information. To address this, future work will
focus on integrating structural data, multi-omics approaches, and
phenotypic information to develop more robust and biologically
meaningful predictive models. Ultimately, we hope that this research
will help biologists recognize possible associations between influenza
A virus and human proteins, and facilitate the development of antiviral
drugs.
Supplemental Information
Supplemental Information 1. Appendix A-Parameter setting and
optimization of different models.
[199]peerj-13-18863-s001.xlsx^ (9.3KB, xlsx)
DOI: 10.7717/peerj.18863/supp-1
Supplemental Information 2. Appendix_B-Predicted_ppi.
[200]peerj-13-18863-s002.xlsx^ (571.1KB, xlsx)
DOI: 10.7717/peerj.18863/supp-2
Supplemental Information 3. Appendix_C-Predicted_human_uniprotid.
[201]peerj-13-18863-s003.xlsx^ (54.7KB, xlsx)
DOI: 10.7717/peerj.18863/supp-3
Supplemental Information 4. Appendix_D-Potential_hub_genes.
[202]peerj-13-18863-s004.xlsx^ (30.4KB, xlsx)
DOI: 10.7717/peerj.18863/supp-4
Funding Statement
The Fundamental Research Funds for the Central Universities of China
(2662018JC034). The funders had no role in study design, data
collection and analysis, decision to publish, or preparation of the
manuscript.
Additional Information and Declarations
Competing Interests
The authors declare that they have no competing interests.
Author Contributions
Binghua Li conceived and designed the experiments, performed the
experiments, authored or reviewed drafts of the article, and approved
the final draft.
Xin Li conceived and designed the experiments, performed the
experiments, authored or reviewed drafts of the article, and approved
the final draft.
Xiaoyu Li analyzed the data, authored or reviewed drafts of the
article, and approved the final draft.
Li Wang analyzed the data, prepared figures and/or tables, and approved
the final draft.
Jun Lu analyzed the data, prepared figures and/or tables, and approved
the final draft.
Jia Wang conceived and designed the experiments, authored or reviewed
drafts of the article, and approved the final draft.
Data Availability
The following information was supplied regarding data availability:
The source codes and datasets are available at GitHub and Zenodo:
- [203]https://github.com/HVPPIlab/IVA-Human-PPI/.
- li,. binghua. (2024). IVA-Human-PPI. Zenodo.
[204]https://doi.org/10.5281/zenodo.14273568.
References