Abstract
Background
Target drugs play an important role in the clinical treatment of virus
diseases. Virus-encoded proteins are widely used as targets for target
drugs. However, they cannot cope with the drug resistance caused by a
mutated virus and ignore the importance of host proteins for virus
replication. Some methods use interactions between viruses and their
host proteins to predict potential virus–target host proteins, which
are less susceptible to mutated viruses. However, these methods only
consider the network topology between the virus and the host proteins,
ignoring the influences of protein complexes. Therefore, we introduce
protein complexes that are less susceptible to drug resistance of
mutated viruses, which helps recognize the unknown virus–target host
proteins and reduce the cost of disease treatment.
Results
Since protein complexes contain virus–target host proteins, it is
reasonable to predict virus–target human proteins from the perspective
of the protein complexes. We propose a coverage
clustering-core-subsidiary protein complex recognition method named
CCA-SE that integrates the known virus–target host proteins, the human
protein–protein interaction network, and the known human protein
complexes. The proposed method aims to obtain the potential unknown
virus–target human host proteins. We list part of the targets after
proving our results effectively in enrichment experiments.
Conclusions
Our proposed CCA-SE method consists of two parts: one is CCA, which is
to recognize protein complexes, and the other is SE, which is to select
seed nodes as the core of protein complexes by using seed expansion.
The experimental results validate that CCA-SE achieves efficient
recognition of the virus–target host proteins.
Supplementary Information
The online version contains supplementary material available at
10.1186/s12859-022-04792-x.
Keywords: Virus–target protein, Host protein, Protein complexes
recognition, Protein–protein interaction network
Introduction
The novel coronavirus (2019-nCov for short) outbroke at the end of
2019, and has been causing varying degrees of disease symptoms in
people of different ages, such as gastrointestinal symptoms in children
while other ages experienced pneumonia, stroke, and blood clots
[[39]1]. 2019-nCov belongs to the
[MATH: β :MATH]
coronavirus, which is a single-stranded RNA virus. In addition to the
discovery of 2019-nCov in coronavirus, six human coronaviruses have
been found and studied in depth. There are four types of
[MATH: β :MATH]
coronavirus group. In 2002, there were 8000 cases of severe acute
respiratory syndrome coronavirus (SARS-CoV) worldwide, and the
mortality rate was about 10%. The Middle East respiratory syndrome
coronavirus (MERS-CoV) produced in 2012 had a higher fatality rate of
36% [[40]2, [41]3]. The other two beta-viruses (HCov-HKU1 and
HCov-OC43) are related to respiratory diseases, but with low incidence.
There are two kinds of
[MATH: α :MATH]
coronavirus groups [[42]4], such as HCoV-229E in 2000, but the
pathogenicity and mild symptoms. The HCov-NL63 virus, found in the
Netherlands in 2004, had little damage to humans and can be ignored.
2019-nCov can be transmitted by face-to-face communication or direct
contact of faces [[43]5]. For patients infected with 2019-nCov, there
are few drugs available to treat the virus, such as Lopinavir,
Remdesivir, and symptom-based treatment [[44]6].
Proteins with similar functions tend to form protein complexes and work
together, as do virus proteins. Therefore, it is very important to
recognize protein complexes in predicting virus–target host proteins.
Detecting complexes from Protein–Protein Interaction (PPI) has become
an important research field in proteomics. Many graph clustering
algorithms use it to find community structures in networks, and
recognize protein complexes. Nepusz et al. [[45]7] proposed ClusterONE,
which for detecting potentially overlapping protein complexes from PPI
data. Liu et al. developed CMC [[46]8] to discover complexes from a
weighted PPI network. Min Wu et al. presented COACH [[47]9] which
detected protein complexes in two stages. COACH first detects
protein-complex cores as the “hearts” of protein complexes and then
includes attachments into these cores to form biologically meaningful
structures.
According to the World Health Organization, over 17 million people die
of infectious diseases every year. It is necessary to identify
interactions between viruses and human proteins in virus disease
treatment. Ho-Joon Lee [[48]10] presented an approach to predict
virus–host PPI by multi-label machine learning classifiers of random
forests and XGBoost using amino acid composition profiles of virus and
human proteins, which predict that histone H2A components are targeted
by multiple 2019-nCov proteins. Jack Lanchantin [[49]11] proposed
DeepVHPPI, combining a self-attention-based transformer architecture
and a transfer learning training strategy to predict interactions
between human proteins and virus proteins that have novel sequence
patterns, achieving great results in predicting virus–human protein
interactions for H1N1 and Ebola. Babak Khorsand [[50]12] adopted
ensemble learning methods to predicting PPI between human proteins and
Alphainfluenzavirus proteins, he extracted several features from
physicochemical properties of amino acids, combined with different
centralities of human PPI.
Studying PPI between the new virus and its known host proteins can
accurately predict the virus. Fatma-Elzahraa Eid [[51]13] proposed
DeNovo, a sequence-based negative sampling and machine learning
framework, he learns from the PPI of different viruses and predicts a
new virus using shared host proteins. Experiments show that this method
can better predict the PPI of human virus infection. Dyer et al.
[[52]14] proposed a method to predict the physical interaction between
human proteins and HIV proteins based on a variety of features, such as
protein domains, sequence information, and the properties of human
proteins in human PPI networks. Zahiri predicted the HIV1–human PPI
network [[53]15]. Ray predicted HCV–human PPI network [[54]16]. Chen
revealed the west Nile virus–human PPI network [[55]17].
However, most existing methods often ignore the inherent biological
structure of the complexes, and many only consider the structure as
dense subgraphs. Few of them consider the modularity of PPI, let alone
search for potential virus–target proteins from protein complexes, such
as Babak Khorsand [[56]18]. Here we adopt CCA to recognize protein
complexes, showing the feasibility of seed selection, and then we used
CCA-SE to predict potential virus–target human proteins. At first, a
network representation learning technique called node2vec is utilized
to learn a dense vector for each vertex to represent the topological
information. Secondly, based on edge clustering coefficient (ECC) and
degree, we introduce a new seed selection strategy, and the core
structure of protein complexes is detected based on SE. Then according
to network topology, we design a fitness function to identify protein
complexes with various densities and modularity. At last, we apply
CCA-SE to predict the 2019-nCov-target proteins, which play a
fundamental role in detecting virus drug targets.
Result
Principle of the CCA-SE method
We developed the CCA-SE method based on the principle of coverage
clustering-core-subsidiary structure, which contained two parts, CCA
was used to recognizing protein complexes, and SE was applied to select
seed nodes. Especially, We demonstrated a new selection strategy for
seed nodes. We clustered seed nodes to obtain the nuclear structure of
protein complexes. Then, based on the density and modularity, we
expanded the core structure and formed protein complexes. Next,
integrate downloaded complexes data with our results. At last, we
calculated the similarity between unknown human proteins and known
virus–target proteins in the same protein complexes. We defined a
scoring function, which was helpful in getting potential virus–target
human proteins. Figure [57]1 shows the algorithm.
Fig. 1.
Fig. 1
[58]Open in a new tab
Process of CCA-SE
Construction of virus–host PPI network
We constructed a virus–target human PPI network based on the human
protein interaction database HIPPIE and the human protein dataset of
2019-nCov infection. Researchers often regard direct neighbor proteins
as potential host proteins in some typical virus–host proteins research
[[59]19]. Therefore, G = (V, E) represents human PPI. V includes not
only host proteins attached by the 2019-nCov directly but also direct
neighbors of those host proteins. E shows human PPI. After removing
noisy data, we got 2308 human proteins. We reported details in
“[60]Methods” section.
Node2Vec [[61]20] not only makes similar nodes closer in the vector
space but also retains network structure, captures the diversity
connection between nodes. We embedded Node2Vec into the PPI network to
extract hidden information and used a low-dimensional vector to
represent each node.
We extracted the hidden layer weight of PPI and demonstrated each
protein node with a 128-dimensional vector [[62]21]. In Fig. [63]2,
nodes included not only the proteins attached by virus but also those
known interacted with virus–host proteins directly.
Fig. 2.
[64]Fig. 2
[65]Open in a new tab
An illumination of virus–host PPI Network
Selection of seed nodes
In a real protein network, many protein complexes are overlapped and
share multiple nodes, so that the core nodes and the overlapped nodes
cannot be distinguished simply by degree [[66]22]. Therefore we use two
topological properties, ECC and nodes degree, which can evaluate the
importance of nodes. The score of each protein is obtained by the sum
of degree and ECC. According to an existing work [[67]9], if the score
of a protein node is greater than the average score, we consider it as
a seed protein that would achieve the best results. The scoring
function for each protein u is defined in Eq. ([68]1). We proved Eq.
([69]1) feasibility in Table [70]1.
[MATH: score
(u)=sumEEC(u)degree(u) :MATH]
1
where sum
[MATH: EEC :MATH]
(u) indicates summation of ECC and degree(u) indicates degree of node
u. Equation ([71]1) considers network topological structure and reduces
the overlapped nodes that include the sum of ECC by dividing each node
degree. Meanwhile, it avoids overlapped nodes mistaken as seed nodes
and improves seed selection accuracy.
Table 1.
Effects in different scores
Score Seed Precision Recall F-score
0.1398 1035 0.9317 0.6399 0.7587
0.1498 982 0.9350 0.6352 0.7565
0.1598 939 0.9390 0.6381 0.7598
0.1698 868 0.9414 0.6406 0.7624
0.1798 821 0.9383 0.6334 0.7563
0.1898 781 0.9272 0.6119 0.7373
0.1998 726 0.9399 0.6078 0.7382
[72]Open in a new tab
We formed Table [73]1 based on our results. “Score” means the standard
score when we choose seed nodes. “Seed” means the number of seeds. We
use 0.1698 as the seed selection standard also is the average score. As
we can see from Table [74]1, whether the score is higher than average
or lower, their precision, recall, and F-score are not good as the
average score’s effect.
We can use Gene Ontology (GO) database to predict and analyze gene
function. Usually, a gene or gene product is annotated by one or more
GO terms. We can calculate the similarity between genes to analyze and
predict gene function. The traditional coverage clustering algorithms
(CCA) [[75]23] are less time-consuming, handle large amounts of data,
and have no overlapping region among obtained clusters. However, these
methods would generate much false positive data without a reasonable
clustering radius. Therefore, we use GO function similarity to obtain a
reasonable radius and improve clustering accuracy. The clustering
radius is computed as follows.
[MATH: dx=rx1+G
mi>Osimilarity :MATH]
2
[MATH: radiu
s=∑x<
mo>∈Ddx∑x∈D1 :MATH]
3
We denote rx, the distance of all unlearned seed nodes to the
clustering center based on the Euclidean distance. We redefine the
distance as shown in Eq. ([76]2). The clustering radius of each final
cluster is obtained by Eq. ([77]3). D indicates the seed nodes that
have not learned yet.
It shows that in PPI networks, high-density subgraphs incline to form
protein complexes [[78]24]. Also, in a subgraph, if the internal weight
is much greater than the external weight, it is more likely to form
protein complexes. Therefore, density [[79]25] and modularity [[80]26]
are two important factors, determining whether the subgraph could form
protein complexes or not. We demonstrate a new method to evaluate
protein complexes.
[MATH: score
(cu)=t∗density(cu)+(1-t)∗modularity(cu) :MATH]
4
[MATH: densi
ty(cu)=2∗|Ecu||Vcu|(Vcu-1) :MATH]
5
[MATH: modularity(cu)=degr<
/mi>eein(cu)-degreeout(cu)degre<
/mi>ein(cu)+degreeout(cu) :MATH]
6
degree
[MATH: in :MATH]
(c
[MATH: u :MATH]
) is the sum of internal edges of the complexes, degree
[MATH: out :MATH]
(c
[MATH: u :MATH]
) is the sum number of other edges connected with the complexes, E
[MATH: Cu :MATH]
is the number of all edges in the complexes, and V
[MATH: Cu :MATH]
is the number of all nodes in the complexes. Moreover, we use a
parameter t to balance the weight of subgraph modularity and subgraph
density, as shown in Eq. ([81]4).
Obtaining potential virus–host proteins
We obtain a set of protein complexes cores. To form complete protein
complexes, we add subsidiary structures to each core with the following
steps and name them SE: (i) The first-order neighbor nodes of each
protein complexes core are obtained from the network diagram as
candidate proteins of the complexes. (ii) Calculating the sum of
functional similarity between the core and their neighbor nodes based
on edge weight. (iii) ranking nodes according to the functional
similarity score, completing protein complexes by adding affiliated
nodes. The local score of the protein complexes calculates when a new
node adds. (iv) Repeating (iii) until no more nodes can improve the
local score, then terminating the expansion of the protein complexes
core.
Protein complexes are molecular polymers that participate in the same
functional region at the same time and space and have direct or
indirect effects [[82]27]. Studies have shown that the network is
modular, and proteins in the same complexes are more likely to
undertake the same life activity or in the same pathway [[83]28]. At
the same time, the more host proteins contained by the complexes
indicate that the complexes are more likely to participate in life
activities such as virus replication [[84]29].
We obtained 255 protein complexes on the human PPI network dataset.
Considering incomplete data, we downloaded all human complexes data
from the CORUM database, including 1948 protein complexes. Then we
integrated them with our results, choosing protein complexes data with
nodes than three and known host proteins while deleting the redundant
data. Therefore we got 455 human complexes, with complexes that contain
more than two known proteins having sixty. The more known host proteins
contained in the complexes, the more likely it participate in viral
survival and reproduction. Since the proteins phenotypes in the same
protein complexes are similar, the remaining proteins may also become
the proteins required for virus replication.
Table [85]2 lists some protein complexes, in which Complex_size
represents the size of the module, Host_protein is the host protein of
the virus, and Complexes is the collection of all proteins.
Table 2.
Partial protein complex data
ID Complex_size Host_protein Complex
1 4 [86]P67870; [87]P19784; [88]P09429; [89]P19784; [90]P67870;
[91]P68400;
2 5 [92]Q15370; [93]P62877; [94]Q15369; [95]Q13617; [96]P40337;
[97]Q15370; [98]P62877; [99]Q15369; [100]Q13617;
[101]Q15369; [102]Q13617; [103]Q15369; [104]Q13617;
3 5 [105]Q15370; [106]Q15369; [107]Q15369; [108]Q15370; [109]Q93034;
[110]Q9UBF6; [111]Q9Y576;
4 7 [112]Q92769; [113]O00422; [114]O75446; [115]Q09028; [116]Q13547;
[117]Q16576; [118]Q92769; [119]Q96ST3;
5 3 [120]P13861; [121]P13861; [122]P24588; [123]Q08209;
6 5 [124]Q9UHD2; [125]P18124; [126]P08238; [127]Q04206; [128]Q9UHD2;
[129]Q00653;
7 3 [130]Q86VM9; [131]Q86VM9; [132]Q9BXB5; [133]Q15287;
8 5 [134]P11940; [135]O75534; [136]P11940; [137]O75534; [138]Q16549;
[139]Q14103;
[140]O60506;
9 16 [141]O43633; [142]Q70EL1; [143]Q9BY43; [144]O95630; [145]Q9UN37;
[146]Q70EL1;
[147]Q96FZ7; [148]Q96CF2; [149]Q9NZZ3; [150]Q9H444;
[151]Q70EL1; [152]Q9HD42; [153]O43633; [154]O75351;
[155]Q9UN37; [156]Q7LBR1; [157]Q9H444; [158]Q9BY43;
10 9 [159]O60885; [160]Q7L2J0; [161]O94992; [162]O60563; [163]O60583;
[164]O60563;
[165]P06400; [166]O94992; [167]O60885; [168]P50750; [169]Q7L2J0;
[170]Open in a new tab
Based on Table [171]2, we can speculate that the unknown human proteins
in the same complexes are closely related to the known virus–host
proteins. The table also shows that not only other proteins in the
module are related to the virus, but also the module itself is related
to the virus, which jointly completes some biological functions and
promotes the reproduction of the virus. Therefore, these remaining
proteins are potential virus–target proteins we need.
The above descriptions show that each of the complexes contains
virus–target host proteins and is more or less associated with the
virus. Then we use the protein complexes and Node2Vec to calculate the
similarity between the rest of the unknown proteins and known host
proteins, then score them as potential target proteins. The score
computes as follows.
[MATH: scorec=wc
∗simc
msub>(complex) :MATH]
7
w
[MATH: c :MATH]
indicates the proportion of known host proteins in the complexes to
which potential target protein C belongs. n
[MATH: c :MATH]
indicates the number of known proteins in the complexes. Let N
indicates the total number of proteins in the complexes. w
[MATH: c :MATH]
in Eq. ([172]8) can be obtained as follows.
[MATH: wc=ncN :MATH]
8
The second term in Eq. ([173]7) indicates the sum of the similarities
between target protein C and the known proteins in its complexes. The
formula is as follows.
[MATH: simc(complex)=∑i=1
nsim(c,Pi
msub>) :MATH]
9
P
[MATH: i :MATH]
indicates each known virus–target protein in the complexes of candidate
protein C, in which the complexes in the Equation select the complexes
corresponding to the maximum value in Eq. ([174]8).
Methods
Statistical model
We define the functional similarity between the two interacting
proteins a and b as GO
[MATH: similarity :MATH]
, and as follows.
[MATH: GOsimilarity(a,b)=|GOsum(a)∩GOsum(b)| :MATH]
10
Based on the definition of protein functional similarity, the adjacency
matrix A
[MATH: ij :MATH]
of graph G can express as follows.
[MATH: Aij=0,other
wiseeij(i,j∈E) :MATH]
11
e
[MATH: ij :MATH]
equals GO
[MATH: similarity :MATH]
.
Dataset
Based on the latest study of the interaction map of 2019-nCov
virus–host protein [[175]30], we obtained 332 high reliability
2019-nCov human PPI data. We mapped the proteins with the Uniprot ID
through the Uniprot database and got 256 key host factors in our
experiments. To reduce the impact of data redundancy on results, we
used the human PPI data in HIPPIE, which integrates interactive data of
multiple databases, including MINT, HPRD, and BioGrid. The HIPPIE is
the most commonly used PPI database. Therefore, we download 65,536 PPI
data from the HIPPIE database, involving 11,564 human proteins.
Weighting protein networks by GO annotations
GO is an internationally standardized gene function classification
system. It consists of a predefined set of GO terms, which can limit
and describe the function of gene products. GO terms provide the
logical structure and correlation of biological processes and classify
biological process (BP), molecular function (MF), and cellular
component (CC). GO annotations [[176]31] are responsible for describing
GO terms function. We use G = (V, E) to represent the proteins network,
V is the set of proteins, and E represents the set of protein–protein
interactions. The specific steps are as follows: (i) We assume that
protein a contains N GO annotation sets on BP.
[MATH: GOBP(a)=go1,go2,⋯,<
mi>gon :MATH]
12
M GO annotation sets on MF.
[MATH: GOMF(a)=go1,go2,⋯,<
mi>gom :MATH]
13
K GO annotation sets on CC.
[MATH: GOCC(a)=go1,go2,⋯,<
mi>gok :MATH]
14
(ii) According to the tree structure of GO, we can calculate all
parental annotation sets of protein a under different categories, then
add to the original annotation set of GO(a), and remove the redundant
data. The function annotation of protein a obtains as follows.
[MATH: GOsum(a)=GOBPsum(a)+GOMFsum(a)+GOCCsum(a)3 :MATH]
15
Cross-validation method
We adopt Cross-Validation to adjust the parameters reasonably under the
condition of moderate source datasets and apply them to practical
problems. Furthermore, we use the K-fold Cross-Validation method to
evaluate the experimental results.
Firstly, we divide the identified host protein complexes into a
training set and validation set according to the ratio of 8:2. To know
how many validation data are in top k, we conduct ten groups of control
experiments to verify the results and use the ratio as the final target
proteins classification standard. We consider deleting the candidate
protein with zero scores in the final ranking process to reduce the
data redundancy. Moreover, k ranges from 0 to 100 with a step of 10,
The Cross-Validation shows that the prediction results can be divided
into four categories, as shown in Table [177]3.
Table 3.
Classification table of predict results
Type of host protein Result (whether host protein) Type of evaluation
Host protein Positive TP
Host protein Negative FN
Not host protein Positive FP
Not host protein Negative TN
[178]Open in a new tab
In this section, the true-positive rate (TPR) and false-positive rate
(FPR) values are obtained according to the four results in
Table [179]3, as shown in Eqs. ([180]16) and ([181]17), TPR indicates
the prediction coverage of our method.
[MATH: TPR=TPP :MATH]
16
[MATH: FPR=FPN :MATH]
17
N represents candidate proteins that are not in the validation data. P
represents candidate proteins that are in the validation data. TP means
the recognized correctly by CCA-SE. FP means the recognized quantity
wrongly.
Considering the unbalanced data distribution between known and unknown
proteins, we use the values of receiver operating characteristic (ROC)
and area under the receiver operating characteristic curve (AUC) as
evaluation indexes, then plot the ROC curves under different thresholds
of TPR and FPR. The abscissa represented the FPR while the ordinate
represented the TPR.
Experiments
Selection of parameter k
We analyze the influence of parameter k, then select the best k value.
Let the value of k vary from 0 to 100. The ROC curve shows in
Fig. [182]3. The prediction results divide into two categories, the
first k% data considered as the predicted potential target proteins,
and the second (100-k)% data not considered as the predicted potential
target proteins. It can be seen from Fig. [183]3 that when k is 40,
CCA-SE successfully predicts 22 known host proteins. According to the
above host PPI network, 80% of the proteins are non-structural protein
interactions encoded by the 2019-nCov. In addition, this experimental
result also shows that the predicted protein scores are high when the
value of k increases, indicating that the use of Eq. ([184]7) helps to
improve the sorting performance of candidate proteins. We plot the AUC
curve to show the advantages and disadvantages of each group of
algorithms. As shown in Fig. [185]4, the AUC curve is relatively stable
in the ten groups of control experiments, which is basically around
0.81, indicating that the algorithm still shows good performance even
when the amount of data is less than the actual biological network
data.
Fig. 3.
[186]Fig. 3
[187]Open in a new tab
The influence of parameter k
Fig. 4.
[188]Fig. 4
[189]Open in a new tab
Comparison of AUC values in multiple control experiments
In summary, in the subsequent prediction of potential virus–target
proteins, we set the value of k as 40 to achieve the best experimental
results.
Comparative experiment of data integration
To make a horizontal comparison and evaluate the applicability of the
integrated data [[190]32], we compare the results of adding complexes
data based on CCA-SE with only containing public human protein
complexes data set. We also use TPR and FPR as evaluation indicators.
The following table shows that the TPR containing public human protein
complexes is only 68.67%, which indicates that the integrated protein
complexes data can improve accuracy and make the biological data more
comprehensive. It also shows that the complexes recognized by CCA-SE
are more biological (Table [191]4).
Table 4.
Experimental comparison of whether the complex is integrated or not
Comparision algorithm Number of recognition TPR (%) FPR (%)
Unintegrated data 17 68.67 38.08
Integrated data 22 72 40.31
[192]Open in a new tab
Performing GO enrichment analysis on prediction results
After we obtain biological data, to read the genetic information,
differential genetic analysis is a necessary experiment between
different samples. Therefore we need to annotate these genes because
the number of these genes are maybe large and difficult to compare. A
common method divides these genes or proteins into several categories,
one category is equivalent to a GO term. This process is called
enrichment analysis. Commonly used enrichment analysis methods include
GO analysis and KEGG analysis.
We integrate known virus–target host proteins with their directly
interacted human proteins and apply CCA-SE to the whole network. Our
predicted results are listed in “Additional file [193]1” and named “new
targets”. In our results, eight proteins have been proved among the top
ten, which are [194]Q13617, [195]Q15370, [196]P62877, [197]Q15369,
[198]P09884, [199]Q14181, [200]P35658, and [201]P78406. It reported
that 87% of them combined with a virus of non-structural proteins. At
the same time, these non-structural proteins are cleaved by 3CLPro, and
this protease is one of the organic substances necessary for the
reproduction of 2019-nCov. On the other hand, researchers have not
recognized similar restriction sites in the human body. It is a medical
value that we target 3CLPro as a drug target. Based on the above
analysis, CCA-SE can recognize virus–target host proteins. Table [202]5
is part of the potential target proteins.
Table 5.
Some of the potential virus–target host proteins
Uniprot ID Protein name MF CC BP
[203]O15264 Mitogen-activated protein GO:0005515; GO:0005829;
GO:0071347;
Kinase 13 (MAPK13) GO:0005524; GO:0005737; GO:0018105
GO:0004674; GO:0005634; GO:0070301;
GO:0004707; GO:0007049;
GO:0050729;
GO:1903936;
GO:0006970;
GO:0051403;
[204]Q92598 Heat shock protein family H GO:0005515; GO:0005654;
GO:1900034;
(Hsp110) member 1 (HSPH1) GO:0005524; GO:0005829; GO:0051135;
GO:0000774; GO:0005737; GO:0045345;
GO:0043014; GO:0005634; GO:0051085;
GO:0070062; GO:0006986;
GO:0032991; GO:0050790;
GO:0071682; GO:0006898;
GO:0005874;
GO:0005576;
[205]Q13177 p21 (RAC1) activated kinase GO:0005515; GO:0005829;
GO:0071407;
2(PAK2) GO:0042802; GO:0005737; GO:0050770;
GO:0019901; GO:0005634; GO:0051497;
GO:0005524; GO:0098978; GO:0018105;
GO:0045296; GO:0005911; GO:0031295;
GO:0004674; GO:0014069; GO:0040008;
GO:0030296; GO:0048471; GO:0043066;
GO:0031267; GO:0005886; GO:0006469;
GO:0004672; GO:0150105;
GO:0034333;
GO:0006468;
GO:0046777;
GO:0050852;
GO:0031098;
[206]Q15311 ralA binding protein GO:0005515; GO:0005829; GO:0043547;
1(RALBP1) GO:0042910; GO:0016020; GO:0007264;
GO:0005096; GO:1990961;
GO:0042626; GO:0051056;
GO:0022857; GO:0043087;
GO:0006935;
GO:0006897;
GO:0055085;
[207]P78545 E74 like ETS transcription GO:0005515; GO:0005654;
GO:0045892;
Factor 3(ELF3) GO:0000978; GO:0005829; GO:0045944;
GO:0001228; GO:0005634; GO:0006366;
GO:1990837; GO:0005794; GO:0045747;
GO:0003700; GO:0006357;
GO:0000981; GO:0030855;
GO:0060056;
GO:0006954;
GO:0001824;
GO:0030198;
GO:0030154;
[208]Open in a new tab
To verify the accuracy of the CCA-SE in predicting candidate
virus–target proteins, we only conduct gene enrichment analysis for the
top 50 potential target proteins. Table [209]6 lists the enrichment
analysis results of proteins that are high scores. These GO terms meet
the p value of less than 0.05. The analysis results in Table [210]6
showed that most predicted GO annotations of target proteins are
related to biological processes such as protein binding, enzyme
binding, transcription factor binding, transcription factor activity,
protein kinase binding, apoptosis, and proliferation. For example,
GO:0005515, which belongs to MF. Previous studies have found that the
spike protein RBD encoded by 2019-nCov contains six amino acids,
including L455, F486, Q493, S494, N501, and Y505. Meantime, RBD can
integrate with the ACE2 protein of human lung epithelial cells, so we
can infer that the host protein corresponding to this functional
annotation has a huge correlation with these residues. Another example,
GO:0016032, which plays an important role in virus affection, relates
to the viral genome replication and the assembly of progeny virus
particles. Moreover, Babak Khorsand [[211]18] listed the most central
nodes in human interactions of 2019-nCov in his paper, which are
[212]Q86VP6, [213]Q92905, [214]Q13573, and [215]P01106. Our results
included [216]Q92905, [217]Q13573, and [218]P01106, and we both
performed experiments in the same datasets. The prediction of
2019-nCov-target potential host proteins shows a significant enrichment
effect. Demonstrating the accuracy of our prediction based on a
molecular network.
Table 6.
Results of GO enrichment analysis
GO ID GO terminology p value
GO:0016032 Viral process 1.23e−4
GO:0003677 DNA binding 1.40e−4
GO:0004842 Ubiquitin-protein transferase activity 1.32e−4
GO:0005515 Protein binding 1.33e−4
GO:0031625 Ubiquitin protein ligase binding 1.09e−05
[219]Open in a new tab
KEGG pathway analysis of prediction results
Kyoto Encyclopedia of Genes and Genomes (KEGG) is a database with
functional information about each gene. The core of the KEGG database
is KEGG PATHWAY and KEGG ORTHOLOGY. In KEGG PATHWAY, biological
metabolic pathways are divided into six categories, namely Cellular
Processes, Environmental Information Processing, Genetic Information
Processing, Human Diseases, Metabolism, and Organismal Systems. Once we
get the differential gene information, in order to learn their
functions more clearly, gene enrichment analysis may be used to
discover biological pathways that play a key role in biological
processes, so that we can better understand the molecular mechanisms of
biological processes. KEGG pathway analysis [[220]33] selects pathway
databases and human-related pathways to analyze the predicted proteins.
In Fig. [221]5, C1–C6 represents the result processed by KEGG pathway
enrichment analysis. C1 means Annotation Cluster 1, C2 means Annotation
Cluster 2, and so on. We sort the results in descending order of
Enrichment Score, and “other” includes those genes that do not belong
to any of the clusters. These genes have not shown their functional
characteristics in our pathway analysis, we consider them less
important factors in our potential virus–target experiments. According
to Fig. [222]5, a total of 42 pathways (p-value 0.05) are obtained by
screening proteins with high scores, mainly including the TNF signaling
pathway, T cell receptor signaling pathway (TCR) pathway, and MAPK
pathway related to cell cycle and inflammatory immune regulation,
PI3K-Akt signaling pathway and HIF-1 signaling pathway related to
pulmonary fibrosis regulation, renal cell carcinoma pathway related to
viral diseases and human immunodeficiency virus 1 infection pathway.
The above pathways demonstrate that although some predicted host
proteins do not directly interact with virus-encoded proteins, they are
closely related to the pathogenesis of the virus.
Fig. 5.
[223]Fig. 5
[224]Open in a new tab
Enrichment function analysis of KEGG signaling pathway
The existing studies have shown that the MAPK pathway is related to
cell growth and mutations. TNF is mainly produced by T cells and NK
cells, and both are closely related to inflammation. They can release
signals to interact with specific receptors on the cell surface, making
them conservative, so that MAPK-JNK, 5-lipoxygenase, and other
signaling pathways are activated, making cytokines related to
inflammation disorders, such as abnormal expression of gp130 and IL-1,
and promoting human inflammatory response [[225]3, [226]4].
The PI3K-Akt signaling pathway is closely related to the renal cell
carcinoma pathway, as shown in Fig. [227]6. Tyrosine kinase receptors
can activate phosphatidylinositol 3-kinases (PI3Ks) signaling pathways,
which are related to cell proliferation and apoptosis. When PI3Ks is
activated, it produces a messenger that binds to the signal protein
PDK1 containing the PH domain. Through phosphorylation, the Akt
signaling protein is activated to form PI3K-Akt. This protein can also
phosphorylate and regulate the downstream factor mammalian target of
rapamycin (mTOR) [[228]34], thereby activating the mTORC1 pathway,
which participates in the expression of T cytokines in the immune
system and promotes the enhancement of the immune system.
Fig. 6.
[229]Fig. 6
[230]Open in a new tab
Renal cell carcinoma pathway
Conclusion
In this paper, we proposed a protein complexes recognition method
CCA-SE. The protein complexes obtained by CCA-SE were integrated with
the human protein complexes to obtain a more reliable protein complexes
dataset, then we defined a score function to get potential target
proteins. The scoring function takes into account not only the
relationship between the protein complexes and the virus-encoded
proteins but also the protein itself to predict the virus–target human
proteins. Moreover, we verified the effectiveness of CCA-SE on the
biological network under different parameter settings. At the same
time, the selected target proteins were imported into the DAVID v6.7
database ([231]https://david.ncifcrf.gov/). We conducted GO function
enrichment analysis and KEGG signal pathway enrichment analysis. The
analysis explained the correlation between the predicted results
obtained by the CCA-SE and the life process of virus infection and
replication and proved the accuracy from the biological perspective.
The experimental results showed that CCA-SE can effectively recognize
human proteins targeted by the 2019-nCov and play a fundamental role in
detecting virus drug targets.
Supplementary Information
[232]12859_2022_4792_MOESM1_ESM.xls^ (23.5KB, xls)
Additional file 1. The Additional file 1 contains “new targets”. “new
targets” lists our predicted results and shows our method of predicting
virus–target human proteins does achieve good results.
Acknowledgements