Abstract
Background
Bacterial invasive infection and host immune response is fundamental to
the understanding of pathogen pathogenesis and the discovery of
effective therapeutic drugs. However, there are very few experimental
studies on the signaling cross-talks between bacteria and human host to
date.
Methods
In this work, taking M. tuberculosis H37Rv (MTB) that is co-evolving
with its human host as an example, we propose a general computational
framework that exploits the known bacterial pathogen protein
interaction networks in STRING database to predict pathogen-host
protein interactions and their signaling cross-talks. In this
framework, significant interlogs are derived from the known pathogen
protein interaction networks to train a predictive l[2]-regularized
logistic regression model.
Results
The computational results show that the proposed method achieves
excellent performance of cross validation as well as low predicted
positive rates on the less significant interlogs and non-interlogs,
indicating a low risk of false discovery. We further conduct gene
ontology (GO) and pathway enrichment analyses of the predicted
pathogen-host protein interaction networks, which potentially provides
insights into the machinery that M. tuberculosis H37Rv targets human
genes and signaling pathways. In addition, we analyse the pathogen-host
protein interactions related to drug resistance, inhibition of which
potentially provides an alternative solution to M. tuberculosis H37Rv
drug resistance.
Conclusions
The proposed machine learning framework has been verified effective for
predicting bacteria-host protein interactions via known bacterial
protein interaction networks. For a vast majority of bacterial
pathogens that lacks experimental studies of bacteria-host protein
interactions, this framework is supposed to achieve a general-purpose
applicability. The predicted protein interaction networks between M.
tuberculosis H37Rv and Homo sapiens, provided in the Additional files,
promise to gain applications in the two fields: (1) providing an
alternative solution to drug resistance; (2) revealing the patterns
that M. tuberculosis H37Rv genes target human immune signaling
pathways.
Electronic supplementary material
The online version of this article (10.1186/s12864-018-4873-9) contains
supplementary material, which is available to authorized users.
Keywords: Pathogen-host protein interaction networks, Pathogen-host
coevolution, Signaling pathways, Transfer learning, l[2]-regularized
logistic regression
Background
Bacterial invasive infection and host immune response is fundamental to
the understanding of pathogen pathogenesis and the discovery of
effective therapeutic drugs. As an example, Mycobacterium tuberculosis
is the causative agent of tuberculosis, an infectious disease that
causes millions of deaths each year [[29]1]. In recent years, M.
tuberculosis H37Rv has attracted much attention partly due to its
co-infection with HIV [[30]2] and drug resistance [[31]3–[32]6]. From
the point of view of interactome, bacterial-host protein interaction
networks can be viewed as the interface/cross-talks between pathogen
protein-protein interaction (PPI) networks and host PPI protein-protein
networks. Bacteria-host signaling cross-talks potentially help us
understand the underlying mechanism of M. tuberculosis infection and
human defence.
To date, most of the experimental work focuses on detecting
protein-protein interactions within bacterial cells. The database
STRING [[33]7] ([34]https://string-db.org/) has curated massive PPI
networks of 1678 bacterial pathogens such as M. tuberculosis, B.
anthracis, F. tularensis, Y. pestis, etc. However, there are very few
experimental studies on protein interactions between bacteria and their
host. From a computational view of point, M. tuberculosis H37Rv has
been extensively studied in recent years in terms of drug resistance
analysis [[35]4–[36]6, [37]8] and PPI networks reconstruction [[38]9,
[39]10]. In [[40]9, [41]10], interlogs are derived as M. tuberculosis
H37Rv PPIs from the known PPIs of other source species. In [[42]9], the
known M. tuberculosis H37Rv PPIs are laid aside unused and instead the
E. coli PPIs are used as training data to predict M. tuberculosis H37Rv
PPIs. In [[43]10], the interlogs derived from lostridium difficile are
used to expand the known M. tuberculosis H37Rv PPI networks, and the
expanded PPI networks are further used as training data to train a
random forest model for the discovery of novel M. tuberculosis H37Rv
PPIs.
Bacterial pathogen PPI networks are useful to study the signaling
mechanism and drug resistance machinery within bacteria cell. However,
we need further reconstruct bacteria-host PPI networks to understand
the cross-talk mechanism of bacterial infection and host immunity. In
recent years, pathogen-host PPI networks reconstruction and
pathogen-host signaling cross-talk modeling have attracted much
attention from computational biologists [[44]11–[45]18], most of which
focus on virus-host protein interactions. Comparatively, t the
experimental studies on bacteria-host protein interactions are much
less than that on virus-host protein interactions, partly because of
the complex bacterial cell wall, which forms a strong permeability
barrier to the mutual access of the bacterial genome and the host
genome [[46]19]. The two genomes could come across to physically
interact only if bacterial proteins are located at the surface or
membrane of bacterial cell, or bacterial proteins could transport or
secret into the host cell. To our knowledge, experimental studies on
bacteria-host protein interactions have been conducted for a very
limited number of species such as Salmonella [[47]20], Bacillus
anthracis, Francisella tularensis, and Yersinia pestis [[48]21]. For
these bacterial pathogens, the known pathogen-host PPIs can be used as
training data of machine learning modeling or be treated as templates
to infer interlogs [[49]22]. In [[50]22], the known 62 Salmonella-human
PPIs are used to derive interlogs as novel Salmonella-human PPIs.
Nevertheless, no experimental studies on pathogen-host protein
interactions have been conducted for the overwhelming majority of
bacteria, e.g. M. tuberculosis H37Rv. To study the signaling
cross-talks between bacteria and host, two solutions to inferring
bacteria-host protein interactions have been proposed, one solution is
ortholog knowledge transfer that transfers ortholog knowledge between
two different hosts, e.g. knowledge transfer between human and plant to
infer Salmonella-plant PPIs from the known Salmonella-human PPIs
[[51]23]; and the other solution is interlog knowledge transfer that
infers interlogs from known PPIs of different bacteria and different
hosts [[52]24, [53]25]. These two solutions are effective for
cross-species knowledge transfer, especially when no experimental data
are available to the species to be studied. Nevertheless, these two
methods both resort to third-party species that may not physically
co-exist with parasitic relationships, e.g. Homo sapiens versus plant
[[54]23]. Ortholog or interlog knowledge transfer across widely-variant
species are prone to yield a certain level of noise and false
interactions due to a large evolutionary divergence.
Actually, the parasitic or co-evolution relationships between bacteria
and host indicate that bacterial protein-protein interaction alone is
sufficient for us to infer bacteria-host protein interaction without
resorting to third-party distant species. Knowledge transfer between
two co-evolving species is more credible than that between two
evolutionarily distant species. The bacterial pathogen protein
interaction networks of 1678 bacteria in STRING [[55]7] provide rich
information for us to study bacteria-host protein interactions, because
many evidences have demonstrated that bacterial genome is co-evolving
with its host genome [[56]19, [57]26, [58]27]. In [[59]27], it has been
concluded that Mycobacterium tuberculosis complex (MTBC) has been
anatomically co-evolving with modern humans for tens of thousands of
years on the basis of the evidences of its origin in Africa, the
congruence in phylogeography and the dating of major branching events.
Moreover, the drug resistance of Mycobacterium tuberculosis (MTB)
strains is also evolving with the host genome. In [[60]19], it has been
claimed that human genetic factors may play important roles in MTB drug
resistance and different MTB lineages (e.g. lineage 1: Indo-Oceanic;
lineage 2: East Asian; lineage 3: Central Asian; lineage 4: Euro
American; lineage 7: Ethiopia) acquire different levels of drug
resistance. Molecular interactions are an effective way to unravel
bacteria-host co-evolution relationship and the progression of
bacterial drug resistance, which is at present hampered by the limited
knowledge of bacteria-host interactions. For instance, the molecular
mechanism involved in sensing of extracellular signals for inducing its
metabolic adaptation still remains unclear [[61]4].
Furthermore, the interaction between bacteria and host also somewhat
contributes to bacterial drug resistance. As suggested in [[62]19], the
interaction of MTB with its macrophage microenvironment may play an
important role in the risk of progression to drug-resistant TB.
Meanwhile from the host side, polymorphisms within genes involved in
macrophage activity (SLC11A1, VDR and HLA genes) have been reported to
be associated with susceptibility to MTB drug resistance. As reviewed
in [[63]28], the amino acid residues from the PPI interfaces are more
conserved than those from other parts of the protein surface, and PPI
inhibitors can perhaps be more resistant to spontaneous mutations at
their binding site versus inhibitors of the active site, thus
bacteria-host PPI inhibitors may be of particular interest as
antimicrobial drugs that induce less risk of drug resistance.
Therefore, the exploration of protein interaction networks between
co-evolving bacterial pathogens and host could potentially achieve two
goals: (1) deriving more reliable interlogs to study pathogen-host
signaling cross-talks; (2) choosing pathogen-host PPI inhibitors to
provide an alternative solution to bacterial drug resistance.
In this work, taking M. tuberculosis H37Rv as an example, we propose a
general computational framework that transfers the knowledge of
bacterial pathogen protein interaction networks to predict pathogen
targeted human genes and immune signaling pathways. Due to lack of
experimental studies, we take advantage of the co-evolution
relationship between M. tuberculosis H37Rv and Homo sapiens [[64]19,
[65]26, [66]27] to derive interlogs as the training data from M.
tuberculosis H37Rv protein interactions alone. We confine the search of
M. tuberculosis H37Rv ortholog genes within its human host without
resorting to a third-party species. The interlogs derived in this way
are presumably more reliable than those derived from remote species.
Given two interacting M. tuberculosis H37Rv genes (m[1,] m[2]) and
their corresponding Homo sapiens ortholog genes (h[1], h[2]), we deem
(m[1], h[2]) and (m[2], h[1]) as two interlogs, since the human
ortholog gene products are functionally or structurally similar to M.
tuberculosis H37Rv gene products. To ensure the quality of data, only
the significant interlogs are used as training data, and the less
significant interlogs need to be further validated by the trained
model. Ortholog and interlog knowledge transfer are prone to introduce
a certain level of noise as well as increase the computational
complexity. To solve this problem, we adopt theoretically
well-established l[2]-regularized logistic regression as the base
machine learning model. Finally, we further conduct gene ontology (GO)
and pathway enrichment analyses on the predicted interactions to
provide insights into the machinery of M. tuberculosis H37Rv infection
and host response. As a major concern in recent years, bacterial drug
resistance is also discussed in terms of bacteria-host PPI inhibition
to provide a potential alternative solution to M. tuberculosis H37Rv
drug resistance.
Methods
Overview flowchart of the proposed framework
As shown in Fig. [67]1, this work is divided into three phases: (I)
data construction; (II) model training and prediction; (III) analyses.
The critical phase is to derive interlogs as training data from the
known M. tuberculosis H37Rv PPI networks in STRING [[68]9]. This phase
takes advantage of the co-evolution relationship between M.
tuberculosis H37Rv and human host to construct training data that are
experimentally not available. The second phase is to construct feature
representation, train an l[2]-regularized logistic regression model,
and then predict interactions from less significant interlogs and
non-interlogs. The final phase is to comprehensively analyse the
reconstructed MTB-human PPI networks for further understanding of the
machinery of bacterial infection and host response. Especially, we
analyse the MTB-human PPIs related to M. tuberculosis H37Rv drug
resistance. Proper selection or design of PPI inhibitors could
potentially provide an alternative solution to bacterial drug
resistance.
Fig. 1.
Fig. 1
[69]Open in a new tab
Overview flowchart of the proposed framework
Data construction via interlog knowledge transfer from M. tuberculosis H37Rv
PPI networks
The M. tuberculosis H37Rv protein interaction networks in STRING
[[70]9] contain 309,715 interactions. Unfortunately, the PPI networks
have been estimated to be of low quality [[71]24, [72]25, [73]29]. To
choose quality M. tuberculosis H37Rv protein interactions, we take the
following three measures. Firstly, we only choose the PPIs with
experimental evidences. Secondly, we remove the proteins that have no
gene names. Lastly, we only choose the well-studied genes that have
been annotated with at least one specific gene ontology (GO) term of
molecular function or biological process, except the generic root GO
terms (GO:0005575, GO:0008150, GO:0003674) in the GO directed acyclic
graph (DAG). As results, we obtain 5224 well-studied M. tuberculosis
H37Rv genes that correspond to 7835 gene products/proteins and
28,347 M. tuberculosis H37Rv protein-protein interactions. In the same
way, we obtain 20,081 well-studied human genes that correspond to
60,126 gene products/proteins.
Now we exploit the knowledge of M. tuberculosis H37Rv PPI networks via
orthologous relationships to construct the training data for MTB-human
PPI predictions. To formally formulate the method of data construction,
we denote the obtained M. tuberculosis H37Rv PPI networks as G, which
contains 28,347 interactions and 1469 well-studied M. tuberculosis
H37Rv proteins. After removing the M. tuberculosis H37Rv proteins that
have no human orthologs, we obtain 1359 well-studied M. tuberculosis
H37Rv proteins in total. We search the human ortholog genes in
SwissProt [[74]30] simply using PSI-BLAST [[75]31] with default E-value
(E-value = 10). Orthologs are defined as homologous genes diverging
after a speciation event [[76]32]. Actually, there are some advanced
methods to search or predict orthologs, e.g. Reciprocal Best Hits (RBH)
[[77]32, [78]33], which relies on BLAST [[79]34] to identify pairwise
orthologs between two species [[80]33], that’s, two genes residing in
two different genomes are deemed orthologs if their protein products
find each other as the best hit in the opposite genome [[81]32]. For
each ortholog pair (A, A’), RBH algorithm needs to run BLAST twice in
two directions, one direction is against A query genome, and the
opposite direction is against A’ query genome. RBH algorithm yields
quality orthologs at the cost of high computational intensity. In this
work, we adopt simple homolog-search method instead of RBH algorithm
for the two reasons: (1) we need more orthologs including the distant
orthologs to derive interlogs as training data, because no experimental
data are available to computational modeling; (2) the RBH algorithm
would computationally worsen the efficiency of the sophisticated
framework as illustrated in Fig. [82]1. If we choose lower BLAST
E-value cutoff, e.g. 1e-50 versus 1e-6 [[83]32], we still could obtain
quality orthologs. The noise from distant orthologs or non-orthologs
could be counteracted using the regularization technique that is
discussed in the next subsection.
From G, we derive the interlogs as follows. Given two interacting M.
tuberculosis H37Rv genes (m[i], m[j]), we use H[i], H[j] to denote the
sets of ortholog genes m[i] and m[j,] respectively. In particular, if
an ortholog gene yields more than one ortholog protein, only one
ortholog protein is randomly chosen as the interacting partner. Given
the cutoff of ortholog significance δ, e.g. E-value of PSI-BLAST, we
further split H[i], H[j] into two subsets
[MATH: Hi≤δ<
/mrow>Hi>δ
,Hj≤δ<
/mrow>Hj>δ
:MATH]
, respectively. Here we define
[MATH:
Hi≤δ,Hj≤δ :MATH]
as the sets of significant ortholog genes and
[MATH:
Hi>δ,Hj>δ :MATH]
as the set of less significant ortholog genes. For any two interacting
M. tuberculosis H37Rv genes (m[i], m[j]), we create the positive
training instances from the set of significant ortholog genes
[MATH:
Hi≤δ,Hj≤δ :MATH]
as follows.
[MATH: Posmimj
mi>=migg∈Hj≤δ∪mjgg∈Hi≤δ :MATH]
1
where (m[i], g) or (m[j], g) denotes significant interlog. Formula
([84]1) is based on the assumption that the M. tuberculosis H37Rv gene
m[i](m[j]) functionally or structurally co-evolves with its human host
ortholog genes
[MATH: gg∈Hi≤δ :MATH]
(
[MATH: gg∈Hj≤δ :MATH]
). The interaction of m[i] with m[j] to a great extent indicates the
interaction of m[i] with m[j] ‘s ortholog gene
[MATH: gg∈Hj≤δ :MATH]
, and the interaction of m[j] with m[i] ‘s ortholog genes
[MATH: gg∈Hi≤δ :MATH]
vice versa. All the positive training instances are then merged to
generate the whole positive training set.
[MATH: Upos=∪<
mfenced close=")" open="("
separators=",">mimj
mi>∈GPosmimj
mi> :MATH]
2
From the evolutionary point of view, the less significant interlogs are
not so reliable as the significant interlogs, so that they need to be
further validated by the predictive model, which is trained on the
significant interlogs. The set of less significant interlogs is defined
as follows.
[MATH: Valpos=∪<
mfenced close=")" open="("
separators=",">mimj
mi>∈Gmigg∈Hj>δ∪mjgg∈Hi>δ :MATH]
3
where (m[i], g) or (m[j], g) denotes insignificant interlog. For each
M. tuberculosis H37Rv gene m[i], let P[i] = {g| {m[i], g} ∈ U[pos]}
denote the set of its human partner genes, M[i] = {g| {m[i], g} ∈ G}
denote the set of its M. tuberculosis H37Rv partner genes, and
[MATH: All_orthi=∪mj∈MiHj
:MATH]
denote the set of human ortholog genes of all the genes in M[i]. Then
we randomly sample the human genes that potentially do not interact
with genem[i] from the set non-ortholog genes
N[i] = {g| g ∉ All_orth[i] ^ g ∈ Homo[well]} to construct the negative
training data, where Homo[well] denotes the well-studied human genes.
To obtain well-balanced training data, we impose the constraint
∣N[i] ∣ = ∣ P[i]∣ on the sampling of negative training data. Then the
whole negative training set is defined as follows.
[MATH: Uneg=∪<
msub>mi∈Mmigg∈Ni :MATH]
4
where M denotes the set of M. tuberculosis H37Rv genes in the M.
tuberculosis H37Rv PPI networks G.
Model evaluation is a hard problem in the case of lack of experimental
data. Nevertheless, random sampling of a tiny fraction of data in the
huge space of non-interlogs is convincingly to capture non-interactions
with a large probability. Hence we obtain the negative data to estimate
the model as follows. Let
[MATH:
Ni'=gg∉All_orthi^g<
mo>∉Ni^g∈Homowell :MATH]
denote the set of human genes, from which the negative data to be
validated are defined as follows.
[MATH: Valneg=∪<
msub>mi∈Mmigg∈Ni',s.t.∣Valpos∣=∣Valneg∣ :MATH]
5
For each M. tuberculosis H37Rv gene m[i], we sample the human genes
from the set
[MATH:
Ni''=gg∉All_orthi^g<
mo>∉Ni^g∉Ni'^g<
mo>∈Homowell :MATH]
to obtain the prediction set as follows.
[MATH: Pred=∪mi∈Mmigg∈Ni'' :MATH]
6
The prediction set is further reduced for the sake of computational
complexity by imposing a constraint on the space as follows.
[MATH: Pred=∪mi∈Mmigg∈Ni'',s.
t.∣Ni'<
/mo>'∣≤300 :MATH]
7
Formula ([85]7) means that no more than 300 human proteins are randomly
sampled for each M. tuberculosis H37Rv protein. The data and analyses
are referred to the section [86]Results.
Multi-instance GO feature construction via homolog knowledge transfer
State-of-art feature construction is a critical step of machine
learning modeling in solving specific problems. Gene ontology (GO) has
been widely used as features to predict protein-protein interactions
[[87]14–[88]18, [89]35–[90]40]. In [[91]35], GO has been claimed to be
the most discriminative feature for PPI prediction [[92]35]. Gene
ontology is a hierarchically organized and controlled vocabulary that
characterizes gene products [[93]41], and the annotations of genes or
gene products are provided in terms of GO terms in GOA [[94]42].
Despite its powerful predictive capability, GO feature representation
could encounter a serious problem for those less-studied or novel
genes, because the sparsity of GO terms potentially yields null feature
vectors. In this work, homolog knowledge transfer is conducted via
independent homolog instances to solve this problem, that is, each
gene/protein is depicted with two instances, namely target instance and
homolog instance. The target instance depicts the GO knowledge of the
gene/protein itself, and the homolog instance depicts the GO knowledge
of the homologs. The homologs are extracted from SwissProt [[95]30]
using PSI-BLAST [[96]31] (E-value = 10) against all species. We treat
all types of evidence codes equally including ISS (Inferred from
Sequence or structural Similarity), IEA (Inferred from Electronic
Annotation), etc. The reason that we choose the default E-value is that
we need to capture distant homologs. Similarly, the reason that we do
not filter out those indirectly-derived or uncurated annotations is to
overcome the sparsity and enlarge the coverage of GO terms.
Undoubtedly, a certain level of noise would be introduced into the
computational framework, which will be discussed in the next
subsection. The GO terms are extracted from GOA [[97]42]. For each
protein i in the training set U, we obtain two sets of GO terms, one
set denoted as
[MATH: SHi :MATH]
contains the GO terms of the homologs, and the other set denoted as
[MATH: STi :MATH]
contains the GO terms of the protein itself. Then the whole set of GO
terms of the training set is defined as follows:
[MATH:
S=∪i∈U
mi>STi∪
mo>SHi
:MATH]
8
For each protein pair (i[1], i[2]), the target instance and the homolog
instance are formally defined as follows:
[MATH: VTi1i2
mn>g=0,g∉
STi1∧g∉STi
22,
mo>g∈STi1∧g∈STi2;
1,otherwiseVHi1i2
mn>g=0,g∉SHi1<
mo>∧g∉SHi
mi>22,g∈SHi1
∧g∈S
Hi2<
mtd>1,otherwise
:MATH]
9
For each GO term g ∈ S,
[MATH: VTi1i2
mn>g :MATH]
denotes the component g of the target instance
[MATH: VTi1i2
mn> :MATH]
and
[MATH: VHi1i2
mn>g :MATH]
denotes the component g of the homolog instance
[MATH: VHi1i2
mn> :MATH]
. Those GO terms g ∉ S are discarded. Formula ([98]9) means that if the
protein pair (i[1], i[2]) shares the same GO term g, then the
corresponding component in the feature vector
[MATH: VTi1i2
mn> :MATH]
or
[MATH: VHi1i2
mn> :MATH]
is set 2; if neither protein in the protein pair possesses the GO term
g, then the value is set 0; otherwise the value is set 1. The GO terms
of the protein pair (i[1], i[2]) that do not belong to the whole set of
GO terms of the training set, formally defined as
[MATH: gg∈S<
mi>Ti∨g∈SHi∧g∉
S :MATH]
, are ignored in the feature construction.
L[2]-regularized logistic regression for large data training and noise
tolerance
The existing interlog modeling methods [[99]22–[100]25] have
demonstrated two major disadvantages. First, they do not discriminate
less significant interlogs from significant interlogs, less significant
interlogs need to be further validated; second, they cannot detect
those interactions that exist inexplicitly in the form of interlogs. To
solve the two problems, we combine machine learning approach with
interlog modeling via two-level knowledge transfer, namely
interaction-level interlog knowledge transfer and protein−/gene-level
homolog knowledge transfer. The former level of knowledge transfer is
to derive the significant interlogs as the training data, and the
second level of knowledge transfer is to make up for the sparsity of GO
terms. The two-level knowledge transfer demands that the machine
learning methods we choose should be resistant to noise. SVM support
vector machine (SVM) [[101]43] is a theoretically established machine
learning method known for its regularization technique resistant to
noise/outliers. Unfortunately, SVM is not applicable to large training
data due to its time complexity O(n^2). Here we adopt the
l[2]-regularized logistic regression method [[102]44], implemented in
the toolbox LIBLINEAR [[103]45], to counteract the ortholog/homolog
noise and fit the large training data in linear time.
Given a set of instance-label pairs (x[i], y[i]), i = 1, 2, …, l;
x[i] ∈ R^n; y[i] ∈ {−1, +1}, linear regression attempts to derive a
decision function f(x[i]) = ω^Tx[i] + b, which is further converted to
probability via the logistic function
p(y = ± 1| ω, x[i]) = 1/1 + exp(−y[i](ω^Tx[i] + b)). The weight vector
and bias (ω, b) could be estimated by minimizing the negative
log-likelihood
[MATH:
minω,b∑i=1
llog1+e−yiωTxi
mi>+b :MATH]
. L[2]-regularized logistic regression imposes a constraint on the
l[2]-norm of the weight vector ω to solve the following unconstrained
optimization problem [[104]44].
[MATH:
minω12ωTω+C∑i=1
llog1+e−yiωTxi
mi>+b :MATH]
10
where C denotes the penalty parameter/regularizer that balances the two
terms in Formula ([105]10) to achieve good generalization ability. The
second term could penalize potential noise/outlier fitting. The
optimization of objective function ([106]10) is solved via its dual
form.
[MATH: minα12αTQα+∑i:α<
mi>i>0lαilogαi+∑i:α<
mi>i<CC−αilogC−αi−∑ilC
logCsubjectto0≤α
mi>i≤C,i=<
mn>1,…,l
:MATH]
11
where α[i] denotes Lagrangian operator and
[MATH: Qij=yi
yjxiTxj :MATH]
.
In the test and prediction phase, the decision function f(x) yields two
outputs
[MATH: fVTi1i2
mn>,fVHi1i2
mn> :MATH]
for each protein-protein pair (i[1], i[2]), which are further combined
into one final decision as follows.
[MATH: FVTi1i2
mn>VHi1i2
mn>=fVTi1i2
mn>,if∣fVTi1i2
mn>∣>∣<
mi>fVHi1i2
mn>∣fVHi1i2
mn>,otherwise
:MATH]
12
where ∣Δ∣ denotes the absolute value of Δ. The final label for protein
pair (i[1], i[2]) is defined as below.
[MATH: Li1i2
mn>=1,ifFVTi1i2
mn>VHi1i2
mn>>ζ0,otherwise
:MATH]
13
The threshold ζ is used to filter out those weak positive predictions.
Experimental setting and model evaluation
As described above, each protein pair (i[1], i[2]) is represented with
two instances, the target instance
[MATH: VTi1i2
mn> :MATH]
and the homolog instance
[MATH: VHi1i2
mn> :MATH]
, so that the proposed framework yields three outputs for decision,
i.e.
[MATH: fVTi1i2
mn> :MATH]
,
[MATH: fVHi1i2
mn> :MATH]
,
[MATH: FVTi1i2
mn>VHi1i2
mn> :MATH]
, respectively. Accordingly, we design three experimental settings,
namely combined-instance (
[MATH: FVTi1i2
mn>VHi1i2
mn> :MATH]
), homolog-instance (
[MATH: fVHi1i2
mn> :MATH]
) and target-instance (
[MATH: fVTi1i2
mn> :MATH]
), to validate the effectiveness of homolog knowledge transfer. The
combined-instance setting combines the outputs of the target instance
and the homolog instance, the homolog-instance setting uses the homolog
instance alone to evaluate the model robustness to GO term sparsity,
and the target-instance setting uses the target instance alone to yield
the baseline performance, equivalence to or excellence over which
indicates that homolog knowledge transfer is effective.
Five performance metrics, i.e. ROC-AUC (Receiver Operating
Characteristic AUC), SE (sensitivity), SP (specificity), MCC (Matthews
correlation coefficient) and F1 score, are used to evaluate the
proposed model via 5-fold cross validation. The dataset is randomly
split into five disjoint parts. For five folds, each fold treats one
part as test set and the other four parts are merged as training set.
For each test example, the true label and the predicted label are
recorded into the confusion matrix M. When the five folds complete, we
use M to calculate the performance metrics. Except ROC-AUC, all the
other metrics are derived from the confusion matrix M. From M, we
define several intermediate variables as formula ([107]14). Based on
these intermediate variables, we further define SP[l], SE[l] and MCC[l]
for each label as formula ([108]15) and the overall MCC as formula
([109]16).
[MATH: pl=Ml,l,ql=∑i=1,
mo>i≠lL∑j=1,
mo>j≠lLMi,j,rl=∑i=1,
mo>i≠lLMi,l,sl=∑j=1,
mo>j≠lLMl,j<
mtr>p=∑l=1
Lpl,q
=∑l=1
Lql,r
=∑l=1
Lrl,s
=∑l=1
Lsl :MATH]
14
[MATH: SPl=plp
l+rl,l=1,2…
mo>,LSEl=plp
l+sl,l=1,2…
mo>,LMCCl=plql
mi>−rlsl/pl+r
mi>lpl+s
mi>lql+r
mi>lql+s
mi>l,l=1,2…,L
:MATH]
15
[MATH: Acc=∑l
mi>=1LMl,l/∑
mo>i=1L∑j=1LMi,jMCC=pq−rs/p+rp+sq+rq+s
msqrt> :MATH]
16
where the element of confusion matrix M[i, j] records the counts that
class i are classified to class j, and L denotes the number of labels.
AUC is calculated based on the decision values defined in formula
([110]12). F1 score is defined as follows:
[MATH: F1score=2×SPl×SElSPl+SEl,<
mi>l=1denotesthepositiveclass :MATH]
17
Results
Quality validation on the constructed data via GO enrichment analysis
The E-value cut-off for significant interlogs is set δ = 1e ‐ 50. To
date, there is no commonly-accepted standard to choose PSI-Blast
E-value cut-off. We are inclined to choose a small E-value cut-off to
obtain quality interlogs. As results, we obtain 15,287 significant
interlogs as the positive examples (see Additional file [111]1), 15,287
randomly sampled non-interlogs as the negative examples (see
Additional file [112]2), a set containing 98,187 less significant
interlogs, a set containing 98,187 non-interlogs that are potentially
negative examples, and a prediction set containing 1359,00 protein
pairs. The sampling ratio of negative examples to positive examples is
set 1:1 for the two reasons: (1) skewed distributions between the
positive class and the negative class (e.g. ratio 1:10, 1:100, etc.)
could increase the risk of model bias; (2) there is actually no direct
mapping from the biological problem space to the computational space,
so it is improper to simulate the huge negative space by sampling a
much larger negative data set to train a predictive model from a
computational point of view.
The significant interlogs are directly viewed as MTB-human protein
interactions, so we need to assess the quality of the derived interlogs
and their potential applications. As there is no independent benchmark
measure, we analyse the quality of interlogs only from the aspects of
similar GO terms. Of course, it is the feature vector of GO terms as
Formula ([113]9) defines that determines the predictive output. In
addition, we analyse the drug resistance related interlogs to reveal
the role of the host factors in the progression of bacterial antibiotic
resistance. The less significant interlogs need to be further validated
by the proposed framework and will be analysed in the following
subsection. Antimicrobial peptides (AMPs) represent a potential
alternative to available antibiotics. Raman et al. [[114]46] exploit
the M. tuberculosis H37R PPI networks to find so-called co-target genes
whose co-inhibition with the resistance genes could effectively
blockade M. tuberculosis H37RV signaling pathways. Nevertheless,
inhibitors of pathogen-host PPI interface could be more therapeutically
specific to bacterial infection with less risk of drug resistance and
drug side-effects. Figure [115]2a illustrates the significant interlogs
related to M. tuberculosis H37Rv drug-resistant genes that get involved
in cytochromes and other target-modifying enzymes [[116]46], which
could cause potential chemical modification of drug molecules (see
Additional file [117]3 for detailed GO enrichment analysis). Figure
[118]2b illustrates the significant interlogs related to M.
tuberculosis H37Rv drug-resistant genes that get involved in
SOS-response and DNA replication [[119]46], which could lead to
mutations in the gene or its regulatory region (see
Additional file [120]4 for detailed GO enrichment analysis).
Fig. 2.
[121]Fig. 2
[122]Open in a new tab
The derived interlogs between M. tuberculosis H37Rv drug-resistant
genes and human genes. a illustrates the interactions of M.
tuberculosis H37Rv drug-resistant genes involved in cytochromes and
other target-modifying enzymes that could cause potential chemical
modification of drug molecules [[123]46]. b illustrates the
interactions of M. tuberculosis H37Rv drug-resistant genes involved in
SOS-response and DNA replication that lead to mutations in the gene or
its regulatory region. The light blue circles denote M. tuberculosis
H37Rv genes and the red diamonds denote human genes [[124]46]
Interlog {Rv2193, MT-CO1}
The drug-resistant gene Rv2193 is involved in cytochromes and other
target-modifying enzymes that could cause potential chemical
modification of drug molecules [[125]46]. The interlog {Rv2193|I6Y8N5,
MT-CO1|P003951} is inferred from the known M. tuberculosis H37Rv
protein interaction {Rv2193|I6Y8N5, Rv3043c|I6YAZ7} [[126]7], where the
human protein P003951 (MT-CO1) is orthologous to the M. tuberculosis
H37Rv protein I6YAZ7 (Rv3043c) with E-value equal to 4e-094. From GO
enrichment analysis as partially provided in Table [127]1, the two
genes {Rv2193, MT-CO1} both get involved in the common biological
processes of oxidation-reduction process (GO:0055114) and hydrogen ion
transmembrane transport (GO:1902600). In addition, the two genes are
also both involved in aerobic cellular respiration, e.g. Rv2193
respiratory electron transport chain (GO:0022904) and MT-CO1 aerobic
respiration (GO:0009060). Besides, the two orthologous proteins
{Rv3043c|I6YAZ7, MT-CO1|P003951} are also involved in other highly
similar biological processes (see Additional file [128]3).
Table 1.
GO enrichment analysis of the significant interlog {Rv2193, MT-CO1}
LH57_11955 (Rv2193) is classified into the drug resistance type of
cytochromes and other target-modifying enzymes that could cause
potential chemical modification of drug molecules [[129]46]
GO term ID GO aspect GO term name
Common GO terms GO:0016020 C membrane
GO:0016021 C integral to membrane
GO:0016491 F oxidoreductase activity
GO:0055114 P oxidation-reduction process
GO:1902600 P hydrogen ion transmembrane transport
H57_11955|Rv2193 only GO:0022904 P respiratory electron transport chain
GO:0019646 P aerobic electron transport chain
GO:0015002 F heme-copper terminal oxidase activity
MT-CO1 only GO:0070469 C respiratory chain
GO:0045277 C respiratory chain complex IV
GO:0005751 C mitochondrial respiratory chain complex IV
GO:0006979 P response to oxidative stress
GO:0009060 P aerobic respiration
GO:0046688 P response to copper ion
GO:0051602 P response to electrical stimulus
GO:0020037 F heme binding
[130]Open in a new tab
C denotes cellular component, F denotes molecular function, and P
denotes biological process
Interlog {Rv2737c, ERCC6}
The drug-resistant gene Rv2737c is involved in SOS-response and DNA
replication that lead to mutations in the gene or its regulatory
region. The interlog {Rv2737c|I6YE90, ERCC6|[131]Q03468} is inferred
from the known M. tuberculosis H37Rv protein interaction
{Rv2737c|I6YE90, helZ|Rv2101|I6YCF3} [[132]7], where the human protein
[133]Q03468 (ERCC6) is orthologous to the M. tuberculosis H37Rv protein
I6YCF3 (Rv2101, gene name helZ) with E-value equal to 5e-058. The GO
enrichment analysis of the two genes {Rv2737c, ERCC6} is partially
provided in Table [134]2. We can see that these two genes get involved
in the common biological processes of DNA repair (GO:0006281) and
response to DNA damage stimulus (GO:0006974). Besides, the gene Rv2737c
protects microbial DNA from antibiotics (GO:0046677, response to
antibiotic), DNA damage (GO:0009432, SOS response), ultraviolet
radiation (GO:0009650, UV protection), ionizing radiation (GO:0010212,
response to ionizing radiation), etc. Accordingly, the human gene ERCC6
gets involved in response to gamma radiation (GO:0010332), response to
UV (GO:0009411), response to oxidative stress (GO:0006979), DNA damage
response, signal transduction resulting in induction of apoptosis
(GO:0008630), base-excision repair (GO:0006284), etc. The similar
biological processes suggest that the two genes {Rv2737c, ERCC6}
potentially interact.
Table 2.
GO enrichment analysis of the derived interlog {Rv2737c, ERCC6}. recA
(Rv2737c) is classified into the drug resistance type of SOS-response
and DNA replication that leads to mutations in the gene or its
regulatory region [[135]46]
GO term ID GO aspect GO term name
Common GO terms GO:0005515 F protein binding
GO:0006281 P DNA repair
GO:0006974 P response to DNA damage stimulus
GO:0016787 F hydrolase activity
GO:0008094 F DNA-dependent ATPase activity
GO:0003677 F DNA binding
recA|Rv2737c only GO:0046677 P response to antibiotic
GO:0009432 P SOS response
GO:0009650 P UV protection
GO:0010212 P response to ionizing radiation
GO:0006310 P DNA recombination
GO:0006259 P DNA metabolic process
GO:0000725 P recombinational repair
ERCC6 only GO:0000303 P response to superoxide
GO:0006283 P transcription-coupled nucleotide-excision repair
GO:0006284 P base-excision repair
GO:0006979 P response to oxidative stress
GO:0007256 P activation of JNKK activity
GO:0008630 P DNA damage response, signal transduction resulting in
induction of apoptosis
GO:0010332 P response to gamma radiation
GO:0009411 P response to UV
[136]Open in a new tab
C denotes cellular component, F denotes molecular function, and P
denotes biological process
Performance of 5-fold cross validation
The ROC curves of 5-fold cross validation are illustrated in Fig.
[137]3 and the detailed performance metrics are provided in
Table [138]3. The proposed method achieves fairly good performance in
terms of all the performance measures, and the performance varies very
little between the three experimental settings, which indicate that the
homolog instances are effective to substitute the target instances when
the GO knowledge of the genes concerned is not available.
Fig. 3.
Fig. 3
[139]Open in a new tab
ROC curves for 5-fold cross validation performance evaluation on the
artificially created significant interlogs between M. tuberculosis
H37Rv and H. sapiens
Table 3.
Performance estimation of 5-fold cross validation and performance
comparison with the existing methods
Size Combined-instance Homolog-instance Target-instance
SP SE MCC SP SE MCC SP SE MCC
Positive 15,287 0.9823 0.9966 0.9790 0.9684 0.9915 0.9601 0.9820 0.9976
0.9796
Negative 15,287 0.9965 0.9821 0.9790 0.9912 0.9676 0.9601 0.9975 0.9817
0.9796
[Acc; MCC] [98.93%; 0.9789] [97.95%; 0.9599] [97.95%; 0.9599]
[ROC-AUC] [0.9933] [0.9912] [0.9978]
F1 Score 0.9894 0.9798 0.9897
KMM-SVM [[140]23] SP SE F1 score
① human- > mouse 0.517 0.937 0.667
② E.coli- > human 0.257 0.161 0.199
[141]Open in a new tab
① denotes the work [[142]23] that transfers the knowledge of
Salmonella-human PPI networks to predict Salmonella-mouse protein
interactions;② denotes the work [[143]23] that transfers the knowledge
of Salmonella-Ecoli PPI networks to predict Salmonella-human protein
interactions
Furthermore, the proposed method achieves quite well-balanced
performance on the two classes, indicating that the positive class of
significant interlogs and the negative class of non-interlogs are well
separated. In the section Analysis of the constructed data, we have
analysed the quality of the derived significant interlogs via GO
enrichment analysis. To well interpret the good two-class separability,
we conduct further GO enrichment analysis on the positive and the
negative training data.
As illustrated in Fig. [144]4a, the protein pairs in the positive
training data (i.e. significant interlogs) show more significant common
patterns of subcellular localization, molecular functionality and
biological processes than those in the negative training data (i.e.
non-interlogs). Such a wide difference of common patterns between the
positive data and the negative data presumably contribute much to the
two-class separability, which then results in the good performance of
5-fold cross validation. The results as illustrated in Fig. [145]4a are
consistent with the observations that two proteins that interact are
more likely to reside in the same cellular compartments, fulfil similar
molecular functions and participate in similar biological processes. As
mentioned in the section [146]Background, the complex bacterial cell
wall that forms a strong permeability barrier to the mutual access of
the bacterial genome and the host genome [[147]19], the two partners of
significant interlogs may merely functionally interact if no transport
or secretion helps the two partners physically contact.
Fig. 4.
[148]Fig. 4
[149]Open in a new tab
Percentage of pathogen-host protein pairs in the training data whose
partners share common GO terms
Quality validation on the predicted interactions from less significant
interlogs and non-interlogs
At present there is no experimental data available as validation set to
evaluate the proposed model. Nevertheless, the less significant
interlogs would be more likely to be interacting partners than
non-interlogs. For the reasons, we explicitly study the potential
interactions from the less significant interlogs and the non-interlogs,
partly to evaluate the proposed model as well. The predicted
interactions from the less significant interlogs and the non-interlogs
are provided in Additional files [150]5 and [151]6, respectively. As
shown in Table [152]4, the predicted positive rates on the less
significant interlogs and non-interlogs are 18.78 and 1.41%,
respectively. We can see that the less significant interlogs are more
likely to interact than the non-interlogs. The result can be well
interpreted by the wide difference of patterns of common GO terms
between the less significant interlogs and the non-interlogs as
illustrated in Fig. [153]4b. Comparing Fig. [154]4a with Fig. [155]4b,
we see that the interlogs show much are similar distributions of GO
terms than non-interlogs. The low positive rate 18.78% indicates that
the less significant interlogs should not be equally treated as the
significant interlogs as the existing work does [[156]22–[157]25], and
need to be further validated by a machine learning framework. The low
positive rate 1.41% shows that the sampling method of negative data as
described in Formula ([158]4) is rational.
Table 4.
Predicted positive rates on less significant interlogs, non-interlogs
and the prediction set
Less significant interlogs Non-interlogs Prediction set
Size 98,187 98,187 407,700
Predicted positive rate 18.78% 1.41% 1.96%
[159]Open in a new tab
As reviewed in [[160]19], host factors play important roles in the
progression of bacterial drug resistance. Hence the interactions
between M. tuberculosis H37Rv drug-resistant genes and human host genes
are of special interest to us. Moreover, inhibitors of pathogen-host
PPI interface would be more therapeutically with less side-effect on
other human genes and pathways. The predicted interactions from less
significant interlogs related to drug resistance are illustrated in
Fig. [161]5. Figure [162]5a illustrates the interlogs that get involved
in antibiotic efflux pumps [[163]46] (see Additional file [164]7 for
detailed GO enrichment analysis), and Fig. [165]5b illustrates the
interlogs that get involved in target-modifying enzymes [[166]46] (see
Additional file [167]8 for detailed GO enrichment analysis).
Fig. 5.
[168]Fig. 5
[169]Open in a new tab
Validated less significant interlogs. a illustrates the interactions of
M. tuberculosis H37Rv drug-resistant genes that are involved in
antibiotic efflux pumps [[170]46]. b illustrates the interactions of M.
tuberculosis H37Rv drug-resistant genes that are involved in
target-modifying enzymes [[171]46]
Interlog {Rv0849, ABCB1}
The interlog {Rv0849|I6X9Y5, ABCB1|[172]P08183} is derived from the
known interaction {Rv0849|I6X9Y5, Rv1348|I6YAB3} [[173]7], wherein the
human protein P08183is orthologous to the M. tuberculosis H37Rv protein
I6X9Y5 with E-value equal to 6e-043. The interlog {Rv0849|I6X9Y5,
ABCB1|[174]P08183} is predicted to be a pathogen-host protein
interaction with probability 0.9987 (see Additional file [175]7). GO
enrichment analysis shows that the two genes {Rv0849, ABCB1} both are
located at membrane (GO:0005886, plasma membrane; GO:0016021, integral
to membrane) and participate the biological process of transport
(GO:0055085, transmembrane transport; GO:0006810, transport) (see
Table [176]5). In addition, the human gene ABCB1 also gets involved in
the biological processes of drug transmembrane transport (GO:0006855)
and xenobiotic transport (GO:0042908). The GO terms (GO:0009986, cell
surface; GO:0070062, extracellular vesicular exosome) indicate that the
human protein [177]P08183 could have physical contact with the M.
tuberculosis H37Rv membrane protein I6X9Y5 to induce immune response
(GO:0002485, antigen processing and presentation of endogenous peptide
antigen via MHC class I via ER pathway, TAP-dependent).
Table 5.
Gene ontology analysis of the predicted interactions {Rv0849, ABCB1}
and {Rv1988, PNP}. Rv0849 is classified into the drug resistance type
of antibiotic efflux pumps [[178]46]. Rv1988 is classified into the
drug resistance type of target-modifying enzymes [[179]46]
{Rv0849, ABCB1} GO term ID GO aspect GO term name
Common GO terms GO:0055085 P transmembrane transport
GO:0006810 P transport
GO:0005886 C plasma membrane
GO:0016021 C integral to membrane
ABCB1 only GO:0005215 F transporter activity
GO:0006855 P drug transmembrane transport
GO:0042908 P xenobiotic transport
GO:0009986 C cell surface
GO:0070062 C extracellular vesicular exosome
GO:0002485 P antigen processing and presentation of endogenous peptide
antigen via MHC class I via ER pathway, TAP-dependent
{Rv1988, PNP} GO term ID GO aspect GO term name
Common GO terms GO:0005737 C cytoplasm
GO:0016740 F transferase activity
Rv1988 only GO:0000154 P rRNA modification
GO:0008649 F rRNA methyltransferase activity
GO:0031167 P rRNA methylation
GO:0046677 P response to antibiotic
PNP only GO:0006139 P nucleobase-containing compound metabolic process
GO:0006148 P inosine catabolic process
GO:0006195 P purine nucleotide catabolic process
O:0006738 P nicotinamide riboside catabolic process
GO:0006955 P immune response
GO:0042493 P response to drug
GO:0070970 P interleukin-2 secretion
GO:0034356 P NAD biosynthesis via nicotinamide riboside salvage pathway
[180]Open in a new tab
C denotes cellular component, F denotes molecular function, and P
denotes biological process
Interlog {Rv1988, PNP}
The interlog {Rv1988|[181]Q10838, PNP|[182]P00491} is derived from the
known interaction {Rv1988|[183]Q10838, Rv0535|I6Y409} [[184]7]. The
human protein [185]P00491 is orthologous to the M. tuberculosis H37Rv
protein I6Y409 with E-value equal to 6e-034. The M. tuberculosis H37Rv
gene Rv1988 is classified into the drug resistance type of
target-modifying enzymes [[186]46]. As shown in Table [187]5, the M.
tuberculosis H37Rv Rv1988 gets involved in the biological processes of
rRNA modification (GO:0000154), rRNA methylation (GO:0031167) and
response to antibiotic (GO:0046677), while the human gene PNP also gets
involved in the biological processes of protein modifications, e.g. the
catabolic processes of nucleobase-containing compound (GO:0006139),
inosine (GO:0006148), purine nucleotide (GO:0006195), etc. In addition,
PNP is involved in the biological processes of immune response
(GO:0006955, GO:0070970) and response to drug (GO:0042493).
Predicted interactions on the prediction set
The prediction set that contains 407,700 MTB-human protein pairs is
derived from the huge space of non-interlogs according to Formula
([188]7). As shown in Tables [189]1, [190]4.96% of protein pairs are
predicted to be pathogen-host PPIs. Such a low positive rate is
presumably rational with a low risk of false positive predictions. The
predicted interactions on the prediction set are provided in
Additional file [191]9. For the convenience of analysis, we merge the
significant interlogs together with the predicted interactions from the
less significant interlogs, non-interlogs and the prediction set into
Additional file [192]10. We totally obtain 43,116 predicted protein
interactions between M. tuberculosis H37Rv and Homo sapiens, which is
still incomplete since the prediction set is only a small part of the
prediction space.
Taking advantage of the predicted MTB-human PPI networks, we need to
address two concerns (1) how many human genes a M. tuberculosis H37Rv
gene is likely to target; (2) what roles the targeted human genes play
in the human PPI networks. The two concerns are actually about two
kinds of degree distributions (1) the degree distribution of the M.
tuberculosis H37Rv genes in the MTB-human PPI networks (see Fig. [193]6
(left)); (2) the degree distribution of the human genes in human PPI
networks (see Fig. [194]6 (right)). We can see that the two degrees
show a tendency of power-law distribution. As shown in Fig. [195]6
(left), only a small portion of M. tuberculosis H37Rv genes are densely
connected by human genes, indicating that only a small number of M.
tuberculosis H37Rv genes intensively target dozens to several hundred
of human genes. For instance, M. tuberculosis H37Rv gene Rv0440 (groEL)
and Rv1436 (LH57_07850) interact with 209 and 206 human genes,
respectively. As shown in Fig. [196]6 (right), only a small of targeted
human genes are highly-connected hub genes and the long tail indicate
that many targeted human genes are orphan genes in human PPI networks.
It could be concluded that only a small number of M. tuberculosis H37Rv
genes target a small number of human hub genes. The human PPI networks
are constructed from HPRD [[197]47] and BioGRID [[198]48]. To further
reveal the signaling cross-talks between M. tuberculosis H37Rv and Homo
sapiens, we will discuss the patterns of M. tuberculosis H37Rv genes
manipulating the human immune signaling pathways in the section
[199]Discussions.
Fig. 6.
[200]Fig. 6
[201]Open in a new tab
The degree distribution of the M. tuberculosis H37Rv genes in the
derived MTB-human PPI networks (left); the degree distribution of M.
tuberculosis H37Rv targeted human genes in human physical PPI networks
(right)
Discussions
To date, there are very few experimental studies on protein
interactions between bacterial pathogens and their hosts. The latest
database STRING has curated 1678 bacterial pathogen protein-protein
interaction networks, but no work has been reported to exploit these
networks to predict pathogen-host protein interactions thus far.
Pathogen-host protein interactions play a critical role of signaling
cross-talks between pathogen PPI networks and host PPI networks, which
is of significance to understand the underlying mechanism of bacterial
invasive infection and host immune response.
Mycobacterium tuberculosis is an obligate pathogenic bacterial species
in the family of Mycobacteriaceae and the causative agent of
tuberculosis. The physiology of M. tuberculosis is highly aerobic and
requires high levels of oxygen. As primarily a pathogen of the
mammalian respiratory system, M. tuberculosis mainly infects the lungs
as well as other tissues. M. tuberculosis H37Rv has received much
attention in recent years partly due to its co-infection with HIV and
increasingly serious drug resistance. To date, the cross-talks or
interactions between M. tuberculosis and H. sapiens proteins are much
less understood than the individual genome of M. tuberculosis. To the
best of our knowledge, there is no experimental study on protein
interactions between M. tuberculosis H37Rv and Homo sapiens.
Methodology comparison with the related computational methods
The related computational methods generally infer interlogs via one or
more third-party species. For instances, Kshirsagar et al. [[202]23]
use the known Salmonella-human PPIs as templates to infer
Salmonella-plant PPIs via plant-human ortholog mapping (see Fig.
[203]7a). Zhou et al. [[204]24, [205]25] use the known
prokaryote-eukaryote PPIs as templates to infer M. tuberculosis-H.
sapiens PPIs via the ortholog mappings of prokaryote-M. tuberculosis
and eukaryote-H. sapiens (see Fig. [206]7b). However, the large gap
between the source species (e.g. plant) and the target species (e.g.
human) is to a large extent prone to yield false pathogen-host protein
interactions. In addition, the interlog-only methods [[207]24, [208]25]
neither validate the less significant interlogs nor train a predictive
model to predict the non-interlogs that also potentially interact.
Similar to the methods that combine interlog with machine learning
approach [[209]9, [210]23], we also use the derived interlogs as
training data since there are no experimental data available, but
differently we confine the search of M. tuberculosis H37Rv ortholog
genes within the human host without resorting to a third-party species,
meanwhile we do not need prior pathogen-host PPIs of other species as
templates (see Fig. [211]7c). As illustrated in Fig. [212]7c, the
ortholog genes of the two interacting M. tuberculosis H37Rv genes (A,
B) are searched within the human genome space, presumably A’ and B′,
respectively, then it is assumed that A interacts with B′ and B
interacts with A’. The assumption is based on the accumulated evidences
that the different strains of the obligate human pathogen M.
tuberculosis have co-evolved, migrated, and expanded with their human
hosts [[213]27]. Knowledge transfer between co-evolving species is more
credible than that between evolutionarily distant species.
Fig. 7.
[214]Fig. 7
[215]Open in a new tab
Illustration of the way of deriving interlogs. The ellipses in red and
blue denote the target species and the ellipses in other colors denote
the source species. The circles denote genes. The red full line denotes
the experimental protein-protein interactions. The red dotted line
denotes ortholog mapping. The blue full line with arrows at two ends
denotes the derived interlogs. The methods illustrated in (a) and (b)
exploited the pathogen-host PPIs of other species to derive the
interlogs of the target species, while the method illustrated in (c)
transferred the knowledge of intra-species M. tuberculosis H37Rv PPI
networks to predict protein interactions between M. tuberculosis H37Rv
and H. sapiens
It is noted that the quality of M. tuberculosis H37Rv protein
interaction networks in STRING [[216]9] directly affects the quality of
inferred interlogs between M. tuberculosis H37Rv and Homo sapiens.
Among the experimental data in STRING [[217]9], only 32 MTB PPIs are
actually derived by experiments. Obviously, such a small data size
cannot satisfy our needs, so we resort to the other experimental data
in STRING [[218]9] that are actually interlogs inferred from other
experimentally-verified PPIs. Yu et al. [[219]49] have testified the
feasibility of interolog mapping, i.e. the transfer of interaction
annotation from one organism to another using comparative genomics.
Performance comparison with the related computational methods
The interlog-only methods [[220]24, [221]25] do not provide baseline
performance for comparison. The method that combines interlog with
machine learning approach [[222]23] derives Salmonella-plant interlogs
from the known plant-human PPIs as the training data to train a KMM-SVM
model for novel Salmonella-plant PPI predictions. As shown in Table
[223]3, the knowledge of Salmonella-human PPI networks is transferred
in the first experiment to predict Salmonella-mouse protein
interactions, achieving SE 0.937 and SP 0.517; and the knowledge of
Salmonella-Ecoli PPI networks is transferred in the second experiment
to predict Salmonella-human protein interactions, achieving SE 0.257
and SP 0.161. The results in the second experiment are obviously much
poorer in that the gap of species between E.coli and human is much
larger than that between mice and human. Even in the first experiment,
the method [[224]23] neither achieves satisfactory performance partly
due to the other two reasons: (1) the less significant interlogs are
not explicitly excluded out of the positive training data; (2) the
two-class skew distribution also contributes to the low performance
(SP = 0.517). In this work, the knowledge is only transferred across
co-evolving pathogen and host, so that the proposed method achieves
much better performance (see Table [225]3). Nevertheless, the proposed
method also yields a certain level of bias and performance
overestimation for the two reasons (1) similar interlogs in the
training data could overestimate the performance of 5-fold cross
validation performance, though they do not affect the final trained
model and predictions; (2) the positive training data do not contain
non-interlogs that also potentially interact because there are no such
experimental data, so that the two classes of training data are easily
separated. How to choose the representative interlogs to more
objectively evaluate the proposed model is worth further consideration
in the future work.
GO enrichment analysis of the targeted human genes
Figure [226]8 illustrates the top 20 GO terms of human genes that are
predicted to be targeted by M. tuberculosis H37Rv genes. As shown in
Fig. [227]8 (left), M. tuberculosis H37Rv genes are inclined to target
those human genes located in the cellular compartments of membrane
(GO:0016020), integral to membrane (GO:0016021), cytoplasm
(GO:0005737), nucleus (GO:0005634), mitochondrion (GO:0005739), etc. As
shown in Fig. [228]8 (middle), the targeted human genes fulfil the
molecular functions of protein binding (GO:0005515), hydrolase activity
(GO:0016787), ATP binding (GO:0005524), metal ion binding (GO:0046872),
oxidoreductase activity (GO:0016491), etc. As shown in Fig. [229]8
(right), the targeted human genes get involved in the biological
processes of transport (GO:0006810), oxidation-reduction process
(GO:0055114), metabolic process (GO:0008152), ion transport
(GO:0006811), proteolysis (GO:0006508), regulation of transcription,
DNA-dependent (GO:0006355), etc.
Fig. 8.
[230]Fig. 8
[231]Open in a new tab
GO enrichment analysis of the human genes that are predicted to be
targeted by M. tuberculosis H37Rv drug-resistant genes
Pathway enrichment analysis of the targeted human genes
Bacterial invasion could induce host inflammatory response, for
instances, TNF-α is thought to play a role in the activation of resting
macrophages and inhibition of bacterial dissemination, and IL-10 might
play a role in controlling the trade-off between the anti-microbial
activity and host-derived tissue caseation [[232]50, [233]51]. We map
the targeted human genes onto the known human immune signaling pathways
curated in NetPath [[234]52] to study how M. tuberculosis genes
interact with human defence system. For simplicity, the pathways
IL1~IL11 in NetPath are merged into one IL signaling pathway, thus we
totally obtain 27 human immune signaling pathways. The predicted
MTB-human PPIs related to human immune signaling pathways are provided
in Additional file [235]11. As shown in Fig. [236]9, M. tuberculosis
H37Rv genes are inclined to target the human immune signaling pathways
of AR (Androgen receptor), TNF-alpha (Tumor necrosis factor alpha),
TGF-beta (Transforming growth factor beta receptor), IL (Interleukin),
BCR (B cell receptor), TSH (Thymic stromal lymphopoietin),, etc. In
most cases, many M. tuberculosis H37Rv genes are predicted to target
more than one human immune signaling pathways (see
Additional file [237]12), for instances, M. tuberculosis H37Rv gene
recA (Rv2737c) is predicted to target five signaling pathways
(TNF-alpha;IL2;BCR;AR;TSH). Partial GO enrichment analysis of the
targeted human genes on TNF-alpha and IL signaling pathways are given
in Table [238]6.
Fig. 9.
[239]Fig. 9
[240]Open in a new tab
Statistics of MTB-human PPIs and MTB genes that are predicted to target
specific human immune signaling pathways
Table 6.
GO and pathway enrichment analysis of the M. tuberculosis H37Rv genes
that are predicted to target human TNF-alpha and IL-1~IL11 signaling
pathways
TNF-Alpha GO term ID GO aspect GO term name % Shared
MTB genes that target human gene PSMC2 GO:0005524 F ATP binding 26.92
Yes
GO:0016020 C membrane 20.51 Yes
GO:0055114 P oxidation-reduction process 16.67 No
GO:0016310 P phosphorylation 5.98 No
GO:0006412 P translation 5.13 No
GO:0055085 P transmembrane transport 2.14 Yes
GO:0046933 F hydrogen ion transporting ATP synthase activity,
rotational mechanism 1.71 No
Human gene PSMC2 GO:0002479 P antigen processing and presentation of
exogenous peptide antigen via MHC class I, TAP-dependent – –
GO:0043687 P post-translational protein modification – –
GO:0045899 P positive regulation of RNA polymerase II transcriptional
preinitiation complex assembly – –
Targeted human genes
CASP10,BID,RUVBL2,DDX21,RPL4,BRINP1,PSMC2,PSMD1,FANCD2,MTIF2,PSMD2,PSMB
5,RPS11,PSMC3,GLUL,PDCD2,KTN1
IL-1 ~ IL-11 GO term ID GO aspect GO term name % Shared
M.TB genes that target the human gene VCP GO:0003824 F catalytic
activity 32.17 No
GO:0016021 C integral to membrane 16.78 No
GO:0005524 F ATP binding 27.27 Yes
GO:0055114 P oxidation-reduction process 9.79 No
GO:0006810 P transport 3.50 Yes
GO:0006281 P DNA repair 2.10 No
Human gene VCP GO:0005515 F protein binding – –
GO:0006914 P autophagy – –
GO:0016236 P macroautophagy – –
GO:0010918 P positive regulation of mitochondrial membrane potential –
–
GO:0006974 P cellular response to DNA damage stimulus – –
[241]Open in a new tab
Targeted human genes
BCL2L11,UNC119,IRS1,IL2,DOK2,PTPN6,BAD,IL11,VCP,IRS2
C denotes cellular component, F denotes molecular function, and P
denotes biological process
TNF-alpha signaling pathway
The tumor necrosis factor alpha (TNF-alpha) is a pro-inflammatory
cytokine that belongs to the TNF superfamily [[242]48]. M. tuberculosis
H37Rv is predicted to invade human TNF-alpha signaling pathway through
443 MTB-human PPIs and the targeted human genes {CASP10, BID, RUVBL2,
DDX21, RPL4, BRINP1, PSMC2, PSMD1, FANCD2, MTIF2, PSMD2, PSMB5, RPS11,
PSMC3, GLUL, PDCD2, KTN1} (see Additional file [243]11). Taking the
targeted human gene PSMC2 for example (see Table [244]6), among the M.
tuberculosis H37Rv genes that target the human gene PSMC2, 20.51% of
proteins are located at membrane (GO:0016020); 16.67% of genes are
involved in the biological process of oxidation-reduction process
(GO:0055114); 2.14% of genes are involved in the biological process of
transmembrane transport (GO:0055085). Besides the GO terms marked with
“shared” in Table [245]6, the human gene PSMC2 also gets involved in
translation (GO:0046933;hydrogen ion transporting ATP synthase
activity, rotational mechanism) and post-translational protein
modification (GO:0043687). Especially, the targeted human gene PSMC2
plays an important role in the host immune response (GO:0002479;
antigen processing and presentation of exogenous peptide antigen via
MHC class I, TAP-dependent).
IL1~IL11 signaling pathway
Interleukins are a group of cytokines (secreted proteins and signal
molecules) that were first seen to be expressed by white blood cells
(leukocytes). The function of immune system depends in a large part on
interleukins that promote the development and differentiation of T and
B lymphocytes, and hematopoietic cells [[246]53]. M. tuberculosis H37Rv
is predicted to interact with human IL-1~ 11 signaling pathways through
157 MTB-human PPIs and the targeted human genes {BCL2L11, UNC119, IRS1,
IL2, DOK2, PTPN6, BAD, IL11, VCP, IRS2} (see Additional file [247]11).
Taking the targeted human gene VCP for example (see Table [248]6),
among the M. tuberculosis H37Rv genes that target human gene VCP,
16.78% of genes are located at integral to membrane (GO:0016021);
27.27% of genes fulfil the molecular function of ATP binding
(GO:0005524); 9.79% of genes are involved in oxidation-reduction
process (GO:0055114); 3.50% of genes are involved in transport
(GO:0006810); 9.79% of genes are involved in DNA repair (GO:0006281),
etc. Besides ATP binding (GO:0005524) and transport (GO:0006810), the
targeted human gene VCP gets involved in autophagy (GO:0006914),
macroautophagy (GO:0016236) and cellular response to DNA damage
stimulus (GO:0006974), etc.
Conclusions
In this work, we provide a general computational framework to exploit
the knowledge of the pathogen protein interaction networks in the
database STRING for the rapid reconstruction of pathogen-host protein
interaction networks. We take full advantage of the co-evolution
relationship between M. tuberculosis H37Rv and H. sapiens to derive
significant interlogs, which are used as the training data to build a
predictive model. The knowledge transfer model effectively solves the
problem that no experimental bacteria-host protein interactions are
available as training data. The predicted protein interactions provided
in the Additional files promise to gain applications in the two fields
(1) providing an alternative solution to drug resistance; (2) revealing
the patterns that M. tuberculosis H37Rv genes target human immune
signaling pathways.
Additional files
[249]Additional file 1:^ (1.2MB, txt)
Text file contains the positive training data consisting of the derived
significant interlogs. (TXT 1201 kb)
[250]Additional file 2:^ (769.5KB, txt)
Text file contains the negative training data consisting of randomly
sampled non-interlogs. (TXT 769 kb)
[251]Additional file 3:^ (1.7MB, txt)
Text file contains the gene ontology analysis of the derived interlogs
that get involved in drug resistance of cytochromes and other
target-modifying enzymes. (TXT 1703 kb)
[252]Additional file 4:^ (1.1MB, txt)
Text file contains the gene ontology analysis of the derived interlogs
that get involved in drug resistance of SOS-response and DNA
replication. (TXT 1157 kb)
[253]Additional file 5:^ (1MB, txt)
Text file contains the predicted results on the positive independent
test set consisting of less significant interlogs. (TXT 1055 kb)
[254]Additional file 6:^ (80.4KB, txt)
Text file contains the predicted results on the negative independent
test set consisting of randomly sampled non-interlogs. (TXT 80 kb)
[255]Additional file 7:^ (234.8KB, txt)
Text file contains the gene ontology analysis of the validated less
significant interlogs that get involved in drug resistance of
antibiotic efflux pumps. (TXT 234 kb)
[256]Additional file 8:^ (21KB, txt)
Text file contains the gene ontology analysis of the validated less
significant interlogs that get involved in drug resistance of
target-modifying enzymes. (TXT 21 kb)
[257]Additional file 9:^ (552.1KB, txt)
Text file contains the predicted results on the prediction set
consisting of randomly sampled non-interlogs. (TXT 552 kb)
[258]Additional file 10:^ (2.8MB, txt)
Text file contains the summary of the derived or predicted M.TB-human
PPIs. (TXT 2888 kb)
[259]Additional file 11:^ (106KB, txt)
Text file contains the human cancer/immune signaling pathways that M.
tuberculosis H37Rv genes are predicted to target. (TXT 106 kb)
[260]Additional file 12:^ (21.3KB, txt)
Text file contains the M. tuberculosis H37Rv genes that target human
cancer/immune signaling pathways. (TXT 21 kb)
Acknowledgments