Graphical abstract
graphic file with name fx1.jpg
[27]Open in a new tab
Highlights
* •
CIPHEN is an HN-based models to screen CPIs across entire human
protein space
* •
CIPHEN is comparable to other methods and capable to unveil
unrecorded CPIs
* •
CIPHEN highlights the mechanism of NPs and helps in drug discovery
and development
__________________________________________________________________
Oncology; Machine learning
Introduction
Liver cancer is a global health challenge with an estimated incidence
of >1 million cases by 2025.[28]^1 Hepatocellular carcinoma (HCC) is
the most prevalent subtype of liver cancer caused by a variety type of
factors, including genetic, environmental, and behavioral
factors.[29]^2 Survival benefits have been achieved by the multi-kinase
and immune checkpoint inhibitors in patients with HCC, but only a part
of patients clinically respond to these therapeutic agents. Natural
products (NPs) are one of the important resources for new drug
development in cancer and have attracted the attention of
pharmaceutical and pharmacological researchers due to their diverse
structures and significant anti-HCC activities.[30]^3^,[31]^4 Icaritin
derived from plants of Epimedium brevicornum, a multi-target
immunomodulatory small molecule, has been approved for the treatment of
HCC in 2022.[32]^5
The major embarrassment in new drug development based on NPs is how to
figure out that the specific genes or proteins and corresponding
signaling pathways are affected by a certain compound. Usually, this
procedure is relied on a large amount of pharmacology, toxicology, and
molecular biology experiments, which is costly and
time-consuming.[33]^6^,[34]^7^,[35]^8 Based on multi-omics data,
computational methods can perform a high-throughput screen of
compound-protein interactions (CPIs) in a more efficient strategy with
low cost.[36]^9^,[37]^10
The current computational approaches for CPIs prediction can be
generally divided into two categories: guilty-by association (GA)-based
and network-based methods. GA-based methods were built on the
assumption that similar compounds would interact with the same proteins
or functional related proteins and exert similar mode of actions. In
the most cases, the “similar” means close three-dimensional chemical
structures.[38]^11^,[39]^12^,[40]^13 CPIs were also predicted based on
the assumption that compounds with similar side effects would have the
same target proteins.[41]^14 Besides the chemical structures of
compounds, the protein sequences and other properties were also
integrated into the computational model to improve the
accuracy.[42]^15^,[43]^16^,[44]^17^,[45]^18 Network-based methods
incorporated multi-layer interaction networks, including compound-drug
interactions, drug-target interactions, protein-protein interactions
(PPIs) networks to identify unreported
CPIs,[46]^19^,[47]^20^,[48]^21^,[49]^22 which were constructed on the
basis of network embedding, neural factorization machine, and other
graph models. Both GA-based and network-based methods can predict
target proteins interacting with known ligands, but cannot predict
proteins interacting with unknown ligands. NPs contain complicated
structures and may demonstrate diverse molecular mechanisms by
interacting with unique proteins. Thus, the construction of prediction
methods to predict targets for NPs is fascinating.
Heterogeneous network (HN) represents node and link information in the
network through multi-type nodes and edges.[50]^23^,[51]^24^,[52]^25
Comparing with a homogeneous network that only contains one type of
node and edges, HN can describe the structure and function of
complicated systems in the real world in more accurate and
comprehensive way, and meet the requirement of heterogeneity of
biological data, and possess good model interpretability, which has
been widely used to elaborate the drug mode of action. For instance,
Chen et al. learned the drug-target-protein-disease relationship
through HN to reveal the repositioning of drugs[53]^26; Wei et al.
applied HN to infer drug-disease associations through integrating
multi-scale biomedical data resources[54]^27; An et al.[55]^28 proposed
an HN learning algorithm, Network EmbeDding framework in mulTiPlex
networks (NEDTP) to accurately predict CPIs by incorporating 15 types
of heterogeneous information. These works indicated the importance of
HN in analyzing the molecular mechanism of drugs in complex diseases,
including cancer, and its usefulness in integrating multi-source data
for large-scale prediction.
Herein, we proposed a new computational approach, named as CIPHEN, to
predict the CPIs from the entire protein space and explore the
promising NPs and their molecular mechanism against HCC. Specifically,
an HN was established by integrating the compound-drug interaction,
drug-target interaction, and PPI network. The compound-drug
relationships were calculated by Tanimoto coefficients[56]^29 according
to compound fingerprints, the drug-protein interactions were
established based on two canonical CPI databases, DrugBank[57]^30 and
BindingDB,[58]^31 and the PPIs network was obtained from
HumanNet.[59]^32 The random walks were generated under the guidance of
pre-defined meta-paths of HN, and the word2vec algorithm[60]^33 was
applied to achieve the low-dimensional representation of HN. Finally,
the well-studied machine learning (ML) algorithms, including support
vector machines (SVMs),[61]^34 random forest (RF),[62]^35 and deep
neural network (DNN)[63]^36 were introduced to learn the underlined
mechanism of drug-target interactions and identify the potent CPIs.
Several procedures were implemented to validate the CIPHEN model. The
prediction performance of the CIPHEN model was evaluated by
leave-one-set-out validation on two benchmark datasets and compared
with the state-of-art prediction tools and other network-based models.
The generalization ability of CIPHEN was tested on biomolecular
complexes with experimentally measured binding affinity data. The
CIPHEN model was subsequently applied to anti-HCC targets and actively
antihepatoma sesquiterpenoid dimers (SDs) from the plant of Artemisia
species. The supported evidence from literature and experiments
suggested the effectiveness of the CIPHEN model in the prediction of
SDs interacting with the known anti-HCC targets and target proteins for
SD candidates, which could promote the development of SDs to drug
candidate molecules. The CIPHEN can be freely available at
[64]https://github.com/wangyc82/CIPHEN.
Results
CIPHEN was proposed to conduct high-throughput screening of CPIs and
predict the promising antihepatoma agents and their target proteins.
Firstly, an HN was constructed by integrating drug-target interactions
from canonical CPI databases, well-known human PPIs, and compound-drug
interactions ([65]Figure 1A). Secondly, the low-dimensional
representation of HN was achieved through a modified embedding method
([66]Figure 1B). The underlined mechanism of drug/ligand-protein
interactions was learned by ML algorithms and unreported CPIs were
revealed. The CIPHEN model was validated on biomolecular complexes with
experimentally measured binding affinity data and SDs with obvious
antihepatoma activity ([67]Figure 1C). More details of CIPHEN will be
explained in Methods.
Figure 1.
[68]Figure 1
[69]Open in a new tab
The framework of CIPHEN
(A) An HN was constructed by integrating known drug-target interactions
and PPIs, and compound-drug interactions that were established by
Tanimoto coefficients of compound path-based and substructure-based
fingerprints.
(B) The constructed HN was represented by low-dimensional network
embedding vectors through a modified metapath2vec model.
(C) The in-depth mechanism of drug-protein interactions was explored by
ML algorithms, and the prediction results were validated by
protein-ligand docking and biological experiments.
Various evaluation procedures were employed to assess the performance
of CIPHEN. It was found that CIPHEN could uncover known CPIs,
outperform the state-of-the-art approaches, and generate unrecorded
ligand-protein complexes. CIPHEN also facilitated the identification of
SD candidates against HCC.
Compound-protein interactions prediction based on heterogeneous network
accurately outputs known compound-protein interactions
The performance of CIPHEN was first evaluated on two benchmark
datasets[70]^30^,[71]^31 via leave-one-set-out validation. The primary
CIPHEN model was trained based on four types of compound fingerprints
in OpenBabel (path-based FP2, and substructure-based FP3, FP4, and
MACCs) and three types of ML algorithms (SVMs, RF, and DNN). To
determine the best ML algorithm for prediction, 10 percent of
drug-target interactions in benchmark data was randomly selected as the
testing data for five times, and the prediction results obtained by
three ML algorithms with FP2 fingerprint representation were compared
with the experimental validated ones, respectively. The ROC curves, PR
curves, AUC, and AUPR implied that both RF and SVMs were suitable for
prediction with better ROC curves and PR curves, and higher AUCs and
AUPRs ([72]Figure S1). RF was chosen as the final ML algorithm in the
CIPHEN mode due to that the extra cross-validation was needed for the
determination of optimal parameters in the SVM model.
The various compound fingerprints were evaluated on two benchmark
datasets through leave-one-set-out validation. On both two benchmark
datasets, four types of fingerprints achieved comparable prediction
results with no significantly different ROC curves, PR curves, AUCs,
and AUPRs ([73]Figures 2 and [74]S2), suggesting that CIPHEN was robust
across various types of fingerprints. FP2 and MACCs performed a little
better than FP3 and FP4 did ([75]Figures 2 and [76]S2), and considering
that well-known target prediction tools SwissTarget Prediction tool
employed FP2 to represent compounds,[77]^13 thus FP2 was utilized in
the following evaluation.
Figure 2.
[78]Figure 2
[79]Open in a new tab
The performance of CIPHEN based on various compound fingerprints
through leave-one-set-out validation on DrugBank dataset
(A and B) ROC curves and AUCs obtained by various compound fingerprints
on DrugBank dataset.
(C and D) PR curves and AUPRs obtained by various compound fingerprints
on DrugBank dataset.
Unlike most of ML-based CPI learning methods, CIPHEN was based on an
HN. The comparison of the CIPHEN model with ML-based methods[80]^16 on
the DrugBank dataset exhibited that the HN-based model had better
performance ([81]Figure S3), suggesting the superiority of HN in
heterogeneous data integration and CPI prediction.
Compound-protein interactions prediction based on heterogeneous network
improves the performance in predicting unreported compound-protein
interactions
To display the effectiveness of CIPHEN in uncovering the undisclosed
target proteins of newly defined compounds, the CIPHEN model was first
compared with a well-known target prediction tool, the SwissTarget
Prediction tool.[82]^13 CIPHEN was trained and tested on the data used
in the SwissTarget Prediction tool,[83]^13 which came from another
canonical CPI dataset, ChEMBL.[84]^37 The prediction results were
compared with the ones reported in.[85]^13 CIPHEN performed better than
the SwissTarget Prediction tool with higher AUC ([86]Figure 3A) and
more truly positives among the top 1% predictions ([87]Figure 3B).
In,[88]^13 a prediction example for compound CHEMBL2325087 was reported
to illustrate the good performance of SwissTarget. CIPHEN achieved much
higher and comparable prediction scores for two validated targets of
CHEMBL2325087 ([89]Figure 3C), indicating that CIPHEN could improve the
prediction performance in revealing new CPIs.
Figure 3.
[90]Figure 3
[91]Open in a new tab
Comparison and independent test of the CIPHEN model
(A) ROC curves and AUCs obtained by CIPHEN and SwissTarget prediction
tool.
(B) The percentage of true-positives in predictions obtained by CIPHEN
and SwissTarget.
(C) The prediction results for compound and its experimentally
validated targets.
(D) The ROC plot shows the prediction results on an independent data,
PDBbind.
(E) The prediction probability for protein-ligand pairs with different
koff value in PDBbind. ∗ indicated Wilcoxon Signed Rank Test p-value
less than 0.05.
(F) The frequency of true positives among top predictions.
Besides the comparison of the CIPHEN model with well-known prediction
tools, CIPHEN was also compared with other network-based methods in the
literature.[92]^22 10-fold cross-validation on a small scale of the
DrugBank data was used to evaluate the performance of several
network-based prediction models, and the model of KGE_NMF surpassed
other models with the highest AUPR of 0.961.[93]^22 Using 10-fold
cross-validation on the DrugBank benchmark data, our CIPHEN model
provided an AUPR value of 0.965. As with other network-based models,
KGE_NM only focused on the proteins with certain ligands, while our
CIPHEN can reveal unrecorded target proteins. These results suggested
that our CIPHEN could be comparable with the state-of-art prediction
model, which was a powerful and robust framework for CPI prediction
from the whole protein space.
Compound-protein interactions prediction based on heterogeneous network
exhibits good generalization ability on independent datasets
The generalization ability of CIPHEN was tested on 680 protein-ligand
complexes with the koff values in the PDBbind database,[94]^38 which
contains 680 small molecules and 406 proteins. According to the
histogram of koff values for these 680 protein-ligand complexes, 0.01
was set as the threshold to distinguish the positive protein-ligand
pairs from negative ones. The training set came from the ChEMBL dataset
that includes 14,410 ligand-receptor interactions between 1,753 ligands
and 14,410 proteins, and about 0.64 AUC was obtained by CIPHEN
([95]Figure 3D). The prediction probability was significantly higher
for positive protein-ligand complexes than negative ones
([96]Figure 3E, Wilcoxon Signed Rank Test p-value <0.01), and over 60%
true positives were achieved in the top 1%, top 5% and top10%
predictions ([97]Figure 3F).
Another independent data came from NPASS (Natural Product Activity and
Species Source), which provides the experimental activity values and
species sources of 35,032 NPs from 25,041 species targeting 5,863
targets (2,946 proteins, 1,352 microbial species, and 1227 cell lines).
This dataset contains 446,552 quantitative activity records (e.g.,
IC[50], Ki, EC[50], GI[50] or MIC mainly in units of
[MATH: nM :MATH]
) of 222,092 NP-target pairs and 288,002 NP-species pairs. The
NP-protein pairs with a Ki value less than 0.01
[MATH: μM :MATH]
were selected as true positives, and NP-protein pairs with Ki values
larger than 10
[MATH: μM :MATH]
were selected as true negatives. CIPHEN was trained on ChEMBL data and
tested on these positive and negative NP-protein pairs, and the
predictions were compared with the ground truth. The AUC of 0.61 and
the significant differences of prediction probability in positives and
negatives ([98]Figure S4) suggested the good generalization of CIPHEN
in the scenario of NPs, and it could be applied to uncover unreported
CPIs.
Compound-protein interactions prediction based on heterogeneous network
reveals sesquiterpenoid dimers that could interact with anti-hepatocellular
carcinoma targets
The validation on benchmark datasets and independent datasets indicated
the good ability of CIPHEN in the prediction of unannounced CPIs. By
January 2023, there were 12 kinase and immune check point inhibitors
that were approved by the United States and P.R of China in the
treatment of HCC
([99]https://www.globecancer.com/azzx/show.php?itemid=16313), and
protein PDGFRB, KIT, RET, MET and FLT3 were targets of kinase
inhibitors, including sorafenib, regorafenib, lenvatinib, cabozantinib,
and donafenib. To test the ability of CIPHEN to predict the compounds
that could interact with these well-known targets, the CIPHEN model was
trained on ChEMBL data that excluded the interactions containing the
above five targets and drugs and tested whether the ligands for these
targets could be identified by CIPHEN. The known ligands were
accurately reported as the top three predictions by CIPHEN
([100]Figure 4A), meaning that CIPHEN could uncover the ligands of
anti-HCC targets. To disclose the new ligands for these targets, 18 SDs
from the plants of Artemisia species with obvious antihepatoma
activity[101]^39^,[102]^40^,[103]^41^,[104]^42 were screened by CIPHEN.
As a result, atermeriopodin G7 derived from A. eriopoda[105]^39 was
predicted as the ligand of RET (PDB: [106]2IVS) and FLT3 (PDB:
[107]1RJB), artemyrianolide H derivative 25[108]^40 was predicted as
the ligand of MET (PDB: [109]1R0P) ([110]Figure 4B). The predictions
were confirmed by protein-ligand docking with less than −9.0 kcal/mol
affinities ([111]Figure 4C). These results indicated that CIPHEN could
identify the active SDs interacting with the given anti-HCC targets.
Figure 4.
[112]Figure 4
[113]Open in a new tab
CIPHEN reveals promising SDs interacting with known anti-HCC targets
(A) CIPHEN accurately outputs the small molecules for known anti-HCC
targets with the top 1% prediction probabilities.
(B) The prediction probabilities for the interactions between 18 active
SDs from Artemisia species and known anti-HCC targets.
(C) The protein-ligand docking results of predicted new ligands and
three receptor proteins of cabozantinib. Over 2 interactions and less
than −9.0 kcal/mol affinity confirmed the predictions of CIPHEN,
meaning CIPHEN could reveal active SDs interacting with anti-HCC
targets.
Compound-protein interactions prediction based on heterogeneous network
unveils sesquiterpenoid dimers’ target proteins and mode of actions
One successful application of CIPHEN was suggested by the
identification of new compounds for given targets. CIPHEN was then
applied to another application scenario, namely, predicting target
proteins of SDs with significant anti-HCC activities. Training on the
ChEMBL dataset, the target proteins of four SDs (artemyrianolide H
derivative 25, atermeriopodin G7, KGA-5203, artemzhongdianolide B9)
from the Artemisia species with obviously inhibitory activity in HCC
cells were achieved. Both prediction results and biological experiments
indicated the potential targets of these four SDs were ascribed to the
MAPK signaling pathway, including MAP2K2, PDGFRA, and MAP2K3
([114]Figure 5A). CIPHEN outputted these proteins in the top 1%
predictions. The interactions between artemyrianolide H derivative 25
and MAP2K2,[115]^40 atermeriopodin G7 and PDGFRA,[116]^39 and KGA-5203
and PDGFRA[117]^42 were confirmed by Surface Plasmon Resonance (SPR)
experiments, while the effect of artemzhongdianolide B9 in increasing
the expression of p-p38, which was the downstream of MAP2K3 in MAPK
signaling pathway, was suggested by the Western blot experiment[118]^41
([119]Figure 5B). All three MAPK genes displayed significantly
different expressions between TCGA LIHC tumors and normal tissues
([120]Figure 5C) and had ROC curves close to 0–1 baseline
([121]Figure S5). Moreover, PDGFRA was closely associated with patients
with HCC initial diagnosis age and gender, MAP2K2 and MAP2K3 were
related to tumor stage, and MAP2K2 was linked to cancer survival
([122]Figure S5), suggesting the close relationship between MAPK
signaling pathway and HCC development. All four compounds displayed
significant activities against HCC and could induce HCC cell apoptosis
and inhibit cell invasion ([123]Figure 5D). The above results indicated
that CIPHEN could identify potential targets for those SDs with obvious
anti-HCC effects, which will get insight into the molecular mechanism
of SDs against HCC and accelerate the development of SDs into drug
candidates.
Figure 5.
[124]Figure 5
[125]Open in a new tab
CIPHEN reveals anti-HCC candidates and their mode of actions
(A) CIPHEN outputs four SDs against HCC and experimental validated
targets.
(B) The SPR and Western blot experiments confirms predictions of
CIPHEN.
(C) The expressions of predicted targets in patients with TCGA HCC.
∗∗∗∗ indicated Wilcoxon Signed Rank Test p-value less than 1e-4.
(D) The experiments exhibit the molecular mechanisms of SDs against HCC
involved in cell apoptosis and invasion.
Discussion
HN was a useful tool to unveil the underlined rules from multiple
heterogeneous data and had lots of successful applications in the
biomedicine area, including the prediction of miRNA-disease
associations,[126]^43 inference of regulatory network,[127]^44
discovery of PPIs associated with SARS-CoV-2,[128]^45 et al. To uncover
the potential target proteins across the entire protein space, we
developed an HN-based model, named as CIPHEN. Using a modified
methpah2vec method, the low-dimensional representation of HN was
generated, which was applied to predict the possible associations
between new compounds and proteins via ML algorithms. The validation on
benchmark datasets indicated the good ability of CIPHEN in revealing
known CPIs, the comparisons with other state-of-art methods presented
the competitive performance of CIPHEN in the prediction of the
compound’s mode of action, and the independent test presented the
capability of CIPHEN in the identification of unreported protein-ligand
complexes. CIPHEN was applied to uncover the SDs interacting with the
given anti-HCC targets and the target proteins binding to SDs with
significant anti-HCC activities, respectively, and the evidence from
molecule docking and biological experiments indicated the effectiveness
of CIPHEN in the discovery of anti-HCC agents. In conclusion, CIPHEN
provides a promising opportunity to elucidate the mechanism of actions
of compounds, which will help to accelerate the process of new drug
development in HCC treatment.
Most CPI prediction models only focus on the known targets, while our
CIPHEN can reveal target proteins interacting with unknown ligands.
Among the top 100 predictions for four SDs, there were 68, 75, 78, and
83 new proteins, respectively, indicating that the undisclosed
mechanism of compounds might be discovered through CIPHEN.
In our established CIPHEN model, the ML algorithms were utilized to
predict the possible CPIs, which was different from using correlation
analysis to mine the possible relationships of HN.[129]^46 The
competitive performance of CIPHEN with the state-of-art methods
illustrated the benefit of ML algorithms. In the future, we will design
a more suitable strategy to decode the relationships of HN and discover
the important features in the determination of CPIs.
There were lots of parameters in conducting the model of CIPHEN, such
as the compound similarity index, and the parameters in the word2vec
model and ML algorithms. To avoid bias and simplify the whole procedure
of CIPHEN, the default parameters on the word2vec model and ML
algorithm were applied. AUCs obtained by leave-one-set out with Dice
coefficient and different times of negative sampling in the word2vec
model on the BindingDB data suggested that the robust performance of
CIPHEN against to these parameters ([130]Figure S6).
Graph convolutional network (GCN) was usually applied on homogeneous
networks. Some heterogeneous graph convolutional network-based models
were generated for HN, such as MHGCN[131]^47 and MHGCN+.[132]^48 Wang
et al. addressed the two deficiencies of existing HIN-oriented GCN
methods: (1) they cannot flexibly explore all possible meta-paths and
extract the most useful ones for each target object, which hinders both
effectiveness and interpretability; (2) before performing aggregation,
heterogeneous graph convolutional network-based models often require
some additional time-consuming pre-processing operations, which
increase the computational complexity. Wang et al. then proposed an
interpretable and efficient Heterogeneous Graph Convolutional Network
(ie-HGCN) to learn the representations of heterogeneous information
networks.[133]^49 MHGCN+ achieved Macro-F1 and Micro-F1 of 0.903 and
0.908 on the AMiner dataset for node classification,
respectively.[134]^2 The Methpath2vec model obtained Macro-F1 and
Micro-F1 of 0.921 and 0.926 on the AMiner dataset for node
classification, respectively,[135]^23 indicating that
metapath2vec-based embedding methods yielded comparable results to the
GCN-based ones.
Limitations of the study
Compound fingerprints were used to represent small molecules in our
CIPHEN model. The validation on benchmark datasets suggested the good
performance of CIPHEN using compound fingerprints. Recently, the
success of graph neural networks (GNN)[136]^50 and natural language
processing (NPL) technique[137]^51 in drug discovery has been
suggested. The efficacy of GNN in molecular representation was
validated on the PDBbind data. In particular, the three graph
convolutional layers were included, the LeakyReLU function was used as
the activation function, and the cross-entropy function was applied as
the loss function. Low-dimensional representations for atoms of a given
compound were obtained, and the row-rank representation of the matrix
was introduced to represent that compound. The GNN algorithm was
conducted via the Pytorch framework. The binary prediction results
obtained by 5-fold cross-validation on PDBbind positive protein-ligand
pairs (Koff <0.01μM) and negative protein-ligand pairs (Koff >0.01μM)
were shown in [138]Figure S7. The value of AUC, AUPR, accuracy,
sensitivity, specificity, precision, and F-measure did not vary too
much, and GNN performed a little better with higher AUC (0.90 vs.
0.88), Sensitivity (0.86 vs. 0.83) and AUPR (0.90 vs. 0.88). These
results indicated that the GNN-based model might be a good choice to
conduct molecular representation, we will try this type of method, such
as a graph convolutional network with an attentive mechanism in our
future work.
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
Drug-target network DrugBank uniportlinks.csv NA
Protein-ligand network BindingDB BindingDB_BindingDB_Inhibition.tsv
ChEMBL alltargetinfo.csv NA
Protein-ligand complex PDBbind koff_dataset NA
Human PPI network HumanNet
HumanNet-PI.tsv NA
__________________________________________________________________
Software and algorithms
__________________________________________________________________
R studio (R version 4.2.3) [139]https://www.rstudio.com/ NA
CIHPEN [140]https://github.com/wangyc82/CIPHEN. NA
[141]Open in a new tab
Resource availability
Lead contact
Further information and requests for resources should be directed to
the lead contact, Chen Ji-Jun (chenjj@mail.kib.ac.cn).
Materials availability
No new materials were generated in this study.
Data and code availability
The benchmark and independent test data are available at DrugBank,
BindingDB, ChEMBL, PDBbind, and HumanNet websites.
Codes are available at [142]https://github.com/wangyc82/CIPHEN.
Additional information is available from the [143]lead contact upon
request.
Method details
Resources for benchmark and independent test data
The two benchmark datasets used to evaluate the performance of CIPHEN
were obtained from two canonical CPIs databases, DrugBank[144]^30 and
BindingDB.[145]^31 The drug-target interactions were extracted from
uniportlinks.csv and BindingDB_BindingDB_Inhibition.tsv files in
DrugBank and BindingDB in February 2023, respectively. As a result, a
total of 11,100 drug-target interactions between 2,227 approved drugs
and 2,868 target proteins were obtained from DrugBank, and a total of
38,522 ligand-receptor interactions between 26,950 ligands and 1,214
protein receptors were obtained from BindingDB. The independent data
used to test the generalization of CIPHEN was from PDBbind, which
offers a comprehensive collection of experimentally measured binding
affinity data for all biomolecular complexes deposited in the Protein
Data Bank (PDB).[146]^38 The dataset used to train the prediction model
for uncovering anti-HCC SDs was extracted from ChEMBL alltargetinfo.csv
file in February 2023,[147]^37 which contained 14,410 ligand-receptor
interactions between 1,753 ligands and 14,410 protein receptors. To
expand the searching space of predictions, the PPIs network from
HumanNet[148]^32 was introduced to provide a total of 316,998 PPIs
among 15,351 human proteins extracted from HumanNet-PI.tsv file. On the
basis of four types of compound fingerprints in OpenBabel, including
path-based fingerprint FP2, and substructure-based fingerprint FP3,
FP4, and MACCs, Tanimoto coefficients[149]^29 were utilized to
calculate the similarities between the given compounds and known drugs
or ligands. The three-dimensional structural data files (SDFs) of
compounds and ligands were downloaded from PubChem, BindingDB and
CHEMBL, respectively. The TCGA LIHC RNA-seq data and clinical
properties were downloaded from GDC Firehose
([150]https://gdac.broadinstitute.org).
HN construction
CIPHEN was proposed for CPIs prediction in light of an HN with four
types of nodes: newly defined compounds, drugs or ligands, target
proteins, and proteins, and with three types of edges: compound-drug
interactions, drug-target interactions, and PPIs. The newly defined
compounds and drugs/ligands were firstly represented by path-based and
substructure-based fingerprints, and the compound-drug/ligand
interactions were established by calculating the Tanimoto coefficients.
The drug-protein or ligand-receptor interactions obtained from DrugBank
and BindingDB was used to establish drug-target interaction network.
The PPIs were then incorporated with compound-drug interactions,
drug-target interactions to set up compound-drug-target-protein HN.
Proteins in DrugBank and BindingDB were matched with their Uniprot ID,
and gene names obtained by Uniprot ID mapping tool were used to match
proteins in PPI network. Compounds could be used either by their common
names or IDs in curated databases, because only similarity was required
for construction of HN. Taking DrugBank dataset as an example, for
DrugBank, an HN with drug-target network between 2,227 drugs and 2,072
target proteins, target-protein network between 2,072 target proteins
and 15,351 human proteins were constructed. For BindingDB, an HN with
drug-target network among 29,460 ligands and 1,069 target proteins,
target-protein network among 1,069 target proteins and 15,351 human
proteins were constructed.
To control the sparsity of network, the threshold for similarity index
was determined by histogram of Tanimoto coefficient obtained by 4
fingerprints. For instance, 0.4 was selected for threshold of Tanimoto
index obtained by MACCs keys on BindingDB data ([151]Figure S8).
Network embedding
Once above compound-drug-target-protein HN was constructed, a modified
network embedding method, metapath2vec that was initially designed by
Dong et al.,[152]^23 was proposed to learn the representations of the
HN. Particularly, five meta-paths were defined to describe the
topological properties of the HN, including
compound-drug/ligand-compound, drug/ligand-target-drug/ligand,
target-protein-target, compound-drug/ligand-target, and
compound-drug/ligand-target-protein-target-drug/ligand-compound. Guided
by these meth-paths, the random walks were generated, and the word2vec
model with Skip-gram algorithm[153]^33 was applied to calculate the
representations of the HN. As a result, the newly defined compounds,
drug/ligands, targets, and proteins were represented by network
embedding vectors. The detail of this subsection was provided in
[154]Method S1: Network embedding learning, related to [155]Figure S1.
CPIs prediction
Through above network embedding procedure, the representations for four
types of nodes (
[MATH:
xc,xd<
/mi>,xt,<
mi>xp :MATH]
) were obtained. Three well-studied ML algorithms (SVM, RF, and DNN)
were introduced to learn CPIs. In particular, it was supposed that
there was a total of
[MATH: n :MATH]
drug-target interactions in above HN that was used as training
positives. The randomly selected
[MATH: n :MATH]
drug-protein pairs were introduced as training negatives. The training
set
[MATH: Trn :MATH]
was defined by
[MATH: Trn={(xi,
yi),i=1,⋯,
2n} :MATH]
, where
[MATH: xi=(xd,
xt), :MATH]
[MATH: yi={1,i
fdrugdandtarge
mi>tthasinter
mi>action<
/mtd>−1,o
therwise
:MATH]
. If we have
[MATH: s :MATH]
newly defined compounds and
[MATH: l :MATH]
human proteins in given PPIs network, the testing set was represented
by
[MATH: Tst={xk,k=1,⋯,
s×l}
:MATH]
, where
[MATH: xk=(xc,
xp). :MATH]
More information can be referred to [156]Method S2: Novel CPIs
prediction, related to [157]Figure S1.
Model implement and leave-one-set-out validation
The path-based and substructure-based fingerprints in OpenBabel[158]^52
were extracted via ChemmineR R package using compounds’
three-dimensional SDF files. The random walks guided by five
pre-defined meta-paths were generated using R programming, and the
word2vec was implemented via word2vec R package with default
parameters. The ML algorithms were carried out via e1071, randomForest,
and h2o R packages for SVMs, RF, and DNN, respectively. RF and DNN were
performed with default parameters, while SVM was performed by choosing
Gaussian function as the kernel function, and the model parameters were
determined via 3-fold cross-validation.
The leave-one-set-out validation was conducted to evaluate the
performance of CIPHEN. Specifically, 10 percent of drugs or ligands
were removed from the known drug-target network, which was regarded as
the newly defined compounds, and the remained drug-target interactions
were used to establish drug-target interactions of HN. The Tanimoto
coefficients between new compounds and remained drugs/ligands were
applied to construct the compound-drug network, and the HN was set up
by integrating PPIs network. For instance, in DrugBank data,
drug-target network contains 2,227 drugs and 2,072 target proteins.
Five times validation were conducted, and for each 'leave-one-set-out'
validation, 227 randomly selected drugs and their interactions with
target proteins were removed from drug-target network and applied as
the testing data. Particularly, an HN with compound-drug network among
227 compounds and 2,000 drugs, drug-target network among 2,000 drugs
and 2,072 target proteins, and PPI network among 2,072 target proteins
and 15,351 human proteins were constructed for each time of validation.
The prediction results were compared with the truly interactions, and
the ROC curve,[159]^53 precision-recall (PR) curve,[160]^54 the area
under ROC (AUC) and PR curve (AUPR) were introduced as evaluation
criteria. The KEGG pathway enrichment analysis was performed via DAVID
Bioinformatics tools. The Autodock Vina with default parameters was
utilized to perform protein-ligand docking. The protein and compound
structures were pre-processed via OpenBabel. The more detail of
implementation and evaluation were provided in [161]Method S3: CIPHEN
implementation and evaluation, related to [162]Figure S1.
Quantification and statistical analysis
p values in the scatterplots are calculated with Wilcoxon signed rank
test (with ∗p < 5e-2, ∗∗p < 1e-2, ∗∗∗p < 1e-3 and ∗∗∗∗p < 1e-4 is
considered statistically significant. R was used to analyze the raw
imaging data and the plots and statistical significance tests were done
by R ggplot2 package.
SPR and western blot experiments
SPR experiments for atermeriopodin G7, KGA-5203, and artemyrianolide H
derivative 25 were performed in previous
studied,[163]^39^,[164]^40^,[165]^42 and Western blot experiment for
artemzhongdianolide B9 was conducted in.[166]^41
Acknowledgments