Graphical abstract

   graphic file with name fx1.jpg
   [27]Open in a new tab

Highlights

     * •
       CIPHEN is an HN-based models to screen CPIs across entire human
       protein space
     * •
       CIPHEN is comparable to other methods and capable to unveil
       unrecorded CPIs
     * •
       CIPHEN highlights the mechanism of NPs and helps in drug discovery
       and development
     __________________________________________________________________

   Oncology; Machine learning

Introduction

   Liver cancer is a global health challenge with an estimated incidence
   of >1 million cases by 2025.[28]^1 Hepatocellular carcinoma (HCC) is
   the most prevalent subtype of liver cancer caused by a variety type of
   factors, including genetic, environmental, and behavioral
   factors.[29]^2 Survival benefits have been achieved by the multi-kinase
   and immune checkpoint inhibitors in patients with HCC, but only a part
   of patients clinically respond to these therapeutic agents. Natural
   products (NPs) are one of the important resources for new drug
   development in cancer and have attracted the attention of
   pharmaceutical and pharmacological researchers due to their diverse
   structures and significant anti-HCC activities.[30]^3^,[31]^4 Icaritin
   derived from plants of Epimedium brevicornum, a multi-target
   immunomodulatory small molecule, has been approved for the treatment of
   HCC in 2022.[32]^5

   The major embarrassment in new drug development based on NPs is how to
   figure out that the specific genes or proteins and corresponding
   signaling pathways are affected by a certain compound. Usually, this
   procedure is relied on a large amount of pharmacology, toxicology, and
   molecular biology experiments, which is costly and
   time-consuming.[33]^6^,[34]^7^,[35]^8 Based on multi-omics data,
   computational methods can perform a high-throughput screen of
   compound-protein interactions (CPIs) in a more efficient strategy with
   low cost.[36]^9^,[37]^10

   The current computational approaches for CPIs prediction can be
   generally divided into two categories: guilty-by association (GA)-based
   and network-based methods. GA-based methods were built on the
   assumption that similar compounds would interact with the same proteins
   or functional related proteins and exert similar mode of actions. In
   the most cases, the “similar” means close three-dimensional chemical
   structures.[38]^11^,[39]^12^,[40]^13 CPIs were also predicted based on
   the assumption that compounds with similar side effects would have the
   same target proteins.[41]^14 Besides the chemical structures of
   compounds, the protein sequences and other properties were also
   integrated into the computational model to improve the
   accuracy.[42]^15^,[43]^16^,[44]^17^,[45]^18 Network-based methods
   incorporated multi-layer interaction networks, including compound-drug
   interactions, drug-target interactions, protein-protein interactions
   (PPIs) networks to identify unreported
   CPIs,[46]^19^,[47]^20^,[48]^21^,[49]^22 which were constructed on the
   basis of network embedding, neural factorization machine, and other
   graph models. Both GA-based and network-based methods can predict
   target proteins interacting with known ligands, but cannot predict
   proteins interacting with unknown ligands. NPs contain complicated
   structures and may demonstrate diverse molecular mechanisms by
   interacting with unique proteins. Thus, the construction of prediction
   methods to predict targets for NPs is fascinating.

   Heterogeneous network (HN) represents node and link information in the
   network through multi-type nodes and edges.[50]^23^,[51]^24^,[52]^25
   Comparing with a homogeneous network that only contains one type of
   node and edges, HN can describe the structure and function of
   complicated systems in the real world in more accurate and
   comprehensive way, and meet the requirement of heterogeneity of
   biological data, and possess good model interpretability, which has
   been widely used to elaborate the drug mode of action. For instance,
   Chen et al. learned the drug-target-protein-disease relationship
   through HN to reveal the repositioning of drugs[53]^26; Wei et al.
   applied HN to infer drug-disease associations through integrating
   multi-scale biomedical data resources[54]^27; An et al.[55]^28 proposed
   an HN learning algorithm, Network EmbeDding framework in mulTiPlex
   networks (NEDTP) to accurately predict CPIs by incorporating 15 types
   of heterogeneous information. These works indicated the importance of
   HN in analyzing the molecular mechanism of drugs in complex diseases,
   including cancer, and its usefulness in integrating multi-source data
   for large-scale prediction.

   Herein, we proposed a new computational approach, named as CIPHEN, to
   predict the CPIs from the entire protein space and explore the
   promising NPs and their molecular mechanism against HCC. Specifically,
   an HN was established by integrating the compound-drug interaction,
   drug-target interaction, and PPI network. The compound-drug
   relationships were calculated by Tanimoto coefficients[56]^29 according
   to compound fingerprints, the drug-protein interactions were
   established based on two canonical CPI databases, DrugBank[57]^30 and
   BindingDB,[58]^31 and the PPIs network was obtained from
   HumanNet.[59]^32 The random walks were generated under the guidance of
   pre-defined meta-paths of HN, and the word2vec algorithm[60]^33 was
   applied to achieve the low-dimensional representation of HN. Finally,
   the well-studied machine learning (ML) algorithms, including support
   vector machines (SVMs),[61]^34 random forest (RF),[62]^35 and deep
   neural network (DNN)[63]^36 were introduced to learn the underlined
   mechanism of drug-target interactions and identify the potent CPIs.

   Several procedures were implemented to validate the CIPHEN model. The
   prediction performance of the CIPHEN model was evaluated by
   leave-one-set-out validation on two benchmark datasets and compared
   with the state-of-art prediction tools and other network-based models.
   The generalization ability of CIPHEN was tested on biomolecular
   complexes with experimentally measured binding affinity data. The
   CIPHEN model was subsequently applied to anti-HCC targets and actively
   antihepatoma sesquiterpenoid dimers (SDs) from the plant of Artemisia
   species. The supported evidence from literature and experiments
   suggested the effectiveness of the CIPHEN model in the prediction of
   SDs interacting with the known anti-HCC targets and target proteins for
   SD candidates, which could promote the development of SDs to drug
   candidate molecules. The CIPHEN can be freely available at
   [64]https://github.com/wangyc82/CIPHEN.

Results

   CIPHEN was proposed to conduct high-throughput screening of CPIs and
   predict the promising antihepatoma agents and their target proteins.
   Firstly, an HN was constructed by integrating drug-target interactions
   from canonical CPI databases, well-known human PPIs, and compound-drug
   interactions ([65]Figure 1A). Secondly, the low-dimensional
   representation of HN was achieved through a modified embedding method
   ([66]Figure 1B). The underlined mechanism of drug/ligand-protein
   interactions was learned by ML algorithms and unreported CPIs were
   revealed. The CIPHEN model was validated on biomolecular complexes with
   experimentally measured binding affinity data and SDs with obvious
   antihepatoma activity ([67]Figure 1C). More details of CIPHEN will be
   explained in Methods.

Figure 1.

   [68]Figure 1
   [69]Open in a new tab

   The framework of CIPHEN

   (A) An HN was constructed by integrating known drug-target interactions
   and PPIs, and compound-drug interactions that were established by
   Tanimoto coefficients of compound path-based and substructure-based
   fingerprints.

   (B) The constructed HN was represented by low-dimensional network
   embedding vectors through a modified metapath2vec model.

   (C) The in-depth mechanism of drug-protein interactions was explored by
   ML algorithms, and the prediction results were validated by
   protein-ligand docking and biological experiments.

   Various evaluation procedures were employed to assess the performance
   of CIPHEN. It was found that CIPHEN could uncover known CPIs,
   outperform the state-of-the-art approaches, and generate unrecorded
   ligand-protein complexes. CIPHEN also facilitated the identification of
   SD candidates against HCC.

Compound-protein interactions prediction based on heterogeneous network
accurately outputs known compound-protein interactions

   The performance of CIPHEN was first evaluated on two benchmark
   datasets[70]^30^,[71]^31 via leave-one-set-out validation. The primary
   CIPHEN model was trained based on four types of compound fingerprints
   in OpenBabel (path-based FP2, and substructure-based FP3, FP4, and
   MACCs) and three types of ML algorithms (SVMs, RF, and DNN). To
   determine the best ML algorithm for prediction, 10 percent of
   drug-target interactions in benchmark data was randomly selected as the
   testing data for five times, and the prediction results obtained by
   three ML algorithms with FP2 fingerprint representation were compared
   with the experimental validated ones, respectively. The ROC curves, PR
   curves, AUC, and AUPR implied that both RF and SVMs were suitable for
   prediction with better ROC curves and PR curves, and higher AUCs and
   AUPRs ([72]Figure S1). RF was chosen as the final ML algorithm in the
   CIPHEN mode due to that the extra cross-validation was needed for the
   determination of optimal parameters in the SVM model.

   The various compound fingerprints were evaluated on two benchmark
   datasets through leave-one-set-out validation. On both two benchmark
   datasets, four types of fingerprints achieved comparable prediction
   results with no significantly different ROC curves, PR curves, AUCs,
   and AUPRs ([73]Figures 2 and [74]S2), suggesting that CIPHEN was robust
   across various types of fingerprints. FP2 and MACCs performed a little
   better than FP3 and FP4 did ([75]Figures 2 and [76]S2), and considering
   that well-known target prediction tools SwissTarget Prediction tool
   employed FP2 to represent compounds,[77]^13 thus FP2 was utilized in
   the following evaluation.

Figure 2.

   [78]Figure 2
   [79]Open in a new tab

   The performance of CIPHEN based on various compound fingerprints
   through leave-one-set-out validation on DrugBank dataset

   (A and B) ROC curves and AUCs obtained by various compound fingerprints
   on DrugBank dataset.

   (C and D) PR curves and AUPRs obtained by various compound fingerprints
   on DrugBank dataset.

   Unlike most of ML-based CPI learning methods, CIPHEN was based on an
   HN. The comparison of the CIPHEN model with ML-based methods[80]^16 on
   the DrugBank dataset exhibited that the HN-based model had better
   performance ([81]Figure S3), suggesting the superiority of HN in
   heterogeneous data integration and CPI prediction.

Compound-protein interactions prediction based on heterogeneous network
improves the performance in predicting unreported compound-protein
interactions

   To display the effectiveness of CIPHEN in uncovering the undisclosed
   target proteins of newly defined compounds, the CIPHEN model was first
   compared with a well-known target prediction tool, the SwissTarget
   Prediction tool.[82]^13 CIPHEN was trained and tested on the data used
   in the SwissTarget Prediction tool,[83]^13 which came from another
   canonical CPI dataset, ChEMBL.[84]^37 The prediction results were
   compared with the ones reported in.[85]^13 CIPHEN performed better than
   the SwissTarget Prediction tool with higher AUC ([86]Figure 3A) and
   more truly positives among the top 1% predictions ([87]Figure 3B).
   In,[88]^13 a prediction example for compound CHEMBL2325087 was reported
   to illustrate the good performance of SwissTarget. CIPHEN achieved much
   higher and comparable prediction scores for two validated targets of
   CHEMBL2325087 ([89]Figure 3C), indicating that CIPHEN could improve the
   prediction performance in revealing new CPIs.

Figure 3.

   [90]Figure 3
   [91]Open in a new tab

   Comparison and independent test of the CIPHEN model

   (A) ROC curves and AUCs obtained by CIPHEN and SwissTarget prediction
   tool.

   (B) The percentage of true-positives in predictions obtained by CIPHEN
   and SwissTarget.

   (C) The prediction results for compound and its experimentally
   validated targets.

   (D) The ROC plot shows the prediction results on an independent data,
   PDBbind.

   (E) The prediction probability for protein-ligand pairs with different
   koff value in PDBbind. ∗ indicated Wilcoxon Signed Rank Test p-value
   less than 0.05.

   (F) The frequency of true positives among top predictions.

   Besides the comparison of the CIPHEN model with well-known prediction
   tools, CIPHEN was also compared with other network-based methods in the
   literature.[92]^22 10-fold cross-validation on a small scale of the
   DrugBank data was used to evaluate the performance of several
   network-based prediction models, and the model of KGE_NMF surpassed
   other models with the highest AUPR of 0.961.[93]^22 Using 10-fold
   cross-validation on the DrugBank benchmark data, our CIPHEN model
   provided an AUPR value of 0.965. As with other network-based models,
   KGE_NM only focused on the proteins with certain ligands, while our
   CIPHEN can reveal unrecorded target proteins. These results suggested
   that our CIPHEN could be comparable with the state-of-art prediction
   model, which was a powerful and robust framework for CPI prediction
   from the whole protein space.

Compound-protein interactions prediction based on heterogeneous network
exhibits good generalization ability on independent datasets

   The generalization ability of CIPHEN was tested on 680 protein-ligand
   complexes with the koff values in the PDBbind database,[94]^38 which
   contains 680 small molecules and 406 proteins. According to the
   histogram of koff values for these 680 protein-ligand complexes, 0.01
   was set as the threshold to distinguish the positive protein-ligand
   pairs from negative ones. The training set came from the ChEMBL dataset
   that includes 14,410 ligand-receptor interactions between 1,753 ligands
   and 14,410 proteins, and about 0.64 AUC was obtained by CIPHEN
   ([95]Figure 3D). The prediction probability was significantly higher
   for positive protein-ligand complexes than negative ones
   ([96]Figure 3E, Wilcoxon Signed Rank Test p-value <0.01), and over 60%
   true positives were achieved in the top 1%, top 5% and top10%
   predictions ([97]Figure 3F).

   Another independent data came from NPASS (Natural Product Activity and
   Species Source), which provides the experimental activity values and
   species sources of 35,032 NPs from 25,041 species targeting 5,863
   targets (2,946 proteins, 1,352 microbial species, and 1227 cell lines).
   This dataset contains 446,552 quantitative activity records (e.g.,
   IC[50], Ki, EC[50], GI[50] or MIC mainly in units of
   [MATH: <mrow><mi>n</mi><mi>M</mi></mrow> :MATH]
   ) of 222,092 NP-target pairs and 288,002 NP-species pairs. The
   NP-protein pairs with a Ki value less than 0.01
   [MATH: <mrow><mi>μ</mi><mi>M</mi></mrow> :MATH]
   were selected as true positives, and NP-protein pairs with Ki values
   larger than 10
   [MATH: <mrow><mi>μ</mi><mi>M</mi></mrow> :MATH]
   were selected as true negatives. CIPHEN was trained on ChEMBL data and
   tested on these positive and negative NP-protein pairs, and the
   predictions were compared with the ground truth. The AUC of 0.61 and
   the significant differences of prediction probability in positives and
   negatives ([98]Figure S4) suggested the good generalization of CIPHEN
   in the scenario of NPs, and it could be applied to uncover unreported
   CPIs.

Compound-protein interactions prediction based on heterogeneous network
reveals sesquiterpenoid dimers that could interact with anti-hepatocellular
carcinoma targets

   The validation on benchmark datasets and independent datasets indicated
   the good ability of CIPHEN in the prediction of unannounced CPIs. By
   January 2023, there were 12 kinase and immune check point inhibitors
   that were approved by the United States and P.R of China in the
   treatment of HCC
   ([99]https://www.globecancer.com/azzx/show.php?itemid=16313), and
   protein PDGFRB, KIT, RET, MET and FLT3 were targets of kinase
   inhibitors, including sorafenib, regorafenib, lenvatinib, cabozantinib,
   and donafenib. To test the ability of CIPHEN to predict the compounds
   that could interact with these well-known targets, the CIPHEN model was
   trained on ChEMBL data that excluded the interactions containing the
   above five targets and drugs and tested whether the ligands for these
   targets could be identified by CIPHEN. The known ligands were
   accurately reported as the top three predictions by CIPHEN
   ([100]Figure 4A), meaning that CIPHEN could uncover the ligands of
   anti-HCC targets. To disclose the new ligands for these targets, 18 SDs
   from the plants of Artemisia species with obvious antihepatoma
   activity[101]^39^,[102]^40^,[103]^41^,[104]^42 were screened by CIPHEN.
   As a result, atermeriopodin G7 derived from A. eriopoda[105]^39 was
   predicted as the ligand of RET (PDB: [106]2IVS) and FLT3 (PDB:
   [107]1RJB), artemyrianolide H derivative 25[108]^40 was predicted as
   the ligand of MET (PDB: [109]1R0P) ([110]Figure 4B). The predictions
   were confirmed by protein-ligand docking with less than −9.0 kcal/mol
   affinities ([111]Figure 4C). These results indicated that CIPHEN could
   identify the active SDs interacting with the given anti-HCC targets.

Figure 4.

   [112]Figure 4
   [113]Open in a new tab

   CIPHEN reveals promising SDs interacting with known anti-HCC targets

   (A) CIPHEN accurately outputs the small molecules for known anti-HCC
   targets with the top 1% prediction probabilities.

   (B) The prediction probabilities for the interactions between 18 active
   SDs from Artemisia species and known anti-HCC targets.

   (C) The protein-ligand docking results of predicted new ligands and
   three receptor proteins of cabozantinib. Over 2 interactions and less
   than −9.0 kcal/mol affinity confirmed the predictions of CIPHEN,
   meaning CIPHEN could reveal active SDs interacting with anti-HCC
   targets.

Compound-protein interactions prediction based on heterogeneous network
unveils sesquiterpenoid dimers’ target proteins and mode of actions

   One successful application of CIPHEN was suggested by the
   identification of new compounds for given targets. CIPHEN was then
   applied to another application scenario, namely, predicting target
   proteins of SDs with significant anti-HCC activities. Training on the
   ChEMBL dataset, the target proteins of four SDs (artemyrianolide H
   derivative 25, atermeriopodin G7, KGA-5203, artemzhongdianolide B9)
   from the Artemisia species with obviously inhibitory activity in HCC
   cells were achieved. Both prediction results and biological experiments
   indicated the potential targets of these four SDs were ascribed to the
   MAPK signaling pathway, including MAP2K2, PDGFRA, and MAP2K3
   ([114]Figure 5A). CIPHEN outputted these proteins in the top 1%
   predictions. The interactions between artemyrianolide H derivative 25
   and MAP2K2,[115]^40 atermeriopodin G7 and PDGFRA,[116]^39 and KGA-5203
   and PDGFRA[117]^42 were confirmed by Surface Plasmon Resonance (SPR)
   experiments, while the effect of artemzhongdianolide B9 in increasing
   the expression of p-p38, which was the downstream of MAP2K3 in MAPK
   signaling pathway, was suggested by the Western blot experiment[118]^41
   ([119]Figure 5B). All three MAPK genes displayed significantly
   different expressions between TCGA LIHC tumors and normal tissues
   ([120]Figure 5C) and had ROC curves close to 0–1 baseline
   ([121]Figure S5). Moreover, PDGFRA was closely associated with patients
   with HCC initial diagnosis age and gender, MAP2K2 and MAP2K3 were
   related to tumor stage, and MAP2K2 was linked to cancer survival
   ([122]Figure S5), suggesting the close relationship between MAPK
   signaling pathway and HCC development. All four compounds displayed
   significant activities against HCC and could induce HCC cell apoptosis
   and inhibit cell invasion ([123]Figure 5D). The above results indicated
   that CIPHEN could identify potential targets for those SDs with obvious
   anti-HCC effects, which will get insight into the molecular mechanism
   of SDs against HCC and accelerate the development of SDs into drug
   candidates.

Figure 5.

   [124]Figure 5
   [125]Open in a new tab

   CIPHEN reveals anti-HCC candidates and their mode of actions

   (A) CIPHEN outputs four SDs against HCC and experimental validated
   targets.

   (B) The SPR and Western blot experiments confirms predictions of
   CIPHEN.

   (C) The expressions of predicted targets in patients with TCGA HCC.
   ∗∗∗∗ indicated Wilcoxon Signed Rank Test p-value less than 1e-4.

   (D) The experiments exhibit the molecular mechanisms of SDs against HCC
   involved in cell apoptosis and invasion.

Discussion

   HN was a useful tool to unveil the underlined rules from multiple
   heterogeneous data and had lots of successful applications in the
   biomedicine area, including the prediction of miRNA-disease
   associations,[126]^43 inference of regulatory network,[127]^44
   discovery of PPIs associated with SARS-CoV-2,[128]^45 et al. To uncover
   the potential target proteins across the entire protein space, we
   developed an HN-based model, named as CIPHEN. Using a modified
   methpah2vec method, the low-dimensional representation of HN was
   generated, which was applied to predict the possible associations
   between new compounds and proteins via ML algorithms. The validation on
   benchmark datasets indicated the good ability of CIPHEN in revealing
   known CPIs, the comparisons with other state-of-art methods presented
   the competitive performance of CIPHEN in the prediction of the
   compound’s mode of action, and the independent test presented the
   capability of CIPHEN in the identification of unreported protein-ligand
   complexes. CIPHEN was applied to uncover the SDs interacting with the
   given anti-HCC targets and the target proteins binding to SDs with
   significant anti-HCC activities, respectively, and the evidence from
   molecule docking and biological experiments indicated the effectiveness
   of CIPHEN in the discovery of anti-HCC agents. In conclusion, CIPHEN
   provides a promising opportunity to elucidate the mechanism of actions
   of compounds, which will help to accelerate the process of new drug
   development in HCC treatment.

   Most CPI prediction models only focus on the known targets, while our
   CIPHEN can reveal target proteins interacting with unknown ligands.
   Among the top 100 predictions for four SDs, there were 68, 75, 78, and
   83 new proteins, respectively, indicating that the undisclosed
   mechanism of compounds might be discovered through CIPHEN.

   In our established CIPHEN model, the ML algorithms were utilized to
   predict the possible CPIs, which was different from using correlation
   analysis to mine the possible relationships of HN.[129]^46 The
   competitive performance of CIPHEN with the state-of-art methods
   illustrated the benefit of ML algorithms. In the future, we will design
   a more suitable strategy to decode the relationships of HN and discover
   the important features in the determination of CPIs.

   There were lots of parameters in conducting the model of CIPHEN, such
   as the compound similarity index, and the parameters in the word2vec
   model and ML algorithms. To avoid bias and simplify the whole procedure
   of CIPHEN, the default parameters on the word2vec model and ML
   algorithm were applied. AUCs obtained by leave-one-set out with Dice
   coefficient and different times of negative sampling in the word2vec
   model on the BindingDB data suggested that the robust performance of
   CIPHEN against to these parameters ([130]Figure S6).

   Graph convolutional network (GCN) was usually applied on homogeneous
   networks. Some heterogeneous graph convolutional network-based models
   were generated for HN, such as MHGCN[131]^47 and MHGCN+.[132]^48 Wang
   et al. addressed the two deficiencies of existing HIN-oriented GCN
   methods: (1) they cannot flexibly explore all possible meta-paths and
   extract the most useful ones for each target object, which hinders both
   effectiveness and interpretability; (2) before performing aggregation,
   heterogeneous graph convolutional network-based models often require
   some additional time-consuming pre-processing operations, which
   increase the computational complexity. Wang et al. then proposed an
   interpretable and efficient Heterogeneous Graph Convolutional Network
   (ie-HGCN) to learn the representations of heterogeneous information
   networks.[133]^49 MHGCN+ achieved Macro-F1 and Micro-F1 of 0.903 and
   0.908 on the AMiner dataset for node classification,
   respectively.[134]^2 The Methpath2vec model obtained Macro-F1 and
   Micro-F1 of 0.921 and 0.926 on the AMiner dataset for node
   classification, respectively,[135]^23 indicating that
   metapath2vec-based embedding methods yielded comparable results to the
   GCN-based ones.

Limitations of the study

   Compound fingerprints were used to represent small molecules in our
   CIPHEN model. The validation on benchmark datasets suggested the good
   performance of CIPHEN using compound fingerprints. Recently, the
   success of graph neural networks (GNN)[136]^50 and natural language
   processing (NPL) technique[137]^51 in drug discovery has been
   suggested. The efficacy of GNN in molecular representation was
   validated on the PDBbind data. In particular, the three graph
   convolutional layers were included, the LeakyReLU function was used as
   the activation function, and the cross-entropy function was applied as
   the loss function. Low-dimensional representations for atoms of a given
   compound were obtained, and the row-rank representation of the matrix
   was introduced to represent that compound. The GNN algorithm was
   conducted via the Pytorch framework. The binary prediction results
   obtained by 5-fold cross-validation on PDBbind positive protein-ligand
   pairs (Koff <0.01μM) and negative protein-ligand pairs (Koff >0.01μM)
   were shown in [138]Figure S7. The value of AUC, AUPR, accuracy,
   sensitivity, specificity, precision, and F-measure did not vary too
   much, and GNN performed a little better with higher AUC (0.90 vs.
   0.88), Sensitivity (0.86 vs. 0.83) and AUPR (0.90 vs. 0.88). These
   results indicated that the GNN-based model might be a good choice to
   conduct molecular representation, we will try this type of method, such
   as a graph convolutional network with an attentive mechanism in our
   future work.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   Drug-target network DrugBank uniportlinks.csv NA
   Protein-ligand network BindingDB BindingDB_BindingDB_Inhibition.tsv
   ChEMBL alltargetinfo.csv NA
   Protein-ligand complex PDBbind koff_dataset NA
   Human PPI network HumanNet
   HumanNet-PI.tsv NA
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   R studio (R version 4.2.3) [139]https://www.rstudio.com/ NA
   CIHPEN [140]https://github.com/wangyc82/CIPHEN. NA
   [141]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources should be directed to
   the lead contact, Chen Ji-Jun (chenjj@mail.kib.ac.cn).

Materials availability

   No new materials were generated in this study.

Data and code availability

   The benchmark and independent test data are available at DrugBank,
   BindingDB, ChEMBL, PDBbind, and HumanNet websites.

   Codes are available at [142]https://github.com/wangyc82/CIPHEN.

   Additional information is available from the [143]lead contact upon
   request.

Method details

Resources for benchmark and independent test data

   The two benchmark datasets used to evaluate the performance of CIPHEN
   were obtained from two canonical CPIs databases, DrugBank[144]^30 and
   BindingDB.[145]^31 The drug-target interactions were extracted from
   uniportlinks.csv and BindingDB_BindingDB_Inhibition.tsv files in
   DrugBank and BindingDB in February 2023, respectively. As a result, a
   total of 11,100 drug-target interactions between 2,227 approved drugs
   and 2,868 target proteins were obtained from DrugBank, and a total of
   38,522 ligand-receptor interactions between 26,950 ligands and 1,214
   protein receptors were obtained from BindingDB. The independent data
   used to test the generalization of CIPHEN was from PDBbind, which
   offers a comprehensive collection of experimentally measured binding
   affinity data for all biomolecular complexes deposited in the Protein
   Data Bank (PDB).[146]^38 The dataset used to train the prediction model
   for uncovering anti-HCC SDs was extracted from ChEMBL alltargetinfo.csv
   file in February 2023,[147]^37 which contained 14,410 ligand-receptor
   interactions between 1,753 ligands and 14,410 protein receptors. To
   expand the searching space of predictions, the PPIs network from
   HumanNet[148]^32 was introduced to provide a total of 316,998 PPIs
   among 15,351 human proteins extracted from HumanNet-PI.tsv file. On the
   basis of four types of compound fingerprints in OpenBabel, including
   path-based fingerprint FP2, and substructure-based fingerprint FP3,
   FP4, and MACCs, Tanimoto coefficients[149]^29 were utilized to
   calculate the similarities between the given compounds and known drugs
   or ligands. The three-dimensional structural data files (SDFs) of
   compounds and ligands were downloaded from PubChem, BindingDB and
   CHEMBL, respectively. The TCGA LIHC RNA-seq data and clinical
   properties were downloaded from GDC Firehose
   ([150]https://gdac.broadinstitute.org).

HN construction

   CIPHEN was proposed for CPIs prediction in light of an HN with four
   types of nodes: newly defined compounds, drugs or ligands, target
   proteins, and proteins, and with three types of edges: compound-drug
   interactions, drug-target interactions, and PPIs. The newly defined
   compounds and drugs/ligands were firstly represented by path-based and
   substructure-based fingerprints, and the compound-drug/ligand
   interactions were established by calculating the Tanimoto coefficients.
   The drug-protein or ligand-receptor interactions obtained from DrugBank
   and BindingDB was used to establish drug-target interaction network.
   The PPIs were then incorporated with compound-drug interactions,
   drug-target interactions to set up compound-drug-target-protein HN.
   Proteins in DrugBank and BindingDB were matched with their Uniprot ID,
   and gene names obtained by Uniprot ID mapping tool were used to match
   proteins in PPI network. Compounds could be used either by their common
   names or IDs in curated databases, because only similarity was required
   for construction of HN. Taking DrugBank dataset as an example, for
   DrugBank, an HN with drug-target network between 2,227 drugs and 2,072
   target proteins, target-protein network between 2,072 target proteins
   and 15,351 human proteins were constructed. For BindingDB, an HN with
   drug-target network among 29,460 ligands and 1,069 target proteins,
   target-protein network among 1,069 target proteins and 15,351 human
   proteins were constructed.

   To control the sparsity of network, the threshold for similarity index
   was determined by histogram of Tanimoto coefficient obtained by 4
   fingerprints. For instance, 0.4 was selected for threshold of Tanimoto
   index obtained by MACCs keys on BindingDB data ([151]Figure S8).

Network embedding

   Once above compound-drug-target-protein HN was constructed, a modified
   network embedding method, metapath2vec that was initially designed by
   Dong et al.,[152]^23 was proposed to learn the representations of the
   HN. Particularly, five meta-paths were defined to describe the
   topological properties of the HN, including
   compound-drug/ligand-compound, drug/ligand-target-drug/ligand,
   target-protein-target, compound-drug/ligand-target, and
   compound-drug/ligand-target-protein-target-drug/ligand-compound. Guided
   by these meth-paths, the random walks were generated, and the word2vec
   model with Skip-gram algorithm[153]^33 was applied to calculate the
   representations of the HN. As a result, the newly defined compounds,
   drug/ligands, targets, and proteins were represented by network
   embedding vectors. The detail of this subsection was provided in
   [154]Method S1: Network embedding learning, related to [155]Figure S1.

CPIs prediction

   Through above network embedding procedure, the representations for four
   types of nodes (
   [MATH:
   <mrow><msub><mi>x</mi><mi>c</mi></msub><mo>,</mo><msub><mi>x</mi><mi>d<
   /mi></msub><mo>,</mo><msub><mi>x</mi><mi>t</mi></msub><mo>,</mo><msub><
   mi>x</mi><mi>p</mi></msub></mrow> :MATH]
   ) were obtained. Three well-studied ML algorithms (SVM, RF, and DNN)
   were introduced to learn CPIs. In particular, it was supposed that
   there was a total of
   [MATH: <mrow><mi>n</mi></mrow> :MATH]
   drug-target interactions in above HN that was used as training
   positives. The randomly selected
   [MATH: <mrow><mi>n</mi></mrow> :MATH]
   drug-protein pairs were introduced as training negatives. The training
   set
   [MATH: <mrow><mi>T</mi><mi>r</mi><mi>n</mi></mrow> :MATH]
   was defined by
   [MATH: <mrow><mi>T</mi><mi>r</mi><mi>n</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mo>{</mo><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo>
   <msub><mi>y</mi><mi>i</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow><mrow><mrow><mo>,</mo><mi>i</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn><mo>,</mo><mo>⋯</mo><mo>,</mo><mn>
   2</mn><mi>n</mi></mrow><mo>}</mo></mrow></mrow> :MATH]
   , where
   [MATH: <mrow><msub><mi>x</mi><mi>i</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>x</mi><mi>d</mi></msub><mo>,</mo>
   <msub><mi>x</mi><mi>t</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow><mtext>,</mtext></mrow> :MATH]
   [MATH: <mrow><msub><mi>y</mi><mi>i</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mrow><mo stretchy="true">{</mo><mtable
   columnalign="center"><mtr><mtd><mrow><mn>1</mn><mo>,</mo><mi>i</mi><mi>
   f</mi><mspace
   width="0.25em"></mspace><mi>d</mi><mi>r</mi><mi>u</mi><mi>g</mi><mspace
   width="0.25em"></mspace><mi>d</mi><mspace
   width="0.25em"></mspace><mi>a</mi><mi>n</mi><mi>d</mi><mspace
   width="0.25em"></mspace><mi>t</mi><mi>a</mi><mi>r</mi><mi>g</mi><mi>e</
   mi><mi>t</mi><mspace width="0.25em"></mspace><mi>t</mi><mspace
   width="0.25em"></mspace><mi>h</mi><mi>a</mi><mi>s</mi><mspace
   width="0.25em"></mspace><mi>i</mi><mi>n</mi><mi>t</mi><mi>e</mi><mi>r</
   mi><mi>a</mi><mi>c</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi></mrow><
   /mtd></mtr><mtr><mtd><mrow><mo>−</mo><mn>1</mn><mo>,</mo><mi>o</mi><mi>
   t</mi><mi>h</mi><mi>e</mi><mi>r</mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e
   </mi></mrow></mtd></mtr></mtable></mrow></mrow> :MATH]
   . If we have
   [MATH: <mrow><mi>s</mi></mrow> :MATH]
   newly defined compounds and
   [MATH: <mrow><mi>l</mi></mrow> :MATH]
   human proteins in given PPIs network, the testing set was represented
   by
   [MATH: <mrow><mi>T</mi><mi>s</mi><mi>t</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mo>{</mo><msub><mi>x</mi><mi>k</mi></msub
   ><mrow><mrow><mo>,</mo><mi>k</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn><mo>,</mo><mo>⋯</mo><mo>,</mo><mi>
   s</mi><mo
   linebreak="badbreak">×</mo><mi>l</mi></mrow><mo>}</mo></mrow></mrow>
   :MATH]
   , where
   [MATH: <mrow><msub><mi>x</mi><mi>k</mi></msub><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>x</mi><mi>c</mi></msub><mo>,</mo>
   <msub><mi>x</mi><mi>p</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow><mtext>.</mtext></mrow> :MATH]
   More information can be referred to [156]Method S2: Novel CPIs
   prediction, related to [157]Figure S1.

Model implement and leave-one-set-out validation

   The path-based and substructure-based fingerprints in OpenBabel[158]^52
   were extracted via ChemmineR R package using compounds’
   three-dimensional SDF files. The random walks guided by five
   pre-defined meta-paths were generated using R programming, and the
   word2vec was implemented via word2vec R package with default
   parameters. The ML algorithms were carried out via e1071, randomForest,
   and h2o R packages for SVMs, RF, and DNN, respectively. RF and DNN were
   performed with default parameters, while SVM was performed by choosing
   Gaussian function as the kernel function, and the model parameters were
   determined via 3-fold cross-validation.

   The leave-one-set-out validation was conducted to evaluate the
   performance of CIPHEN. Specifically, 10 percent of drugs or ligands
   were removed from the known drug-target network, which was regarded as
   the newly defined compounds, and the remained drug-target interactions
   were used to establish drug-target interactions of HN. The Tanimoto
   coefficients between new compounds and remained drugs/ligands were
   applied to construct the compound-drug network, and the HN was set up
   by integrating PPIs network. For instance, in DrugBank data,
   drug-target network contains 2,227 drugs and 2,072 target proteins.
   Five times validation were conducted, and for each 'leave-one-set-out'
   validation, 227 randomly selected drugs and their interactions with
   target proteins were removed from drug-target network and applied as
   the testing data. Particularly, an HN with compound-drug network among
   227 compounds and 2,000 drugs, drug-target network among 2,000 drugs
   and 2,072 target proteins, and PPI network among 2,072 target proteins
   and 15,351 human proteins were constructed for each time of validation.

   The prediction results were compared with the truly interactions, and
   the ROC curve,[159]^53 precision-recall (PR) curve,[160]^54 the area
   under ROC (AUC) and PR curve (AUPR) were introduced as evaluation
   criteria. The KEGG pathway enrichment analysis was performed via DAVID
   Bioinformatics tools. The Autodock Vina with default parameters was
   utilized to perform protein-ligand docking. The protein and compound
   structures were pre-processed via OpenBabel. The more detail of
   implementation and evaluation were provided in [161]Method S3: CIPHEN
   implementation and evaluation, related to [162]Figure S1.

Quantification and statistical analysis

   p values in the scatterplots are calculated with Wilcoxon signed rank
   test (with ∗p < 5e-2, ∗∗p < 1e-2, ∗∗∗p < 1e-3 and ∗∗∗∗p < 1e-4 is
   considered statistically significant. R was used to analyze the raw
   imaging data and the plots and statistical significance tests were done
   by R ggplot2 package.

SPR and western blot experiments

   SPR experiments for atermeriopodin G7, KGA-5203, and artemyrianolide H
   derivative 25 were performed in previous
   studied,[163]^39^,[164]^40^,[165]^42 and Western blot experiment for
   artemzhongdianolide B9 was conducted in.[166]^41

Acknowledgments