Graphical abstract graphic file with name fx1.jpg [27]Open in a new tab Highlights * • CIPHEN is an HN-based models to screen CPIs across entire human protein space * • CIPHEN is comparable to other methods and capable to unveil unrecorded CPIs * • CIPHEN highlights the mechanism of NPs and helps in drug discovery and development __________________________________________________________________ Oncology; Machine learning Introduction Liver cancer is a global health challenge with an estimated incidence of >1 million cases by 2025.[28]^1 Hepatocellular carcinoma (HCC) is the most prevalent subtype of liver cancer caused by a variety type of factors, including genetic, environmental, and behavioral factors.[29]^2 Survival benefits have been achieved by the multi-kinase and immune checkpoint inhibitors in patients with HCC, but only a part of patients clinically respond to these therapeutic agents. Natural products (NPs) are one of the important resources for new drug development in cancer and have attracted the attention of pharmaceutical and pharmacological researchers due to their diverse structures and significant anti-HCC activities.[30]^3^,[31]^4 Icaritin derived from plants of Epimedium brevicornum, a multi-target immunomodulatory small molecule, has been approved for the treatment of HCC in 2022.[32]^5 The major embarrassment in new drug development based on NPs is how to figure out that the specific genes or proteins and corresponding signaling pathways are affected by a certain compound. Usually, this procedure is relied on a large amount of pharmacology, toxicology, and molecular biology experiments, which is costly and time-consuming.[33]^6^,[34]^7^,[35]^8 Based on multi-omics data, computational methods can perform a high-throughput screen of compound-protein interactions (CPIs) in a more efficient strategy with low cost.[36]^9^,[37]^10 The current computational approaches for CPIs prediction can be generally divided into two categories: guilty-by association (GA)-based and network-based methods. GA-based methods were built on the assumption that similar compounds would interact with the same proteins or functional related proteins and exert similar mode of actions. In the most cases, the “similar” means close three-dimensional chemical structures.[38]^11^,[39]^12^,[40]^13 CPIs were also predicted based on the assumption that compounds with similar side effects would have the same target proteins.[41]^14 Besides the chemical structures of compounds, the protein sequences and other properties were also integrated into the computational model to improve the accuracy.[42]^15^,[43]^16^,[44]^17^,[45]^18 Network-based methods incorporated multi-layer interaction networks, including compound-drug interactions, drug-target interactions, protein-protein interactions (PPIs) networks to identify unreported CPIs,[46]^19^,[47]^20^,[48]^21^,[49]^22 which were constructed on the basis of network embedding, neural factorization machine, and other graph models. Both GA-based and network-based methods can predict target proteins interacting with known ligands, but cannot predict proteins interacting with unknown ligands. NPs contain complicated structures and may demonstrate diverse molecular mechanisms by interacting with unique proteins. Thus, the construction of prediction methods to predict targets for NPs is fascinating. Heterogeneous network (HN) represents node and link information in the network through multi-type nodes and edges.[50]^23^,[51]^24^,[52]^25 Comparing with a homogeneous network that only contains one type of node and edges, HN can describe the structure and function of complicated systems in the real world in more accurate and comprehensive way, and meet the requirement of heterogeneity of biological data, and possess good model interpretability, which has been widely used to elaborate the drug mode of action. For instance, Chen et al. learned the drug-target-protein-disease relationship through HN to reveal the repositioning of drugs[53]^26; Wei et al. applied HN to infer drug-disease associations through integrating multi-scale biomedical data resources[54]^27; An et al.[55]^28 proposed an HN learning algorithm, Network EmbeDding framework in mulTiPlex networks (NEDTP) to accurately predict CPIs by incorporating 15 types of heterogeneous information. These works indicated the importance of HN in analyzing the molecular mechanism of drugs in complex diseases, including cancer, and its usefulness in integrating multi-source data for large-scale prediction. Herein, we proposed a new computational approach, named as CIPHEN, to predict the CPIs from the entire protein space and explore the promising NPs and their molecular mechanism against HCC. Specifically, an HN was established by integrating the compound-drug interaction, drug-target interaction, and PPI network. The compound-drug relationships were calculated by Tanimoto coefficients[56]^29 according to compound fingerprints, the drug-protein interactions were established based on two canonical CPI databases, DrugBank[57]^30 and BindingDB,[58]^31 and the PPIs network was obtained from HumanNet.[59]^32 The random walks were generated under the guidance of pre-defined meta-paths of HN, and the word2vec algorithm[60]^33 was applied to achieve the low-dimensional representation of HN. Finally, the well-studied machine learning (ML) algorithms, including support vector machines (SVMs),[61]^34 random forest (RF),[62]^35 and deep neural network (DNN)[63]^36 were introduced to learn the underlined mechanism of drug-target interactions and identify the potent CPIs. Several procedures were implemented to validate the CIPHEN model. The prediction performance of the CIPHEN model was evaluated by leave-one-set-out validation on two benchmark datasets and compared with the state-of-art prediction tools and other network-based models. The generalization ability of CIPHEN was tested on biomolecular complexes with experimentally measured binding affinity data. The CIPHEN model was subsequently applied to anti-HCC targets and actively antihepatoma sesquiterpenoid dimers (SDs) from the plant of Artemisia species. The supported evidence from literature and experiments suggested the effectiveness of the CIPHEN model in the prediction of SDs interacting with the known anti-HCC targets and target proteins for SD candidates, which could promote the development of SDs to drug candidate molecules. The CIPHEN can be freely available at [64]https://github.com/wangyc82/CIPHEN. Results CIPHEN was proposed to conduct high-throughput screening of CPIs and predict the promising antihepatoma agents and their target proteins. Firstly, an HN was constructed by integrating drug-target interactions from canonical CPI databases, well-known human PPIs, and compound-drug interactions ([65]Figure 1A). Secondly, the low-dimensional representation of HN was achieved through a modified embedding method ([66]Figure 1B). The underlined mechanism of drug/ligand-protein interactions was learned by ML algorithms and unreported CPIs were revealed. The CIPHEN model was validated on biomolecular complexes with experimentally measured binding affinity data and SDs with obvious antihepatoma activity ([67]Figure 1C). More details of CIPHEN will be explained in Methods. Figure 1. [68]Figure 1 [69]Open in a new tab The framework of CIPHEN (A) An HN was constructed by integrating known drug-target interactions and PPIs, and compound-drug interactions that were established by Tanimoto coefficients of compound path-based and substructure-based fingerprints. (B) The constructed HN was represented by low-dimensional network embedding vectors through a modified metapath2vec model. (C) The in-depth mechanism of drug-protein interactions was explored by ML algorithms, and the prediction results were validated by protein-ligand docking and biological experiments. Various evaluation procedures were employed to assess the performance of CIPHEN. It was found that CIPHEN could uncover known CPIs, outperform the state-of-the-art approaches, and generate unrecorded ligand-protein complexes. CIPHEN also facilitated the identification of SD candidates against HCC. Compound-protein interactions prediction based on heterogeneous network accurately outputs known compound-protein interactions The performance of CIPHEN was first evaluated on two benchmark datasets[70]^30^,[71]^31 via leave-one-set-out validation. The primary CIPHEN model was trained based on four types of compound fingerprints in OpenBabel (path-based FP2, and substructure-based FP3, FP4, and MACCs) and three types of ML algorithms (SVMs, RF, and DNN). To determine the best ML algorithm for prediction, 10 percent of drug-target interactions in benchmark data was randomly selected as the testing data for five times, and the prediction results obtained by three ML algorithms with FP2 fingerprint representation were compared with the experimental validated ones, respectively. The ROC curves, PR curves, AUC, and AUPR implied that both RF and SVMs were suitable for prediction with better ROC curves and PR curves, and higher AUCs and AUPRs ([72]Figure S1). RF was chosen as the final ML algorithm in the CIPHEN mode due to that the extra cross-validation was needed for the determination of optimal parameters in the SVM model. The various compound fingerprints were evaluated on two benchmark datasets through leave-one-set-out validation. On both two benchmark datasets, four types of fingerprints achieved comparable prediction results with no significantly different ROC curves, PR curves, AUCs, and AUPRs ([73]Figures 2 and [74]S2), suggesting that CIPHEN was robust across various types of fingerprints. FP2 and MACCs performed a little better than FP3 and FP4 did ([75]Figures 2 and [76]S2), and considering that well-known target prediction tools SwissTarget Prediction tool employed FP2 to represent compounds,[77]^13 thus FP2 was utilized in the following evaluation. Figure 2. [78]Figure 2 [79]Open in a new tab The performance of CIPHEN based on various compound fingerprints through leave-one-set-out validation on DrugBank dataset (A and B) ROC curves and AUCs obtained by various compound fingerprints on DrugBank dataset. (C and D) PR curves and AUPRs obtained by various compound fingerprints on DrugBank dataset. Unlike most of ML-based CPI learning methods, CIPHEN was based on an HN. The comparison of the CIPHEN model with ML-based methods[80]^16 on the DrugBank dataset exhibited that the HN-based model had better performance ([81]Figure S3), suggesting the superiority of HN in heterogeneous data integration and CPI prediction. Compound-protein interactions prediction based on heterogeneous network improves the performance in predicting unreported compound-protein interactions To display the effectiveness of CIPHEN in uncovering the undisclosed target proteins of newly defined compounds, the CIPHEN model was first compared with a well-known target prediction tool, the SwissTarget Prediction tool.[82]^13 CIPHEN was trained and tested on the data used in the SwissTarget Prediction tool,[83]^13 which came from another canonical CPI dataset, ChEMBL.[84]^37 The prediction results were compared with the ones reported in.[85]^13 CIPHEN performed better than the SwissTarget Prediction tool with higher AUC ([86]Figure 3A) and more truly positives among the top 1% predictions ([87]Figure 3B). In,[88]^13 a prediction example for compound CHEMBL2325087 was reported to illustrate the good performance of SwissTarget. CIPHEN achieved much higher and comparable prediction scores for two validated targets of CHEMBL2325087 ([89]Figure 3C), indicating that CIPHEN could improve the prediction performance in revealing new CPIs. Figure 3. [90]Figure 3 [91]Open in a new tab Comparison and independent test of the CIPHEN model (A) ROC curves and AUCs obtained by CIPHEN and SwissTarget prediction tool. (B) The percentage of true-positives in predictions obtained by CIPHEN and SwissTarget. (C) The prediction results for compound and its experimentally validated targets. (D) The ROC plot shows the prediction results on an independent data, PDBbind. (E) The prediction probability for protein-ligand pairs with different koff value in PDBbind. ∗ indicated Wilcoxon Signed Rank Test p-value less than 0.05. (F) The frequency of true positives among top predictions. Besides the comparison of the CIPHEN model with well-known prediction tools, CIPHEN was also compared with other network-based methods in the literature.[92]^22 10-fold cross-validation on a small scale of the DrugBank data was used to evaluate the performance of several network-based prediction models, and the model of KGE_NMF surpassed other models with the highest AUPR of 0.961.[93]^22 Using 10-fold cross-validation on the DrugBank benchmark data, our CIPHEN model provided an AUPR value of 0.965. As with other network-based models, KGE_NM only focused on the proteins with certain ligands, while our CIPHEN can reveal unrecorded target proteins. These results suggested that our CIPHEN could be comparable with the state-of-art prediction model, which was a powerful and robust framework for CPI prediction from the whole protein space. Compound-protein interactions prediction based on heterogeneous network exhibits good generalization ability on independent datasets The generalization ability of CIPHEN was tested on 680 protein-ligand complexes with the koff values in the PDBbind database,[94]^38 which contains 680 small molecules and 406 proteins. According to the histogram of koff values for these 680 protein-ligand complexes, 0.01 was set as the threshold to distinguish the positive protein-ligand pairs from negative ones. The training set came from the ChEMBL dataset that includes 14,410 ligand-receptor interactions between 1,753 ligands and 14,410 proteins, and about 0.64 AUC was obtained by CIPHEN ([95]Figure 3D). The prediction probability was significantly higher for positive protein-ligand complexes than negative ones ([96]Figure 3E, Wilcoxon Signed Rank Test p-value <0.01), and over 60% true positives were achieved in the top 1%, top 5% and top10% predictions ([97]Figure 3F). Another independent data came from NPASS (Natural Product Activity and Species Source), which provides the experimental activity values and species sources of 35,032 NPs from 25,041 species targeting 5,863 targets (2,946 proteins, 1,352 microbial species, and 1227 cell lines). This dataset contains 446,552 quantitative activity records (e.g., IC[50], Ki, EC[50], GI[50] or MIC mainly in units of [MATH: nM :MATH] ) of 222,092 NP-target pairs and 288,002 NP-species pairs. The NP-protein pairs with a Ki value less than 0.01 [MATH: μM :MATH] were selected as true positives, and NP-protein pairs with Ki values larger than 10 [MATH: μM :MATH] were selected as true negatives. CIPHEN was trained on ChEMBL data and tested on these positive and negative NP-protein pairs, and the predictions were compared with the ground truth. The AUC of 0.61 and the significant differences of prediction probability in positives and negatives ([98]Figure S4) suggested the good generalization of CIPHEN in the scenario of NPs, and it could be applied to uncover unreported CPIs. Compound-protein interactions prediction based on heterogeneous network reveals sesquiterpenoid dimers that could interact with anti-hepatocellular carcinoma targets The validation on benchmark datasets and independent datasets indicated the good ability of CIPHEN in the prediction of unannounced CPIs. By January 2023, there were 12 kinase and immune check point inhibitors that were approved by the United States and P.R of China in the treatment of HCC ([99]https://www.globecancer.com/azzx/show.php?itemid=16313), and protein PDGFRB, KIT, RET, MET and FLT3 were targets of kinase inhibitors, including sorafenib, regorafenib, lenvatinib, cabozantinib, and donafenib. To test the ability of CIPHEN to predict the compounds that could interact with these well-known targets, the CIPHEN model was trained on ChEMBL data that excluded the interactions containing the above five targets and drugs and tested whether the ligands for these targets could be identified by CIPHEN. The known ligands were accurately reported as the top three predictions by CIPHEN ([100]Figure 4A), meaning that CIPHEN could uncover the ligands of anti-HCC targets. To disclose the new ligands for these targets, 18 SDs from the plants of Artemisia species with obvious antihepatoma activity[101]^39^,[102]^40^,[103]^41^,[104]^42 were screened by CIPHEN. As a result, atermeriopodin G7 derived from A. eriopoda[105]^39 was predicted as the ligand of RET (PDB: [106]2IVS) and FLT3 (PDB: [107]1RJB), artemyrianolide H derivative 25[108]^40 was predicted as the ligand of MET (PDB: [109]1R0P) ([110]Figure 4B). The predictions were confirmed by protein-ligand docking with less than −9.0 kcal/mol affinities ([111]Figure 4C). These results indicated that CIPHEN could identify the active SDs interacting with the given anti-HCC targets. Figure 4. [112]Figure 4 [113]Open in a new tab CIPHEN reveals promising SDs interacting with known anti-HCC targets (A) CIPHEN accurately outputs the small molecules for known anti-HCC targets with the top 1% prediction probabilities. (B) The prediction probabilities for the interactions between 18 active SDs from Artemisia species and known anti-HCC targets. (C) The protein-ligand docking results of predicted new ligands and three receptor proteins of cabozantinib. Over 2 interactions and less than −9.0 kcal/mol affinity confirmed the predictions of CIPHEN, meaning CIPHEN could reveal active SDs interacting with anti-HCC targets. Compound-protein interactions prediction based on heterogeneous network unveils sesquiterpenoid dimers’ target proteins and mode of actions One successful application of CIPHEN was suggested by the identification of new compounds for given targets. CIPHEN was then applied to another application scenario, namely, predicting target proteins of SDs with significant anti-HCC activities. Training on the ChEMBL dataset, the target proteins of four SDs (artemyrianolide H derivative 25, atermeriopodin G7, KGA-5203, artemzhongdianolide B9) from the Artemisia species with obviously inhibitory activity in HCC cells were achieved. Both prediction results and biological experiments indicated the potential targets of these four SDs were ascribed to the MAPK signaling pathway, including MAP2K2, PDGFRA, and MAP2K3 ([114]Figure 5A). CIPHEN outputted these proteins in the top 1% predictions. The interactions between artemyrianolide H derivative 25 and MAP2K2,[115]^40 atermeriopodin G7 and PDGFRA,[116]^39 and KGA-5203 and PDGFRA[117]^42 were confirmed by Surface Plasmon Resonance (SPR) experiments, while the effect of artemzhongdianolide B9 in increasing the expression of p-p38, which was the downstream of MAP2K3 in MAPK signaling pathway, was suggested by the Western blot experiment[118]^41 ([119]Figure 5B). All three MAPK genes displayed significantly different expressions between TCGA LIHC tumors and normal tissues ([120]Figure 5C) and had ROC curves close to 0–1 baseline ([121]Figure S5). Moreover, PDGFRA was closely associated with patients with HCC initial diagnosis age and gender, MAP2K2 and MAP2K3 were related to tumor stage, and MAP2K2 was linked to cancer survival ([122]Figure S5), suggesting the close relationship between MAPK signaling pathway and HCC development. All four compounds displayed significant activities against HCC and could induce HCC cell apoptosis and inhibit cell invasion ([123]Figure 5D). The above results indicated that CIPHEN could identify potential targets for those SDs with obvious anti-HCC effects, which will get insight into the molecular mechanism of SDs against HCC and accelerate the development of SDs into drug candidates. Figure 5. [124]Figure 5 [125]Open in a new tab CIPHEN reveals anti-HCC candidates and their mode of actions (A) CIPHEN outputs four SDs against HCC and experimental validated targets. (B) The SPR and Western blot experiments confirms predictions of CIPHEN. (C) The expressions of predicted targets in patients with TCGA HCC. ∗∗∗∗ indicated Wilcoxon Signed Rank Test p-value less than 1e-4. (D) The experiments exhibit the molecular mechanisms of SDs against HCC involved in cell apoptosis and invasion. Discussion HN was a useful tool to unveil the underlined rules from multiple heterogeneous data and had lots of successful applications in the biomedicine area, including the prediction of miRNA-disease associations,[126]^43 inference of regulatory network,[127]^44 discovery of PPIs associated with SARS-CoV-2,[128]^45 et al. To uncover the potential target proteins across the entire protein space, we developed an HN-based model, named as CIPHEN. Using a modified methpah2vec method, the low-dimensional representation of HN was generated, which was applied to predict the possible associations between new compounds and proteins via ML algorithms. The validation on benchmark datasets indicated the good ability of CIPHEN in revealing known CPIs, the comparisons with other state-of-art methods presented the competitive performance of CIPHEN in the prediction of the compound’s mode of action, and the independent test presented the capability of CIPHEN in the identification of unreported protein-ligand complexes. CIPHEN was applied to uncover the SDs interacting with the given anti-HCC targets and the target proteins binding to SDs with significant anti-HCC activities, respectively, and the evidence from molecule docking and biological experiments indicated the effectiveness of CIPHEN in the discovery of anti-HCC agents. In conclusion, CIPHEN provides a promising opportunity to elucidate the mechanism of actions of compounds, which will help to accelerate the process of new drug development in HCC treatment. Most CPI prediction models only focus on the known targets, while our CIPHEN can reveal target proteins interacting with unknown ligands. Among the top 100 predictions for four SDs, there were 68, 75, 78, and 83 new proteins, respectively, indicating that the undisclosed mechanism of compounds might be discovered through CIPHEN. In our established CIPHEN model, the ML algorithms were utilized to predict the possible CPIs, which was different from using correlation analysis to mine the possible relationships of HN.[129]^46 The competitive performance of CIPHEN with the state-of-art methods illustrated the benefit of ML algorithms. In the future, we will design a more suitable strategy to decode the relationships of HN and discover the important features in the determination of CPIs. There were lots of parameters in conducting the model of CIPHEN, such as the compound similarity index, and the parameters in the word2vec model and ML algorithms. To avoid bias and simplify the whole procedure of CIPHEN, the default parameters on the word2vec model and ML algorithm were applied. AUCs obtained by leave-one-set out with Dice coefficient and different times of negative sampling in the word2vec model on the BindingDB data suggested that the robust performance of CIPHEN against to these parameters ([130]Figure S6). Graph convolutional network (GCN) was usually applied on homogeneous networks. Some heterogeneous graph convolutional network-based models were generated for HN, such as MHGCN[131]^47 and MHGCN+.[132]^48 Wang et al. addressed the two deficiencies of existing HIN-oriented GCN methods: (1) they cannot flexibly explore all possible meta-paths and extract the most useful ones for each target object, which hinders both effectiveness and interpretability; (2) before performing aggregation, heterogeneous graph convolutional network-based models often require some additional time-consuming pre-processing operations, which increase the computational complexity. Wang et al. then proposed an interpretable and efficient Heterogeneous Graph Convolutional Network (ie-HGCN) to learn the representations of heterogeneous information networks.[133]^49 MHGCN+ achieved Macro-F1 and Micro-F1 of 0.903 and 0.908 on the AMiner dataset for node classification, respectively.[134]^2 The Methpath2vec model obtained Macro-F1 and Micro-F1 of 0.921 and 0.926 on the AMiner dataset for node classification, respectively,[135]^23 indicating that metapath2vec-based embedding methods yielded comparable results to the GCN-based ones. Limitations of the study Compound fingerprints were used to represent small molecules in our CIPHEN model. The validation on benchmark datasets suggested the good performance of CIPHEN using compound fingerprints. Recently, the success of graph neural networks (GNN)[136]^50 and natural language processing (NPL) technique[137]^51 in drug discovery has been suggested. The efficacy of GNN in molecular representation was validated on the PDBbind data. In particular, the three graph convolutional layers were included, the LeakyReLU function was used as the activation function, and the cross-entropy function was applied as the loss function. Low-dimensional representations for atoms of a given compound were obtained, and the row-rank representation of the matrix was introduced to represent that compound. The GNN algorithm was conducted via the Pytorch framework. The binary prediction results obtained by 5-fold cross-validation on PDBbind positive protein-ligand pairs (Koff <0.01μM) and negative protein-ligand pairs (Koff >0.01μM) were shown in [138]Figure S7. The value of AUC, AUPR, accuracy, sensitivity, specificity, precision, and F-measure did not vary too much, and GNN performed a little better with higher AUC (0.90 vs. 0.88), Sensitivity (0.86 vs. 0.83) and AUPR (0.90 vs. 0.88). These results indicated that the GNN-based model might be a good choice to conduct molecular representation, we will try this type of method, such as a graph convolutional network with an attentive mechanism in our future work. STAR★Methods Key resources table REAGENT or RESOURCE SOURCE IDENTIFIER Deposited data __________________________________________________________________ Drug-target network DrugBank uniportlinks.csv NA Protein-ligand network BindingDB BindingDB_BindingDB_Inhibition.tsv ChEMBL alltargetinfo.csv NA Protein-ligand complex PDBbind koff_dataset NA Human PPI network HumanNet HumanNet-PI.tsv NA __________________________________________________________________ Software and algorithms __________________________________________________________________ R studio (R version 4.2.3) [139]https://www.rstudio.com/ NA CIHPEN [140]https://github.com/wangyc82/CIPHEN. NA [141]Open in a new tab Resource availability Lead contact Further information and requests for resources should be directed to the lead contact, Chen Ji-Jun (chenjj@mail.kib.ac.cn). Materials availability No new materials were generated in this study. Data and code availability The benchmark and independent test data are available at DrugBank, BindingDB, ChEMBL, PDBbind, and HumanNet websites. Codes are available at [142]https://github.com/wangyc82/CIPHEN. Additional information is available from the [143]lead contact upon request. Method details Resources for benchmark and independent test data The two benchmark datasets used to evaluate the performance of CIPHEN were obtained from two canonical CPIs databases, DrugBank[144]^30 and BindingDB.[145]^31 The drug-target interactions were extracted from uniportlinks.csv and BindingDB_BindingDB_Inhibition.tsv files in DrugBank and BindingDB in February 2023, respectively. As a result, a total of 11,100 drug-target interactions between 2,227 approved drugs and 2,868 target proteins were obtained from DrugBank, and a total of 38,522 ligand-receptor interactions between 26,950 ligands and 1,214 protein receptors were obtained from BindingDB. The independent data used to test the generalization of CIPHEN was from PDBbind, which offers a comprehensive collection of experimentally measured binding affinity data for all biomolecular complexes deposited in the Protein Data Bank (PDB).[146]^38 The dataset used to train the prediction model for uncovering anti-HCC SDs was extracted from ChEMBL alltargetinfo.csv file in February 2023,[147]^37 which contained 14,410 ligand-receptor interactions between 1,753 ligands and 14,410 protein receptors. To expand the searching space of predictions, the PPIs network from HumanNet[148]^32 was introduced to provide a total of 316,998 PPIs among 15,351 human proteins extracted from HumanNet-PI.tsv file. On the basis of four types of compound fingerprints in OpenBabel, including path-based fingerprint FP2, and substructure-based fingerprint FP3, FP4, and MACCs, Tanimoto coefficients[149]^29 were utilized to calculate the similarities between the given compounds and known drugs or ligands. The three-dimensional structural data files (SDFs) of compounds and ligands were downloaded from PubChem, BindingDB and CHEMBL, respectively. The TCGA LIHC RNA-seq data and clinical properties were downloaded from GDC Firehose ([150]https://gdac.broadinstitute.org). HN construction CIPHEN was proposed for CPIs prediction in light of an HN with four types of nodes: newly defined compounds, drugs or ligands, target proteins, and proteins, and with three types of edges: compound-drug interactions, drug-target interactions, and PPIs. The newly defined compounds and drugs/ligands were firstly represented by path-based and substructure-based fingerprints, and the compound-drug/ligand interactions were established by calculating the Tanimoto coefficients. The drug-protein or ligand-receptor interactions obtained from DrugBank and BindingDB was used to establish drug-target interaction network. The PPIs were then incorporated with compound-drug interactions, drug-target interactions to set up compound-drug-target-protein HN. Proteins in DrugBank and BindingDB were matched with their Uniprot ID, and gene names obtained by Uniprot ID mapping tool were used to match proteins in PPI network. Compounds could be used either by their common names or IDs in curated databases, because only similarity was required for construction of HN. Taking DrugBank dataset as an example, for DrugBank, an HN with drug-target network between 2,227 drugs and 2,072 target proteins, target-protein network between 2,072 target proteins and 15,351 human proteins were constructed. For BindingDB, an HN with drug-target network among 29,460 ligands and 1,069 target proteins, target-protein network among 1,069 target proteins and 15,351 human proteins were constructed. To control the sparsity of network, the threshold for similarity index was determined by histogram of Tanimoto coefficient obtained by 4 fingerprints. For instance, 0.4 was selected for threshold of Tanimoto index obtained by MACCs keys on BindingDB data ([151]Figure S8). Network embedding Once above compound-drug-target-protein HN was constructed, a modified network embedding method, metapath2vec that was initially designed by Dong et al.,[152]^23 was proposed to learn the representations of the HN. Particularly, five meta-paths were defined to describe the topological properties of the HN, including compound-drug/ligand-compound, drug/ligand-target-drug/ligand, target-protein-target, compound-drug/ligand-target, and compound-drug/ligand-target-protein-target-drug/ligand-compound. Guided by these meth-paths, the random walks were generated, and the word2vec model with Skip-gram algorithm[153]^33 was applied to calculate the representations of the HN. As a result, the newly defined compounds, drug/ligands, targets, and proteins were represented by network embedding vectors. The detail of this subsection was provided in [154]Method S1: Network embedding learning, related to [155]Figure S1. CPIs prediction Through above network embedding procedure, the representations for four types of nodes ( [MATH: xc,xd< /mi>,xt,< mi>xp :MATH] ) were obtained. Three well-studied ML algorithms (SVM, RF, and DNN) were introduced to learn CPIs. In particular, it was supposed that there was a total of [MATH: n :MATH] drug-target interactions in above HN that was used as training positives. The randomly selected [MATH: n :MATH] drug-protein pairs were introduced as training negatives. The training set [MATH: Trn :MATH] was defined by [MATH: Trn={(xi, yi),i=1,, 2n} :MATH] , where [MATH: xi=(xd, xt), :MATH] [MATH: yi={1,i fdrugdandtargetthasinteraction< /mtd>1,o therwise :MATH] . If we have [MATH: s :MATH] newly defined compounds and [MATH: l :MATH] human proteins in given PPIs network, the testing set was represented by [MATH: Tst={xk,k=1,, s×l} :MATH] , where [MATH: xk=(xc, xp). :MATH] More information can be referred to [156]Method S2: Novel CPIs prediction, related to [157]Figure S1. Model implement and leave-one-set-out validation The path-based and substructure-based fingerprints in OpenBabel[158]^52 were extracted via ChemmineR R package using compounds’ three-dimensional SDF files. The random walks guided by five pre-defined meta-paths were generated using R programming, and the word2vec was implemented via word2vec R package with default parameters. The ML algorithms were carried out via e1071, randomForest, and h2o R packages for SVMs, RF, and DNN, respectively. RF and DNN were performed with default parameters, while SVM was performed by choosing Gaussian function as the kernel function, and the model parameters were determined via 3-fold cross-validation. The leave-one-set-out validation was conducted to evaluate the performance of CIPHEN. Specifically, 10 percent of drugs or ligands were removed from the known drug-target network, which was regarded as the newly defined compounds, and the remained drug-target interactions were used to establish drug-target interactions of HN. The Tanimoto coefficients between new compounds and remained drugs/ligands were applied to construct the compound-drug network, and the HN was set up by integrating PPIs network. For instance, in DrugBank data, drug-target network contains 2,227 drugs and 2,072 target proteins. Five times validation were conducted, and for each 'leave-one-set-out' validation, 227 randomly selected drugs and their interactions with target proteins were removed from drug-target network and applied as the testing data. Particularly, an HN with compound-drug network among 227 compounds and 2,000 drugs, drug-target network among 2,000 drugs and 2,072 target proteins, and PPI network among 2,072 target proteins and 15,351 human proteins were constructed for each time of validation. The prediction results were compared with the truly interactions, and the ROC curve,[159]^53 precision-recall (PR) curve,[160]^54 the area under ROC (AUC) and PR curve (AUPR) were introduced as evaluation criteria. The KEGG pathway enrichment analysis was performed via DAVID Bioinformatics tools. The Autodock Vina with default parameters was utilized to perform protein-ligand docking. The protein and compound structures were pre-processed via OpenBabel. The more detail of implementation and evaluation were provided in [161]Method S3: CIPHEN implementation and evaluation, related to [162]Figure S1. Quantification and statistical analysis p values in the scatterplots are calculated with Wilcoxon signed rank test (with ∗p < 5e-2, ∗∗p < 1e-2, ∗∗∗p < 1e-3 and ∗∗∗∗p < 1e-4 is considered statistically significant. R was used to analyze the raw imaging data and the plots and statistical significance tests were done by R ggplot2 package. SPR and western blot experiments SPR experiments for atermeriopodin G7, KGA-5203, and artemyrianolide H derivative 25 were performed in previous studied,[163]^39^,[164]^40^,[165]^42 and Western blot experiment for artemzhongdianolide B9 was conducted in.[166]^41 Acknowledgments