Abstract

   Many of genes mediating Known Drug-Disease Association (KDDA) are
   escaped from experimental detection. Identifying of these genes (hidden
   genes) is of great significance for understanding disease pathogenesis
   and guiding drug repurposing. Here, we presented a novel computational
   tool, called KDDANet, for systematic and accurate uncovering the hidden
   genes mediating KDDA from the perspective of genome-wide functional
   gene interaction network. KDDANet demonstrated the competitive
   performances in both sensitivity and specificity of identifying genes
   in mediating KDDA in comparison to the existing state-of-the-art
   methods. Case studies on Alzheimer’s disease (AD) and obesity uncovered
   the mechanistic relevance of KDDANet predictions. Furthermore, when
   applied with multiple types of cancer-omics datasets, KDDANet not only
   recapitulated known genes mediating KDDAs related to cancer, but also
   revealed novel candidates that offer new biological insights.
   Importantly, KDDANet can be used to discover the shared genes mediating
   multiple KDDAs. KDDANet can be accessed at [32]http://www.kddanet.cn
   and the code can be freely downloaded at
   [33]https://github.com/huayu1111/KDDANet.

   Subject terms: Data mining, Drug discovery

Introduction

   The conventional development of novel promising drugs for treating
   specific diseases is a time-consuming and effort-costing process,
   including discovery of new chemical entities, target detection and
   verification, preclinical and clinical trials and so on^[34]1. In
   addition, only about 10% of new drugs are approved by FDA per year,
   meaning most new drugs are never approved and taken to market, which
   prevents them from being used in practice^[35]1. The decrease in
   pharmaceutical research productivity towards new drug development has
   left a gap between the therapeutic needs of patients and the available
   treatment paradigms^[36]1. Compared with traditional drug development,
   drug repositioning, i.e., finding the novel indications of existing
   drugs, offers an alternative approach for safer and faster drug
   development because of several procedures of traditional drug
   development workflow are not involved during repurposing efforts^[37]2.
   There are multiple examples of repositioned drugs that are on the
   market now including Minoxidil, a drug designed to treat hypertension
   but is now used to treat hair loss^[38]3, and Sildenafil, a drug
   originally developed for patients with heart disease but is now
   commonly used to treat erectile dysfunction^[39]4. However, these
   examples of repositioned drugs were primarily based on clinical
   observations of the side effects of the drug^[40]5. Thanks to the
   advance in next-generation omics sequencing and qualification
   technologies, a large volume of biomedical data, for example, the
   pharmacogenomics datasets produced by Connective Map project, The
   Cancer Genome Atlas (TCGA) project, Cancer Cell Line Encyclopedia
   (CCLE) project, Genomics of Drug Sensitivity in Cancer 1000 human
   cancer cell lines (GDSC1000) project and Library of Integrated
   Network-based Cellular Signatures project, has been rapidly accumulated
   for enabling drug repurposing^[41]6–[42]14. Based on these datasets,
   various computational methods have been designed for facilitating the
   process of drug repurposing (see Supplementary Note [43]1 for a mini
   review). To infer pharmacokinetic and pharmacodynamic drug–drug
   interactions and their associated recommendations, for example,
   Gottlieb et al. designed a similarity measure-based logistic regression
   classifier^[44]15. For predicting drug side effects, Tatonetti et al.
   presented an adaptive data-driven approach^[45]16. To identify novel
   drug combinations, Zhao et al. integrated the molecular and
   pharmacological data and developed a novel computational
   approach^[46]17. In addition, network-based method has also been
   employed to achieve the similar goal^[47]18. Interestingly, Kuenzi et
   al. developed a deep-learning model of visible neural network, called
   DrugCell, to predict drug response and synergy in human cancer
   cells^[48]19. For discovering novel drug indication, Gottlieb et al.
   developed PREDICT, which scored a possible drug-disease link by
   combining multiple drug–drug and disease–disease similarity
   measures^[49]20. With the similar motivation and aim, an unsupervised
   and unbiased network-based proximity measure has also been
   designed^[50]21. Moreover, Cheng et al. showed that the further
   integration of network proximity-based approach with large-scale
   patient-level longitudinal data can offer an effective platform for
   validating drug indications^[51]22. For predicting drug-target
   interaction, Paolini et al. presented a global mapping of
   pharmacological space and probabilistic models by integrating multiple
   medicinal chemistry data^[52]23. Different from the approach employed
   by Paolini et al., Campillos et al. used the phenotypic side-effect
   similarity to determine whether two drugs share a target^[53]24.
   Besides, we and others also employed machine learning and network
   integration approaches for achieving the same goal^[54]25,[55]26. To
   discover disease-related genes, Gottlieb et al. designed “PRINCIPLE”,
   which employed classical network propagation algorithm^[56]27. Wu et
   al. developed CIPHER that integrated human protein–protein
   interactions, disease phenotype similarities, and known gene-phenotype
   associations to capture the complex relationships between phenotypes
   and genotypes^[57]28. In addition, Ghiassian et al. identified disease
   gene modules based on a systematic analysis of connectivity patterns of
   disease proteins in the human interactome^[58]29. Furthermore, Zhou et
   al. and Menche et al. constructed disease-symptom and disease–disease
   relationship networks based on the biomedical literature databases and
   incomplete interactome, respectively^[59]30,[60]31. Excitingly, Hofree
   et al. developed a new computational approach which employed
   network-based stratification to uncover tumor subtypes by integrating
   somatic tumor genomes with gene networks^[61]32. Collectively, these
   methods have effectively exploited and integrated multilevel biomedical
   and omics data sources for understanding the pathology of diseases and
   mechanisms of drug actions and thus accelerated drug repurposing.

   Theoretically, drug repurposing has been proposed based on two
   molecular aspects. (1) On one hand, complex diseases often involve
   multiple genetic and environmental determinants, including multifactor
   driven alterations and dysregulation of a series of
   genes^[62]33,[63]34, which will propagate and perturb certain
   biological processes by the interactions among molecules, leading to
   the onset of diseases. (2) On the other hand, one drug can exert
   impacts on many targets and perturb multiple biological
   processes^[64]34,[65]35. As a result, the genes of shared biological
   pathways in the cellular network manifested in certain disease state
   and induced by a known drug administration suggest potential drug
   repurposing^[66]33–[67]38. However, many of these genes mediating Known
   Drug-Disease Associations (KDDAs) across various types of diseases have
   not yet been identified (see the “Results” section and Supplementary
   Fig. [68]S1a, [69]b for supporting data of this claim). Therefore,
   developing the appropriate theoretical computational tools from the
   perspective of molecular interaction network to unveil the KDDA genes
   missed from experiments (hidden genes) is of great significance for
   understanding disease pathogenesis and guiding drug repurposing. The
   network-based computational methods have been designed for facilitating
   drug repurposing which linked drugs to targets or connected diseases to
   genes or associated drugs with
   diseases^[70]18,[71]21,[72]22,[73]26–[74]29,[75]39. Nevertheless, the
   publicly available computational tools specially tailored for
   simultaneously bridging drugs, genes, and diseases have not been fully
   developed. To our knowledge, some computational tools have been
   designed to identify drug-gene-disease co-module. Kutalik et al. and
   Chen et al. developed ping-pong algorithm (PPA) and sparse
   network-regularized partial least square (SNPLS) to identify co-modules
   related to specific cancer cell lines of NCI-60 and Cancer Genome
   Projects, respectively^[76]40,[77]41. However, these two methods need
   to integrate gene-expression and drug-response data of cancer cell
   lines for constructing models and do not carry out prediction for other
   types of diseases. The other two methods of comCIPHER and DGPsubNet
   have been designed to identify the coherent subnetworks linking drugs
   and diseases (not limited in cancer) with the related
   genes^[78]42,[79]43. The common shortcomings of these methods are the
   identified co-modules including multiple drugs and diseases and thus
   are unable to uncover genes specifically mediating individual KDDA. To
   this end, we designed a novel computational pipeline, called KDDANet,
   which uses known functional gene interaction network to identify hidden
   genes of cellular pathways mediating KDDA in a genome-wide scale. Our
   KDDANet pipeline depends on three existing network algorithms: minimum
   cost network flow optimization, depth-first searching, and graph
   clustering algorithm. The minimum cost network flow optimization has
   been effectively employed to identify cellular response subnetwork
   connecting genetic hits and differentially expressed genes, including
   components of the response that are otherwise hidden or missed from
   experiments^[80]44. KDDANet can be applied to two contexts: (1)
   uncovering hidden genes mediating the association between a query drug
   and its related disease (SDrTDi); (2) unveiling hidden genes mediating
   the association between a query disease and its related drug (SDiTDr).
   The computational procedure of KDDANet in SDrTDi context was showed in
   Fig. [81]1 (see “Method” section for details). KDDANet first built a
   unified flow model by integrating query drug, genes, and all related
   diseases into a heterogeneous network (Fig. [82]1a). Then, the minimum
   cost flow optimization was designed and implemented to identify gene
   subnetwork mediating the association between the query drug to all its
   related diseases (Fig. [83]1b). Finally, depth-first searching, and
   Markov clustering (MCL) algorithm were adopted to further uncover gene
   modules mediating the association between the query drug and each its
   related disease (Fig. [84]1c–[85]e). The outputs of KDDANet were
   validated against existing literature and multi-omics’ datasets (Fig.
   [86]1f). To apply KDDANet in SDiTDr context, what the user need was
   just to simply rebuilt the unified flow network model (see “Methods”
   section for details).

Fig. 1. Schematic illustration of KDDANet computational pipeline in SDrTDi
context.

   [87]Fig. 1
   [88]Open in a new tab

   a Mapping query drug (QD) and all its related diseases (RDs) into the
   weighted gene network (WGN) through known drug-target relationships and
   gene-disease associations. b Constructing a unified flow network model
   for each QD and RDs (multiple KDDAs). c Identifying the
   highest-probability gene subnetwork mediating multiple KDDAs by minimum
   cost flow optimization. d Identifying gene subnetwork mediating
   individual KDDA by depth-first searching. e Identifying gene
   interaction modules mediating individual KDDA by Markov clustering
   (MCL). f Validating the prediction results of KDDANet by existing
   knowledge bases.

   The key novelty of the KDDANet method lay in that the original
   designation of a unified flow network model and effective implementing
   multiple network-based algorithms on this unified flow network model.
   We demonstrated that KDDANet showed competitive performance in
   discovering known and novel genes mediating KDDA through a comparison
   with existing methods. KEGG pathway enrichment analysis on KDDANet
   resulting subnetworks and case studies on Alzheimer’s disease (AD) and
   obesity further showed the mechanistic relevance of KDDANet
   predictions. Validated with multiple types of cancer-omics’ datasets,
   KDDANet did not only revealed known genes mediating KDDAs associating
   drug with cancer, but also uncovered new candidates that offer novel
   biological insights. Particularly, our results demonstrated that
   KDDANet can reveal the shared genes mediating multiple KDDAs. These
   outcomes showed the importance of incorporating hidden genes in drug
   discovery pipelines. For facilitating biomedical researchers to explore
   the molecular mechanism of KDDA and guiding drug repurposing, an online
   web server, [89]http://www.kddanet.cn, was provided for user to access
   the subnetwork of genes mediating KDDA, and the source codes of KDDANet
   were freely available at [90]https://github.com/huayu1111/KDDANet. In
   summary, we developed an effective and universal computational tool and
   an online web source for accurate and systematic discovering hidden
   genes mediating KDDA and thus providing novel insights into mechanism
   basis of drug repurposing and disease treatments. We believed that
   KDDANet can provide additional contributions to the development of new
   therapies.

Results

Evaluation of the performance and general applicability of KDDANet method

   We first checked whether the potential genes mediating KDDA across
   various types of diseases have been experimentally identified. For a
   given KDDA, we proposed a hypothesis that the Known Drug Target Genes
   (KDTGs) and Known Disease-Related Genes (KDRGs) should be highly
   overlapped if the genes mediating this KDDA have been fully identified.
   We analyzed the overlap between KDTGs and KDRGs of 53124 KDDAs obtained
   from Comparative Toxicogenomics Database (CTD)^[91]12. For a KDDA, we
   defined the overlap ratio as the number of shared genes between KDTGs
   and KDRGs divided by the number of total KDTG and KDRG genes. We
   observed that the most gene sets demonstrated extremely low overlap
   ratio, with the increasing of gene number, the overlap ratio was
   sharply decreased (Supplementary Fig. [92]1a). We then checked the
   overlap ratio in different types of diseases and found the overlaps
   between KDTGs and KDRGs were small for all 19 disease types in our
   dataset (Supplementary Fig. [93]1b). The discrepancy between KDTGs and
   KDRGs indicated that each gene set alone provided only a limited and
   biased view of KDDA, many of true genes in the cellular pathways
   mediating KDDA were not identified from the experiments but otherwise
   hidden. To address this, we designed a novel computational tool,
   KDDANet, which effectively integrated minimum cost flow optimization,
   combined with depth-first searching and graph clustering to
   systematically discover the hidden genes of cellular pathways mediating
   KDDA (see “Methods” section for details).

   To examine whether KDDANet can capture true genes mediating KDDA, we
   introduced two new concepts: “known true KDDA genes” (KTKGs) and “novel
   true KDDA genes” (NTKGs). For a given KDDA, KTKGs were defined as the
   shared genes between KDTGs and KDRGs inputted for constructing KDDANet
   flow network model (see “Dataset” section for details). To obtain
   NTKGs, we collected a set of drug’s non-target genes (A) from SMPDB 2.0
   database^[94]45 that were included in the drug’s ADME pathways and were
   responsible for mediating KDDA. Meanwhile, we collected a recently
   updated set of disease-related genes (B) from DisGeNet v6.0
   database^[95]46, which were not included in KDDANet flow network model.
   Based on these, we defined, for each KDDA, the NTKGs as the shared
   genes between A and B. A parameter γ effected the size and quality of
   KDDANet output subnetwork, higher γ values will identify more gene
   links mediating KDDA but with lower confidence. Using gene set
   enrichment analysis, we observed that KDDANet can consistently and
   effectively capture the KTKGs and NTKGs mediating KDDA under different
   γ settings (see Supplementary Notes [96]2 and [97]3, and Supplementary
   Fig. [98]1c–l for details).

   Based on this, we compiled a standard set of positive and negative KDDA
   genes for each KDDA to unbiasedly evaluate the capability of KDDANet
   method on uncovering the true genes mediating KDDA using Receiver
   Operating Characteristic (ROC) and Precision-Recall (PR) curves (see
   “Methods” section for details). We observed that the performance of
   KDDANet was obviously better than random permutation across a widely
   settings of γ (Fig. [99]2a, [100]b). We next selected γ = 6 for SDrTDi
   and γ = 8 for SDiTDr for subsequent evaluation (see Supplementary Note
   [101]4 for reasons). With this setting, we further compared KDDANet
   with other existing methods for drug-gene-disease co-module discovery,
   including SNPLS^[102]40, PPA^[103]41, comCHIPER^[104]42, and
   DGPsubNet^[105]43 (see “Methods” section for the details of performance
   comparison). Among all the methods tested, KDDANet demonstrated the
   competitive performance with the averages of AUROC 0.733 and 0.715 and
   AUPRC 0.793 and 0.825 in SDrTDi and SDiTDr contexts, respectively (Fig.
   [106]2c and Table [107]1). These results suggested that KDDANet was an
   effective tool for achieving the goal of uncovering true genes
   mediating KDDA genome wide. We next tested whether KDDANet had the
   general application value across a variety of disease types and found
   that KDDANet can make effective capturing of true genes mediating KDDA
   for all 19 types of diseases (Supplementary Fig. [108]2a–d). We further
   tested KDDANet using different types of networks, including
   HINT + HI2012, iRefIndex, MultiNet, and STRINGv10^[109]47. We found
   that the performances of KDDANet were consistent well across all these
   networks (Fig. [110]2d, Supplementary Fig. [111]2e, f). Collectively,
   these results fully demonstrated that KDDANet was an effective and
   general computational tool for uncovering hidden genes mediating KDDA
   across broad types of diseases.

Fig. 2. Performance evaluation of KDDANet method.

   [112]Fig. 2
   [113]Open in a new tab

   a The density curve of AUROC and AUPRC of KDDANet with different γ
   setting and permutation test in SDrTDi context. b The density curve of
   AUROC and AUPRC of KDDANet with different γ setting and permutation
   test in SDiTDr context. c Comparison of AUROC and AUPRC of KDDANet with
   PPA, SNPLS, comCHIPER, and DGPsubNet. d AUROC and AUPRC of KDDANet with
   different types of gene interaction networks.

Table 1.

   Average AUC values of different computational tools for identifying
   hidden genes mediating KDDA.
      Method     KDDANet(SDrTDi) KDDANet(SDiTDr) SNPLS  PPA  comCHIPER DGPsubNet
   Average AUROC 0.733           0.715           0.712 0.713 0.703     0.706
   Average AUPRC 0.793           0.825           0.775 0.776 0.778     0.777
   [114]Open in a new tab

Mechanistic relevance of KDDANet predictions

   We examined whether the enriched pathways of KDDANet resulting
   subnetworks have the mechanistic relevance with KDDA by carrying out a
   global enrichment of all predicted KDDA subnetwork genes against 53
   classical KEGG pathways. The obtained enrichment results can be
   validated by existing knowledge (see Supplementary Note [115]5 and
   Supplementary Fig. [116]3a, b for details). We aimed to provide two
   cases to intuitively describe the mechanistic relevance of KDDANet
   resulting subnetwork. Phylloquinone (DB01022)-AD (104300) association
   has been reported in previous study^[117]48. A subnetwork including 46
   genes and 44 links were uncovered mediating this association (Fig.
   [118]3a). Interestingly, for this subnetwork, two separated gene
   modules (M1 and M2) were detected (Fig. [119]3a). The AUROC and AUPRC
   values of for this KDDANet resulting subnetwork were 0.864 and 0.795,
   respectively (Supplementary Fig. [120]3c). Two known targets of
   phylloquinone and 18 AD-related genes were identified in this
   subnetwork. Against with genome background, this subnetwork captured 4
   NTKGs with ~86-fold enrichment and adjusted p value of 9.716e−09
   (Hypergeometric test and Bonferroni correction). The top ten enriched
   KEGG terms of this subnetwork, such as Phospholipase D signaling
   pathway and Neurotrophin signaling pathway^[121]49,[122]50, were
   closely related with AD (Fig. [123]3b). The module M1 mainly functioned
   in Insulin signaling pathway, ErbB signaling pathway, FoxO signaling
   pathway and growth hormone synthesis, secretion, and action, which
   played important roles in neural system development and the onset and
   development of AD^[124]51–[125]53 (Supplementary Fig. [126]3d). The
   enriched KEGG pathways of M2 genes, including Complement and
   coagulation cascades, Glycolysis and AGE-RAGE signaling pathway in
   diabetic complications, were dysfunction in AD^[127]54–[128]56
   (Supplementary Fig. [129]3d). We further analyzed the published RNA-seq
   data to detect the expressional change of these two modules in normal
   individuals and AD patients^[130]57. We observed that the averaged
   expression level of M1 genes was significantly upregulated in AD
   (Supplementary Fig. [131]3e). Interestingly, a predicted novel gene,
   STAT3, was obviously activated in AD patients (Fig. [132]3c). Two
   newest studies published in years 2019 and 2020 reported that STAT3 was
   a potential therapeutic target for cognitive impairment in
   AD^[133]58,[134]59. Another predicted novel gene, GNAI1, was also
   obviously activated in AD patients. This gene was contained in the
   causal pathways associated with an imaging endophenotype characteristic
   of longitudinal structural change in the brains of patients with
   AD^[135]60. For module M2 genes, the obviously expressional changes
   between normal individuals and AD patients were not observed
   (Supplementary Fig. [136]3e). However, we found that a predicted novel
   KDDA gene, GC, was activated in AD patients (Fig. [137]3c). This gene
   encoded vitamin D binding protein, which was recently evidenced as a
   potential therapeutic agent for the treatment of AD^[138]61.

Fig. 3. Mechanistic relevance of KDDANet prediction results.

   [139]Fig. 3
   [140]Open in a new tab

   a KDDANet resulting subnetwork mediating phylloquinone
   (DB01022)-Alzheimer’s disease (AD, 104300) association. b Top ten
   enriched KEGG terms of subnetwork genes mediating phylloquinone-AD
   association. c Expression level of STAT3, GNAI1, and GC in normal
   individuals and AD patients, **p value < 0.01, calculated by
   Mann–Whitney U test. d KDDANet resulting subnetwork for heparin
   (DB01109)-obesity (601665) association. e Top ten enriched KEGG terms
   of subnetwork genes mediating heparin-obesity association. f ROC and PR
   curves of KDDANet gene subnetwork mediating heparin-obesity
   association. In the subnetworks, the size of a gene node was
   proportional to its network degree; The thickness of a network edge was
   proportional to its flow amount. FPKM Fragments Per Kilobase Of Exon
   Per Million Fragments Mapped.

   The association between heparin (DB01109) and obesity (601665) was
   inferred by multiple genes as described in CTD database. For this
   association, KDDANet predicted a subnetwork containing 168 edges
   connecting 169 genes (Fig. [141]3d). We found that two KDTGs and 67
   disease-related genes were captured in this subnetwork. Particularly,
   eight genes were the NTKGs with ~52-fold enrichment and adjusted p
   value of 1.485e−10 (Hypergeometric test and Bonferroni correction).
   Three genes used to infer this KDDA, including ARK1, PARP1, and TNF,
   were also effectively captured in the resulting subnetwork. The top ten
   enriched functions of this subnetwork were showed in Fig. [142]3e. As
   expected, insulin resistance, type II diabetes mellitus, insulin
   signaling pathway, and adipocytokine signaling pathway were the
   frequently reported events and molecular processes associated with
   obesity^[143]62,[144]63. In addition, proteoglycans, lipolysis, and
   AMPK signaling pathway were also highly related with Insulin
   resistance^[145]64–[146]66. In consistent with this, the AUROC and
   AUPRC values of KDDANet for this subnetwork were 0.854 and 0.849,
   respectively (Supplementary Fig. [147]3e). We next delineated the
   subnetwork into gene modules. The enriched functions of top three gene
   modules were demonstrated in Supplementary Fig. [148]3f. The genes of
   module M1 mainly participated in Glycolysis and Carbon metabolism.
   Further analysis of public RNA-seq data of normal individuals and
   obesity patients^[149]67,[150]68 demonstrated the significantly
   repressed expression of M1 genes in patients with obesity, such as
   GADPH (Supplementary Fig. [151]3g, h). This was consistent with the
   fact that enhancing the level of glycolysis reduced obesity^[152]69.
   The mainly enriched pathways of M2 genes were related to Insulin
   resistance, a frequently happened event in obesity patients.
   Interestingly, Cell adhesion molecules was a significantly enriched
   KEGG term of M2 genes (Supplementary Fig. [153]3f), which was elevated
   in patients with obesity^[154]70. The M3 genes functioned in Complement
   and coagulation cascades and Platelet activation (Supplementary Fig.
   [155]3f). As reported, these two terms were closely associated with
   obesity^[156]71,[157]72. In support with these, the expression of genes
   in M2 and M3 were activated in patients with obesity (Supplementary
   Fig. [158]3g, h). Together, these results indicated that KDDANet can
   serve as a useful tool to unveil the molecular basis of KDDA.

KDDANet provided novel molecular insights on KDDAs related to cancer

   Cancer was a frequently happened complex genetic disease caused by DNA
   abnormalities^[159]73. For this reason, substantial genetic, genomic
   and pharmacogenomics efforts, including TCGA, CCLE, and
   GDSC1000^[160]7–[161]9, have been undertaken to improve existing
   therapies or to guide early-phase clinical trials of compounds under
   development. With these efforts, an increasing amount of available
   high-throughput datasets at both levels of genomic data and
   pharmacogenomics data were produced at recent years. In addition,
   COSMIC, the world’s largest and most comprehensive resource for
   exploring the impact of somatic mutations in human cancer, collected a
   catalog of genes with mutations that were causally implicated in cancer
   ([162]https://cancer.sanger.ac.uk/cosmic). With these datasets, we
   observed that the genes of KDDANet resulting subnetworks mediating the
   associations between drugs and cancer were significantly enriched in
   COSMIC Cancer Gene Census, and these genes harbored more oncogenic
   alterations in tumor samples than randomly selected genes
   (Supplementary Note [163]6, Supplementary Fig. [164]4a, b). Moreover,
   we found that oncogenic alterations of genes in KDDANet resulting
   subnetworks mediating the associations between drugs and cancer were
   more correlated with the responses of cancer cell lines under
   anti-cancer drug treatment than randomly selected genes (see
   Supplementary Note [165]5, Supplementary Fig. [166]4c–f). We provided
   two detailed examples to describe the potential values of KDDANet in
   revealing novel genes mediating the associations between drugs and
   cancer.

   Sotalol (DB00489) was normally used to treat life-threatening
   ventricular arrhytmias. It has been reported that sotalol was
   associated with decreased prostate cancer (176807) risk^[167]74. For
   sotalol-prostate cancer association, KDDANet predicted a subnetwork
   consisting of 31 genes and 28 links (Fig. [168]4a). By applying MCL
   with default parameters, this subnetwork was further decomposed into
   three gene modules, M1, M2, and M3. All three known target genes of
   sotalol were included in this subnetwork. Meanwhile, this subnetwork
   also captured 12 prostate cancer-related genes with a novel NTKG of
   PRKACA. The top ten enriched KEGG terms of this subnetwork were showed
   in Fig. [169]4b. Among these, PI3K-Akt signaling pathway, MAPK
   signaling pathway and FoxO signaling pathway were related to cancer
   formation and development. The relationships between EGFR tyrosine
   kinase inhibitor resistance, Relaxin signaling pathway, AGE-RAGE
   signaling pathway, and prostate cancer have been widely investigated
   and reported in the previous works^[170]75–[171]77. In addition,
   Autophagy and Focal adhesion were two widely observed processes in
   cancer^[172]78,[173]79. As expected, the enriched KEGG signaling
   pathways of M1 were closely related to tumorigenesis (Supplementary
   Fig. [174]4g). The expression of M1 genes were activated in tumor
   samples, such as an oncogene YWHAE (Supplementary Fig. [175]4h, i).
   Interestingly, M1 captured that IGF1R, a gene encoding insulin-like
   growth factor receptor, harbored SNVs and CNVs in TCGA prostate cancer
   samples, and has been reported to be oncogenic genes of prostate
   cancer^[176]80. The genes of M2 were not enrich to any KEGG term and
   but have lower expression in TCGA tumor samples (Supplementary Fig.
   [177]4h). Surprisingly, SPARC, a reported prostate cancer-related
   gene^[178]81 with significantly lower expression in TCGA tumor samples
   was captured in this module (Supplementary Fig. [179]4i). Moreover, we
   found that both SPARC and COL1A1 harbor SNVs and CNVs in TCGA prostate
   cancer samples and their expressions were obviously correlated to the
   survival of patients (Fig. [180]4c). The M3 genes mainly participated
   in ErbB signaling pathways, a biological process involved in prostate
   cancer progression^[181]82. The expression of M3 genes were not
   obviously changed in TCGA prostate cancer samples (Supplementary Fig.
   [182]4h). However, we found that GRB2 was overexpressed in TCGA
   prostate cancer samples and obviously correlated with the survival of
   patients at the later stage of disease (Supplementary Fig. [183]4i,
   [184]c).

Fig. 4. KDDANet provided novel molecular insights on KDDAs related to cancer.

   [185]Fig. 4
   [186]Open in a new tab

   a KDDANet resulting subnetwork mediating sotalol (DB00489)-prostate
   cancer (176807) association. b Top ten enriched KEGG terms of
   subnetwork genes mediating sotalol-prostate cancer association. c Gene
   expression-based survival analysis for CASP2, SPARC, and GRB2 in
   patients with prostate cancer, obtained by GEPIA online analysis
   ([187]http://gepia.cancer-pku.cn/about.html). d KDDANet gene subnetwork
   mediating nebularine (DB04440)-lung cancer (211980) association. e Top
   ten enriched KEGG terms of subnetwork genes mediating nebularine-lung
   cancer association. f Gene expression-based survival analysis for BYSL
   and BOP1 in patients with TCGA LUAD lung cancer obtained by GEPIA
   online analysis ([188]http://gepia.cancer-pku.cn/about.html) and their
   expression levels in lung cancer tumor samples and adjacent normal
   tissue samples, ***p value < 0.001, calculated by Wilcox signed rank
   test. In the subnetworks, the size of a gene node was proportional to
   its network degree; the thickness of a network edge was proportional to
   its flow amount.

   Another example was the association between nebularine (DB04440)-lung
   cancer (211980) that was inferred by ADA targeted by
   nebularine^[189]83. For this association, KDDANet predicted 237 genes
   connected by 238 links that constituted a subnetwork without apparent
   modular structure (Fig. [190]4d). The only known target gene ADA of
   nebularine was connected to 119 lung cancer-related genes and 117
   predicted novel KDDA genes in this subnetwork. As expected, KEGG
   enrichment demonstrated that the genes in this subnetwork were involved
   in various cancers (Fig. [191]4e). Interestingly, we found that two
   highly connected novel genes BYSL and BOP1 were significantly
   overexpressed in TCGA lung cancer tumor samples and their expression
   levels were obviously correlated with the survival of lung cancer
   patients (Fig. [192]4f). These results indicated that KDDANet not only
   captured known genes mediating KDDAs linking drug with cancer, but also
   uncovered novel candidates that offered novel biological insights.

KDDANet uncovered the shared genes mediating multiple KDDAs

   Comprehensive analysis above fully demonstrated that KDDANet can
   uncover true genes mediating individual KDDA. We further asked whether
   KDDANet can reveal the shared genes mediating multiple KDDAs. We
   answered this from two aspects as follow: (1) Multiple Diseases
   associating with One Drug (MDiODr); (2) Multiple Drugs associating with
   One Disease (MDrODi). Considering the practical merits for the first
   one analysis, we required multiple diseases belongs to the same type of
   diseases. To evaluate the capability of KDDANet for revealing the
   shared genes mediating multiple KDDAs, we produced meta-subnetworks by
   integrating multiple KDDANet resulting subnetworks for 12386 MDiODr
   combinations and 773 MDrODi combinations produced in SDrTDi context,
   and 12189 MDiODr combinations and 773 MDrODi combinations in SDiTDr
   context. The weight of an edge in the meta-subnetwork was defined as
   the number of KDDA resulting subnetworks containing this edge divided
   by the total number of KDDA resulting subnetworks. The higher weight
   value of a link in the meta-subnetwork indicated more conservation and
   commonality. Thus, we used the weight value to evaluate the capacity of
   KDDANet in unveiling the shared gene interactions mediating multiple
   KDDAs. We carried out a permutation test by producing random
   meta-subnetworks with the same number of edges for comparing with
   random background. As shown in Fig. [193]5a, Supplementary Fig.
   [194]5a, b, the weights of KDDANet meta-subnetworks were significantly
   higher than random meta-subnetworks across different types of diseases
   in SDrTDi context. This indicate that the genes tend to be shared in
   KDDANet meta-subnetworks than random one. We also conducted the same
   analysis in SDiTDr context and obtained the similar results (Fig.
   [195]5b, Supplementary Fig. [196]5c, d). Collectively, these results
   indicated that KDDANet can effectively uncover the shared genes
   mediating multiple KDDAs.

Fig. 5. KDDANet uncovered the shared genes mediating multiple KDDAs.

   [197]Fig. 5
   [198]Open in a new tab

   a Boxplots demonstrating the distributions of weight values in KDDANet
   meta-subnetwork and random meta-subnetwork for MDiODr and MDrODi in
   SDrTDi context, ***p value < 0.001, calculated by Wilcox signed rank
   test. b Boxplots demonstrating the distributions of weight values of
   KDDANet meta-subnetwork and random meta-subnetwork for MDiODr and
   MDrODi in SDiTDr context, ***p value < 0.001, calculated by Wilcox
   signed rank test. c Shared meta-subnetwork mediating profenamine
   (DB00392)-neurological disease associations. d Top ten enriched KEGG
   terms of shared meta-subnetwork genes mediating
   profenamine-neurological diseases associations. e Shared
   meta-subnetwork mediating the associations between GRACILE syndrome
   (603358) and 13 drugs. f Top ten enriched KEGG terms of shared
   meta-subnetwork genes mediating the associations between GRACILE
   syndrome and multiple drugs; In the meta-subnetworks, the size of a
   gene node was proportional to its network degree; the thickness of a
   network edge was proportional to its weight value.

   We presented some examples for describing the capability of KDDANet in
   identifying shared genes mediating multiple KDDAs. For MDiODr, we
   selected two cases: (1) profenamine (DB00392) and neurological disease
   associations and (2) mirtazapine (DB00370) and cancer associations. As
   reported in CTD database, profenamine was associated with three
   neurological diseases, including Parkinsonian Disorders, Multiple
   Sclerosis, and Alzheimer Disease. The shared meta-subnetwork contained
   75 edges linking 43 unknown genes with three profenamine’s target genes
   and 30 neurological disease-related genes (Fig. [199]5c). The top ten
   enriched KEGG terms were demonstrated in Fig. [200]5d. A majority of
   enriched KEGG terms, such as Cholinergic synapse and Glutamatergic
   synapses, have been reported dysfunction in neurological disorders and
   diseases^[201]84,[202]85. Interestingly, we found that GNAI2 and GNA11
   were two mostly shared genes linking profenamine with neurological
   diseases. These two genes were recently discovered involving in the
   pathological pathways of neurological diseases^[203]86,[204]87.
   Mirtazapine was associated with nine types of cancers, including
   Colorectal Neoplasms, Breast Neoplasms, Neuroblastoma, Glioma, Urinary
   Bladder Neoplasms, Stomach Neoplasms, Esophageal Neoplasms, Lung
   Neoplasms, and Prostatic Neoplasms. Supplementary Figure [205]5e showed
   the shared meta-subnetwork that included 97 edges connecting 21
   mirtazapine’s target genes with 33 cancer-related genes and 54 unknown
   genes. As expected, KEGG enrichment demonstrated that these genes were
   involved in cancer-related signaling pathways and played important
   roles in various cancers (Supplementary Fig. [206]5f). Intriguingly,
   DRD4 and GRB2 were two mostly shared genes mediating the associations
   between mirtazapine and cancers (Supplementary Fig. [207]5e). These two
   genes were involved in the oncogenesis of multiple types of
   cancers^[208]88,[209]89.

   For MDrODi, some interesting cases were also observed. For example,
   GRACILE syndrome, a metabolic disease, was associated with 13 drugs as
   recorded in CTD database. A shared meta-subnetwork including 66 genes
   and 61 edges was obtained for this disease (Fig. [210]5e). This
   meta-subnetwork contained 14 drug target genes and 13 GRACILE
   syndrome-related genes. The mostly shared gene was ATP5B, and the
   mostly significant enriched KEGG term was Oxidative phosphorylation
   (Fig. [211]5f). This was expected as GRACILE syndrome was a fatal
   inherited disorder caused by a mutation in an oxidative
   phosphorylation-related gene, BCS1L^[212]90. It was also not surprising
   that the neurological diseases related genes were also enriched in this
   meta-subnetwork as patients with GRACILE syndrome had severe
   neurological problems^[213]90. Another example was the Keratoconus
   (148300), an ophthamological disease, which was associated with three
   different drugs, including acetaminophen, valproic acid and
   theophylline. Keratoconus and these three drugs shared a
   meta-subnetwork consisting of 54 genes and 51 edges (Supplementary Fig.
   [214]5g). This meta-subnetwork included 7 Keratoconus-related genes, 11
   drug target genes, and 36 unknown genes. It was expected that HDAC2 had
   high weights with its partner genes in this meta-subnetwork as it was
   involved in notch signaling pathway which is downregulated in
   keratoconus^[215]91. Consistent with the associations between collagen
   genes and keratoconus^[216]92, a novel collagens coding gene, COL1A1,
   was captured in this meta-subnetwork as a highly shared gene. The
   enriched KEGG terms of this meta-subnetwork included Hippo signaling
   pathway (Supplementary Fig. [217]5h) which has been reported involving
   in keratoconus corneas^[218]93. Collectively, these results indicated
   that KDDANet can discover the shared genes mediating multiple KDDAs.
   The highly shared unknown genes can serve as potential candidate
   targets for drug repurposing.

Discussion

   To facilitate drug repurposing, various computational tools have been
   developed to uncover novel drug-disease associations^[219]94. However,
   the potential genes mediating KDDA have been still not fully explored.
   Unveiling the hidden genes (missed from experiments) mediating KDDA
   become a great challenge for guiding novel target discovery and drug
   repurposing. In this work, we developed a novel computational tool,
   KDDANet, which integrated minimum cost network flow optimization,
   depth-first searching and graph clustering algorithm to reveal hidden
   genes and modules mediating KDDA. KDDANet allowed for a global and
   systematic exploration of the hidden genes mediating KDDA. We applied
   KDDANet to unravel the subnetworks of genes mediating 53124 KDDAs. The
   comprehensive and system-level evaluations fully demonstrated the
   effective prediction capability and general applicability of KDDANet.
   Case studies on both AD and obesity showed that the subnetworks of
   genes identified by KDDANet were reliable and useful. Further validated
   by integrating analysis of genomic, transcriptomic, pharmacogenomic and
   survival data on primary tumors and cancer cell lines highlighted that
   KDDANet captured novel candidates from interactome mediating the
   associations between drugs and different types of cancer. Based on
   these, we concluded that the inferred subnetworks mediating KDDA can
   serve as genome-wide molecular landscapes for guiding drug repurposing
   and disease treatment. Insights learned from our predictions would also
   enable to help researchers to design repurposing drugs to reverse
   disease phenotypes via targeting key genes in the subnetwork mediating
   KDDA. An important capability of KDDANet method was that it can reveal
   the shared genes mediating multiple KDDAs. This provided more valuable
   guides for drug repositioning and disease treatment since the shared
   genes mediating multiple KDDAs were closely linked to the molecular
   basis of drug repurposing. We constructed a user-friendly online web
   tool ([220]http://www.kddanet.cn), which allowed users to explore the
   subnetwork of genes mediating individual KDDA and the meta-subnetworks
   mediating multiple KDDAs (See Supplementary Note [221]7 as well as the
   Help section of our website for detailed description of the utility of
   an online web version of KDDANet). In summary, we presented a novel
   computational tool KDDANet and an online web source to decode the
   hidden genes mediating KDDA that had broad utility and application
   value in biomedical studies.

   The hidden genes mediating KDDA predicted by KDDANet highlighted the
   power of integrative approaches to illuminate underexplored molecular
   processes mediating KDDA. In the future, the application value of our
   KDDANet tool in drug repurposing can be further improved from three
   aspects as follow: Firstly, the gene interactome used in KDDANet did
   not contain enough interactions between genome elements. In further
   studies, we would integrate other non-coding genome elements,
   especially long non-coding RNA and microRNA for constructing a
   comprehensive interactome. Secondly, integrating the biological
   networks from other omics layers, such as epigenomics, might have also
   further enhance the accuracy of KDDANet in discovering subnetworks and
   key genes mediating KDDA, and help us better to understand KDDA at
   multi-omics levels. The intrinsically capability of KDDANet to analyze
   large-scale heterogeneous interactome data containing tens of thousands
   of nodes and edges make it can well be suited to analyzing the
   accumulating data from ‘multi-omics’ technologies and biomedical
   research. Finally, KDDANet did not carry out predictions for KDDA when
   a drug’s target genes were unknown or when a disease has not been
   related to any known gene. For this, we planned to calculate the
   similarity scores between drugs and the similarity scores between
   diseases, and then integrate drugs without any target gene and diseases
   without any related gene to our flow network model by similarity
   scores.

Methods

   Datasets

   Five different types of gene networks, including HumanNet,
   HINT + HI2012, iRefIndex, MultiNet, and STRINGv10, were used in our
   current study^[222]47,[223]95, in which the nodes were represented by
   gene ID and connected by bidirectional edges (Supplementary Data
   [224]1). Drugs and their target genes were obtained from DrugBank 5.0.9
   database ([225]http://www.drugbank.ca/). In this study, we selected
   4861 drugs with at least one known target that was contained in the
   gene networks for further analysis. In total, 2196 KDTGs included in
   the gene networks were connected to these drugs by 12014 interactions
   (Supplementary Data [226]2). KDRGs and classification of diseases were
   obtained by manually collecting Human Disease Network (HDN) from the
   previous study of HDN^[227]33 and DisGeNET v5.0 database^[228]96. We
   focused on 1441 diseases with at least one related gene which was
   included in the gene networks for our study. In total, 16,712
   associations link these diseases to 1521 genes which existed in the
   gene networks (Supplementary Data [229]3). The KDDAs were extracted
   from CTD^[230]12 (Supplementary Data [231]4). In this study, 53124
   KDDAs were analyzed in which the drug had at least one target gene and
   the disease had at least one related gene contained in HumanNet. For
   simplicity and consistency, we converted different types of drugs,
   diseases, and gene nomenclatures to DrugBank drug ID, OMIM disease ID,
   and NCBI Entrez gene ID for subsequent modeling and analysis.

   The primary RNA-seq datasets of AD patients, obesity patients, and
   normal individuals were downloaded from NCBI GEO Datasets under the
   accession number of [232]GSE53697, [233]GSE81965, and [234]GSE63887.
   After the SRA files were gathered, the archives were extracted and
   saved in FASTQ format using the SRA Toolkit. RNA-seq reads were trimmed
   using Trimmomatic software
   ([235]http://www.usadellab.org/cms/?page=trimmomatic) with the
   following parameters “ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3
   TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36” (Version 0.36), and were
   further quality-filtered using FASTX Toolkit’s fastq_quality_trimmer
   command ([236]http://hannonlab.cshl.edu/fastx_toolkit/) (Version
   0.0.13) with the minimum quality score 20 and minimum percent of 80%
   bases that had a quality score larger than this cutoff value. The
   high-quality reads were mapped to the hg38 genome by HISAT2, a fast and
   sensitive spliced alignment program for mapping RNA-seq reads, with
   -dta paramenter ([237]http://daehwankimlab.github.io/hisat2/). PCR
   duplicate reads were removed using Picard tools
   ([238]http://broadinstitute.github.io/picard/) and only uniquely mapped
   reads were kept for further analysis. The expression levels of genes
   were calculated by StringTie
   ([239]http://ccb.jhu.edu/software/stringtie/) (Version v1.3.4d) with -e
   -B -G parameters using Release 29 (GRCh38.p12) gene annotations
   downloaded from GENCODE data portal
   ([240]https://www.gencodegenes.org/). To obtain comparable expression
   abundance estimation for each gene, reads mapped to hg38 were counted
   as Fragments Per Kilobase Of Exon Per Million Fragments Mapped based on
   their genome locations. Differential expression analysis of genes was
   performed by DESeq2 using the reads count matrix produced from a python
   script “prepDE.py” provided in StringTie website
   ([241]http://ccb.jhu.edu/software/stringtie/).

   TCGA cancer genomics datasets were directly downloaded via the UCSC
   Xena project data portal ([242]https://xenabrowser.net/datapages/). As
   the DNA methylation datasets were quantified as beta value in the DNA
   probe level, we mapped Illumina Human Methylation 450 probe ID to gene
   name using HumanMethylation450 annotation file. If a gene was mapped by
   multiple probes, we considered the averaged signals of these probes as
   the methylation level of this gene. We used Wilcox signed rank test for
   differential analysis of gene expression and DNA methylation of KDDANet
   resulting subnetwork genes mediating the associations between drug and
   cancer. For this analysis, we only used the tumor samples which have
   adjacent normal tissue samples as control. GDSC1000 cancer cell line
   pharmacogenomic datasets were downloaded from GDSC website
   ([243]https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources/Home.
   html). Since this website provided only the CpG island methylation
   data, we downloaded the beta value matrix of probe-level DNA
   methylation from NCBI GEO Dataset under accession number of
   [244]GSE68379 and then converted it to gene-level beta value matrix
   using the same method as TCGA DNA methylation data. CCLE
   pharmacogenomic datasets were directly downloaded from Broad Institute
   data portal ([245]https://portals.broadinstitute.org/ccle/data).

Construction of a unified flow network model

   For each query drug and all its related diseases, the unified flow
   network model in SDrTDi context was built by integrating the query drug
   and all its related diseases into the gene network based on known
   drug-target relationships and gene-disease associations. As shown in
   Fig. [246]1, for the given query drug, we used it as source node (S)
   and integrated it into gene network by introducing the directed edges
   from it point to its target genes. For each its related disease, we
   mapped the disease to gene network by introducing the directed edges
   from its related genes point to it. We incorporated a sink node T and
   introduced a directed edge pointing from each disease to it. With these
   definitions, a unified flow network model was constructed as a complex
   heterogeneous graph G = (V, E), where V was the set of vertices and E
   was the set of edges. This graph included two types of edges
   (bidirectional and directed) and three types of nodes (drugs, genes,
   and diseases). Each edge was assigned with a weight and a capacity.
   Flow goes from a source node to a sink node through the graph edges.
   The assigning scheme of weight and capacity was illustrated as follow:

Weight and capacity assigning scheme for network edges

Edges between gene nodes

   Edges between gene nodes were weighted (W[ij]) to reflect the
   probability that two genes g[i] and g[j] were functionally linked in
   the biological processes. The weight value between g[i] and g[j] was
   derived from a Bayesian statistics approach by integrating diverse
   functional genomics datasets^[247]47. Briefly, each dataset was
   benchmarked for its real capability of reconstructing known cellular
   pathways by measuring the likelihood that pairs of genes (linkages)
   were functionally connected conditioned on the experimental evidence,
   calculated as a log likelihood score LLS^[248]97:
   [MATH: <mrow><mi>L</mi><mi>L</mi><mi>S</mi><mspace
   width="0.25em"></mspace><mo>=</mo><mspace
   width="0.25em"></mspace><mo>ln</mo><mfenced close="}"
   open="{"><mrow><mfrac><mrow><mi>F</mi><mrow><mo>(</mo><mrow><mi>L</mi><
   mo>∣</mo><mi>E</mi></mrow><mo>)</mo></mrow><mo>/</mo><mi
   mathvariant="normal">~</mi><mi>F</mi><mrow><mo>(</mo><mrow><mi>L</mi><m
   o>∣</mo><mi>E</mi></mrow><mo>)</mo></mrow></mrow><mrow><mi>F</mi><mrow>
   <mo>(</mo><mrow><mi>L</mi></mrow><mo>)</mo></mrow><mo>/</mo><mi
   mathvariant="normal">~</mi><mi>F</mi><mrow><mo>(</mo><mrow><mi>L</mi></
   mrow><mo>)</mo></mrow></mrow></mfrac></mrow></mfenced></mrow> :MATH]
   1

   Where F(L|E) and ~F(L|E) represented the observed numbers of linkages L
   appear in the given experiment E between functionally annotated human
   genes interacting within the same pathway and between different
   pathways, respectively, F(L) and ~F(L) denoted the total observed
   numbers of linkages between all annotated human genes interacting
   within the same pathway and between different pathways, respectively.
   The weight value (W) of a given linkage between two genes was produced
   by combining LLS score of each dataset using the formula:
   [MATH: <mrow><mi>W</mi><mspace
   width="0.25em"></mspace><mo>=</mo><mspace
   width="0.25em"></mspace><mi>L</mi><mi>L</mi><msub><mrow><mi>S</mi></mro
   w><mrow><mi>b</mi><mi>e</mi><mi>s</mi><mi>t</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>+</mo><mspace
   width="0.25em"></mspace><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mspace
   width="0.25em"></mspace><mo>=</mo><mspace
   width="0.25em"></mspace><mn>1</mn></mrow><mrow><mi>N</mi></mrow></munde
   rover><mfrac><mrow><mi>L</mi><mi>L</mi><msub><mrow><mi>S</mi></mrow><mr
   ow><mi>i</mi></mrow></msub></mrow><mrow><mi>D</mi><mo>⋅</mo><mi>i</mi><
   /mrow></mfrac><mo>,</mo><mspace
   width="0.25em"></mspace><mi>f</mi><mi>o</mi><mi>r</mi><mspace
   width="0.25em"></mspace><mi>a</mi><mi>l</mi><mi>l</mi><mspace
   width="0.25em"></mspace><mi>L</mi><mspace
   width="0.25em"></mspace><mo>≥</mo><mspace
   width="0.25em"></mspace><mi>T</mi></mrow> :MATH]
   2

   Where LLS[best] represented the highest LLS score for a linkage between
   two genes, D determined decay rate of the LLS score for additional
   evidence, and i was the order index of LLS scores for a given linkage
   between two genes, ranking starting from the second maximum LLS score
   with descending order of magnitude for all N remaining LLS scores. T
   represented a minimum threshold of LLS score to be considered. The
   values of D and T were empirically optimized to maximize overall
   performance on known GO annotations measured by AUPRC^[249]98.

Edges between drug and drug’s target gene nodes

   Edges between each drug and the drug’s target gene nodes were weighted
   (w[Si]) to reflect the normalized reliability of the interaction
   between drug and target protein based on experimental and computational
   evidence. The weighting scheme was based on the predicted score of drug
   and target protein interaction^[250]25. The weight value (W[Si]) was
   calculated as:
   [MATH:
   <mrow><msub><mrow><mi>W</mi></mrow><mrow><mi>S</mi><mi>i</mi></mrow></m
   sub><mspace width="0.25em"></mspace><mo>=</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><msub><mrow><mi>P</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow><mrow><msub><mrow><mo>∑</mo></mrow><mro
   w><mi>j</mi><mo>∈</mo><mi>T</mi></mrow></msub><msub><mrow><mi>P</mi></m
   row><mrow><mi>j</mi></mrow></msub></mrow></mfrac></mrow> :MATH]
   3

   Where T denoted the set of each drug’s targets, [Pi] denotes the
   predicted score between drug S and target i, P[j] denoted the predicted
   score between drug S and target j.

Edges between disease-related gene and disease nodes

   Edges between disease-related gene and disease nodes were weighted
   (w[jd]) to reflect the normalized reliability of the linkage between
   disease and gene based on experimental and computational evidence. The
   weighting scheme was based on the predicted score of gene-disease
   association derived from MAXIF algorithm^[251]99. We calculated the
   weight value (W[jd]) as follow:
   [MATH:
   <mrow><msub><mrow><mi>W</mi></mrow><mrow><mi>j</mi><mi>d</mi></mrow></m
   sub><mspace width="0.25em"></mspace><mo>=</mo><mspace
   width="0.25em"></mspace><mfrac><mrow><msub><mrow><mi>F</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow><mrow><msub><mrow><mo>∑</mo></mrow><mro
   w><mi>j</mi><mo>∈</mo><mi>D</mi></mrow></msub><msub><mrow><mi>F</mi></m
   row><mrow><mi>j</mi></mrow></msub></mrow></mfrac></mrow> :MATH]

   Where D denoted the set of genes linked to disease d, F[i] denoted the
   predicted score between gene i and disease d, P[j] denoted the
   predicted score between gene j and disease d.

Edges between disease and sink nodes

   For each edge linking each disease d to sink node T, we assigned it a
   same weight value W[dT] = 1/N, where N denoted the number of diseases
   linking to sink node.

   We further defined for each edge in this heterogeneous network a
   capacity value that limited the flow quantity. For each edge connecting
   the query drug S to its target gene i, we assigned it a capacity C[Si]
   equal to W[Si]. For each edge linking the disease-related gene j to
   disease d, we assigned it a capacity C[jd] equal to W[jd]. For each
   edge linking the disease d to sink node T, we assigned it a capacity
   C[dT] equal to W[dT.] For other edges, we assigned them a capacity
   C[ij] = 1.

Minimum cost flow optimization algorithm

   With the purpose of identifying hidden genes mediating KDDA, we search
   for an possible solution that would (1) capture the subset of the query
   drug’s target genes which closely modulate all its related diseases by
   the disease-related genes without restrict to the prior KDDA genes, (2)
   determine hidden genes that were likely to be part of cellular pathways
   connecting the query drug’s target genes to all its related diseases
   but escaped detection by experiments, (3) give high priority to genes
   that lie on paths with highest probability connecting the query drug to
   all its related diseases without making constraints on the network
   structure. The rationality for proposing this solution including two
   aspects: (1) this needed less computational time than that finding the
   highest-probability subnetwork connects a query drug to each its
   related diseases at a time; (2) this can effectively find the shared
   genes mediating multiple KDDAs. Similar to how a fluid flows through
   the path of least resistance, we formulated this goal as a minimum cost
   flow optimization problem^[252]44,[253]100,[254]101. Cost was defined
   as the negative log of the probability of an edge. Thus, minimizing the
   cost given preference to the highest-probability paths. Given the
   unified flow network, this problem can be expressed as a linear
   programming formula that minimizes the total cost of the flow network
   while diffusing the most flow from query drug node to hypothetical sink
   node. Let W[ij], F[ij], and C[ij] referred to the weight, flow, and
   capacity from node i to node j, respectively. The linear programming
   formula can be written as follow.
   [MATH: <mrow><mi
   mathvariant="normal">Minimize</mi><munder><mrow><mo>∑</mo></mrow><mrow>
   <mi>i</mi><mo>∈</mo><mi>V</mi><mo>,</mo><mi>j</mi><mo>∈</mo><mi>V</mi><
   /mrow></munder><mfenced close=")"
   open="("><mrow><mo>−</mo><mi>log</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi><mi>j</mi><
   /mrow></msub></mrow></mfenced><mspace
   width="0.25em"></mspace><mo>*</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>F</mi></mrow><mrow><mi>i</mi><m
   i>j</mi></mrow></msub></mrow></mfenced><mspace
   width="0.25em"></mspace><mo>−</mo><mspace
   width="0.25em"></mspace><mfenced close=")"
   open="("><mrow><mi>γ</mi><mspace
   width="0.25em"></mspace><mo>*</mo><mspace width="0.25em"></mspace><mo
   mathsize="big">∑</mo><msub><mrow><mi>F</mi></mrow><mrow><mi>S</mi><mi>v
   </mi></mrow></msub></mrow></mfenced></mrow> :MATH]
   5
   [MATH: <mrow><mi mathvariant="normal">Subject</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">to</mi><munder><mrow><mo>∑</mo></mrow><mrow><mi>j<
   /mi><mo>∈</mo><mi>V</mi></mrow></munder><msub><mrow><mi>F</mi></mrow><m
   row><mi>i</mi><mi>j</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>−</mo><mspace
   width="0.25em"></mspace><munder><mrow><mo>∑</mo></mrow><mrow><mi>j</mi>
   <mo>∈</mo><mi>V</mi></mrow></munder><msub><mrow><mi>F</mi></mrow><mrow>
   <mi>j</mi><mi>i</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>∀</mo><mi>i</mi><mo>∈</mo><mi>V</mi><mspace
   width="0.25em"></mspace><mo>−</mo><mspace
   width="0.25em"></mspace><mfenced close="}"
   open="{"><mrow><mi>S</mi><mo>,</mo><mi>T</mi></mrow></mfenced></mrow>
   :MATH]
   6
   [MATH:
   <mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>S</mi><mi>v</mi></mrow></m
   sub><mspace width="0.25em"></mspace><mo>−</mo><mspace
   width="0.25em"></mspace><munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi>
   <mo>∈</mo><mi>D</mi></mrow></munder><msub><mrow><mi>F</mi></mrow><mrow>
   <mi>i</mi><mi>T</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>=</mo><mspace
   width="0.25em"></mspace><mn>0</mn></mrow> :MATH]
   7
   [MATH: <mrow><mn>0</mn><mspace
   width="0.25em"></mspace><mo>≤</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>F</mi></mrow><mrow><mi>i</mi><m
   i>j</mi></mrow></msub><mspace width="0.25em"></mspace><mo>≤</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>C</mi></mrow><mrow><mi>i</mi><m
   i>j</mi></mrow></msub></mrow> :MATH]
   8

   Where V denoted a set of nodes included in the flow network; S denoted
   the query drug and v denoted its target gene. The parameter gamma (
   [MATH: <mi>γ</mi> :MATH]
   ) controlled the size and the quality of the optimized subnetwork. The
   first component of this formula,
   [MATH:
   <mrow><msub><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>V</mi>
   <mo>,</mo><mi>j</mi><mo>∈</mo><mi>V</mi></mrow></msub><mfenced
   close=")" open="("><mrow><mo>−</mo><mi>log</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>W</mi></mrow><mrow><mi>i</mi><mi>j</mi><
   /mrow></msub></mrow></mfenced><mspace
   width="0.25em"></mspace><mo>*</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>F</mi></mrow><mrow><mi>i</mi><m
   i>j</mi></mrow></msub></mrow></mfenced></mrow> :MATH]
   , ensured minimizing the network cost that given priority to obtain
   highest-probability gene subnetwork, at the same time, the second
   component,
   [MATH:
   <mrow><mo>−</mo><mrow><mo>(</mo><mrow><mi>γ</mi><mo>*</mo><mo>∑</mo><ms
   ub><mrow><mi>F</mi></mrow><mrow><mi>S</mi><mi>v</mi></mrow></msub></mro
   w><mo>)</mo></mrow></mrow> :MATH]
   , indicated maximizing the total flow across entire network. This
   optimization problem can be efficiently solved using primal simplex
   method provided in Mixed Integer Linear Programming solver
   ([255]http://lpsolve.sourceforge.net/). The solution
   [MATH:
   <mrow><mi>a</mi><mi>r</mi><mi>g</mi><mi>m</mi><mi>i</mi><msub><mrow><mi
   >n</mi></mrow><mrow><msub><mrow><mi>f</mi></mrow><mrow><mi>i</mi><mi>j<
   /mi></mrow></msub><mspace width="0.25em"></mspace><mo>></mo><mspace
   width="0.25em"></mspace><mn>0</mn></mrow></msub><msub><mrow><mo>∑</mo><
   /mrow><mrow><mi>i</mi><mo>∈</mo><mi>V</mi><mo>,</mo><mi>j</mi><mo>∈</mo
   ><mi>V</mi></mrow></msub><mfenced close=")"
   open="("><mrow><mo>−</mo><mi>log</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>w</mi></mrow><mrow><mi>i</mi><mi>j</mi><
   /mrow></msub></mrow></mfenced><mspace
   width="0.25em"></mspace><mo>*</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>f</mi></mrow><mrow><mi>i</mi><m
   i>j</mi></mrow></msub></mrow></mfenced><mspace
   width="0.25em"></mspace><mo>−</mo><mspace
   width="0.25em"></mspace><mfenced close=")"
   open="("><mrow><mi>γ</mi><mspace
   width="0.25em"></mspace><mo>*</mo><mspace width="0.25em"></mspace><mo
   mathsize="big">∑</mo><msub><mrow><mi>f</mi></mrow><mrow><mi>S</mi><mi>v
   </mi></mrow></msub></mrow></mfenced></mrow> :MATH]
   obtained the highest-probability subnetwork mediating the associations
   between query drug and all its related diseases.

Depth-first searching and Markov clustering (MCL)

   Once the highest-probability subnetwork mediating the query drug
   (disease) with all its related diseases (drugs) was obtained, we
   implemented depth-first searching^[256]102 on this subgraph to find the
   subnetwork made up of all paths linking the query drug (disease) to
   each it’s related disease (drug). All genes in the solution were ranked
   by the amount of flow they carry. The more flow that passes through a
   protein, the more important it was in mediating KDDA. After obtaining
   the subnetwork mediating individual KDDA, MCL
   ([257]https://micans.org/mcl/) was employed to further discover gene
   modules mediating KDDA by using the flow quantities through edges of
   subnetwork as weight values.

Application KDDANet to SDiTDr context

   To apply KDDANet in SDiTDr context, the query disease, and a set of all
   its related drugs were mapped into gene network by disease-related
   genes and drug target genes. For constructing flow network model, the
   weights and capacities of network edges can be assigned using the
   similar method as described in SDrTDi context by substituting query
   drug with query disease and using query disease as source node (S),
   substituting query drug-related diseases with query disease-related
   drugs, substituting query drug’s target genes with query
   disease-related genes, substituting disease-related genes with drug’s
   target genes, and incorporating a sink node T and introducing a
   directed edge pointing from each drug to it. Implementing minimum cost
   flow optimization, depth-first searching, and MCL on the unified flow
   network was same as SDrTDi context.

Performance evaluation

   As the true genes mediating KDDA are poorly understood, there was no
   perfect way to assess the prediction results. The predictive
   performance of KDDANet was evaluated as follow: (1) Based on the
   hypothesis that the larger functional similarity between a gene and
   known KDDA genes, the higher probability this gene was positive one
   mediating KDDA, we compiled a standard set of positive and negative
   KDDA genes for each KDDA resulting subnetwork to unbiasedly evaluate
   the performance of KDDANet using the following strategy:

   (1) For each gene in the KDDA resulting subnetwork, we first calculated
   the mean functional similarity scores of it with KDTGs and KDRGs by our
   previous published method using gene ontology (GO), KEGG pathways, and
   InterPro annotation as functional terms^[258]103.

   (2) We performed a permutation test by randomly producing KDTGs and
   KDRGs 1000 times to compute the empirical significance level of
   functional similarity. We selected the genes having significant
   functional similarities with both KDTGs and KDRGs from KDDA resulting
   subnetwork as positive KDDA genes using the criterion that the
   similarity score was larger than 95th percentiles of the simulated
   background distributions. The other genes were considered as negative
   KDDA genes.

   Based on this gold standard, ROC and PR curves were produced and the
   areas of under curve of ROC and PR (AUROC and AUPRC) were calculated
   for gene list ranked by flow amount of each KDDANet resulting
   subnetwork. For comparison with PPA, SNPLS, comCHIPER, and DGPsubNet,
   we used the default parameters as mentioned in their published papers.
   We considered that all genes contained in a co-module mediating KDDA of
   each drug and disease pair in the co-module. The genes were ranked by
   probability score mediating a KDDA. For PPA and SNPLS, the probability
   score of a gene g mediating the association between drug i and a type
   of cancer j was defined as averaged prediction score across all cell
   lines of this type of cancer^[259]40,[260]41. For comCHIPER, the
   probability score of a gene g mediating the association between drug i
   and disease j was defined as the sum of the products of posterior
   indicator probabilities across all co-modules^[261]42. For DGPsubNet,
   we defined the probability score of a gene g mediating the association
   between drug i and disease j as the calculated z score z[ij]^[262]43.
   The areas of under curve of ROC and PR (AUROC and AUPR) were calculated
   using R package of PRROC. KDDANet resulting subnetwork visualization
   was carried out using Cytoscape software ([263]https://cytoscape.org/).
   KEGG pathway enrichment analysis was performed by clusterProfiler R
   package^[264]104. The visualization of results was carried out in R
   software. Case studies were conducted in SDrTDi context unless
   specifically indicated.

Reporting summary

   Further information on research design is available in the [265]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [266]Supplementary Information^ (2.5MB, pdf)
   [267]Reporting Summary^ (70.3KB, pdf)
   [268]Supplementary Data 1^ (17.7MB, xls)
   [269]Supplementary Data 2^ (674.5KB, xls)
   [270]Supplementary Data 3^ (780.5KB, xls)
   [271]Supplementary Data 4^ (1.3MB, xls)

Acknowledgements