Abstract
Many of genes mediating Known Drug-Disease Association (KDDA) are
escaped from experimental detection. Identifying of these genes (hidden
genes) is of great significance for understanding disease pathogenesis
and guiding drug repurposing. Here, we presented a novel computational
tool, called KDDANet, for systematic and accurate uncovering the hidden
genes mediating KDDA from the perspective of genome-wide functional
gene interaction network. KDDANet demonstrated the competitive
performances in both sensitivity and specificity of identifying genes
in mediating KDDA in comparison to the existing state-of-the-art
methods. Case studies on Alzheimer’s disease (AD) and obesity uncovered
the mechanistic relevance of KDDANet predictions. Furthermore, when
applied with multiple types of cancer-omics datasets, KDDANet not only
recapitulated known genes mediating KDDAs related to cancer, but also
revealed novel candidates that offer new biological insights.
Importantly, KDDANet can be used to discover the shared genes mediating
multiple KDDAs. KDDANet can be accessed at [32]http://www.kddanet.cn
and the code can be freely downloaded at
[33]https://github.com/huayu1111/KDDANet.
Subject terms: Data mining, Drug discovery
Introduction
The conventional development of novel promising drugs for treating
specific diseases is a time-consuming and effort-costing process,
including discovery of new chemical entities, target detection and
verification, preclinical and clinical trials and so on^[34]1. In
addition, only about 10% of new drugs are approved by FDA per year,
meaning most new drugs are never approved and taken to market, which
prevents them from being used in practice^[35]1. The decrease in
pharmaceutical research productivity towards new drug development has
left a gap between the therapeutic needs of patients and the available
treatment paradigms^[36]1. Compared with traditional drug development,
drug repositioning, i.e., finding the novel indications of existing
drugs, offers an alternative approach for safer and faster drug
development because of several procedures of traditional drug
development workflow are not involved during repurposing efforts^[37]2.
There are multiple examples of repositioned drugs that are on the
market now including Minoxidil, a drug designed to treat hypertension
but is now used to treat hair loss^[38]3, and Sildenafil, a drug
originally developed for patients with heart disease but is now
commonly used to treat erectile dysfunction^[39]4. However, these
examples of repositioned drugs were primarily based on clinical
observations of the side effects of the drug^[40]5. Thanks to the
advance in next-generation omics sequencing and qualification
technologies, a large volume of biomedical data, for example, the
pharmacogenomics datasets produced by Connective Map project, The
Cancer Genome Atlas (TCGA) project, Cancer Cell Line Encyclopedia
(CCLE) project, Genomics of Drug Sensitivity in Cancer 1000 human
cancer cell lines (GDSC1000) project and Library of Integrated
Network-based Cellular Signatures project, has been rapidly accumulated
for enabling drug repurposing^[41]6–[42]14. Based on these datasets,
various computational methods have been designed for facilitating the
process of drug repurposing (see Supplementary Note [43]1 for a mini
review). To infer pharmacokinetic and pharmacodynamic drug–drug
interactions and their associated recommendations, for example,
Gottlieb et al. designed a similarity measure-based logistic regression
classifier^[44]15. For predicting drug side effects, Tatonetti et al.
presented an adaptive data-driven approach^[45]16. To identify novel
drug combinations, Zhao et al. integrated the molecular and
pharmacological data and developed a novel computational
approach^[46]17. In addition, network-based method has also been
employed to achieve the similar goal^[47]18. Interestingly, Kuenzi et
al. developed a deep-learning model of visible neural network, called
DrugCell, to predict drug response and synergy in human cancer
cells^[48]19. For discovering novel drug indication, Gottlieb et al.
developed PREDICT, which scored a possible drug-disease link by
combining multiple drug–drug and disease–disease similarity
measures^[49]20. With the similar motivation and aim, an unsupervised
and unbiased network-based proximity measure has also been
designed^[50]21. Moreover, Cheng et al. showed that the further
integration of network proximity-based approach with large-scale
patient-level longitudinal data can offer an effective platform for
validating drug indications^[51]22. For predicting drug-target
interaction, Paolini et al. presented a global mapping of
pharmacological space and probabilistic models by integrating multiple
medicinal chemistry data^[52]23. Different from the approach employed
by Paolini et al., Campillos et al. used the phenotypic side-effect
similarity to determine whether two drugs share a target^[53]24.
Besides, we and others also employed machine learning and network
integration approaches for achieving the same goal^[54]25,[55]26. To
discover disease-related genes, Gottlieb et al. designed “PRINCIPLE”,
which employed classical network propagation algorithm^[56]27. Wu et
al. developed CIPHER that integrated human protein–protein
interactions, disease phenotype similarities, and known gene-phenotype
associations to capture the complex relationships between phenotypes
and genotypes^[57]28. In addition, Ghiassian et al. identified disease
gene modules based on a systematic analysis of connectivity patterns of
disease proteins in the human interactome^[58]29. Furthermore, Zhou et
al. and Menche et al. constructed disease-symptom and disease–disease
relationship networks based on the biomedical literature databases and
incomplete interactome, respectively^[59]30,[60]31. Excitingly, Hofree
et al. developed a new computational approach which employed
network-based stratification to uncover tumor subtypes by integrating
somatic tumor genomes with gene networks^[61]32. Collectively, these
methods have effectively exploited and integrated multilevel biomedical
and omics data sources for understanding the pathology of diseases and
mechanisms of drug actions and thus accelerated drug repurposing.
Theoretically, drug repurposing has been proposed based on two
molecular aspects. (1) On one hand, complex diseases often involve
multiple genetic and environmental determinants, including multifactor
driven alterations and dysregulation of a series of
genes^[62]33,[63]34, which will propagate and perturb certain
biological processes by the interactions among molecules, leading to
the onset of diseases. (2) On the other hand, one drug can exert
impacts on many targets and perturb multiple biological
processes^[64]34,[65]35. As a result, the genes of shared biological
pathways in the cellular network manifested in certain disease state
and induced by a known drug administration suggest potential drug
repurposing^[66]33–[67]38. However, many of these genes mediating Known
Drug-Disease Associations (KDDAs) across various types of diseases have
not yet been identified (see the “Results” section and Supplementary
Fig. [68]S1a, [69]b for supporting data of this claim). Therefore,
developing the appropriate theoretical computational tools from the
perspective of molecular interaction network to unveil the KDDA genes
missed from experiments (hidden genes) is of great significance for
understanding disease pathogenesis and guiding drug repurposing. The
network-based computational methods have been designed for facilitating
drug repurposing which linked drugs to targets or connected diseases to
genes or associated drugs with
diseases^[70]18,[71]21,[72]22,[73]26–[74]29,[75]39. Nevertheless, the
publicly available computational tools specially tailored for
simultaneously bridging drugs, genes, and diseases have not been fully
developed. To our knowledge, some computational tools have been
designed to identify drug-gene-disease co-module. Kutalik et al. and
Chen et al. developed ping-pong algorithm (PPA) and sparse
network-regularized partial least square (SNPLS) to identify co-modules
related to specific cancer cell lines of NCI-60 and Cancer Genome
Projects, respectively^[76]40,[77]41. However, these two methods need
to integrate gene-expression and drug-response data of cancer cell
lines for constructing models and do not carry out prediction for other
types of diseases. The other two methods of comCIPHER and DGPsubNet
have been designed to identify the coherent subnetworks linking drugs
and diseases (not limited in cancer) with the related
genes^[78]42,[79]43. The common shortcomings of these methods are the
identified co-modules including multiple drugs and diseases and thus
are unable to uncover genes specifically mediating individual KDDA. To
this end, we designed a novel computational pipeline, called KDDANet,
which uses known functional gene interaction network to identify hidden
genes of cellular pathways mediating KDDA in a genome-wide scale. Our
KDDANet pipeline depends on three existing network algorithms: minimum
cost network flow optimization, depth-first searching, and graph
clustering algorithm. The minimum cost network flow optimization has
been effectively employed to identify cellular response subnetwork
connecting genetic hits and differentially expressed genes, including
components of the response that are otherwise hidden or missed from
experiments^[80]44. KDDANet can be applied to two contexts: (1)
uncovering hidden genes mediating the association between a query drug
and its related disease (SDrTDi); (2) unveiling hidden genes mediating
the association between a query disease and its related drug (SDiTDr).
The computational procedure of KDDANet in SDrTDi context was showed in
Fig. [81]1 (see “Method” section for details). KDDANet first built a
unified flow model by integrating query drug, genes, and all related
diseases into a heterogeneous network (Fig. [82]1a). Then, the minimum
cost flow optimization was designed and implemented to identify gene
subnetwork mediating the association between the query drug to all its
related diseases (Fig. [83]1b). Finally, depth-first searching, and
Markov clustering (MCL) algorithm were adopted to further uncover gene
modules mediating the association between the query drug and each its
related disease (Fig. [84]1c–[85]e). The outputs of KDDANet were
validated against existing literature and multi-omics’ datasets (Fig.
[86]1f). To apply KDDANet in SDiTDr context, what the user need was
just to simply rebuilt the unified flow network model (see “Methods”
section for details).
Fig. 1. Schematic illustration of KDDANet computational pipeline in SDrTDi
context.
[87]Fig. 1
[88]Open in a new tab
a Mapping query drug (QD) and all its related diseases (RDs) into the
weighted gene network (WGN) through known drug-target relationships and
gene-disease associations. b Constructing a unified flow network model
for each QD and RDs (multiple KDDAs). c Identifying the
highest-probability gene subnetwork mediating multiple KDDAs by minimum
cost flow optimization. d Identifying gene subnetwork mediating
individual KDDA by depth-first searching. e Identifying gene
interaction modules mediating individual KDDA by Markov clustering
(MCL). f Validating the prediction results of KDDANet by existing
knowledge bases.
The key novelty of the KDDANet method lay in that the original
designation of a unified flow network model and effective implementing
multiple network-based algorithms on this unified flow network model.
We demonstrated that KDDANet showed competitive performance in
discovering known and novel genes mediating KDDA through a comparison
with existing methods. KEGG pathway enrichment analysis on KDDANet
resulting subnetworks and case studies on Alzheimer’s disease (AD) and
obesity further showed the mechanistic relevance of KDDANet
predictions. Validated with multiple types of cancer-omics’ datasets,
KDDANet did not only revealed known genes mediating KDDAs associating
drug with cancer, but also uncovered new candidates that offer novel
biological insights. Particularly, our results demonstrated that
KDDANet can reveal the shared genes mediating multiple KDDAs. These
outcomes showed the importance of incorporating hidden genes in drug
discovery pipelines. For facilitating biomedical researchers to explore
the molecular mechanism of KDDA and guiding drug repurposing, an online
web server, [89]http://www.kddanet.cn, was provided for user to access
the subnetwork of genes mediating KDDA, and the source codes of KDDANet
were freely available at [90]https://github.com/huayu1111/KDDANet. In
summary, we developed an effective and universal computational tool and
an online web source for accurate and systematic discovering hidden
genes mediating KDDA and thus providing novel insights into mechanism
basis of drug repurposing and disease treatments. We believed that
KDDANet can provide additional contributions to the development of new
therapies.
Results
Evaluation of the performance and general applicability of KDDANet method
We first checked whether the potential genes mediating KDDA across
various types of diseases have been experimentally identified. For a
given KDDA, we proposed a hypothesis that the Known Drug Target Genes
(KDTGs) and Known Disease-Related Genes (KDRGs) should be highly
overlapped if the genes mediating this KDDA have been fully identified.
We analyzed the overlap between KDTGs and KDRGs of 53124 KDDAs obtained
from Comparative Toxicogenomics Database (CTD)^[91]12. For a KDDA, we
defined the overlap ratio as the number of shared genes between KDTGs
and KDRGs divided by the number of total KDTG and KDRG genes. We
observed that the most gene sets demonstrated extremely low overlap
ratio, with the increasing of gene number, the overlap ratio was
sharply decreased (Supplementary Fig. [92]1a). We then checked the
overlap ratio in different types of diseases and found the overlaps
between KDTGs and KDRGs were small for all 19 disease types in our
dataset (Supplementary Fig. [93]1b). The discrepancy between KDTGs and
KDRGs indicated that each gene set alone provided only a limited and
biased view of KDDA, many of true genes in the cellular pathways
mediating KDDA were not identified from the experiments but otherwise
hidden. To address this, we designed a novel computational tool,
KDDANet, which effectively integrated minimum cost flow optimization,
combined with depth-first searching and graph clustering to
systematically discover the hidden genes of cellular pathways mediating
KDDA (see “Methods” section for details).
To examine whether KDDANet can capture true genes mediating KDDA, we
introduced two new concepts: “known true KDDA genes” (KTKGs) and “novel
true KDDA genes” (NTKGs). For a given KDDA, KTKGs were defined as the
shared genes between KDTGs and KDRGs inputted for constructing KDDANet
flow network model (see “Dataset” section for details). To obtain
NTKGs, we collected a set of drug’s non-target genes (A) from SMPDB 2.0
database^[94]45 that were included in the drug’s ADME pathways and were
responsible for mediating KDDA. Meanwhile, we collected a recently
updated set of disease-related genes (B) from DisGeNet v6.0
database^[95]46, which were not included in KDDANet flow network model.
Based on these, we defined, for each KDDA, the NTKGs as the shared
genes between A and B. A parameter γ effected the size and quality of
KDDANet output subnetwork, higher γ values will identify more gene
links mediating KDDA but with lower confidence. Using gene set
enrichment analysis, we observed that KDDANet can consistently and
effectively capture the KTKGs and NTKGs mediating KDDA under different
γ settings (see Supplementary Notes [96]2 and [97]3, and Supplementary
Fig. [98]1c–l for details).
Based on this, we compiled a standard set of positive and negative KDDA
genes for each KDDA to unbiasedly evaluate the capability of KDDANet
method on uncovering the true genes mediating KDDA using Receiver
Operating Characteristic (ROC) and Precision-Recall (PR) curves (see
“Methods” section for details). We observed that the performance of
KDDANet was obviously better than random permutation across a widely
settings of γ (Fig. [99]2a, [100]b). We next selected γ = 6 for SDrTDi
and γ = 8 for SDiTDr for subsequent evaluation (see Supplementary Note
[101]4 for reasons). With this setting, we further compared KDDANet
with other existing methods for drug-gene-disease co-module discovery,
including SNPLS^[102]40, PPA^[103]41, comCHIPER^[104]42, and
DGPsubNet^[105]43 (see “Methods” section for the details of performance
comparison). Among all the methods tested, KDDANet demonstrated the
competitive performance with the averages of AUROC 0.733 and 0.715 and
AUPRC 0.793 and 0.825 in SDrTDi and SDiTDr contexts, respectively (Fig.
[106]2c and Table [107]1). These results suggested that KDDANet was an
effective tool for achieving the goal of uncovering true genes
mediating KDDA genome wide. We next tested whether KDDANet had the
general application value across a variety of disease types and found
that KDDANet can make effective capturing of true genes mediating KDDA
for all 19 types of diseases (Supplementary Fig. [108]2a–d). We further
tested KDDANet using different types of networks, including
HINT + HI2012, iRefIndex, MultiNet, and STRINGv10^[109]47. We found
that the performances of KDDANet were consistent well across all these
networks (Fig. [110]2d, Supplementary Fig. [111]2e, f). Collectively,
these results fully demonstrated that KDDANet was an effective and
general computational tool for uncovering hidden genes mediating KDDA
across broad types of diseases.
Fig. 2. Performance evaluation of KDDANet method.
[112]Fig. 2
[113]Open in a new tab
a The density curve of AUROC and AUPRC of KDDANet with different γ
setting and permutation test in SDrTDi context. b The density curve of
AUROC and AUPRC of KDDANet with different γ setting and permutation
test in SDiTDr context. c Comparison of AUROC and AUPRC of KDDANet with
PPA, SNPLS, comCHIPER, and DGPsubNet. d AUROC and AUPRC of KDDANet with
different types of gene interaction networks.
Table 1.
Average AUC values of different computational tools for identifying
hidden genes mediating KDDA.
Method KDDANet(SDrTDi) KDDANet(SDiTDr) SNPLS PPA comCHIPER DGPsubNet
Average AUROC 0.733 0.715 0.712 0.713 0.703 0.706
Average AUPRC 0.793 0.825 0.775 0.776 0.778 0.777
[114]Open in a new tab
Mechanistic relevance of KDDANet predictions
We examined whether the enriched pathways of KDDANet resulting
subnetworks have the mechanistic relevance with KDDA by carrying out a
global enrichment of all predicted KDDA subnetwork genes against 53
classical KEGG pathways. The obtained enrichment results can be
validated by existing knowledge (see Supplementary Note [115]5 and
Supplementary Fig. [116]3a, b for details). We aimed to provide two
cases to intuitively describe the mechanistic relevance of KDDANet
resulting subnetwork. Phylloquinone (DB01022)-AD (104300) association
has been reported in previous study^[117]48. A subnetwork including 46
genes and 44 links were uncovered mediating this association (Fig.
[118]3a). Interestingly, for this subnetwork, two separated gene
modules (M1 and M2) were detected (Fig. [119]3a). The AUROC and AUPRC
values of for this KDDANet resulting subnetwork were 0.864 and 0.795,
respectively (Supplementary Fig. [120]3c). Two known targets of
phylloquinone and 18 AD-related genes were identified in this
subnetwork. Against with genome background, this subnetwork captured 4
NTKGs with ~86-fold enrichment and adjusted p value of 9.716e−09
(Hypergeometric test and Bonferroni correction). The top ten enriched
KEGG terms of this subnetwork, such as Phospholipase D signaling
pathway and Neurotrophin signaling pathway^[121]49,[122]50, were
closely related with AD (Fig. [123]3b). The module M1 mainly functioned
in Insulin signaling pathway, ErbB signaling pathway, FoxO signaling
pathway and growth hormone synthesis, secretion, and action, which
played important roles in neural system development and the onset and
development of AD^[124]51–[125]53 (Supplementary Fig. [126]3d). The
enriched KEGG pathways of M2 genes, including Complement and
coagulation cascades, Glycolysis and AGE-RAGE signaling pathway in
diabetic complications, were dysfunction in AD^[127]54–[128]56
(Supplementary Fig. [129]3d). We further analyzed the published RNA-seq
data to detect the expressional change of these two modules in normal
individuals and AD patients^[130]57. We observed that the averaged
expression level of M1 genes was significantly upregulated in AD
(Supplementary Fig. [131]3e). Interestingly, a predicted novel gene,
STAT3, was obviously activated in AD patients (Fig. [132]3c). Two
newest studies published in years 2019 and 2020 reported that STAT3 was
a potential therapeutic target for cognitive impairment in
AD^[133]58,[134]59. Another predicted novel gene, GNAI1, was also
obviously activated in AD patients. This gene was contained in the
causal pathways associated with an imaging endophenotype characteristic
of longitudinal structural change in the brains of patients with
AD^[135]60. For module M2 genes, the obviously expressional changes
between normal individuals and AD patients were not observed
(Supplementary Fig. [136]3e). However, we found that a predicted novel
KDDA gene, GC, was activated in AD patients (Fig. [137]3c). This gene
encoded vitamin D binding protein, which was recently evidenced as a
potential therapeutic agent for the treatment of AD^[138]61.
Fig. 3. Mechanistic relevance of KDDANet prediction results.
[139]Fig. 3
[140]Open in a new tab
a KDDANet resulting subnetwork mediating phylloquinone
(DB01022)-Alzheimer’s disease (AD, 104300) association. b Top ten
enriched KEGG terms of subnetwork genes mediating phylloquinone-AD
association. c Expression level of STAT3, GNAI1, and GC in normal
individuals and AD patients, **p value < 0.01, calculated by
Mann–Whitney U test. d KDDANet resulting subnetwork for heparin
(DB01109)-obesity (601665) association. e Top ten enriched KEGG terms
of subnetwork genes mediating heparin-obesity association. f ROC and PR
curves of KDDANet gene subnetwork mediating heparin-obesity
association. In the subnetworks, the size of a gene node was
proportional to its network degree; The thickness of a network edge was
proportional to its flow amount. FPKM Fragments Per Kilobase Of Exon
Per Million Fragments Mapped.
The association between heparin (DB01109) and obesity (601665) was
inferred by multiple genes as described in CTD database. For this
association, KDDANet predicted a subnetwork containing 168 edges
connecting 169 genes (Fig. [141]3d). We found that two KDTGs and 67
disease-related genes were captured in this subnetwork. Particularly,
eight genes were the NTKGs with ~52-fold enrichment and adjusted p
value of 1.485e−10 (Hypergeometric test and Bonferroni correction).
Three genes used to infer this KDDA, including ARK1, PARP1, and TNF,
were also effectively captured in the resulting subnetwork. The top ten
enriched functions of this subnetwork were showed in Fig. [142]3e. As
expected, insulin resistance, type II diabetes mellitus, insulin
signaling pathway, and adipocytokine signaling pathway were the
frequently reported events and molecular processes associated with
obesity^[143]62,[144]63. In addition, proteoglycans, lipolysis, and
AMPK signaling pathway were also highly related with Insulin
resistance^[145]64–[146]66. In consistent with this, the AUROC and
AUPRC values of KDDANet for this subnetwork were 0.854 and 0.849,
respectively (Supplementary Fig. [147]3e). We next delineated the
subnetwork into gene modules. The enriched functions of top three gene
modules were demonstrated in Supplementary Fig. [148]3f. The genes of
module M1 mainly participated in Glycolysis and Carbon metabolism.
Further analysis of public RNA-seq data of normal individuals and
obesity patients^[149]67,[150]68 demonstrated the significantly
repressed expression of M1 genes in patients with obesity, such as
GADPH (Supplementary Fig. [151]3g, h). This was consistent with the
fact that enhancing the level of glycolysis reduced obesity^[152]69.
The mainly enriched pathways of M2 genes were related to Insulin
resistance, a frequently happened event in obesity patients.
Interestingly, Cell adhesion molecules was a significantly enriched
KEGG term of M2 genes (Supplementary Fig. [153]3f), which was elevated
in patients with obesity^[154]70. The M3 genes functioned in Complement
and coagulation cascades and Platelet activation (Supplementary Fig.
[155]3f). As reported, these two terms were closely associated with
obesity^[156]71,[157]72. In support with these, the expression of genes
in M2 and M3 were activated in patients with obesity (Supplementary
Fig. [158]3g, h). Together, these results indicated that KDDANet can
serve as a useful tool to unveil the molecular basis of KDDA.
KDDANet provided novel molecular insights on KDDAs related to cancer
Cancer was a frequently happened complex genetic disease caused by DNA
abnormalities^[159]73. For this reason, substantial genetic, genomic
and pharmacogenomics efforts, including TCGA, CCLE, and
GDSC1000^[160]7–[161]9, have been undertaken to improve existing
therapies or to guide early-phase clinical trials of compounds under
development. With these efforts, an increasing amount of available
high-throughput datasets at both levels of genomic data and
pharmacogenomics data were produced at recent years. In addition,
COSMIC, the world’s largest and most comprehensive resource for
exploring the impact of somatic mutations in human cancer, collected a
catalog of genes with mutations that were causally implicated in cancer
([162]https://cancer.sanger.ac.uk/cosmic). With these datasets, we
observed that the genes of KDDANet resulting subnetworks mediating the
associations between drugs and cancer were significantly enriched in
COSMIC Cancer Gene Census, and these genes harbored more oncogenic
alterations in tumor samples than randomly selected genes
(Supplementary Note [163]6, Supplementary Fig. [164]4a, b). Moreover,
we found that oncogenic alterations of genes in KDDANet resulting
subnetworks mediating the associations between drugs and cancer were
more correlated with the responses of cancer cell lines under
anti-cancer drug treatment than randomly selected genes (see
Supplementary Note [165]5, Supplementary Fig. [166]4c–f). We provided
two detailed examples to describe the potential values of KDDANet in
revealing novel genes mediating the associations between drugs and
cancer.
Sotalol (DB00489) was normally used to treat life-threatening
ventricular arrhytmias. It has been reported that sotalol was
associated with decreased prostate cancer (176807) risk^[167]74. For
sotalol-prostate cancer association, KDDANet predicted a subnetwork
consisting of 31 genes and 28 links (Fig. [168]4a). By applying MCL
with default parameters, this subnetwork was further decomposed into
three gene modules, M1, M2, and M3. All three known target genes of
sotalol were included in this subnetwork. Meanwhile, this subnetwork
also captured 12 prostate cancer-related genes with a novel NTKG of
PRKACA. The top ten enriched KEGG terms of this subnetwork were showed
in Fig. [169]4b. Among these, PI3K-Akt signaling pathway, MAPK
signaling pathway and FoxO signaling pathway were related to cancer
formation and development. The relationships between EGFR tyrosine
kinase inhibitor resistance, Relaxin signaling pathway, AGE-RAGE
signaling pathway, and prostate cancer have been widely investigated
and reported in the previous works^[170]75–[171]77. In addition,
Autophagy and Focal adhesion were two widely observed processes in
cancer^[172]78,[173]79. As expected, the enriched KEGG signaling
pathways of M1 were closely related to tumorigenesis (Supplementary
Fig. [174]4g). The expression of M1 genes were activated in tumor
samples, such as an oncogene YWHAE (Supplementary Fig. [175]4h, i).
Interestingly, M1 captured that IGF1R, a gene encoding insulin-like
growth factor receptor, harbored SNVs and CNVs in TCGA prostate cancer
samples, and has been reported to be oncogenic genes of prostate
cancer^[176]80. The genes of M2 were not enrich to any KEGG term and
but have lower expression in TCGA tumor samples (Supplementary Fig.
[177]4h). Surprisingly, SPARC, a reported prostate cancer-related
gene^[178]81 with significantly lower expression in TCGA tumor samples
was captured in this module (Supplementary Fig. [179]4i). Moreover, we
found that both SPARC and COL1A1 harbor SNVs and CNVs in TCGA prostate
cancer samples and their expressions were obviously correlated to the
survival of patients (Fig. [180]4c). The M3 genes mainly participated
in ErbB signaling pathways, a biological process involved in prostate
cancer progression^[181]82. The expression of M3 genes were not
obviously changed in TCGA prostate cancer samples (Supplementary Fig.
[182]4h). However, we found that GRB2 was overexpressed in TCGA
prostate cancer samples and obviously correlated with the survival of
patients at the later stage of disease (Supplementary Fig. [183]4i,
[184]c).
Fig. 4. KDDANet provided novel molecular insights on KDDAs related to cancer.
[185]Fig. 4
[186]Open in a new tab
a KDDANet resulting subnetwork mediating sotalol (DB00489)-prostate
cancer (176807) association. b Top ten enriched KEGG terms of
subnetwork genes mediating sotalol-prostate cancer association. c Gene
expression-based survival analysis for CASP2, SPARC, and GRB2 in
patients with prostate cancer, obtained by GEPIA online analysis
([187]http://gepia.cancer-pku.cn/about.html). d KDDANet gene subnetwork
mediating nebularine (DB04440)-lung cancer (211980) association. e Top
ten enriched KEGG terms of subnetwork genes mediating nebularine-lung
cancer association. f Gene expression-based survival analysis for BYSL
and BOP1 in patients with TCGA LUAD lung cancer obtained by GEPIA
online analysis ([188]http://gepia.cancer-pku.cn/about.html) and their
expression levels in lung cancer tumor samples and adjacent normal
tissue samples, ***p value < 0.001, calculated by Wilcox signed rank
test. In the subnetworks, the size of a gene node was proportional to
its network degree; the thickness of a network edge was proportional to
its flow amount.
Another example was the association between nebularine (DB04440)-lung
cancer (211980) that was inferred by ADA targeted by
nebularine^[189]83. For this association, KDDANet predicted 237 genes
connected by 238 links that constituted a subnetwork without apparent
modular structure (Fig. [190]4d). The only known target gene ADA of
nebularine was connected to 119 lung cancer-related genes and 117
predicted novel KDDA genes in this subnetwork. As expected, KEGG
enrichment demonstrated that the genes in this subnetwork were involved
in various cancers (Fig. [191]4e). Interestingly, we found that two
highly connected novel genes BYSL and BOP1 were significantly
overexpressed in TCGA lung cancer tumor samples and their expression
levels were obviously correlated with the survival of lung cancer
patients (Fig. [192]4f). These results indicated that KDDANet not only
captured known genes mediating KDDAs linking drug with cancer, but also
uncovered novel candidates that offered novel biological insights.
KDDANet uncovered the shared genes mediating multiple KDDAs
Comprehensive analysis above fully demonstrated that KDDANet can
uncover true genes mediating individual KDDA. We further asked whether
KDDANet can reveal the shared genes mediating multiple KDDAs. We
answered this from two aspects as follow: (1) Multiple Diseases
associating with One Drug (MDiODr); (2) Multiple Drugs associating with
One Disease (MDrODi). Considering the practical merits for the first
one analysis, we required multiple diseases belongs to the same type of
diseases. To evaluate the capability of KDDANet for revealing the
shared genes mediating multiple KDDAs, we produced meta-subnetworks by
integrating multiple KDDANet resulting subnetworks for 12386 MDiODr
combinations and 773 MDrODi combinations produced in SDrTDi context,
and 12189 MDiODr combinations and 773 MDrODi combinations in SDiTDr
context. The weight of an edge in the meta-subnetwork was defined as
the number of KDDA resulting subnetworks containing this edge divided
by the total number of KDDA resulting subnetworks. The higher weight
value of a link in the meta-subnetwork indicated more conservation and
commonality. Thus, we used the weight value to evaluate the capacity of
KDDANet in unveiling the shared gene interactions mediating multiple
KDDAs. We carried out a permutation test by producing random
meta-subnetworks with the same number of edges for comparing with
random background. As shown in Fig. [193]5a, Supplementary Fig.
[194]5a, b, the weights of KDDANet meta-subnetworks were significantly
higher than random meta-subnetworks across different types of diseases
in SDrTDi context. This indicate that the genes tend to be shared in
KDDANet meta-subnetworks than random one. We also conducted the same
analysis in SDiTDr context and obtained the similar results (Fig.
[195]5b, Supplementary Fig. [196]5c, d). Collectively, these results
indicated that KDDANet can effectively uncover the shared genes
mediating multiple KDDAs.
Fig. 5. KDDANet uncovered the shared genes mediating multiple KDDAs.
[197]Fig. 5
[198]Open in a new tab
a Boxplots demonstrating the distributions of weight values in KDDANet
meta-subnetwork and random meta-subnetwork for MDiODr and MDrODi in
SDrTDi context, ***p value < 0.001, calculated by Wilcox signed rank
test. b Boxplots demonstrating the distributions of weight values of
KDDANet meta-subnetwork and random meta-subnetwork for MDiODr and
MDrODi in SDiTDr context, ***p value < 0.001, calculated by Wilcox
signed rank test. c Shared meta-subnetwork mediating profenamine
(DB00392)-neurological disease associations. d Top ten enriched KEGG
terms of shared meta-subnetwork genes mediating
profenamine-neurological diseases associations. e Shared
meta-subnetwork mediating the associations between GRACILE syndrome
(603358) and 13 drugs. f Top ten enriched KEGG terms of shared
meta-subnetwork genes mediating the associations between GRACILE
syndrome and multiple drugs; In the meta-subnetworks, the size of a
gene node was proportional to its network degree; the thickness of a
network edge was proportional to its weight value.
We presented some examples for describing the capability of KDDANet in
identifying shared genes mediating multiple KDDAs. For MDiODr, we
selected two cases: (1) profenamine (DB00392) and neurological disease
associations and (2) mirtazapine (DB00370) and cancer associations. As
reported in CTD database, profenamine was associated with three
neurological diseases, including Parkinsonian Disorders, Multiple
Sclerosis, and Alzheimer Disease. The shared meta-subnetwork contained
75 edges linking 43 unknown genes with three profenamine’s target genes
and 30 neurological disease-related genes (Fig. [199]5c). The top ten
enriched KEGG terms were demonstrated in Fig. [200]5d. A majority of
enriched KEGG terms, such as Cholinergic synapse and Glutamatergic
synapses, have been reported dysfunction in neurological disorders and
diseases^[201]84,[202]85. Interestingly, we found that GNAI2 and GNA11
were two mostly shared genes linking profenamine with neurological
diseases. These two genes were recently discovered involving in the
pathological pathways of neurological diseases^[203]86,[204]87.
Mirtazapine was associated with nine types of cancers, including
Colorectal Neoplasms, Breast Neoplasms, Neuroblastoma, Glioma, Urinary
Bladder Neoplasms, Stomach Neoplasms, Esophageal Neoplasms, Lung
Neoplasms, and Prostatic Neoplasms. Supplementary Figure [205]5e showed
the shared meta-subnetwork that included 97 edges connecting 21
mirtazapine’s target genes with 33 cancer-related genes and 54 unknown
genes. As expected, KEGG enrichment demonstrated that these genes were
involved in cancer-related signaling pathways and played important
roles in various cancers (Supplementary Fig. [206]5f). Intriguingly,
DRD4 and GRB2 were two mostly shared genes mediating the associations
between mirtazapine and cancers (Supplementary Fig. [207]5e). These two
genes were involved in the oncogenesis of multiple types of
cancers^[208]88,[209]89.
For MDrODi, some interesting cases were also observed. For example,
GRACILE syndrome, a metabolic disease, was associated with 13 drugs as
recorded in CTD database. A shared meta-subnetwork including 66 genes
and 61 edges was obtained for this disease (Fig. [210]5e). This
meta-subnetwork contained 14 drug target genes and 13 GRACILE
syndrome-related genes. The mostly shared gene was ATP5B, and the
mostly significant enriched KEGG term was Oxidative phosphorylation
(Fig. [211]5f). This was expected as GRACILE syndrome was a fatal
inherited disorder caused by a mutation in an oxidative
phosphorylation-related gene, BCS1L^[212]90. It was also not surprising
that the neurological diseases related genes were also enriched in this
meta-subnetwork as patients with GRACILE syndrome had severe
neurological problems^[213]90. Another example was the Keratoconus
(148300), an ophthamological disease, which was associated with three
different drugs, including acetaminophen, valproic acid and
theophylline. Keratoconus and these three drugs shared a
meta-subnetwork consisting of 54 genes and 51 edges (Supplementary Fig.
[214]5g). This meta-subnetwork included 7 Keratoconus-related genes, 11
drug target genes, and 36 unknown genes. It was expected that HDAC2 had
high weights with its partner genes in this meta-subnetwork as it was
involved in notch signaling pathway which is downregulated in
keratoconus^[215]91. Consistent with the associations between collagen
genes and keratoconus^[216]92, a novel collagens coding gene, COL1A1,
was captured in this meta-subnetwork as a highly shared gene. The
enriched KEGG terms of this meta-subnetwork included Hippo signaling
pathway (Supplementary Fig. [217]5h) which has been reported involving
in keratoconus corneas^[218]93. Collectively, these results indicated
that KDDANet can discover the shared genes mediating multiple KDDAs.
The highly shared unknown genes can serve as potential candidate
targets for drug repurposing.
Discussion
To facilitate drug repurposing, various computational tools have been
developed to uncover novel drug-disease associations^[219]94. However,
the potential genes mediating KDDA have been still not fully explored.
Unveiling the hidden genes (missed from experiments) mediating KDDA
become a great challenge for guiding novel target discovery and drug
repurposing. In this work, we developed a novel computational tool,
KDDANet, which integrated minimum cost network flow optimization,
depth-first searching and graph clustering algorithm to reveal hidden
genes and modules mediating KDDA. KDDANet allowed for a global and
systematic exploration of the hidden genes mediating KDDA. We applied
KDDANet to unravel the subnetworks of genes mediating 53124 KDDAs. The
comprehensive and system-level evaluations fully demonstrated the
effective prediction capability and general applicability of KDDANet.
Case studies on both AD and obesity showed that the subnetworks of
genes identified by KDDANet were reliable and useful. Further validated
by integrating analysis of genomic, transcriptomic, pharmacogenomic and
survival data on primary tumors and cancer cell lines highlighted that
KDDANet captured novel candidates from interactome mediating the
associations between drugs and different types of cancer. Based on
these, we concluded that the inferred subnetworks mediating KDDA can
serve as genome-wide molecular landscapes for guiding drug repurposing
and disease treatment. Insights learned from our predictions would also
enable to help researchers to design repurposing drugs to reverse
disease phenotypes via targeting key genes in the subnetwork mediating
KDDA. An important capability of KDDANet method was that it can reveal
the shared genes mediating multiple KDDAs. This provided more valuable
guides for drug repositioning and disease treatment since the shared
genes mediating multiple KDDAs were closely linked to the molecular
basis of drug repurposing. We constructed a user-friendly online web
tool ([220]http://www.kddanet.cn), which allowed users to explore the
subnetwork of genes mediating individual KDDA and the meta-subnetworks
mediating multiple KDDAs (See Supplementary Note [221]7 as well as the
Help section of our website for detailed description of the utility of
an online web version of KDDANet). In summary, we presented a novel
computational tool KDDANet and an online web source to decode the
hidden genes mediating KDDA that had broad utility and application
value in biomedical studies.
The hidden genes mediating KDDA predicted by KDDANet highlighted the
power of integrative approaches to illuminate underexplored molecular
processes mediating KDDA. In the future, the application value of our
KDDANet tool in drug repurposing can be further improved from three
aspects as follow: Firstly, the gene interactome used in KDDANet did
not contain enough interactions between genome elements. In further
studies, we would integrate other non-coding genome elements,
especially long non-coding RNA and microRNA for constructing a
comprehensive interactome. Secondly, integrating the biological
networks from other omics layers, such as epigenomics, might have also
further enhance the accuracy of KDDANet in discovering subnetworks and
key genes mediating KDDA, and help us better to understand KDDA at
multi-omics levels. The intrinsically capability of KDDANet to analyze
large-scale heterogeneous interactome data containing tens of thousands
of nodes and edges make it can well be suited to analyzing the
accumulating data from ‘multi-omics’ technologies and biomedical
research. Finally, KDDANet did not carry out predictions for KDDA when
a drug’s target genes were unknown or when a disease has not been
related to any known gene. For this, we planned to calculate the
similarity scores between drugs and the similarity scores between
diseases, and then integrate drugs without any target gene and diseases
without any related gene to our flow network model by similarity
scores.
Methods
Datasets
Five different types of gene networks, including HumanNet,
HINT + HI2012, iRefIndex, MultiNet, and STRINGv10, were used in our
current study^[222]47,[223]95, in which the nodes were represented by
gene ID and connected by bidirectional edges (Supplementary Data
[224]1). Drugs and their target genes were obtained from DrugBank 5.0.9
database ([225]http://www.drugbank.ca/). In this study, we selected
4861 drugs with at least one known target that was contained in the
gene networks for further analysis. In total, 2196 KDTGs included in
the gene networks were connected to these drugs by 12014 interactions
(Supplementary Data [226]2). KDRGs and classification of diseases were
obtained by manually collecting Human Disease Network (HDN) from the
previous study of HDN^[227]33 and DisGeNET v5.0 database^[228]96. We
focused on 1441 diseases with at least one related gene which was
included in the gene networks for our study. In total, 16,712
associations link these diseases to 1521 genes which existed in the
gene networks (Supplementary Data [229]3). The KDDAs were extracted
from CTD^[230]12 (Supplementary Data [231]4). In this study, 53124
KDDAs were analyzed in which the drug had at least one target gene and
the disease had at least one related gene contained in HumanNet. For
simplicity and consistency, we converted different types of drugs,
diseases, and gene nomenclatures to DrugBank drug ID, OMIM disease ID,
and NCBI Entrez gene ID for subsequent modeling and analysis.
The primary RNA-seq datasets of AD patients, obesity patients, and
normal individuals were downloaded from NCBI GEO Datasets under the
accession number of [232]GSE53697, [233]GSE81965, and [234]GSE63887.
After the SRA files were gathered, the archives were extracted and
saved in FASTQ format using the SRA Toolkit. RNA-seq reads were trimmed
using Trimmomatic software
([235]http://www.usadellab.org/cms/?page=trimmomatic) with the
following parameters “ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3
TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36” (Version 0.36), and were
further quality-filtered using FASTX Toolkit’s fastq_quality_trimmer
command ([236]http://hannonlab.cshl.edu/fastx_toolkit/) (Version
0.0.13) with the minimum quality score 20 and minimum percent of 80%
bases that had a quality score larger than this cutoff value. The
high-quality reads were mapped to the hg38 genome by HISAT2, a fast and
sensitive spliced alignment program for mapping RNA-seq reads, with
-dta paramenter ([237]http://daehwankimlab.github.io/hisat2/). PCR
duplicate reads were removed using Picard tools
([238]http://broadinstitute.github.io/picard/) and only uniquely mapped
reads were kept for further analysis. The expression levels of genes
were calculated by StringTie
([239]http://ccb.jhu.edu/software/stringtie/) (Version v1.3.4d) with -e
-B -G parameters using Release 29 (GRCh38.p12) gene annotations
downloaded from GENCODE data portal
([240]https://www.gencodegenes.org/). To obtain comparable expression
abundance estimation for each gene, reads mapped to hg38 were counted
as Fragments Per Kilobase Of Exon Per Million Fragments Mapped based on
their genome locations. Differential expression analysis of genes was
performed by DESeq2 using the reads count matrix produced from a python
script “prepDE.py” provided in StringTie website
([241]http://ccb.jhu.edu/software/stringtie/).
TCGA cancer genomics datasets were directly downloaded via the UCSC
Xena project data portal ([242]https://xenabrowser.net/datapages/). As
the DNA methylation datasets were quantified as beta value in the DNA
probe level, we mapped Illumina Human Methylation 450 probe ID to gene
name using HumanMethylation450 annotation file. If a gene was mapped by
multiple probes, we considered the averaged signals of these probes as
the methylation level of this gene. We used Wilcox signed rank test for
differential analysis of gene expression and DNA methylation of KDDANet
resulting subnetwork genes mediating the associations between drug and
cancer. For this analysis, we only used the tumor samples which have
adjacent normal tissue samples as control. GDSC1000 cancer cell line
pharmacogenomic datasets were downloaded from GDSC website
([243]https://www.cancerrxgene.org/gdsc1000/GDSC1000_WebResources/Home.
html). Since this website provided only the CpG island methylation
data, we downloaded the beta value matrix of probe-level DNA
methylation from NCBI GEO Dataset under accession number of
[244]GSE68379 and then converted it to gene-level beta value matrix
using the same method as TCGA DNA methylation data. CCLE
pharmacogenomic datasets were directly downloaded from Broad Institute
data portal ([245]https://portals.broadinstitute.org/ccle/data).
Construction of a unified flow network model
For each query drug and all its related diseases, the unified flow
network model in SDrTDi context was built by integrating the query drug
and all its related diseases into the gene network based on known
drug-target relationships and gene-disease associations. As shown in
Fig. [246]1, for the given query drug, we used it as source node (S)
and integrated it into gene network by introducing the directed edges
from it point to its target genes. For each its related disease, we
mapped the disease to gene network by introducing the directed edges
from its related genes point to it. We incorporated a sink node T and
introduced a directed edge pointing from each disease to it. With these
definitions, a unified flow network model was constructed as a complex
heterogeneous graph G = (V, E), where V was the set of vertices and E
was the set of edges. This graph included two types of edges
(bidirectional and directed) and three types of nodes (drugs, genes,
and diseases). Each edge was assigned with a weight and a capacity.
Flow goes from a source node to a sink node through the graph edges.
The assigning scheme of weight and capacity was illustrated as follow:
Weight and capacity assigning scheme for network edges
Edges between gene nodes
Edges between gene nodes were weighted (W[ij]) to reflect the
probability that two genes g[i] and g[j] were functionally linked in
the biological processes. The weight value between g[i] and g[j] was
derived from a Bayesian statistics approach by integrating diverse
functional genomics datasets^[247]47. Briefly, each dataset was
benchmarked for its real capability of reconstructing known cellular
pathways by measuring the likelihood that pairs of genes (linkages)
were functionally connected conditioned on the experimental evidence,
calculated as a log likelihood score LLS^[248]97:
[MATH: LLS=lnF(L<
mo>∣E)/~F(L∣E)F
(L)/~F(L
mrow>) :MATH]
1
Where F(L|E) and ~F(L|E) represented the observed numbers of linkages L
appear in the given experiment E between functionally annotated human
genes interacting within the same pathway and between different
pathways, respectively, F(L) and ~F(L) denoted the total observed
numbers of linkages between all annotated human genes interacting
within the same pathway and between different pathways, respectively.
The weight value (W) of a given linkage between two genes was produced
by combining LLS score of each dataset using the formula:
[MATH: W=LLSbest+∑i=1NLLSiD⋅i<
/mrow>,forallL≥T :MATH]
2
Where LLS[best] represented the highest LLS score for a linkage between
two genes, D determined decay rate of the LLS score for additional
evidence, and i was the order index of LLS scores for a given linkage
between two genes, ranking starting from the second maximum LLS score
with descending order of magnitude for all N remaining LLS scores. T
represented a minimum threshold of LLS score to be considered. The
values of D and T were empirically optimized to maximize overall
performance on known GO annotations measured by AUPRC^[249]98.
Edges between drug and drug’s target gene nodes
Edges between each drug and the drug’s target gene nodes were weighted
(w[Si]) to reflect the normalized reliability of the interaction
between drug and target protein based on experimental and computational
evidence. The weighting scheme was based on the predicted score of drug
and target protein interaction^[250]25. The weight value (W[Si]) was
calculated as:
[MATH:
WSi=Pi∑j∈TPj :MATH]
3
Where T denoted the set of each drug’s targets, [Pi] denotes the
predicted score between drug S and target i, P[j] denoted the predicted
score between drug S and target j.
Edges between disease-related gene and disease nodes
Edges between disease-related gene and disease nodes were weighted
(w[jd]) to reflect the normalized reliability of the linkage between
disease and gene based on experimental and computational evidence. The
weighting scheme was based on the predicted score of gene-disease
association derived from MAXIF algorithm^[251]99. We calculated the
weight value (W[jd]) as follow:
[MATH:
Wjd=Fi∑j∈DFj :MATH]
Where D denoted the set of genes linked to disease d, F[i] denoted the
predicted score between gene i and disease d, P[j] denoted the
predicted score between gene j and disease d.
Edges between disease and sink nodes
For each edge linking each disease d to sink node T, we assigned it a
same weight value W[dT] = 1/N, where N denoted the number of diseases
linking to sink node.
We further defined for each edge in this heterogeneous network a
capacity value that limited the flow quantity. For each edge connecting
the query drug S to its target gene i, we assigned it a capacity C[Si]
equal to W[Si]. For each edge linking the disease-related gene j to
disease d, we assigned it a capacity C[jd] equal to W[jd]. For each
edge linking the disease d to sink node T, we assigned it a capacity
C[dT] equal to W[dT.] For other edges, we assigned them a capacity
C[ij] = 1.
Minimum cost flow optimization algorithm
With the purpose of identifying hidden genes mediating KDDA, we search
for an possible solution that would (1) capture the subset of the query
drug’s target genes which closely modulate all its related diseases by
the disease-related genes without restrict to the prior KDDA genes, (2)
determine hidden genes that were likely to be part of cellular pathways
connecting the query drug’s target genes to all its related diseases
but escaped detection by experiments, (3) give high priority to genes
that lie on paths with highest probability connecting the query drug to
all its related diseases without making constraints on the network
structure. The rationality for proposing this solution including two
aspects: (1) this needed less computational time than that finding the
highest-probability subnetwork connects a query drug to each its
related diseases at a time; (2) this can effectively find the shared
genes mediating multiple KDDAs. Similar to how a fluid flows through
the path of least resistance, we formulated this goal as a minimum cost
flow optimization problem^[252]44,[253]100,[254]101. Cost was defined
as the negative log of the probability of an edge. Thus, minimizing the
cost given preference to the highest-probability paths. Given the
unified flow network, this problem can be expressed as a linear
programming formula that minimizes the total cost of the flow network
while diffusing the most flow from query drug node to hypothetical sink
node. Let W[ij], F[ij], and C[ij] referred to the weight, flow, and
capacity from node i to node j, respectively. The linear programming
formula can be written as follow.
[MATH: Minimize∑
i∈V,j∈V<
/mrow>−logWij<
/mrow>*Fij−γ*∑FSv
:MATH]
5
[MATH: Subjectto∑j<
/mi>∈VFij−∑j
∈VF
ji∀i∈V−S,T
:MATH]
6
[MATH:
FSv−∑i
∈DF
iT=0 :MATH]
7
[MATH: 0≤Fij≤Cij :MATH]
8
Where V denoted a set of nodes included in the flow network; S denoted
the query drug and v denoted its target gene. The parameter gamma (
[MATH: γ :MATH]
) controlled the size and the quality of the optimized subnetwork. The
first component of this formula,
[MATH:
∑i∈V
,j∈V−logWij<
/mrow>*Fij :MATH]
, ensured minimizing the network cost that given priority to obtain
highest-probability gene subnetwork, at the same time, the second
component,
[MATH:
−(γ*∑FSv) :MATH]
, indicated maximizing the total flow across entire network. This
optimization problem can be efficiently solved using primal simplex
method provided in Mixed Integer Linear Programming solver
([255]http://lpsolve.sourceforge.net/). The solution
[MATH:
argminfij<
/mi>>0∑<
/mrow>i∈V,j∈V−logwij<
/mrow>*fij−γ*∑fSv
:MATH]
obtained the highest-probability subnetwork mediating the associations
between query drug and all its related diseases.
Depth-first searching and Markov clustering (MCL)
Once the highest-probability subnetwork mediating the query drug
(disease) with all its related diseases (drugs) was obtained, we
implemented depth-first searching^[256]102 on this subgraph to find the
subnetwork made up of all paths linking the query drug (disease) to
each it’s related disease (drug). All genes in the solution were ranked
by the amount of flow they carry. The more flow that passes through a
protein, the more important it was in mediating KDDA. After obtaining
the subnetwork mediating individual KDDA, MCL
([257]https://micans.org/mcl/) was employed to further discover gene
modules mediating KDDA by using the flow quantities through edges of
subnetwork as weight values.
Application KDDANet to SDiTDr context
To apply KDDANet in SDiTDr context, the query disease, and a set of all
its related drugs were mapped into gene network by disease-related
genes and drug target genes. For constructing flow network model, the
weights and capacities of network edges can be assigned using the
similar method as described in SDrTDi context by substituting query
drug with query disease and using query disease as source node (S),
substituting query drug-related diseases with query disease-related
drugs, substituting query drug’s target genes with query
disease-related genes, substituting disease-related genes with drug’s
target genes, and incorporating a sink node T and introducing a
directed edge pointing from each drug to it. Implementing minimum cost
flow optimization, depth-first searching, and MCL on the unified flow
network was same as SDrTDi context.
Performance evaluation
As the true genes mediating KDDA are poorly understood, there was no
perfect way to assess the prediction results. The predictive
performance of KDDANet was evaluated as follow: (1) Based on the
hypothesis that the larger functional similarity between a gene and
known KDDA genes, the higher probability this gene was positive one
mediating KDDA, we compiled a standard set of positive and negative
KDDA genes for each KDDA resulting subnetwork to unbiasedly evaluate
the performance of KDDANet using the following strategy:
(1) For each gene in the KDDA resulting subnetwork, we first calculated
the mean functional similarity scores of it with KDTGs and KDRGs by our
previous published method using gene ontology (GO), KEGG pathways, and
InterPro annotation as functional terms^[258]103.
(2) We performed a permutation test by randomly producing KDTGs and
KDRGs 1000 times to compute the empirical significance level of
functional similarity. We selected the genes having significant
functional similarities with both KDTGs and KDRGs from KDDA resulting
subnetwork as positive KDDA genes using the criterion that the
similarity score was larger than 95th percentiles of the simulated
background distributions. The other genes were considered as negative
KDDA genes.
Based on this gold standard, ROC and PR curves were produced and the
areas of under curve of ROC and PR (AUROC and AUPRC) were calculated
for gene list ranked by flow amount of each KDDANet resulting
subnetwork. For comparison with PPA, SNPLS, comCHIPER, and DGPsubNet,
we used the default parameters as mentioned in their published papers.
We considered that all genes contained in a co-module mediating KDDA of
each drug and disease pair in the co-module. The genes were ranked by
probability score mediating a KDDA. For PPA and SNPLS, the probability
score of a gene g mediating the association between drug i and a type
of cancer j was defined as averaged prediction score across all cell
lines of this type of cancer^[259]40,[260]41. For comCHIPER, the
probability score of a gene g mediating the association between drug i
and disease j was defined as the sum of the products of posterior
indicator probabilities across all co-modules^[261]42. For DGPsubNet,
we defined the probability score of a gene g mediating the association
between drug i and disease j as the calculated z score z[ij]^[262]43.
The areas of under curve of ROC and PR (AUROC and AUPR) were calculated
using R package of PRROC. KDDANet resulting subnetwork visualization
was carried out using Cytoscape software ([263]https://cytoscape.org/).
KEGG pathway enrichment analysis was performed by clusterProfiler R
package^[264]104. The visualization of results was carried out in R
software. Case studies were conducted in SDrTDi context unless
specifically indicated.
Reporting summary
Further information on research design is available in the [265]Nature
Research Reporting Summary linked to this article.
Supplementary information
[266]Supplementary Information^ (2.5MB, pdf)
[267]Reporting Summary^ (70.3KB, pdf)
[268]Supplementary Data 1^ (17.7MB, xls)
[269]Supplementary Data 2^ (674.5KB, xls)
[270]Supplementary Data 3^ (780.5KB, xls)
[271]Supplementary Data 4^ (1.3MB, xls)
Acknowledgements