Abstract
Background
Identifying novel therapeutic targets is crucial for the successful
development of drugs. However, the cost to experimentally identify
therapeutic targets is huge and only approximately 400 genes are
targets for FDA-approved drugs. As a result, it is inevitable to
develop powerful computational tools that can identify potential novel
therapeutic targets. Fortunately, the human protein-protein interaction
network (PIN) could be a useful resource to achieve this objective.
Methods
In this study, we developed a deep learning-based computational
framework that extracts low-dimensional representations of
high-dimensional PIN data. Our computational framework uses latent
features and state-of-the-art machine learning techniques to infer
potential drug target genes.
Results
We applied our computational framework to prioritize novel putative
target genes for Alzheimer’s disease and successfully identified key
genes that may serve as novel therapeutic targets (e.g., DLG4, EGFR,
RAC1, SYK, PTK2B, SOCS1). Furthermore, based on these putative targets,
we could infer repositionable candidate-compounds for the disease
(e.g., tamoxifen, bosutinib, and dasatinib).
Conclusions
Our deep learning-based computational framework could be a powerful
tool to efficiently prioritize new therapeutic targets and enhance the
drug repositioning strategy.
Supplementary Information
The online version contains supplementary material available at
(10.1186/s13195-021-00826-3).
Keywords: Network embedding, Deep learning, Machine learning, Systems
biology, Drug discovery, Protein interaction network
Background
Biomedical research, especially for the field of drug discovery, is
currently experiencing a global paradigm shift with artificial
intelligence (AI) technologies and their application to “Big Data” in
the biomedical domain [[43]1–[44]3]. The complex, non-linear,
multi-dimensional nature of big data is accompanied by unique
challenges and opportunities when employed for processing and analysis
to derive actionable insights. In particular, existing statistical
techniques, such as principle components analysis (PCA), are
insufficient for capturing the complex interaction patterns that are
hidden in multiple dimensions across the data spectrum [[45]4]. Thus, a
key challenge for future drug discovery research is the development of
powerful AI-based computational tools that can capture multiple
dimension of biomedical insights and obtain “value” in the form of
actionable insights (e.g., insights toward to select and prioritize
candidate targets and repositionable drugs for candidate targets) from
big data volumes.
“Big Data” in the biomedical domain are generally associated with high
dimensionality. Their dimensionality should be reduced to avoid
undesired properties of high-dimensional space, such as the curse of
dimensionality [[46]5]. Dimensionality reduction techniques facilitate
classification, data visualization, and high-dimensional data
compression [[47]6]. However, classical dimensional reduction
techniques (e.g., PCA) are generally linear techniques and thus
insufficient to handle non-linear data [[48]4, [49]6].
With the recent advancement in AI technologies, several dimensionality
reduction techniques have become available for non-linear complex data
[[50]4, [51]6, [52]7]. Among the dimensionality reduction techniques,
the multi-layer neural network-based technique, “deep autoencoder,”
could serve as the most powerful technique for reducing the
dimensionality of non-linear data [[53]4, [54]6]. Deep autoencoders are
composed of multilayer “encoder” and “decoder” networks. The multilayer
“encoder” component transforms high-dimensionality data to a
low-dimensional representation while multilayer “decoder” component
recovers original high-dimensional data from the low-dimensional
representation. Weights associated with the links that connect the
layers are optimized by minimizing the discrepancy between the input
and output of the network (i.e., in an ideal condition, the values for
the nodes in the input layer is the same as those in the output layer).
After the optimization steps, the middle-hidden encoder layer yields a
low dimensional representation that preserves information that is
considered original data as much as possible [[55]6]. The values of
nodes in the middle-hidden encoder layer would be useful features for
classification, regression, and data visualization of high-dimensional
data.
In drug discovery research, identifying novel drug-targets is critical
for the successful development of a therapeutic drug [[56]8–[57]10].
However, the cost to experimentally predict drug targets is huge and
only approximately 400 genes are used as targets of FDA-approved drugs
[[58]11]. Thus, it is inevitable to develop a powerful computational
framework that can identify potentially novel drug-targets.
Drug repositioning is another promising approach for boosting new drug
development. The advantage of drug repositioning is its established
safety (i.e., toxicology studies have already been carried out with a
target drug). Therefore, the development of computational methods to
predict repositionable candidates could be a promising strategy to
reduce the cost and time for drug development.
Different drug repositioning methods have been proposed in prior
studies. Further, these methods can be classified into two different
major categories: activity-based drug repositioning and in silico drug
repositioning. Several drugs for non-cancerous diseases have been
discovered for cancer therapeutics using the former approach [[59]12],
and in recent years, the latter approach has become successful because
of advancements of the protein-protein interaction database, protein
structural database, and in-silico network analysis technology. Such
types of applications for drug repositioning via the network theory
have also been discussed. By verifying the similarity between CDK2
inhibitors and topoisomerase inhibitors, Iorioet et al. [[60]13]
reported that Fasudil (a Rho-kinase inhibitor) might be applicable to
several neurodegenerative disorders. Further, Cheng et al. [[61]14]
applied the inference method based on three similarities (drug-based,
target-based, and network-based similarities) to predict the
interactions between drugs and targets and finally confirmed that five
old drugs could be repositioned.
PIN data could be a useful resource for computational investigations of
potential novel drug-targets; that is because proteins derive their
functions together with their interacting partners and a network of
protein interaction captures downstream relationships between targets
and proteins [[62]8–[63]10, [64]15]. With the recent advancement in
network science, various network metrics are presently available and
have been used to investigate the structure of molecular interaction
networks and their relationship with drug-target genes [[65]8–[66]10,
[67]15, [68]16]. For example, “degree,” which is the number of links to
a protein, is a representative network metric for investigating the
molecular interaction networks (i.e., almost all FDA-approved
drug-targets are middle- or low-degree proteins; however, almost no
therapeutic targets exist among high-degree proteins [[69]10]). Such
finding indicates that the key features for identifying potential drug
target genes could be embedded in the complex architectures of the PIN
[[70]10].
Genome-wide PIN data are typical non-linear high-dimensional big-data
in the biomedical domain that are composed of thousands of proteins as
well as more than ten-thousand interactions among them [[71]8, [72]9].
Mathematically, a PIN is represented as an adjacency matrix [[73]17].
The adjacency matrices for PINs within rows and columns labeled by
proteins and elements in the matrices are presented as a binary value
(i.e., 1 or 0 in position (i,j) if protein i interacts with protein j
or not). In the adjacency matrix, each row represents the interacting
pattern for each protein and may be a useful feature for predicting
potential drug target proteins.
Recently, researchers have developed “network embedding” methods that
apply dimensional reduction techniques to extract low-dimensional
representations of a large network from the high-dimensional adjacency
matrix of the network [[74]17, [75]18]. For example, several
researchers have used singular value decomposition and non-negative
matrix factorization methods to map high-dimensional adjacency matrices
of large-scale networks onto low-dimensional representations [[76]19,
[77]20]. However, the feature vector for a protein is high dimensional
(e.g., several thousand dimensions) and sparse; this is because protein
interaction network composed of thousands of proteins and the vast
majority of proteins in the PIN have few interactions [[78]17].
To address this issue, several researchers have employed network
embedding methods based on deep learning techniques [[79]21, [80]22].
Deep autoencoder-based network embedding methods would be especially
useful for transforming non-linear large-scale networks into
low-dimensional representations. Wang et al. applied a deep
autoencoder-based network embedding method to large-scale social
networks (e.g., arxiv-GrQc, blogcatalog, Flicker, and Youtube) and
successfully mapped these networks onto low-dimensional representations
[[81]21].
Herein, to infer potentially novel target genes, we proposed a
computational framework based on a representative network embedding
method that employs a deep autoencoder to map a genome-wide protein
interaction network onto low-dimensional representations. The framework
builds a classifier based on state-of-the-art machine learning
techniques to predict potentially novel drug-targets using the
resultant low-dimensional representations. We applied the framework to
predict potentially novel drug targets for Alzheimer’s disease. Based
on the list of predicted candidate novel drug targets, we further
inferred potential repositionable drug candidates for Alzheimer’s
disease.
Methods
Overview
The first part of the method was preparing the PIN data and calculating
the 100 dimension vector representation for each gene by using a deep
autoencoder. To examine the performance of the deep autoencoder, we
compared the 100 features with nine known network metrics. The second
part was building a machine learning model which can predict whether a
gene is a putative target of Alzheimer’s therapeutic drug or not. In
this step, we used Xgboost to build the model and SMOTE to mitigate the
sample imbalance (i.e. only a few genes were known therapeutic
targets).
PIN data and drug-target information
The PIN data was obtained elsewhere [[82]23]. This network is composed
of 6,338 genes and 34,814 non-redundant interactions among the genes.
We obtained information for drugs and their target genes from the
DrugBank database [[83]24, [84]25]. Thereafter, we investigated the
“description” field for all the drugs in the DrugBank database and
identified 61 therapeutic drugs for Alzheimer’s disease. The 61 targets
for these drugs were regarded as the established drug targets for
Alzheimer’s disease. Among the 61 targets, 31 were mapped onto the PIN.
Feature extraction from PIN using a deep autoencoder
We build a deep autoencoder with a symmetric layer structure composed
of 7 encoder layers and 7 decoder layers (e.g., 7 encoder layers
(6338-3000-1500-500-250-150-100) and symmetric decoder layers
(100-150-250-500-1500-3000-6338)). Layers are fully connected. In
addition, layers, except output layer, use rectified linear unit (ReLU)
[[85]26] as an activation function while output layer uses sigmoid
function to generate binary outputs. We optimized the deep autoencoder
network by using “adam” [[86]27] optimizer with a learning rate
=1.0×10^−6, number of epochs = 10,000, batch size = 10, and default
values for the other parameters. In the optimization step, we minimized
the binary cross-entropy loss between the values of nodes in the input
layer and those in the output layer. We used a representative deep
learning platform, “Keras” [[87]28], with Tensorflow [[88]29] backend
to implement the deep autoencoder. To perform the deep
autoencoder-based dimensionality reduction analysis of PIN, we used
Tesla K80 GPU on the shirokane 5 super computer system [[89]30].
Statistical and topological analysis of the PIN
To determine the statistical topological features in the PIN for each
gene, we calculated the following representative network metrics:
indegree, outdegree, betweenness, closeness, PageRank [[90]31], cluster
coefficient [[91]32], nearest neighbor degree (NND) [[92]33], bow-tie
structures [[93]34], and indispensable nodes [[94]35, [95]36].
Indegree: Indegree for a given node represents the number of nodes
connected to the node (i.e., upstream neighbors of the node).
Outderee: Outdegree represents the number of links from the given node
to other nodes (i.e., downstream neighbors of the nodes).
Betweenness: Betweenness for a given node i is the number of shortest
paths between two nodes that pass through node i.
Closeness: The value of closeness for a given node i is the mean length
of the shortest paths between node i and all other nodes in the
network.
PageRank [[96]31]: PageRank for a given node is a metric used to
roughly estimate the importance of the node in the network. The
PageRank score is calculated using the algorithm proposed by Google
[[97]37]. A given node has a higher PageRank if the nodes with a higher
rank have links to the node.
Cluster coefficient [[98]32]: Cluster coefficient of a node i (C[i]) is
calculated by using the following equation:
[MATH:
Ci=2eiki(ki−1) :MATH]
, where k[i] is the degree of node i and e[i] is the number of links
connecting the neighborhood of node i to one another.
Nearest neighbor degree (NND) [[99]33]: The value of NND for a given
node i is the average degree among nearest neighbor nodes of node i.
Bow-tie structure [[100]34]: Biological networks often possess bow-tie
structures that are composed of three components (i.e., input, core,
and output layers). Yang et al. proposed a bow-tie decomposition method
to classify nodes into three classes: the input layer, the core layer,
and the output layer [[101]34]. In the decomposition analysis, a
strongly connected component composed of the largest number of nodes is
defined as the nodes in the core layer. Nodes in the input layers can
reach the core layer; however, those in the core layer cannot reach the
input layer. Further, the nodes in the core layer can reach the nodes
in the output layers but those in the output layer cannot reach the
core layer. Herein, one-hot vector encoding was employed to represent
the analysis results from bow-tie decomposition. For example, for a
node assigned to the core layer, the value of the “core layer” of the
node is equal to 1 while the value of the “input layer” and the “output
layer” is equal to 0.
Indispensable nodes [[102]35, [103]36]: Liu et al. developed a
controllability analysis method to identify the minimum number of
driver nodes (ND) that must be controlled to modulate the dynamics of
the entire network [[104]36] (i.e., they used the Hopcroft–Karp
“maximum matching” algorithm [[105]38] to identify the minimum set of
driver nodes [[106]36]). Indispensable nodes that are potential key
player nodes and are sensitive to structural changes in a network are
obtained from controllability analysis (i.e., removal of an
indispensable node increases the ND in the network [[107]35]).
Vinayagam et al. reported that indispensable proteins in the human PIN
tend to be targets of mutations associated with human diseases and
human viruses [[108]35]. One-hot vector encoding was also used to
represent the analysis results of indispensable nodes. For example, for
an indispensable node, the value of the binary variable of the node is
equal to 1 while that for a non-indispensable node is equal to 0.
For network analysis, we employed the igraph R package [[109]39].
Oversampling by the SMOTE algorithm
In order to prepare a class-balanced dataset for building binary
classifier, we used a state-of-the-art sampling method, SMORT
[[110]40], to generate this class-balanced dataset to construct a
binary classifier for drug target prediction. The SMOTE algorithm
synthetically creates more cases in the minority class. Thus, the
algorithm selects k nearest neighbours of a case in the minority class
and randomly selects a point along the line that connects them. The
selected point is used as an additional case in the minority class. We
used the Python module, imbalance-learn[[111]41], to perform
oversampling based on the SMOTE algorithm. In addition, we used k=2 to
carry out SMOTE-based oversampling.
Binary classifier model based on Xgboost
To build a binary classifier for drug target prediction, we used
Xgboost, which is the most efficient implementation of the gradient
tree boosting algorithms [[112]42]. The algorithm generates a large
number of weak learners and builds a strong learner that exists as an
ensemble of the weak learners. In the boosting step, the algorithm
continues to update the weak learners by correcting the errors made by
previous learners. Thereafter, the algorithm aggregates the predictions
from the weak learners to make the final prediction by minimizing the
loss with a gradient descent algorithm.
To build the Xgboost algorithm-based binary classifiers, we used the
XGBClassifier and scikit-learn [[113]43] python modules. The
XGBClassifier has several parameters. Briefly, we employed the
following values for each parameter (please see manual for
XGBClassifier module [[114]44] for details): learning_rate = (0.01,
0.1,0.5), max_depth = (1, 2, 3, 5, 10), n_estimators = (100), gamma =
(0, 0.3), boostor = (“gblinear”), objective = (“binary:logistic”),
reg_lambda = (0, 0.1, 1.0), and reg_alpha = (0, 0.1,1). For the other
parameters, we used a default value. To evaluate the binary classifier
models and optimize the parameters of the models, we performed 5-fold
cross validation.
Pathway enrichment analysis
To identify the pathways that are significantly associated with the
putative targets inferred by our computational framework, we used
WebGestalt web tool [[115]45]. WebGestalt uses over-representation
analysis (ORA) to statistically evaluate overlaps between the gene set
of interest and a pathway [[116]46]. In the analysis, the number of
overlapped genes between the gene set of interest and a pathway is
first counted. Thereafter, a hyper-geometric test is used to determine
whether the pathway is over- or under-represented in the gene set of
interest (for each pathway, the p value and FDR are calculated based on
the overlap). Based on the ORA, we examined the pathways in Reactome,
KEGG, and GO biological processes. The pathways with an FDR<0.05 were
regarded as significant pathways associated with the gene set of
interest.
Results
Network embedding: deep autoencoder-based dimensional reduction of PIN
We obtained the directed human PIN from [[117]23]; this PIN is composed
of 6338 genes and 34,814 interactions (see the “[118]Methods” section
for details). Thereafter we generated an adjacency matrix for the human
PIN. Elements in the matrix are represented as a binary value (i.e., 1
or 0 in position (i,j) denotes whether or not protein j is a downstream
interacting partner of protein i). The resultant matrix is composed of
6,338 rows and 6,338 columns. Each row in the matrix presents the
interacting pattern for each gene and used as features of the gene.
Because there are 6,338 genes in the PIN, the features for each gene
are of 6,338 dimensions (i.e., a gene is characterized by 6338
dimensional features based on the PIN data).
As shown in Fig. [119]1, to map the high dimensionality of the features
(6338 dimensions) for each gene onto low dimensional features, we built
and used a deep autoencoder. The deep autoencoder is composed of 7
encoder layers (6338-3000-1500-500-250-150-100) and symmetric decoder
layers (100-150-250-500-1500-3000-6338) (see Fig. [120]1). In the deep
autoencoder, layers are fully connected and weights of links connecting
layers are optimized by minimizing binary cross-entropy loss between
values of nodes in the input layer and those in the output layer (for
details, see the “[121]Methods” section). Following optimization, for
each gene, we used the optimized deep autoencoder to map the high
dimensionality of the original features (6,338 dimensional features)
into low dimensionality (100 dimensional features) through the middle
layer (layer with 100 nodes) in the network. Accordingly the resultant
features for each gene are of 100-dimensional features.
Fig. 1.
[122]Fig. 1
[123]Open in a new tab
Computational analysis pipeline for drug target prioritization. (Step
1) Our computational framework employed genome-wide PINs and
information of drug targets obtained from public domain databases.
(Step 2) The framework is based on a deep autoencoder to extract
low-dimensional latent features from high-dimensional PIN. (Step 3) By
using features from step 2 and a target gene list for a specific
disease, we generated 100 datasets to train the 100 classifier models.
By using the 100 datasets and the state-of-the-art machine learning
techniques (SMOTE and Xgboost), we build 100 classifier models to infer
potential drug targets. (Step 4) We applied the classifier models to
all unknown drug-target genes in the PIN to prioritize potential drug
target genes
The low-dimensional latent space contains enough information to
represent original high-dimensional human PIN. However, it is still
unclear whether the low-dimensional features in the latent scape can
explain the topological and statistical properties obtained from the
representative network metrics. To examine this issue, we calculated
nine representative network metrics for each gene in the PIN (e.g.,
indegree, outdegree, betweenness, closeness, PageRank, cluster
coefficient, nearest neighbor degree (NND), bow-tie structure, and node
dispensability, see the “[124]Methods” section for details) and
compared the metrics to the 100-dimensional features for the gene from
the network embedding analysis (see Fig. [125]2 and the original data
for Supplementary Figure 1). As shown in the figure, among the
100-dimensional feature, most of the features were correlated with the
representative network metrics. Interestingly, several features (e.g.,
dimensions 58, 86, 88, and 89) did not correlate with the nine
representative network-metrics (shown in gray background). Such
findings indicate that the low-dimensional features from the network
embedding analysis can capture not only the topological and statistical
properties of network metrics but also information that cannot be
obtained from analysis using representative network metrics.
Fig. 2.
[126]Fig. 2
[127]Open in a new tab
Relationship between features in low-dimensional latent space by deep
autoencoder and representative network metrics in the PIN. The X-axis
is the latent space dimension and the Y-axis is Spearman’s correlation
coefficient between a given low-dimensional feature and a given network
metric (see Supplementary Figure 1 for the original data). The gray
background dimensions (58, 86, 88, and 89) indicate almost no
correlation to the representative network metrics. Several dimensions
without the box (e.g., dimension 6 and 7) are n.a. because the encoded
numerical values for all genes are zero
Machine learning-based drug target prediction using the extracted feature
from PIN
In this study, we treated the issue of drug-target prediction as a
binary classification model. To construct a binary classifier for
drug-target prediction, we generated a training dataset using the
low-dimensional features extracted from PIN and public domain
drug-target information. From the public domain drug-target database,
we obtained known drug-target genes for Alzheimer’s disease. Among the
known targets, we could map 31 onto PIN. These 31 genes were further
regarded as positive cases and the negative cases were selected from
the remaining 6,307 genes. We randomly selected 500 negative cases
(genes) from the 6307 genes 100 times to build 100 datasets composed of
500 negative and 31 positive cases (genes). In the 100 datasets, each
gene had 100 dimensional features that were obtained from deep
autoencoder. Further, we employed the 100 datasets to build 100 binary
classifier models to predict novel candidate targets for Alzheimer’s
disease.
The 100 datasets are class-imbalanced (e.g., 31 positive and 500
negative cases, respectively). Furthermore, classification using
class-imbalanced data is biased toward the majority class. In the
datasets, the number of “positive” cases was very small (i.e., only 31
positive cases were found in the datasets). These problems can be
mitigated by using over-samplings that are often used to produce
class-balanced training datasets from class-imbalance data. To generate
class-balanced training datasets for binary classifiers, we used a
state-of-the-art sampling method, SMOTE (Synthetic Minority
Oversampling TEchnique) [[128]40] that synthetically creates new cases
in the minority class (in this study, “positive” case) (see the
“[129]Methods” section in details).
By using the class-balanced training datasets from SMOTE, we trained
binary classifiers for drug target prediction. The binary classifier
models are based on the Xgboost algorithm which is the most efficient
implementation of the gradient boosting algorithm [[130]42]. The
trained binary classifier models calculate two class probabilities for
each gene based on 100 dimensional features (e.g., probability of
“positive” and that of “negative”). Accordingly, a gene with a higher
class probability of “positive” is more likely to be a member of the
“positive” class.
To optimize the binary classifiers based on Xgboost for drug target
prediction, we performed a grid search with 5-fold cross validations.
Notably to avoid data leakage, we conducted data splits for cross
validations before SMOTE-based over-sampling to generate class
balancing training datasets. To evaluate the predictive performance of
each parameter combination, we calculated area under the receiver
operator characteristic curve (AUC ROC). The mean value of AUC ROC for
the 100 binary classifiers with the optimal parameters was 0.661. Such
result indicates that the 100 binary classifiers tend to assign a high
class probability of “positive” for known drug-target genes of
Alzheimer’s disease. Therefore, unknown drug-target genes with a high
probability of “positive” could serve as novel drug-targets for
Alzheimer’s disease.
Further, to infer the putative therapeutic targets for Alzheimer’s
disease, we used the mean value of the class probability of “positive”
from the 100 binary classifier to prioritize the 6,307 genes (see
Table [131]1 and Supplementary Table 1 for details); i.e., the unknown
targets with a higher mean value of “positive” for the class
probability (e.g., DLG4 in Table [132]1 and Supplementary Table 1) are
more likely potential novel drug targets. A total of 187 unknown
drug-target genes had a mean value greater than 0.75 for a class
probability of “positive” (see Supplementary Table 1). These 187 genes
were thus regarded as putative novel target genes for Alzheimer’s
disease.
Table 1.
Top 20 genes with the highest mean probability value for the “positive
(drug target)” class
Gene Mean probability
DLG4 0.99859
PLCG1 0.99775
EGFR 0.99758
SYK 0.99752
PTK2B 0.99617
RAC1 0.99585
CAV1 0.99579
DLG1 0.99512
PIK3R1 0.99500
PRKCA 0.99292
KIT 0.99224
JAK1 0.99154
PTPN6 0.98968
CRKL 0.98918
SHC1 0.98840
NCK1 0.98760
ZAP70 0.98750
PTPN11 0.98630
DLG3 0.98551
PTK2 0.98537
DLG2 0.98471
IL2RB 0.98328
JAK2 0.98299
GRB2 0.98278
[133]Open in a new tab
Pathway enrichment analysis of putative target genes
To deduce the potential target pathways for Alzheimer’s disease, we
determined the significant pathways that are associated with the 187
putative targets inferred using our computational framework (see
Figs. [134]3, [135]4, and [136]5). The 187 putative targets were
significantly associated with the pathways that control Alzheimer’s
disease mechanisms (e.g., cytokine-related signaling pathways and EGF
receptor signaling pathway), especially those associated with
inflammatory mechanisms and the immune system. The innate immune system
is a key component of Alzheimer’s disease pathology [[137]47]. In fact,
continuous amyloid- β formation and deposition chronically activate the
immune system, causing disruption of the microglial clearance systems
[[138]47]. Accordingly, the progression of Alzheimer’s disease could be
suppressed by modulating these pathways, especially the immune system
and inflammation-related pathways, by targeting these putative target
genes.
Fig. 3.
[139]Fig. 3
[140]Open in a new tab
Pathway enrichment analysis using GO biological database for the 187
putative targets from our computational pipeline for Alzheimer’s
disease. The names of the pathways are shown on the vertical axis, and
the bars on the horizontal axis represent the − log10(p value) of the
corresponding pathway. Dashed lines in orange, magenta, and red
indicate p value <0.05, 0.01, and 0.001, respectively
Fig. 4.
[141]Fig. 4
[142]Open in a new tab
Pathway enrichment analysis using the KEGG database for 187 putative
targets. The legend for this figure is the same as that for Fig. [143]3
Fig. 5.
[144]Fig. 5
[145]Open in a new tab
Pathway enrichment analysis using the Reactome pathway for 187 putative
targets. The legend for this figure is the same as that for Fig. [146]3
Inference of repositionable drug candidates
Networks connecting drugs, targets, and diseases could serve as useful
resources for investigating novel indications for FDA-approved drugs,
i.e., if target gene P is a putative target for disease A and is a
known target gene of drug R for disease B, disease A may be a novel
target disease for drug R (see Fig. [147]6). Thus, to infer the
putative repositionable drugs and their potential target disease, we
further examined the list of 187 predicted putative target genes (genes
with a class probability of target class >0.75 in Supplementary Table
1) from our computational framework and drug-target information across
different diseases. If at least one target of an known drug is included
among the 187 putative targets, the drug was regarded as a potential
repositionable drug. As shown in Supplementary Table 2, we inferred 244
candidate repositionable drugs for Alzheimer’s disease. For each
candidate repositionable drug, we calculated the number of overlapping
genes between the known targets of the drug and the 187 putative
targets. Thereafter, we ranked the candidate repositionable drugs based
on the number of overlapped genes. Among the predicted repositionable
drug candidate, the top ranked candidates may be effective for the
target disease. Table [148]2 lists the 20 highest ranked candidate
compounds.
Fig. 6.
[149]Fig. 6
[150]Open in a new tab
A method to infer potential repositionable drugs based on the putative
targets derived from our computational pipeline. Step 1: We obtained
the drug-target-disease network from the DrugBank database. Step 2: We
mapped the associations between the putative target genes and their
target diseases to infer the potential repositionable drugs for a given
disease
Table 2.
Top 20 candidate repositioning drugs for Alzheimer’s disease
DRUG Overlaps between known targets and predicted targets # of overlaps
Regorafenib RET; FLT1; KDR; KIT; PDGFRA; PDGFRB; FGFR1; TEK; NTRK1;
EPHA2; ABL1 11
Tamoxifen ESR1; ESR2; PRKCA; PRKCB; PRKCD; PRKCE; PRKCG; PRKCQ; PRKCZ;
ESRRG 10
Ponatinib ABL1; KIT; RET; TEK; FGFR1; LCK; SRC; LYN; KDR; PDGFRA 10
Dasatinib ABL1; SRC; FYN; LCK; KIT; PDGFRB; EPHA2; BTK; FGR; LYN 10
Imatinib PDGFRB; ABL1; KIT; RET; NTRK1; CSF1R; PDGFRA 7
Brigatinib EGFR; ABL1; IGF1R; INSR; MET; ERBB2 6
Sorafenib PDGFRB; KIT; KDR; FGFR1; RET; FLT1 6
Sunitinib PDGFRB; FLT1; KDR; KIT; CSF1R; PDGFRA 6
Nintedanib FLT1; KDR; FGFR1; LCK; LYN; SRC 6
Pazopanib FLT1; KDR; PDGFRA; PDGFRB; KIT 5
Midostaurin PRKCA; KDR; KIT; PDGFRA; PDGFRB 5
Resveratrol ITGA5; ITGB3; SNCA; ESR1; AKT1 5
Diethylstilbestrol ESR1; ESRRG; ESR2; ESRRA 4
Tofacitinib TYK2; JAK2; JAK1; JAK3 4
Lenvatinib FLT1; KDR; FGFR1; KIT 4
Foreskin fibroblast (neonatal) FLT1; CSF2RA; PDGFRB; TGFB1 4
Baricitinib JAK1; JAK2; PTK2B; JAK3 4
Foreskin keratinocyte (neonatal) EGFR; CSF2RA; PDGFRA; TGFB1 4
Bosutinib ABL1; LYN; SRC 3
Estradiol valerate ESR1; ESR2; ESRRG 3
[151]Open in a new tab
Discussion
Putative targets from our computational framework
Among the 187 putative targets from our analysis (see Supplementary
Table 1), we investigated the top ranked genes and found that several
of these genes play an important role in the mechanism of Alzheimer’s
disease.
For example, the first ranked putative target, DLG4, encodes PSD95,
which is a key protein for synaptic plasticity that is downregulated in
under aged patients as well as patients with Alzheimer’s disease.
Recently, Bustos et al. demonstrated that epigenetic editing of
DLG4/PSD95 ameliorates cognitions in model mice with Alzheimer’s
disease [[152]48]. Thus, epigenetic editing of DLG4 may serve as a
novel therapy for rescuing cognitive impairment induced by Alzheimer’s
disease.
EGFR is the third ranked putative target and is frequently upregulated
in certain cancers. By employing an amyloid- β-expressing fruit fly
model, Wang et al. demonstrated that the upregulation of EGFR causes
memory impairment [[153]49]. Furthermore, they administered several
EGFR inhibitors (e.g., erlotinib and gefitinib) to transgenic fly and a
mouse model of Alzheimer’s disease and found that the inhibitors
prevented memory loss in both animal models. Based on these findings,
they suggested that EGFR may be a therapeutic target for the treatment
of amyloid- β-induced memory impairment.
RAC1, the sixth ranked putative target, is a small signaling GTPase,
that controls different cellular processes, including cell growth,
cellular plasticity, and inflammatory responses. Inhibition of RAC1
downregulates amyloid precursor protein (APP) and amyloid- β through
regulation of the APP gene in hippocampal primary neurons [[154]50].
RAC1 inhibitors can prevent cell death caused by amyloid- β42 in
primary neurons of the hippocampus and those of the entorhinal cortex
[[155]51]. Furthermore, based on an analysis of the protein-domain
interaction network and experiments using drosophila genetic models,
Kikuchi et al. demonstrated that RAC1 is a hub gene in the network and
thus causes age-related alterations in behavior and neuronal
degenerations [[156]52]. The RAC1 gene could be a potential therapeutic
target for preventing amyloid- β-induced neuronal cell death in
Alzheimer’s disease.
Spleen tyrosine kinase (SYK), the fourth ranked potential target, could
modulate the accumulation of amyloid- β and hyperphosphorylation of Tau
protein, which is associated with Alzheimer’s disease [[157]53].
Nilvadipine, an antagonist of the L-type calcium channel (LCC),
inhibits the accumulation of amyloid- β; however, this does not occur
because of LCC inhibition, but rather other mechanisms. Paris et al.
demonstrated that the down-regulation of SYK exerts an effect that is
similar to an enantiomer of Nilvadipine ((-)-nilvadipine) for the
clearance of amyloid- β and reduction of Tau hyperphosphorylation
[[158]53]. Schweig et al. demonstrated that in mice with overexpressing
amyloid- β, SYK activation occurred in the microglia. Further, neurite
degeneration was found to increase because of the association between
amyloid- β plaques and aging [[159]54]. These researchers also
demonstrated that in mice overexpressing Tau, SKY was activated in the
microglia while misfolded and hyperphosphorylated Tau was accumulated
in the hippocampus and cortex. Schweig et al. demonstrated that SYK
inhibition induces Tau reduction in an autophagic manner [[160]55].
Moreover, they demonstrated that SYK acts as an upstream target in the
mTOR pathway and its inhibition induces Tau degradation by decreasing
the activation of mTOR pathway.
The 5th ranked putative target, PTK2B, is a key gene in the mediation
of synaptic dysfunction induced by amyloid- β in Alzheimer’s disease
[[161]56]. Salazar et al. demonstrated that in a transgenic mice model
of Alzheimer’s disease, PTK2B deletion improves deficits in memory and
learning functions as well as synaptic loss [[162]56].
Although SOCS1 is the 78th ranked putative target, it modulates
cytokine responses by suppressing JAK/STAT signaling to control
inflammation in the CNS (central nerve system) [[163]57]. Thus, SOCS1
may be a key therapeutic modulator in Alzheimer’s disease.
GWAS and other sequencing technologies have identified over 20 genes
that modify Alzheimer’s disease risk. We obtained 29 genes listed in
[[164]58] and compared them with our 187 genes. PTK2B and INPP5D were
listed as the overlap between the two gene sets. While as mentioned
above, PTK2B is the 5th ranked strong candidate gene, INPP5D was the
68th ranked putative gene in the set of our 187 genes. INPP5D (Inositol
Polyphosphate-5-Phosphatase D) is selectively expressed in brain
microglia and likely a crucial player in Alzheimer’s disease
pathophysiology. Tsai et al. reported that INPP5D expression was
upregulated in late-onset Alzheimer’s disease and positively correlated
with amyloid plaque density [[165]59].
Collectively, these findings indicate that our computational framework
could successfully identify key genes that may be novel target
candidates for Alzheimer’s disease.
Promising repositionable drugs for Alzheimer’s disease
In our computational drug repositioning analysis, our method predicted
that tamoxifen (the second ranked candidate, see Table [166]2), an
FDA-approved estrogen receptor modulator for the treatment of
hormone-receptor-positive breast cancer patients, could serve as a
potential drug target for Alzheimer’s disease. As mentioned in Wise PM
[[167]60], estrogen therapy could protect neuronal cells from cell
death by modulating the expression of key genes that inhibit the
apoptotic cell death pathway. Based on a nation-wide cohort study in
Taiwan, Sun et al. reported that patients with long-term use of
tamoxifen exhibited a reduced risk of dementia [[168]61].
Our method also predicted that bosutinib (the nineteenth ranked
target), an FDA-approved tyrosine-kinase-inhibitor (TKI) drug (Bcr-Abl
kinase inhibitor) for the treatment of Philadelphia chromosome-positive
(Ph+) chronic myelogenous leukemia, may be a repositionable drug for
Alzheimer’s disease (see Table [169]2). Lonskaya et al. reported that
Bosutinib combined with nilotinib systematically modulates with immune
system in the CNS by inhibiting the non-receptor tyrosine kinase, Abl,
to remove amyloid and decrease neuroinflammation [[170]62]. Such
findings indicates that TKIs, especially bosutinib, could be potential
repositionable drugs for the treatment of early stage Alzheimer’s
disease.
Among the predicted repositionable candidates, 19 are immunosuppressive
agents. These 19 candidates may include promising repositionable drugs
for Alzheimer’s disease; this is because of the important role played
by inflammation in the mechanisms of Alzheimer’s disease. Among the 19
candidates, dasatinib (the fourth ranked compound) may be the most
promising candidate. Recently, Zhang et al. reported that senolytic
therapy (a combination of dasanitib and quercetin) could reduce the
production of proinflammatory cytokine and alleviate deficits of
cognitive functions in Alzheimer’s disease mouse models, via the
selective removal of senescent oligodendrocyte progenitor cells
[[171]63, [172]64]. Furthermore, the combined therapy of dasatinib and
quercetin is now registered in a clinical trial (ClinicalTrials.gov
Identifier: [173]NCT04063124).
One limitation of our method was that the process of identifying the
putative target genes was dependent on the drug taget gene database
(i.e., the DrugBank in this research). This means that there is a
possibility of bias in the known target genes because the DrugBank
contains the existing therapeutic drugs and compounds which may have
failed the clinical trials. However, we could overcome this limitation
by adding new drug and target relationships, such as tau targeting
compounds.
Conclusions
In this study, we developed a deep autoencoder-based computational
framework and applied it to prioritize putative target genes for
Alzheimer’s disease. The method identified key genes (e.g., DLG4, EGFR,
RAC1, SYK, PTK2B, SOCS1) associated with the disease mechanisms.
Furthermore, by using the putative targets, we successfully inferred
promising repositionable candidate-compounds (e.g., tamoxifen,
bosutinib, dasatinib) for Alzheimer’s disease. Our method could be a
powerful tool for inferring potential repositionable drugs, especially
those that could be used to treat Alzheimer’s disease. Notably, our
computational framework can be easily applied to the investigation of
novel potential therapeutic targets and repositioning compounds for any
disease. Accordingly, we anticipate that our method will be used by
large pharmaceutical companies that house large volumes of their own
non-public data.
Supplementary Information
[174]13195_2021_826_MOESM1_ESM.pdf^ (101.8KB, pdf)
Additional file 1 The original data of Fig. [175]2. Rows and columns
represent the names of features in the low-dimensional latent space and
names of the network metrics, respectively. The numeric value in a cell
represents Spearman’s correlation coefficient between a given
low-dimensional feature and a given network metric (i.e., the
correlation coefficient between the feature “Dimension 1” and the
network metric “outdegree” is 0.67). Darker red (blue) indicates a
higher (lower) correlation coefficient. Dimensions that are zero for
all genes are denoted as n.a.
[176]13195_2021_826_MOESM2_ESM.xlsx^ (17.8KB, xlsx)
Additional file 2 A list of potential therapeutic targets for
Alzheimer’s disease.
[177]13195_2021_826_MOESM3_ESM.xlsx^ (39.6KB, xlsx)
Additional file 3 A list of all candidate repositionable compounds for
Alzheimer’s disease.
Acknowledgements