Abstract
Understanding transcriptional responses to chemical perturbations is
central to drug discovery, but exhaustive experimental screening of
disease-compound combinations is unfeasible. To overcome this
limitation, here we introduce PRnet, a perturbation-conditioned deep
generative model that predicts transcriptional responses to novel
chemical perturbations that have never experimentally perturbed at bulk
and single-cell levels. Evaluations indicate that PRnet outperforms
alternative methods in predicting responses across novel compounds,
pathways, and cell lines. PRnet enables gene-level response
interpretation and in-silico drug screening for diseases based on gene
signatures. PRnet further identifies and experimentally validates novel
compound candidates against small cell lung cancer and colorectal
cancer. Lastly, PRnet generates a large-scale integration atlas of
perturbation profiles, covering 88 cell lines, 52 tissues, and various
compound libraries. PRnet provides a robust and scalable candidate
recommendation workflow and successfully recommends drug candidates for
233 diseases. Overall, PRnet is an effective and valuable tool for
gene-based therapeutics screening.
Subject terms: Virtual screening, Computational models, High-throughput
screening, Virtual drug screening
__________________________________________________________________
Understanding transcriptional responses to chemical perturbations is
crucial for drug discovery. Here, authors present PRnet, a deep
generative model that predicts gene responses to novel chemical
perturbations, enabling in-silico drug screening and the identification
of candidate compounds for various diseases.
Introduction
Transcriptional responses to chemical perturbations reveal fundamental
insights into biological functioning and play an integral role in both
disease understanding and drug discovery. Bulk and single-cell
RNA-sequencing (scRNA-seq) experiments facilitate the high-throughput
screen (HTS) of chemical perturbations at the omics level. Recent HTS
studies^[44]1,[45]2 have experimental profiled thousands of independent
perturbations exposing cells or cell lines to compounds. These
transcriptional responses to chemical perturbations revealed coherent
interpretable gene-level programs representing individual and cellular
processes and quantified them in response to chemical perturbation.
Although encouraging progress has been made, experimentally screening
chemical perturbations remains a time-consuming and expensive process
with a low discovery rate of new therapies^[46]3. It is unfeasible to
conduct an exhaustive exploration of the vast, novel chemical
perturbation space by experimentally screening disease and compound
combinations.
In the past decades, deep learning-based methods have emerged as
important tools for modeling transcriptional responses to
perturbations. Many approaches have been proposed recently for modeling
HTS perturbation responses. CPA^[47]4 utilized an auto-encoder-based
model to map chemical induce transcriptomic effect into a latent space
to reconstruct perturbation response. Biolord^[48]5, scGen^[49]6, and
scVIDR^[50]7 performed counterfactual prediction via deep
encoder-decoder/generator-based generative frameworks, which take a
control cell and unseen labels as input to predict the gene expression
of the unseen cellular states. These recently published cell
perturbation modeling tools precisely simulated chemical perturbations
and predicted chemical perturbations in gene expression of unseen cell
types, but few predict responses to novel chemicals. Following
CPA^[51]4, chemCPA^[52]8 introduced a new encoder-decoder architecture
that incorporated the compounds’ structure to predict the
perturbational effects of unseen drugs. Deep generative frameworks with
encoder-decoder and its variants effectively predicted single-cell gene
expression perturbations. CellOT and CINEMA-OT^[53]9,[54]10 leveraged
optimal transport to match paired unperturbed-perturbed observations.
Optimal-transport-based matched the observations experimentally
perturbed but were incapable of modeling novel perturbations, such as
novel compounds or novel cell types. Linear regression-based
methods^[55]11,[56]12 estimate the impact of perturbations on gene
expression by linearly combining the effects of genetic perturbation.
However, this linear combination approach leads to limitations in
accurately modeling the nonlinear chemical perturbations across diverse
cell types and compound combinations. GEARS and
CellOracle^[57]13,[58]14 leveraged a knowledge graph of gene-gene
relationships to predict genetic perturbation outcomes. However,
graph-based models relied on accurate prior knowledge leading to the
lack of scalability. Given that most diseases are associated with
characteristic gene expression profiles, Connectivity Map (CMap)^[59]15
proposed a concept that connected genes, drugs, and diseases by virtue
of common gene-expression signatures, leading to projects such as
CMap^[60]15, L1000^[61]1, etc. Inspired by this concept, DLEPS^[62]16,
OCTAD^[63]17 and other studies (such as^[64]18–[65]20, etc.) utilized
gene signature matching methods to screen candidates by finding drugs
that reverse the disease signature. DLEPS^[66]16 predicted chemically
induced changes in transcriptional profiles directly from molecular
structure, and OCTAD^[67]17 virtually screened compounds by matching
the cancer-specific expression signature to compound-induced gene
expression profiles. Although DLEPS and OCTAD screened candidates
effectively based on bulk HTSs, they failed to predict
cell-type-specific transcriptional response to novel perturbations and
model cellular heterogeneity, which is highly relevant to treatments.
Consequently, perturbations response models are required to address the
limited exploration power to novel perturbations of existing
experimental and computational methods, which are also needed for
predicting the response to unseen perturbations and discovering
promising therapeutic drug candidates. Deep generative models,
including Generative Adversarial Network (GAN^[68]21), Variational
Auto-Encoder (VAE^[69]22), Denoising Diffusion Probabilistic Model
(Diffusion^[70]23), Normalization Flow (NF^[71]24), Generative
Pre-Trained Transformer (GPT^[72]25), and so on, learn the probability
density of observable samples and generate new samples. Deep generative
models have greatly improved diverse areas (for example, natural
language processing^[73]25,[74]26, computer vision^[75]27,[76]28,
chemicals^[77]29, and so on), suggesting the potential for applications
in drug discovery.
Here we present PRnet, which is a flexible and scalable
perturbation-conditioned deep generative model for predicting
transcriptional responses to novel chemical perturbations that were
never experimentally perturbed at bulk and single-cell levels. PRnet is
a new encoder-decoder architecture based generative model which
comprises three components, including the Perturb-adapter, the
Perturb-encoder, and the Perturb-decoder. PRnet adapts novel compounds
and diseases in various perturbation scenarios by taking compound
structures and unperturbed transcriptional profiles as input to predict
transcriptional responses. The Perturb-adapter uses simplified
molecular-input line-entry system^[78]30 (SMILES) chemical encoding as
input, enabling generalization to unseen compounds without prior
knowledge and annotation. The learnable latent space of PRnet
facilitates gene-level response interpretation and capturing
heterogeneity. PRnet was trained with close to one hundred million bulk
HTS observations perturbed by 175,549 compounds and tens of millions
single cell HTS observations perturbed by 188 compounds. Crucially, the
model operates as a data-driven model, allowing for effective
generalization to novel perturbations. The evaluation indicated that
PRnet outperformed alternative approaches in predicting changes and
expression in transcription response to novel compounds, pathways, and
cell lines in bulk and single-cell HTS data. To further validate the
effectiveness, PRnet has been utilized to identify novel bioactive
compounds against small cell lung cancer (SCLC) and seek novel natural
compounds against colorectal cancer (CRC). Experimental validation
demonstrated the activity of novel candidate compounds against SCLC and
CRC cell lines within the appropriate predicted ranges of
concentration.
This model’s flexibility and scalability make it a valuable tool to
screen candidates for various diseases. We, therefore, leveraged PRnet
to in silico screen various compound libraries and generated a virtual
large integration atlas of perturbation profiles, covering 88 cell
lines and 52 tissues, as well as compound libraries comprising 935
FDA-approved drugs, 4158 active compounds, 30,456 natural compounds,
and 29,670 drug-like compounds. PRnet also provided a robust and
scalable candidate recommendation workflow for diseases according to
the reference changes in gene sets. Given disease-specific or sensitive
compound gene signatures, gene set enrichment analysis (GSEA) is
employed to assess the potential efficacy of compounds against these
diseases. PRnet successfully recommended 577 drug candidate lists from
577 studies for 233 different diseases based on the profiles atlas. In
three cases of metabolic disorders, including non-alcoholic
steatohepatitis (NASH), polycystic ovary syndrome (PCOS) patients, and
inflammatory bowel disease (IBD), drugs recommended by PRnet have been
supported by previous literature with human or animal studies. PRnet
enables effective prediction of transcriptional responses to novel
complex chemical perturbations and screening large-scale compound
libraries for specific diseases, thus being a valuable tool for
gene-based therapeutics screening.
Results
PRnet: perturbation-conditioned deep generative model for transcriptional
response prediction
In this paper, we formulate the transcriptional response prediction as
a distribution generation problem conditioned on perturbations. Cells
inherently recognize and respond to chemical stimuli, with their
responses influenced by both external stimuli and their intracellular
state. In the single cell or bulk HTS, transcriptional responses to
chemical perturbations are affected by multiple conditions, such as
compound structures, compound dosages, and covariates, such as cell
types, and cell lines. Given an n-dimensional unperturbed
transcriptional profile (a single cell or bulk RNA-seq observation) and
a chemical perturbation imposed by a compound with a specific structure
and dosage, PRnet aims to predict the distribution of the perturbed
transcriptional profile. Modeling perturbation patterns enables
capturing the novel chemical perturbation effect on gene-level programs
and quantifying them in various perturbation scenarios (Fig. [79]1a).
Fig. 1. Overview of the methodological approach.
[80]Fig. 1
[81]Open in a new tab
a Problem formulation: Given unperturbed transcriptional profiles
(single cell or bulk) and applied perturbations (structures and the
dosages of the compounds), predict transcriptional responses. Red
arrows indicate changes in transcriptional profiles. b Model
architecture: PRnet is a perturbation-conditioned deep generative model
for transcriptional response prediction with three components,
including the Perturb-adapter, the Perturb-encoder, and the
Perturb-decoder. Crucially, the model operates as a data-driven model,
allowing for effective generalization to novel perturbations. c
Screening candidates with PRnet: The training compound library
comprises a bulk high-throughput screening library (175,549 bioactive
compounds), and a single-cell high-throughput screening library (188
active compounds). The screening library comprises four compound
libraries, including 935 FDA-approved drugs, 4158 active compounds,
30,455 natural compounds, and 29,670 in-house druglike compounds,
respectively. For in-silico screening, PRnet initially predicts the
average transcriptional profile, fold-change in the gene, and gene rank
after perturbing the specific cell line with screening compounds. Given
the gene signature for a particular disease or sensitive drug, the
enrichment scores of screening compounds were computed for compound
ranking. In vitro validation experiments were conducted using MTT
assays on colorectal cancer and small cell lung cancer cell lines to
validate the activity of compound candidates. After in vitro
validation, PRnet is utilized to recommend drug candidates for 233
different diseases. d The large-scale integration atlas of perturbation
profiles, including the L1000 dataset, the Sci-plex3 dataset, the
FDA-approved drugs dataset, the anti-cancer compounds dataset, the
natural compounds dataset, the drug-like compounds dataset, and the
Gtex dataset. Some icons were created in BioRender. Qi, X. (2024)
[82]biorender.com/l45e810.
PRnet (as detailed in “Methods”) is a flexible and scalable
perturbation-conditioned deep generative model designed to predict
transcriptional responses to novel complex chemical perturbations at
bulk and single-cell levels. The design of PRnet comprises three
components, including the Perturb-adapter, the Perturb-encoder, and the
Perturb-decoder (Fig. [83]1b). In preprocessing, each perturbed profile
is assigned an unperturbed transcriptional profile of the same cell
line. Initially, given a chemical perturbation imposed by the structure
of compounds represented by Simplified Molecular Input Line Entry
System^[84]30 (SMILES) strings and the dosages of the corresponding
compounds, PRnet leverages RDKit^[85]31 to capture the functional
topology information of the structures, generate the Functional-Class
Fingerprints (FCFP) of compounds. These FCFPs were scaled by their
dosages and then summed to generate rescaled Functional-Class
Fingerprints (rFCFP) embedding. Without loss of generality, for the
i-th compound perturbation, the Perturb-adapter encodes the fingerprint
rFCFP[i] to an additive latent embedding
[MATH:
zip
:MATH]
which allows generalization to novel compounds and compound
combinations. Then, the Perturb-encoder maps the chemical perturbation
effect on heterogeneous unperturbed states
[MATH:
xiu
:MATH]
into the interpretable latent space
[MATH:
zil
:MATH]
. At last, the Perturb-decoder estimates the distribution of the
transcriptional response
[MATH: N(x
i∣μ
mrow>i,σ<
/mrow>i2) :MATH]
within the context of the chemical perturbation effect on the
unperturbed state
[MATH:
zil
:MATH]
, the applied perturbation
[MATH:
zip
:MATH]
and a noise
[MATH:
zin
:MATH]
. PRnet encodes the chemical effect on the unperturbed state to a
learnable latent space, estimates the distribution, and performs
conditioned sampling to generate a transcriptional response with
biological and chemical contexts. Sampling generates a specific
transcriptional profile
[MATH: x^i :MATH]
that provides gene-level up- and down-regulation information. For bulk
HTS data, the predicted transcriptional responses of 978 landmark genes
[MATH: x^i :MATH]
are transformed into 12,328 genes by linear transformation. For
single-cell HTS data, 5000 highly variable genes (HVGs) of
transcriptional profile are selected. SMILES^[86]30 is widely used for
representing chemical structures due to its simplicity and efficiency
in encoding complex molecules as strings. By taking SMILES of the
compound as the input, the Perturb-adapter has sufficient flexibility
to screen large-scale compound libraries without any prior knowledge.
Driven by data, PRnet automatically identifies the heterogeneity in
latent space corresponding to compound, dosage contexts, and cell-type
specific contexts. This allows the model to directly generalize to
novel perturbation scenarios involving novel compounds, pathways, cell
types, and cell lines that have not been previously perturbed.
With the ability to predict transcriptional responses to novel
perturbations, PRnet enables efficient screening of candidates for
complex diseases (Fig. [87]1c). Inspired by the assumption embodied in
the CMap^[88]15 concept that gene signatures are used as indicators
reflecting the underlying mechanisms of diseases, PRnet predicted new
therapeutic candidates by finding drugs that reverse the disease
signature. There are two steps to applying PRnet to downstream tasks.
In step 1, for in silico screening, PRnet predicts the transcriptional
profile of specific cell lines perturbed by a user-defined compound
library with multiple gradient concentrations. In step 2, we calculate
the average transcriptional profile and fold-change in gene expression
of each compound, and rank genes according to their fold-change values.
Then, given the query gene signature for a particular disease or known
sensitive compounds, gene set enrichment analysis (GSEA^[89]32) was
employed to evaluate compound efficacy with their enrichment scores.
Finally, compounds are ranked based on these enrichment scores.
Large-scale high-throughput screening data are initially fitted to the
model, facilitating its adaptability to diverse compound libraries and
diseases.
PRnet was trained on two compound libraries and screened four compound
libraries (Fig. [90]1c). The training compound libraries contain a bulk
high-throughput screening library consisting of over 883,269
transcriptional profiles of 175,549 bioactive compounds^[91]1, and a
single-cell high-throughput screening library consisting of 290,888
transcriptional profiles of 188 active compounds^[92]2. Being well
trained in HTS observations, PRnet enables in silico high-throughput
screening of novel compound libraries for various cell lines. PRnet has
further been applied to screen active and drug-like compounds for SCLC
and natural compounds for CRC. In vitro validation experiments with MTT
assays confirmed the efficacy of the candidate compounds against SCLC
and CRC cell lines. Lastly, PRnet screened four compound libraries and
generated a large-scale integration atlas of perturbation profiles
(Fig. [93]1d), including (1) 82 cell lines perturbed by 935
FDA-approved drugs, (2) 88 cell lines perturbed by 4158 active
compounds, (3) 14 CRC cell lines perturbed by 30,456 natural compounds,
(4) 6 SCLC cell lines perturbed by 29,670 drug-like compounds and (5)
54 tissues perturbed by 935 FDA-approved drugs. Based on the
large-scale integration atlas of perturbation profiles, PRnet is
capable of a variety of downstream applications. PRnet has been
utilized to recommend drugs for 233 different diseases. PRnet
successfully predicted drug candidates for these diseases and
demonstrates the potential in drug discovery.
PRnet robustly predicted the response of novel perturbation and learned
interpretable latent embeddings
To evaluate the performance of PRnet for predicting responses to unseen
perturbations, all datasets were strictly split by perturbation
attributes (compound_split, cell_line_split, and pathway_split) into 3
subsets: train, validation, and test (Supplementary Fig. [94]1a). The
held-out test sets were used to simulate datasets of novel
perturbations. Three train-test data split strategies were employed to
assess the performance of out-of-distribution perturbation scenarios,
including (1) Random Split: randomly divides compounds and cell lines,
(2) Unseen Compounds: testing compounds not seen perturbed during
training, and (3) Unseen cell lines: testing cell lines not seen
perturbed during training. Five-fold cross-validation was applied in
each split strategy, and the average performance over five folds was
computed as the overall metric for comparison. Two high-throughput
screening data of different resolutions were used to test model
performance, consisting of a bulk HTS dataset (from the L1000
project^[95]1) and a single cell HTS dataset (from the sci-Plex3
assay^[96]2). All models were trained and compared separately on two
HTS datasets.
We employed bulk high-throughput screening data from the L1000
project^[97]1 to fit the model, in which 978 genes (hereinafter called
landmark genes) were selected to represent the diversity of biological
pathways and processes in human cells. We first preprocessed these data
(as detailed in “Methods”) and obtained 836,352 paired bulk RNA-seq
observations (represented by the expression levels of 978 landmark
genes), covering 82 cell lines and their perturbation data perturbated
by 175,549 compounds. To quantitatively evaluate the compound-induced
gene expression changes, we compared the Pearson correlation between
the true and predicted post-perturbation of the average logarithm of
fold-change in gene expression (log(FC)) for the hold-out test set with
alternative approaches. The “Pearson of log(FC) in compounds” metric
evaluated the Pearson correlation between the true and predicted mean
log(FC) perturbed by the same compound in the test set. We demonstrated
the performance of the PRnet on the bulk HTS data in Fig. [98]2a, where
a higher value indicates a better performance. PRnet consistently
demonstrated the best performance across all three split strategies,
particularly well-fitted in unseen compound prediction scenarios with
an average Pearson Correlation (PCC) of 0.8. PRnet significantly
outperformed in predicting unseen cell line log(FC) with an increase in
PCC over 0.3 compared to other approaches. Some hold-out predicted
cases in predicting transcriptional responses of unseen compounds and
cell lines were illustrated in Supplementary Fig. [99]1b–d. In more
challenging scenarios, we evaluated the performances of predicting
cell-line-specific compound-induced changes in genes by the “Pearson of
log(FC) in cov_compounds” metric. The “Pearson of log(FC) in
cov_compounds” metric is the Pearson correlation between the true and
predicted mean log(FC) perturbed by the same compound within the same
cell line in the test set. PRnet achieved the best performance in the
“Pearson of log(FC) in cov_compounds” metric across three scenarios. In
particular, PRnet exhibited more than two times better performance in
unseen cell line predictions compared to other methods and demonstrated
improvements of 0.16 in unseen compound predictions, demonstrating the
generalization of PRnet to novel perturbations (Fig. [100]2b).
Fig. 2. PRnet outperforms alternative approaches in predicting
post-perturbation change in gene expression.
[101]Fig. 2
[102]Open in a new tab
a Performance of out-of-distribution perturbation scenarios in three
train-test data split strategies (1) Random split, (2) Unseen compounds
split, and (3) Unseen cell lines split. The “Pearson of log(FC) in
compounds” metric evaluates the mean log(FC) perturbed by the same
compounds in the test set, which are presented as mean values ± SD.
Error bars indicate the standard error (SD) for each method, and the
points (n = 5) refer to the 5-fold used for validation. b The “Pearson
of log(FC) in cov_compounds” metric evaluated mean the (log(FC)) of
compounds within the same cell line in the test set. Results are
presented as mean values ± SD (n = 5). c T-SNE representations of
latent embeddings learned by PRnet. Post-perturbation transcriptional
profiles are from 82 cell lines from 20 different organs/tissues. d The
“R2 in compounds" metric evaluates the R2 score of the average gene
expression perturbed by the same compounds in the test set. Results are
presented as mean values ± SD (n = 5). e The “R2 in cov_compounds”
metric evaluates the R2 score of the average gene expression perturbed
by the same compounds within the same cell line. Results are presented
as mean values ± SD (n = 5). f T-SNE representation of latent
embeddings from massive single-cell screens of 188 drugs across 3
cancer cell lines. g T-SNE representation of latent embedding of
transcriptional profiles from MCF7 cell line perturbed with AG-14361.
The pseudo-dose trajectory is displayed with a black line. [103]Source
data are provided as a Source Data file.
To better characterize the heterogeneous gene-level change under
certain perturbations, it is desirable to identify the set of cells or
cell lines and isolate the precise variations enriched in the data from
the corresponding cell lines or cells. After training, PRnet learned
interpretable embeddings in the latent space in the context of the base
unperturbed state and the applied perturbation. The low-dimensional
(t-SNE^[104]33) representation (Fig. [105]2c) illustrates the latent
embeddings of post-perturbation transcriptional profiles learned by
PRnet. In the latent space, embeddings from the same cell line tend to
form a cluster together. Each cancer cell lines form specific
gene-level responses for corresponding perturbations. In a way, PRnet
captures strong cell line-specific transcriptional profile variations
under different conditions. Interestingly, we observed that the
embedding learned by PRnet also represents cell line similarity in
response to various perturbations. Figure [106]2c illustrates the t-SNE
representation of the latent embeddings for all cell lines, where cell
lines originating from the same organ exhibit a similarity preference
in the latent space, resulting in close spatial locations, such as cell
lines from the colon, breast, and lung. Supplementary Fig. [107]6a–c
demonstrates in detail the low-dimensional (t-SNE) representation of
the latent embeddings for cell lines from colorectal cancer, breast
cancer, and lung cancer, respectively. To quantitatively evaluate the
similarity of perturbations among cell types, we computed the
normalized cosine similarity among the mean embeddings in the latent
space for all cell lines. The resulting cosine similarity heatmap
(Supplementary Fig. [108]6g) shows that most cell lines from the same
tissue exhibit higher similarity in the latent space compared to those
from different tissues.
Human cancer cell lines have facilitated drug discoveries in cancer
biology, but they are neither clonal nor genetically stable. This
instability can generate variability in drug sensitivity, as shown by
Ben-David et al.^[109]34. The genomic evolution of cancer cell lines
leads to a high degree of variation across cell line strains. To
explore the impact of inter- and intra-heterogeneity of cell lines on
drug responses, we conducted additional experiments focusing on A549
and MCF7 cell strains, as described by Ben-David et al.^[110]34 in
Supplementary Note [111]2. These experiments were designed to explore
inter- and intra-cellular heterogeneity, similar patterns of the same
MOA drug across cell strains, and screening candidate compounds for
heterogeneous cells. These results indicated that drug response was
highly similar to genetic or gene expression which aligned with
findings in the original study. These findings underscore the utility
of our model in capturing both inter- and intra-heterogeneity,
enhancing its application in screening for specific cancer strains.
PRnet adapted to model profiles of HTS at different resolutions. The
sci-Plex assay^[112]2 screened 188 compounds in 3 cancer cell lines at
single-cell resolution, measuring millions of cells. The screened cell
lines A549 (lung adenocarcinoma), K562 (chronic myeloid leukemia), and
MCF7 (breast adenocarcinoma) were treated to each of these 188
compounds at four dosages (10 nM, 100 nM, 1 μM, 10 μM). To
quantitatively evaluate the performance of PRnet in single-cell HTS
data, we followed commonly used metric^[113]4–[114]6 to compare the R^2
score between the true and predicted post-perturbation gene expression
for the hold-out test set with alternative approaches (Fig. [115]2d,
e). The “R^2 in compound” metric evaluated the R^2 score of the
average post-perturbation gene expression of the same compound in the
test set. We also compared the cell-type-specific performance “R^2 in
cov_compound” metric, which evaluated the R^2 score of the average
post-perturbation of the same compound within each specific cell line.
PRnet outperformed alternative models in the “R^2 in the compound” and
“R^2 in cov_compound” metrics in unseen compounds (R^2 in the
compound: 0.969) and unseen pathways scenarios (R^2 in the compound:
0.97) than the other methods. The low-dimensional (t-SNE)
representation illustrates latent embeddings learned by PRnet from
massive single-cell screens of 188 compounds across 3 cancer cell lines
(Fig. [116]2f). The latent embeddings automatically cluster the
concordance of cells to their cell types. The results demonstrated that
PRnet not only captures the heterogeneous responses of large-scale HTS
but also resolves similar responses of homogeneous cells to various
perturbations. The t-SNE embedding (Fig. [117]2g) in the latent space
of MCF7 cells treated with AG-14361 produces a pseudo-dose trajectory,
indicating that AG-14361 induced heterogeneous responses. Several other
pseudo-dose trajectory examples were illustrated in Supplementary
Fig. [118]6d–f.
To validate the clinical relevance of PRnet, we incorporated scRNA-seq
data from a cohort of pediatric acute myeloid leukemia (AML)
patients^[119]35. We trained PRnet to predict transcriptional responses
to chemotherapy treatments and validated its potential for clinical
application. PRnet showed robustness in predicting transcriptional
responses for pediatric AML patients post-chemotherapy, demonstrating
PRnet’s potential to assist in clinical applications. For detailed
experimental results, please see Supplementary Note [120]1 and
Supplementary Fig. [121]8.
PRnet captures gene programs in various perturbation scenarios
We reason that PRnet can be a valuable tool to analyze and capture
functional transcriptional experiments that aim to uncover gene
programs or effects on various conditions by introducing perturbation
to the system. To investigate this in greater detail, we first
collected and tested the hold-out perturbation data of compound
vorinostat. Vorinostat (suberoylanilide hydroxamic acid) is the first
FDA-approved HDAC inhibitor for the treatment of cutaneous
manifestations of cutaneous T-cell lymphoma (CTCL) and is also
currently being studied as monotherapy and in combination therapy for
other types of cancers^[122]36,[123]37. Results demonstrated a
comprehensive ability of PRnet to capture gene-level fold changes in
all of the 71 cell lines (Fig. [124]3a). By comparing the
transcriptional responses of cell lines from different organs, we
observed that PRnet captured cell-type-specific responses. For
instance, cell lines from muscle exhibited relatively weaker responses
and smaller fold changes compared to those from the lung.
Figure [125]3a shows that PRnet correctly captures both the right trend
and the magnitude of perturbation of the top up- and down-regulation
genes across 71 cell lines from 16 tissues/organs. Taking gene FAM57A
and TP53 as examples, PRnet made accurate predictions in cases of both
up- and down-regulation in all cell lines after perturbation. In
addition, PRnet even correctly predicted the fold change values in the
expression of FAM57A across all cell lines. Figure [126]3b shows a
detailed comparison between the predicted and actual distributions of
fold changes in gene expression for three representative cell lines
from different organs (HT29: colorectal adenocarcinoma, A549: lung
adenocarcinoma and MCF7: breast adenocarcinoma,) treated with
vorinostat. It can be observed that PRnet aligns consistently with the
distribution of predicted and true observations and accurately predicts
the up- and down-regulation trends of the top 5 genes with high log(FC)
values. We employed the KEGG pathway Gene Set Enrichment Analysis
(GSEA) on the average predicted post-perturbations gene rank of
Vorinostat across all cell lines. The GSEA results (Fig. [127]3c, d and
Supplementary Fig. [128]3a) reveal that Vorinostat is enriched in
pathways related to fundamental cellular processes related to tumor
suppressor mechanisms. The GSEA results indicated that Vorinostat
suppresses pathways such as Cell cycle, DNA replication, and
Spliceosome, and activates pathways including Autophagy - animal,
Lysosome, Phagosome, and so on, which are all associated with autophagy
and apoptosis in tumor cells^[129]38.
Fig. 3. PRnet captures gene programs in various perturbation scenarios.
[130]Fig. 3
[131]Open in a new tab
a The heatmap illustrates the average log(FC) of the gene expression
profiles predicted by PRnet and the ground truth for 71 cell lines
under the drug Vorinostat. The horizontal axis represents genes, while
the vertical axis represents cell lines, with the color orange
indicating upregulation and the color green indicating downregulation.
b The violin plots illustrate the distribution of log(FC) for the top 5
up-and down-regulated genes in multiple perturbations of cell lines,
including A549, HT29, and MCF7, from three cancer types treated with
the drug Vorinostat. c The GSEA results of average post-perturbation
changes in expression of 71 cell lines perturbed by Vorinostat. GSEA
was performed for KEGG pathway enrichment. Enriched terms were
identified by adjusted p-values (p.adjust) < 0.05 (Benjamini-Hochberg
method). d The category net plot visualizes the functional enrichment
result of Vorinostat with suppressed pathways colored in blue and
activated pathways colored in red. e The box plots illustrate the
log(FC) of the top 20 up-and down-regulated genes in cell lines HT29
with the drug bortezomib (n = 865 observations), MG-132 (n = 892
observations), and wortmannin (n = 207 observations), respectively. The
middle line in the box plot, median; box boundary, IQR; whiskers,
1.5 × IQR; minimum and maximum, not indicated in the box plot; gray
dots, points beyond the minimum or maximum whisker. f The box plots
illustrate the expression of the 10 different expression genes of cell
lines A549 (n = 450 cells), K562 (n = 430 cells), and MCF7 (n = 988
cells) perturbed by the drug GSK-LSD1, respectively. The middle line in
the box plot, median; box boundary, IQR; whiskers, 1.5 × IQR; minimum
and maximum, not indicated in the box plot; gray dots, points beyond
the minimum or maximum whisker. [132]Source data are provided as a
Source Data file.
To demonstrate the generalization of PRnet, we also analyzed
perturbation observations of other hold-out compounds on a gene-by-gene
basis. We collected and tested some case perturbation data of HT29 with
the most observations. Figure [133]3e illustrates the log(FC) of the
top 20 up- and down-regulated genes in multiple perturbations of HT29
cell lines treated with the hold-out compounds bortezomib, MG-132, and
wortmannin, respectively. These results suggested that PRnet is able to
capture regulated gene level information that is consistent with
evidence from corresponding compounds in inferring different cancer
transcriptional profile conditions that can be missed by perturbations
analysis. The predicted distribution of fold changes in gene expression
after perturbation closely aligned with the actual observed
distribution, indicating the accuracy of PRnet in capturing the
perturbation effects. More predicted gene-level perturbation responses
of cell lines of breast cancer exhibited similar performance
(Supplementary Fig. [134]2a). Besides, Fig. [135]3f demonstrates the
ability of PRnet to predict cell-type-specific gene-level perturbation
transitions in single-cell HTS observations. In the case of predicting
the response of perturbing A549, K562, and MCF7 cell lines with
GSK-LSD1, PRnet correctly captured both the right trend in gene
expression and the magnitude of response across all 10 differentially
expressed genes (Fig. [136]3f). Similar performance was observed for
several other examples across perturbation conditions (Supplementary
Fig. [137]4a–c). The ability to capture changes in gene-level programs
under different compound conditions and resolutions indicates the
robustness, generalization, and precise performance of PRnet in
predicting perturbation responses.
PRnet identified active compounds against small-cell lung cancer
Having been trained to simulate experimental measurements of
high-throughput screening, PRnet was applied to identify potential
novel compound candidates for the treatment of small cell lung cancer
(SCLC). SCLC is an extremely aggressive lung cancer characterized by
small cells with limited cytoplasm forming clusters or
spheroids^[138]39. Despite an initial positive response to conventional
chemotherapy and radiation, SCLC often recurs rapidly, with less than
5% of patients surviving five years. Currently, highly effective
treatment options for this disease remain unresolved, making drug
development efforts for this cancer a high priority.
Given several novel cell lines of SCLC, namely NCI-H69, NCI-H526,
NCI-H446, NCI-H209, and NCI-H196, and DMS114, we first employed PRnet
to predict the transcriptional response of SCLC cell lines to sensitive
compounds. Then, we in silico screened two user-defined compound
libraries to identify potential compound candidates against SCLC
(Fig. [139]4a), namely an active compound library (4158 compounds) from
Selleckchem, and an in-house druglike compound library (29,670
compounds). Through in silico screening, PRnet predicted the
transcriptional responses of each compound across eight concentration
gradients perturbing 6 cell lines, with each scenario repeated three
times for computational robustness and calculated gene rank of changes
in the average post-perturbation expression of each compound. After
that, the predicted up/downregulated genes of sensitive compounds on
their cell lines were used as the GSEA gene signature input to
calculate the enrichment scores. We then performed GSEA to calculate
the enrichment scores of compounds in libraries and ranked them
according to their scores (Supplementary Datas [140]1, [141]2).
Ultimately, three compounds ((+)-Fangchinoline, (+)-JQ-1, and
SEL120-34A HCl) at the top rank (rank ≤ 3 for enrichment score of
sensitive compounds) were chosen as the candidate set (Fig. [142]4b).
Among them, it has been proved that small cell lung cancer (SCLC) cells
are exquisitely sensitive to growth inhibition by the BET inhibitor
(+)-JQ-1 (CAS No. : 1268524-70-4)^[143]40, and (+)-Fangchinoline and
SEL120-34A HCl were assessed experimentally. We also explored the
suitable activity concentrations of candidate compounds by calculating
the enrichment scores of perturbed transcriptional profiles at
concentration gradients. As shown in Fig. [144]4c, d, concentrations in
the range of 1–10 μmol/L might be the proper inhibitory concentration
for these candidate compounds.
Fig. 4. The PRnet-identified candidates against small cell lung cancer and
colorectal cancer.
[145]Fig. 4
[146]Open in a new tab
a PRnet predicted the perturbed transcriptional profiles of 6 SCLC cell
lines and 14 CRC cell lines. For in silico screening, PRnet first
predicted the transcriptional profiles of SCLC cell lines perturbed by
a multi-concentration gradient of 4158 active compounds, as well as
29,670 in-house druglike compounds. PRnet also predicted the
transcriptional profiles of 14 CRC cell lines perturbed by 30,456
natural compounds. Then, post-perturbation fold-changes and average
fold-changes of compounds across cell lines are computed. The gene
ranking is performed based on the fold-change values. Given the
predicted gene signature of sensitive compounds, the model computes the
enrichment scores for up- and down-regulated gene sets of screening
compounds. Finally, compounds are ranked based on the enrichment
scores. For in vitro testing, MTT assays were performed for the
evaluation of cell viability. b The heatmap illustrates the enrichment
score of candidates for sensitive compounds. c, d The line charts plot
the enrichment score of SCLC cells exposed to (+)-Fangchinoline and
SEL120-34A HCl. e, f The cell survival curve of SCLC cells exposed to
(+)-Fangchinoline and SEL120-34A HCl (n = 3 replicates). IC[50]
(half-maximal inhibitory concentration) are presented as mean values
± SD. g, h The cell survival curve of CRC cells exposed to
7-Methoxyrosmanol and Mulberrofuran Q (n = 3 replicates). IC[50] are
presented as mean values ± SD. [147]Source data are provided as a
Source Data file. Some icons were created in BioRender. Qi, X. (2024)
[148]biorender.com/c85y849.
We utilized MTT assays to examine the activities of compound candidates
against SCLC cells. Six human SCLC cell lines (NCI-H69, NCI-H526,
NCI-H446, NCI-H209, and NCI-H196, and DMS114) were chosen for
experiments. Results revealed the activity of SEL120-34A HCl (CAS No. :
1609452-30-3) and (+)-Fangchinoline (CAS No. : 436-77-1) against small
cell lung cancer (SCLC) cell lines. SEL120-34A HCl and
(+)-Fangchinoline showed significant inhibitory effects on the
proliferation of SCLC cells, and exhibited an IC[50] (half-maximal
inhibitory concentration) of less than 10 μmol/L, indicating their
inhibitory effect on SCLC cell viability (Fig. [149]4e, f).
(+)-Fangchinoline and SEL120-34A HCl moderately inhibited the viability
of SCLC cell lines. Detailed antiviability activities of two candidates
are shown in Supplementary Table [150]3. These findings suggested the
potential therapeutic efficacy of SEL120-34A HCl and (+)-Fangchinoline
in the context of SCLC treatment, highlighting their promising role as
active compounds against this aggressive lung cancer subtype.
PRnet found natural compounds against colorectal cancer
Colorectal cancer (CRC) ranks as the third most common cancer and the
second leading cause of cancer death worldwide^[151]41. Advances in
molecularly targeted therapy and immunotherapy over the past decades
have significantly improved patient survival^[152]41. However, some CRC
patients, initially responsive to these treatments, quickly develop
insensitivity. The emergence of drug resistance in cancer treatment
significantly reduces patient outcomes. We aim to explore more
potential novel treatment options for colorectal cancer to mitigate the
impact of drug resistance through in silico screening of a large-scale
library of natural compounds (30,456 compounds^[153]42).
We extended the application of PRnet to screening novel natural
compounds for the treatment of CRC. We first leveraged PRnet to predict
the post-perturbation transcriptional profiles of sensitive compounds
in 14 CRC cell lines (HT29, LOVO, MDST8, RKO, HT115, SW948, SNU1040,
SNUC4, SNUC5, SW480, SW620, HCT116, CL34, and NCIH508). Then, we used
the predicted up/downregulated genes of sensitive compounds on their
cell lines as the GSEA gene signature. After that, we also in silico
screened a large-scale natural compounds library containing 30,456
natural ingredients. PRnet predicted the transcriptional responses of
each compound across eight concentration gradients perturbing 14 cell
lines with three repeats and calculated the post-perturbation average
gene rank of compounds. Two natural compounds (7-Methoxyrosmanol and
Mulberrofuran Q) at the top rank (rank ≤ 5 for enrichment score of
sensitive compounds) were chosen as the candidate set (see
Supplementary Tables. [154]1, 2).
We utilized MTT assays to examine the activities of candidate compounds
against CRC cells. Six human CRC cell lines, namely HT29, SW480,
Caco-2, SW620, HCT116, and Colo205, were chosen for the experiments.
The results revealed that 7-Methoxyrosmanol (CAS No. : 113085-62-4) and
Mulberrofuran Q (CAS No. : 101383-35-1) inhibited the viability of CRC
cell lines (Fig. [155]4e, f). 7-Methoxyrosmanol and Mulberrofuran Q
moderately inhibited the viability of CRC cell lines. Detailed
antiviability activities of two candidates are shown in Supplementary
Table [156]3. These findings showed the activity of 7-Methoxyrosmanol
and Mulberrofuran Q against CRC cell lines, suggesting their potential
efficacy in CRC treatment.
PRnet generated a large-scale integration atlas of perturbation profiles
With the ability to characterize specific gene-level perturbation
responses and identify anti-cancer compounds, PRnet was applied to in
silico screen novel compound libraries and cell lines and generate a
large-scale integration atlas of perturbation profiles across various
scenarios (Fig. [157]5a). PRnet was trained with two datasets: (1) the
L1000 dataset, a bulk high-throughput screening library consisting of
883,269 transcriptional profiles from 82 cell lines perturbed by
175,549 biologically active compounds, and (2) the Sci-plex3 dataset, a
single-cell high-throughput screening library consisting of 290,888
transcriptional profiles from 3 cell lines perturbed by 188 active
compounds. The L1000 dataset^[158]1 screened cell lines derived from
over 20 diverse tissues and exposed to compounds targeting multiple
genes and pathways (Supplementary Fig. [159]5a–c). The Sci-plex3
dataset screened three cancer cell lines treated by 188 compounds
targeting a diverse range of targets and molecular pathways, covering
various mechanisms of action (Supplementary Fig. [160]5d–f). After
training, PRnet was applied to screen various perturbation scenarios to
generate a large-scale integration atlas of perturbation profiles.
Through virtual screening, PRnet has predicted over 25 million
post-perturbation expression profile atlas of perturbation profiles
which consists of five parts: (1) the FDA-approved drugs dataset: a
bulk virtual high-throughput screening library containing 1,891,330
transcriptional profiles from 82 cell lines perturbed by 935
FDA-approved drugs, (2) the anti-cancer compounds dataset: a bulk
virtual high-throughput screening library containing 8,781,784
transcriptional profiles from 88 cell lines perturbed by 4158 active
compounds, (3) the natural compounds dataset: a bulk virtual
high-throughput screening library containing 10,233,230 transcriptional
profiles from 14 colorectal cancer cell lines perturbed by 30,456
natural compounds, (4) the bioactive compounds dataset: a bulk virtual
high-throughput screening library containing 4,272,486 transcriptional
profiles from 6 small cell lung cancer cell lines perturbed by 29,670
druglike compounds, and (5) the Gtex dataset: a bulk virtual
high-throughput screening library containing 1,245,510 transcriptional
profiles from 54 tissues perturbed by 935 FDA-approved drugs. Details
of all cell lines are provided in Supplementary Data [161]3. PRnet
offered a broad perspective by providing insights into how
perturbations impact the transcriptional landscape on a large scale and
extended its utility to diverse screening contexts. The large-scale
integrated atlas of perturbation profiles can be applied to a variety
of downstream application scenarios. For example, the FDA-approved drug
dataset can be used for drug repositioning to recommend drugs for
specific diseases based on gene signatures (see PRnet provided a robust
and scalable drug 568 recommendation workflow based on the profiles
atlas). The anti-cancer compounds dataset, the natural compounds
dataset, and the bioactive compounds dataset are valuable for screening
new anti-cancer compounds (see PRnet identified active compounds
against small cell 430 lung cancer and PRnet found natural compounds
against colorectal 482 cancer). In addition, the Gtex dataset can be
useful to analyze the toxicity of compounds in different tissues. PRnet
imports gene-level functionality in perturbations of different
compounds and empowers flexibility by utilizing user-defined compounds’
structures and transcription profiles to estimate the gene expression
matrix. These profiles can be compared against various perturbation
conditions(dosages, structures of compounds) using PRnet, to evaluate
the impact of using different compounds on specific gene-expression
profiles from single cell and bulk data. These diverse profiles
provided potential solutions for drug discovery, disease treatment, and
toxicity analysis.
Fig. 5. PRnet provided a robust and scalable recommend workflow on a large
integration atlas of perturbation profile.
[162]Fig. 5
[163]Open in a new tab
a The large-scale integration atlas of perturbation profiles. b PRnet
provided a recommended workflow for specific diseases. In step 1, given
the structures of the screening library, PRnet predicts the
post-perturbed transcriptional profiles of all compounds across
multiple concentration gradients in 82 cell lines. In step 2, the
transcriptional profiles of 978 landmark genes are transformed into
12,328 genes through linear transformation. Subsequently, the
perturbation fold-change and the average fold-change across cell lines
for the transformed expression profile are computed. Gene ranking is
then performed based on the fold-change values. In step 3, given the
gene signature for a particular disease, the model computes the
enrichment scores for up- and down-regulated gene sets of screening
compounds. Finally, compounds are ranked based on the enrichment
scores. c–e Scatter plots of the enrichment score of compounds against
NASH, Crohn’s disease, and PCOS. The computationally predicted
candidates are highlighted with their names in the upper right corner.
The color gradient shows the density of the dots. [164]Source data are
provided as a Source Data file. Some icons were created in BioRender.
Qi, X. (2024) [165]biorender.com/z56i777.
PRnet provided a robust and scalable drug recommendation workflow based on
the profiles atlas
PRnet provides a comprehensive drug recommendation workflow based on
the large-scale integration atlas of perturbation profiles
(Fig. [166]5b). In step 1, given the compounds’ structure of the
screening library, PRnet predicts the post-perturbed transcriptional
profiles of all compounds across multiple concentration gradients
across cell lines. In step 2, the transcriptional profiles of 978
landmark genes are transformed into 12,328 genes through linear
transformation. Subsequently, the perturbation fold-change and the
average fold-change across cell lines for the transformed expression
profile are computed. The gene ranking is then performed based on the
fold-change values. In step 3, given the gene signature for a specific
disease, PRnet computes the enrichment scores for up- and
down-regulated gene sets of screening compounds. Finally, compounds in
the screening library are ranked based on these enrichment scores.
These downstream applications demonstrated the versatility and utility
of PRnet in addressing diverse challenges in the field of drug
discovery and perturbation analysis.
We leveraged the drug recommendation workflow to identify drug
candidates for the treatment of 233 different diseases. For the
user-defined compound library, we here chose the FDA-approved drug
dataset, which comprises 935 FDA-approved drugs. All gene signatures of
233 diseases versus normal were collected from the CREEDS project
(CRowd Extracted Expression of Differential Signatures^[167]43). The
up/downregulated genes of a specific disease were used as the GSEA gene
signature input to calculate enrichment scores (see “Methods”) of
drugs. Finally, we obtained 577 candidate lists from 577 studies for
233 unique diseases (Supplementary Data [168]4). Taking three metabolic
disorder diseases, including nonalcoholic steatohepatitis (NASH),
inflammatory bowel disease (IBD), and polycystic ovary syndrome
patients (PCOS) as examples, we provided enrichment scores of each drug
for the three diseases and the recommended drug candidates for them.
The enrichment scores are plotted in Fig. [169]5c–e, drugs positioned
at the upper right corner (rank≤ 10 for enrichment score) were chosen
as the candidate set for literature verification. Supplementary
Table [170]4 lists the Top 10 enrichment scores for compounds against
NASH, Crohn’s disease, and PCOS.
Nonalcoholic steatohepatitis (NASH) is liver inflammation and damage
caused by a buildup of fat in the liver without standard treatment and
any well-established drug targets. After calculation, we ultimately
selected a set of candidate drugs from the upper right corner (rank≤ 6
for enrichment score) as candidate treatments for NASH (Fig. [171]5c).
Literature verification revealed that Mirabegron (rank 1), Vidofludimus
(rank 5), and Rifaximin (rank 6) have literature support for use in the
treatment of NASH. A study on high-fat diet rats^[172]44 suggested that
mirabegron may have a protective effect against NASH as it improves
liver enzymes, lipids, serum HbA1c, fasting insulin and glucose,
insulin resistance index, and serum adiponectin levels, as well as
ameliorates hepatic histopathologic changes in NASH-induced rats. A
study on obesity mice^[173]45 suggested vidofludimus reduced hepatic
steatosis and inflammation in obesity mice and has been repurposed to a
therapeutic potential in the treatment of NASH by targeting FXR based
on the newly established relationships among drugs, targets, and
diseases. Rifaximin therapy appeared to be effective and safe in
modifying NASH through reduction of serum endotoxin and improvement of
insulin resistance, proinflammatory cytokines, CK-18, and NAFLD-liver
fat score. Patients with biopsy-proven NASH with Rifaximin therapy
showed that Rifaximin appeared to be effective and safe in modifying
NASH^[174]46,[175]47. We also performed KEGG pathway enrichment
analysis of predicted up/down-regulated genes of Mirabegron,
Vidofludimus, and Rifaximin (Supplementary Fig. [176]7). The KEGG
pathway enrichment analysis of downregulated genes for both Mirabegron
and Rifaximin suggested they may downregulate pathways associated with
NASH.
Crohn’s disease and ulcerative colitis are idiopathic inflammatory
bowel disorders (IBD). Crohn’s disease is a relapsing systemic
inflammatory disease that primarily affects the gastrointestinal tract
with extraintestinal manifestations and associated immune
disorders^[177]48. After ranking the drugs, two drugs positioned at the
upper right corner (rank≤ 9 for enrichment score) were chosen as the
candidate set (Fig. [178]5d). The literature verification revealed that
Escin (rank 3) and Ozanimod (rank 9) have literature support for use in
the treatment of Crohn’s disease. Escin is the main bioactive
ingredient of Semen aesculi, which improves intestinal barrier
dysfunction of IBD via Akt/NF-κB signaling pathway^[179]49. Ozanimod is
a once-daily sphingosine 1-phosphate receptor modulator for the
treatment of inflammatory bowel disease and was approved by the FDA.
Phase 2 clinical trial had proved that Ozanimod can be a novel oral
small molecule therapy for the treatment of Crohn’s
disease^[180]50,[181]51. The KEGG pathway enrichment analysis results
of predicted up/down-regulated genes of Escin and Ozanimod are
illustrated in Supplementary Fig. [182]7. The predicted upregulated
pathways (Supplementary Fig. [183]7) suggested Escin may upregulate the
PI3K/Akt signaling pathway which regulates a broad cascade of target
proteins including nuclear factor kappa B (NF-κB) and glycogen synthase
kinase-3β (GSK-3β)^[184]52.
Polycystic ovary syndrome (PCOS) is a heterogeneous endocrine disorder
in which the major endocrine disruption is excessive androgen secretion
or activity, and a large proportion of women also have abnormal insulin
activity. We also chose positively scored drugs as candidates for PCOS
(rank≤ 9 for enrichment score, Fig. [185]5e). The literature
verification revealed that Enzalutamide (rank 3), Linagliptin (rank 8),
and Topiramate (rank 9) have literature support for use in the
treatment of PCOS. Enzalutamide has been suggested for use as an
antiandrogen to treat hirsutism and hyperandrogenism in women with
polycystic ovary syndrome^[186]53,[187]54 investigated the effect of
linagliptin and/or I3C on experimentally-induced PCOS in female rats,
and the results suggested that linagliptin/I3C combination might
represent a beneficial therapeutic modality for amelioration of PCOS.
PCOS has completed phase 3 trials for Topiramate, and
phentermine-topiramate extended-release (PHEN/TPM) resulted in the most
loss of weight and total body fat^[188]55. The predicted regulated
pathways (Supplementary Fig. [189]7) of Enzalutamide showed that
Enzalutamide may upregulate the PI3K/Akt signaling pathway and MAPK
signaling pathway, which are related to Androgen signaling^[190]56. The
predicted downregulated pathways suggested Topiramate may downregulate
the Lipid and atherosclerosis and the Fat digestion and absorption
pathway, thereby contributing to weight loss. All three literature
verification results illustrated that PRnet recommended reliable and
effective drugs for different diseases, and hence is a valuable tool
for drug discovery.
Discussion
Recent advancements in high-throughput screening have profiled
thousands of independent chemical perturbations, providing crucial
insights into the fundamental responses of biological systems to
perturbations. However, screening all possible disease and compound
combinations is experimentally unfeasible. To address the limited
exploration power of existing experimental methods, we developed PRnet,
which supports the prediction of transcriptional responses to novel
chemical perturbations that were never experimentally studied at bulk
and single-cell levels. PRnet serves as a valuable tool that
facilitates gene-level response interpretation in various novel
perturbation scenarios and effectively recommends candidates for
various diseases.
PRnet has the unique ability to screen various novel compound libraries
for specific diseases and infer their post-perturbed transcriptional
response. Given the structures of compounds in libraries and
unperturbed transcriptional profile, PRnet encodes the perturbation and
unperturbed state to an interpretable latent space as condition
contexts to generate transcriptional responses. Trained on extensive
data, PRnet is able to adapt to complex chemical perturbations of
generalization to novel compounds and cell lines. To further validate
the effectiveness of PRnet, we identified novel compounds candidates
against SCLC and natural compounds against CRC. Experimental
validations confirmed the activity of candidate compounds, showcased
the capabilities of PRnet to guide the design of new screens, and
reduced the time and costs of experiments. Lastly, PRnet generated a
large-scale integration atlas of perturbation profiles and demonstrates
its capability to recommend drug candidates for 233 complex diseases
based on reference changes in gene sets. The flexibility and
scalability of PRnet make it a valuable tool for guiding the design of
screening strategies for gene-based therapeutics.
Although PRnet is effective in predicting drug candidates, there is
still room for improvement. Due to the scalability of the SMILES
format^[191]30 and the RDKit^[192]31, PRnet is able to encode unseen
complex compounds as embeddings. SMILES is widely used for representing
chemical structures due to its simplicity and efficiency in encoding
complex molecules as strings. However, SMILES may not fully capture
complex molecular features such as 3D geometry, conformational
flexibility, or dynamic behavior, which can be important for
understanding molecular interactions. Alternative encoding methods such
as MOL/SDF files and graph-based representations are more flexible
representations in capturing the 3D geometry structure of compounds.
MOL/SDF formats represent chemical structures with explicit atom
coordinates and bond connectivity. Graph-based representations use
graph theory to represent molecules where atoms are nodes and bonds are
edges. These methods can provide a more intuitive and flexible
representation. However, they are more complex or require pre-training
to obtain better representations. In the scenario of large-scale in
silico screening, we chose SMILES to encode chemical structures for its
simplicity and scalability. Alternative encoding methods, such as
MOL/SDF files and graph-based representations, will be considered
in the future work.
The reverse signature paradigm to connect disease and drug established
by Lamb et al in CMap^[193]15 has demonstrated its effectiveness in
previous studies^[194]16–[195]20. However, this may not hold true for
all diseases, and there are instances where transcriptional changes do
not correlate directly with drug sensitivity. Jie Cheng et al.^[196]57
found the reverse signature paradigm may perform poor accuracy when
some disease signatures are of low quality. Rasool Bhat^[197]58
discusses how cancer cells exhibit phenotypic plasticity under drug
treatment, leading to drug resistance which often involves complex
regulatory networks that are not fully captured by changes in gene
expression alone. The reverse signature paradigm is an effective
matching algorithm in many diseases. But there are also cases that not
fit this algorithm, more comprehensive omics data, such as genomics,
epigenetics, and proteomics should be considered in the future work to
better characterize disease states and facilitate drug discovery.
In addition, in the process of compound screening, we mainly focus on
the impact at the gene level, lacking consideration for the effects on
phenotypes, such as the Area Under the Curve (AUC) or half-maximal
inhibitory concentration (IC[50]) derived from the experimental
dose-response curve. To comprehensively assess compound impacts, future
research directions could involve incorporating more phenotype data for
a holistic assessment of the relationship between genes and phenotypes.
Moreover, expanding the scope of perturbation scenarios and
incorporating extensive biological knowledge will be considered to
enhance the model’s predictive capabilities in the future. Furthermore,
we expect PRnet to extend its utility beyond the prediction of chemical
perturbations to encompass various perturbation experiments, including
genetic perturbations and other forms, thereby contributing to the
advancement of drug discovery.
Methods
The PRnet algorithm
In this work, we formulate transcriptional response prediction as a
distribution generation problem conditioned on perturbations. Given a
dataset
[MATH:
D=(xi,
Pi)i=1N
msubsup> :MATH]
, where
[MATH:
xi∈Rn
:MATH]
denotes as the n-dimensional gene expression and P[i] is the attribute
set of perturbation, PRnet was trained to learn a function f that
predicts transcriptional responses
[MATH: x^i=f(xiu<
/msubsup>,Pi
) :MATH]
to novel perturbations. Here,
[MATH:
xiu
:MATH]
represents the unperturbed gene expression,
[MATH: xi^ :MATH]
is the predicted perturbed gene expression, and P[i] = (s[i], d[i]) is
the attribute set of perturbation, which includes the chemical
structures s[i] and dosages of the compounds d[i]. As a generation
model, the Gaussian negative log-likelihood loss is chosen as the loss
function. For HTS RNA-seq datasets
[MATH:
D=(xi,
Pi)i=1N
msubsup>=(
xi,(si,di))i=
1N :MATH]
, the gene expression x[i], compounds s[i], and dosages d[i] attributes
are usually considered, in which
s[i] = (s[i,1], s[i,2], . . . , s[i,M]) describes the Canonical SMILES
format of M compounds of perturbation i and
d[i ]= (d[i,1], d[i,2], . . . , d[i,M]) are the dosages of compounds.
When d[i][,][j] = 0, the compound j was not applied in perturbation i.
If M = 0, there is no perturbation performed, and in this case,
perturbation i is an unperturbed state
[MATH:
xiu
:MATH]
, otherwise is in a perturbed state. In addition, the covariates vector
c[i] contains discrete covariates such as cell types, cell lines, or
species, depending on the available data. While c[i] does not directly
serve as input to the function f, it relates to the intermediate
variable vector
[MATH:
xiu
:MATH]
of the function f. Each perturbed profile x[i] is assigned an
unperturbed transcriptional profile of the same cell line
[MATH:
xiu
:MATH]
by random selection in the dataset. The goal of PRnet is to learn a
function f that predicts transcriptional responses
[MATH: x^i=f(xiu<
/msubsup>,Pi
) :MATH]
to novel perturbations. Given an unperturbed gene expression
[MATH:
xiu
:MATH]
, PRnet maps the unperturbed state to a distribution
[MATH: N(x
i∣μi
mrow>,σi<
/mrow>2) :MATH]
, and the sample to a perturbed state
[MATH: x^i :MATH]
.
PRnet is a perturbation-conditioned generative model aimed at
predicting gene expression profiles under different perturbations. The
design of PRnet consists of three components: (1) the Perturb-adapter,
a scalable adapter
[MATH:
fpert<
/mi>:Z→Rk
:MATH]
encodes complex perturbations (s[i], d[i]) into a k-dimensional
perturbation embedding
[MATH:
zip
:MATH]
, (2) the Perturb-encoder
[MATH:
fenc:Rk+
n→Re
:MATH]
encodes the combined learnable perturbation embedding
[MATH:
zip
:MATH]
and unperturbed gene expression
[MATH:
xiu
:MATH]
to a latent space to get the embedding
[MATH:
zil
:MATH]
, and (3) the Perturb-decoder
[MATH:
fdec:Re+
k+m→R2n :MATH]
decodes combines perturbation embedding
[MATH:
zip
:MATH]
, latent state
[MATH:
zil
:MATH]
and noise
[MATH:
zin
:MATH]
into Gaussian likelihood distribution
[MATH: N(x
i∣μi
mrow>,σi<
/mrow>2) :MATH]
of perturbed state and generates perturbed gene expression
[MATH: x^i :MATH]
by sampling from
[MATH: N(x
i∣μi
mrow>,σi<
/mrow>2) :MATH]
. We elaborate on the three components in the following.
The Perturb-adapter
The Perturb-adapter encodes the i-th perturbation P[i] as a fixed-size
embedding
[MATH:
zip
∈Rk :MATH]
(k = 64). Given the i-th perturbation P[i], the j-th compound of P[i]
was first represented as the canonical SMILES format s[i,j], and was
converted to a fixed size fingerprint embedding FCFP[i,j] (FCFP4
fingerprints: Functional-Class Fingerprints with radius = 2 which focus
on the functional topological pharmacophore) by RDKit^[198]31 encoder
[MATH: H:Z→Rh
:MATH]
(h = 1024). Then, the rescaled Functional-Class Fingerprints rFCFP[i]
embedding of P[i] was calculated by the weighted sum of all fingerprint
embeddings of all compounds applied by P[i]:
[MATH:
rFCFPi=∑jϕ(di,
mo>j)×H
(si
,j)=<
mrow>∑jlog10(di,j+1)<
mo>×H(s<
mrow>i,j), :MATH]
1
where ϕ is a log scale function using the dosage of j-th compound. At
last, the final perturbation embedding
[MATH:
zip
:MATH]
was generated by the Perturb-adapter
[MATH:
Eθ<
mi>p :MATH]
:
[MATH:
zip
=fpert(si,
di)<
/mrow>=Eθp(G(H(si),log10(d
i+1)
)), :MATH]
2
where θ[p] are the parameters of the Perturb-adapter, which is a
2-layer feedforward neural network.
Thanks to the scalability of the SMILES^[199]30 format and the
RDKit^[200]31, which can encode any compound to a latent embedding, the
design of the Perturb-adapter is scalable to in silico screen novel
compounds and cell line perturbations.
The Perturb-encoder and the Perturb-decoder
Inspired by the Variational Auto-Encoder (VAE^[201]22) framework, PRnet
uses the encoder and decoder framework to estimate the Gaussian
distribution of each gene parameterized with the mean (μ[i]) and the
variance (
[MATH:
σi2
:MATH]
). Due to the technical limitation, only unpaired perturbed and
unperturbed transcriptional profiles were observed. To match the
expression profiles before and after perturbation, for scRNA-seq
datasets, each perturbed cell was assigned with an unperturbed cell
from the same cell type by random selection in the dataset, and for
bulk RNA-seq datasets, the same cell line observation was assigned for
the perturbed state. Given the perturbation embedding
[MATH:
zip
:MATH]
generated by the Perturb-apaptor, the Perturb-encoder
[MATH:
Eθ<
mi>e :MATH]
mapped the
[MATH:
zip
:MATH]
and the unperturbed gene expression
[MATH:
xiu
:MATH]
to a latent embedding
[MATH:
zil
∈Re :MATH]
(e = 64). Then, the perturb-decoder
[MATH:
Eθ<
mi>d :MATH]
accepted
[MATH:
zip
:MATH]
,
[MATH:
zil
:MATH]
, and the noise
[MATH:
zin
:MATH]
to estimate the Gaussian likelihood distribution
[MATH: N(x
i∣μi
mrow>,σi<
/mrow>2) :MATH]
of the perturbed profile:
[MATH:
zil
=Eθe<
mo>(xi<
mrow>u,zip)
, :MATH]
3
[MATH: [μ^i,σ^i2]
=Eθd(zip
mrow>,zil,
zin
mrow>), :MATH]
4
where both
[MATH:
Eθ<
mi>e :MATH]
and
[MATH:
Eθ<
mi>d :MATH]
are 2-layer feedforward neural networks, and θ[e]and θ[d] are the
parameters of
[MATH:
Eθ<
mi>e :MATH]
and
[MATH:
Eθ<
mi>d :MATH]
, respectively. In the design of PRnet, the perturbation embedding
[MATH:
zip
:MATH]
was used as the input for both the perturb-encoder and the
perturb-decoder. Multiplexing
[MATH:
zip
:MATH]
in the perturb-decoder combined the chemical and biology context helped
assemble a more precise output. And the noise
[MATH:
zin
:MATH]
increases the robustness of the model. Once the estimated Gaussian
likelihood distribution
[MATH: N(x
i∣μi
mrow>,σi<
/mrow>2) :MATH]
was generated, PRnet sampled the estimated perturbed gene expression
[MATH: x^i :MATH]
from it:
[MATH:
μi=μ^i, :MATH]
5
[MATH:
σi2
=φ(σ^i2)
, :MATH]
6
[MATH: x^i~N(μ
i,σi2
), :MATH]
7
where φ(.) denotes the non-linear softplus function. μ[i] and
[MATH:
σi2
:MATH]
are the mean and the variance for each gene, which are n-dimensional
vectors. And
[MATH: x^i :MATH]
is the estimated n-dimensional transcriptional responses to
perturbations.
Training and testing
The training objective of PRnet is to minimize the Gaussian negative
log-likelihood loss defined as:
[MATH: L(θ
p,θ
mrow>e,θd)=∑i
12logmaxσi<
mrow>2,eps+<
mrow>(μi−xi
)2maxσi<
mrow>2,eps, :MATH]
8
Where eps is used to clamp
[MATH:
σi2
:MATH]
for stability, and is set to 1e-6 by default. By minimizing the loss of
all cells/samples, PRnet is able to learn the most proper parameters.
The training datasets (a bulk HTS dataset from the L1000 project^[202]1
and a single cell HTS dataset from the sci-Plex3 assay^[203]2) were
split by dividing the attribute set of perturb P[i] =
(s[i], d[i], c[i]).
To train and test PRnet, all datasets were split into three subsets:
train, valid, and test in a ratio of 6:2:2, according to different
split strategies. A total of four split strategies were designed:
* random_split: randomly divides compounds and cell lines
* compound_split: groups the datasets according to s[i] for each
profile, and then splits groups of the dataset
* cell_line_split: groups the datasets according to c[i] for each
profile, and then splits groups of the dataset
* pathway_split: groups the datasets according to c[i] from the same
pathway for each profile, and then splits groups of the dataset
For bulk RNA-seq datasets, random_split, compounds_split, and
cell_line_split were applied, and for scRNA-seq datasets, random_split,
compound_split, and pathway_split were applied. Strict data split can
prevent data leakage and bring better generalization performance. PRnet
is trained using 5-fold cross-validation for each split category.
The table below (see Table [204]1) outlines the values for the
hyperparameters involved in PRnet training. The same set of
hyperparameters is used across all splits and datasets.
Table 1.
Hyperparameters set of PRnet
Module Hyperparameter Default value hyperparameters set
PRnet batch size 512 256, 512, 1024
learning rate 1e-3 1e-3, 1e-5, 3e-3
weight decay 1e-8 0,1e-3
scheduler factor 0.5 0.5
scheduler patience 5 5
early stopping patience 20 20
num of epochs 100 100
gradient descent Adam Adam,RMSprop
loss function GaussNLLLoss^1 GaussNLLLoss, NBLoss^2, MSE^3, KLLoss^4
Perturbe-adapter drug dimension(h) 1024 1024
input size h 1024, 2048
hidden size 128 64, 128, 256
output size(k) 64 64, 128
Perturbe-encoder input size 5000, 978 5000(scRNA-seq), 978(bulk
RNA-seq)
hidden size 128 128, 256
output size(e) 64 64, 128
Perturbe-decoder input size(e + k + m) 138 138, 266
hidden size 128 128, 256
output size n × 2 10000, 1956
[205]Open in a new tab
^1 GaussNLLLoss is the Gaussian negative log-likelihood loss.
^2 NBLoss is the negative binomial negative log-likelihood loss.
^3 MSE is the mean squared error loss.
^4 KLLoss is the Kullback-Leibler divergence loss.
Model evaluation
Four metrics were used to quantitatively evaluate the performance of
PRnet:
1. fold − change: describes the ratio changes between the gene
expression of unperturbed and perturbed state:
[MATH:
fci=xi+1xiu, :MATH]
9
[MATH: fc^i=x^i+1x^iu. :MATH]
10
Where fc[i] and
[MATH: fc^
mrow>i :MATH]
are the ground truth and predicted perturbed fold change in gene
expression x[i], respectively. A higher value of fold − change
indicates a better performance.
2. R^2: is the mean coefficient of determination score between the
predicted and true perturbed gene expression of all cells:
[MATH:
R2=1O∑i=1
O1−∑j=1
j=n
(x^ij−xij)2
∑j=1
j=n
(x^ij−x¯ij)
2, :MATH]
11
where O denotes the number of cells in test datasets,
[MATH: x^i :MATH]
, x[i] and
[MATH: x¯i :MATH]
are the predicted, true perturbed gene expression and mean of the
ground truth perturbed gene expression, respectively. A higher
value of R^2 indicates a better performance.
3. Pearson of log(FC) in compounds: evaluates the Pearson correlation
between the true and predicted post-perturbation of the average
logarithm of the fold-change in gene expression (log(FC)) perturbed
by the same compound:
[MATH: log(f<
mrow>cj)
mrow>¯=1n∑i
=1nlo
g(fci,j)<
mo>, :MATH]
12
[MATH: log(fc^j)¯=1n∑i
=1nlo
g(fc^i,j), :MATH]
13
[MATH:
PCCj=PCC(<
mover
accent="true">log(f<
mrow>cj)
mrow>¯,log(fc^j)¯), :MATH]
14
[MATH: Pearsonoflog(FC)incompounds=1m∑j=1
mPCC<
/mi>j, :MATH]
15
where fc[i,j] and
[MATH: fc^
mrow>i,j :MATH]
denote as the ground truth and predicted post-perturbation fold
change in gene expression x[i] perturbed by compound j. The set
[MATH:
(fci,j<
mo>)i=1<
mrow>n :MATH]
represents all perturbed fold-changes for compound j.
[MATH: log(fcj
)¯ :MATH]
and
[MATH: log(<
mover
accent="true">fc^
mrow>j)<
mo accent="true">¯ :MATH]
are the ground truth and predicted post-perturbation average
logarithm of logarithm of the fold-change in gene expression for
compound j, respectively. PCC[j] is the Pearson correlation
coefficient of the mean log(FC) perturbed by compound j. The
average of the Pearson correlations for all m compounds in the test
set is referred to as the “Pearson of log(FC) in compounds”. A
higher value of “Pearson of log(FC) in compounds” indicates a
better performance.
4. Pearson of log(FC) in cov_compounds: evaluates the Pearson
correlation between the true and predicted post-perturbation of the
average logarithm of the fold-change in gene expression (log(FC))
perturbed by the same compound within the same cell line:
[MATH: log(FCjc
)¯=1n∑i=1
nlog(
fci,j
mi>c)
, :MATH]
16
[MATH: log(FCjc^)¯=1n∑i=1
nlog(fci,jc^), :MATH]
17
[MATH:
PCCjc=PC
C(log(FCjc
)¯,log(FCjc^)¯), :MATH]
18
[MATH: Pearsonoflog(FC)incov_compounds=1z∑c,j=
mo>1zPCCjc, :MATH]
19
where
[MATH:
fci,jc :MATH]
and
[MATH: fci,jc^ :MATH]
denote the ground truth and predicted post-perturbation fold change
in gene expression x[i] perturbed by compound j in covariate c,
respectively. The covariate c represents cell line or cell type of
x[i].
[MATH:
(fc
i,jc<
/mi>)i=1n
:MATH]
represents all perturbed fold-changes for compound j in covariate
c.
[MATH: log(FCjc)¯ :MATH]
and
[MATH: log(FCjc^)¯ :MATH]
are the ground truth and predicted post-perturbation average
logarithm of logarithm of the fold-change in gene expression for
compound j in covariate c, respectively.
[MATH:
PCCjc :MATH]
is the Pearson correlation coefficient of the mean log(FC)
perturbed by compound j in covariate c. The average of the Pearson
correlations for all z “cov_compounds” conditions in the test set
is referred to as the “Pearson of log(FC) in cov_compounds”. A
higher value of “Pearson of log(FC) in cov_compounds” indicates a
better performance.
5. R^2 in compounds: is the mean coefficient of determination score
between the predicted and true post-perturbation gene expression of
the same compound:
[MATH: xt¯=1p∑i=1
pxi,t, :MATH]
20
[MATH: xt^¯=1p∑i=1
pxi,<
/mo>t^, :MATH]
21
[MATH:
Rt2
=R2(xt¯,xt^¯), :MATH]
22
[MATH: R2incompounds=1T∑j=1
TR
t2, :MATH]
23
where x[i,t] and
[MATH: xi,t^ :MATH]
denote as the ground truth and predicted post-perturbation gene
expression x[i] perturbed by compound t. The set
[MATH:
(x<
mrow>i,t)<
/mrow>i=1p
:MATH]
represents all post-perturbation gene expression for compound t.
[MATH: xt
mrow>¯ :MATH]
and
[MATH: xt
mrow>^¯ :MATH]
are the ground truth and predicted post-perturbation average gene
expression for compound t, respectively.
[MATH:
Rt2<
/mn> :MATH]
is the coefficient of determination score of the mean
post-perturbation gene expression of compounds t. The average of
the coefficient of determination score for all T compounds in the
test set is referred to as the “R^2 in compounds”. A higher value
of “R^2 in compounds” indicates a better performance.
6. R^2 in cov_compounds: is the mean coefficient of determination
score between the predicted and true post-perturbation gene
expression of the same compound within each specific cell line:
[MATH: xtc¯=1p∑i=1
px
i,tc<
mo>, :MATH]
24
[MATH: xtc^¯=1p∑i=1
pxi,tc^
mo>, :MATH]
25
[MATH:
Rt2
=R2(xtc¯,xtc^¯), :MATH]
26
[MATH:
R2in cov_compounds=1<
mi>Q∑j=1
QR<
/mrow>2tc, :MATH]
27
where
[MATH:
xi,tc :MATH]
and
[MATH: xi,tc^ :MATH]
denote as the ground truth and predicted post-perturbation gene
expression x[i] perturbed by compound t in covariate c,
respectively. The covariate c represents the cell line or cell type
of x[i]. The set
[MATH:
(xi,tc)i=1p :MATH]
represents all post-perturbation gene expression for compound t in
covariate c.
[MATH: xtc¯ :MATH]
and
[MATH: xtc^<
/mrow>¯ :MATH]
are the ground truth and predicted post-perturbation average gene
expression for compound t in covariate c, respectively.
[MATH:
R2
tc :MATH]
is the coefficient of determination score of the mean
post-perturbation gene expression of compounds t in covariate c.
The average of the coefficient of determination score for all Q
“cov_compounds” conditions in the test set is referred to as the
“R^2 in cov_compounds”. A higher value of “R^2 in cov_compounds”
indicates a better performance.
We use the following baseline models to compare model performance:
1. Linear model: This model uses a linear regression model to learn
weights between all genes and perturbations. The linear model uses
MSELoss to learn perturbation effects applied to the control
sample/cell. Let θ[l] represent the weight matrix of the linear
model, the perturbed gene expression would be:
[MATH: x^i=Eθl(zip,x
iu<
mo>), :MATH]
28
[MATH: Li<
mo>(Eθl)
mrow>=MSELos<
/mi>s(x^i,xi
). :MATH]
29
2. MLP model: MLP model utilizes Multilayer Perceptron (MLP) to fit
the effect of perturbation to genes. The input is passed to an MLP
model(Input Layer, Linear with output dimension 128, BatchNorm1d,
LeakyReLU, Linear with output dimension as input size, and ReLU).
Let θ[MLP] represent the parameters of the linear model, the
perturbed gene expression would be:
[MATH: x^i=MLPEθMLP(zip,xiu
), :MATH]
30
[MATH: Li<
mo>(EθMLP)=MSEL<
/mi>oss(x^i,xi
). :MATH]
31
The pseudo-dose trajectory
To calculate the pseudo-dose trajectory, we take the mean t-SNE
embedding of cells with the same dosage as a point on the pseudo-dose
trajectory:
[MATH: zd¯=1nd
mfrac>∑i=1
ndzi,d
, :MATH]
32
where z[i,d] is the t-SNE latent embedding of cell i at dose d, n[d] is
the number of cells at dose d, and
[MATH: zd¯ :MATH]
is the mean t-SNE embedding for dose d, representing a point on the
pseudo-dose trajectory. The sequence of these mean latent embeddings
[MATH: zd¯ :MATH]
forms the pseudo-dose trajectory.
Screening candidates based on gene signatures
Inspired by the CMap connectivity score^[206]15 and L1000
project^[207]1, PRnet employed a similar reverse signature paradigm to
screen candidates based on gene signatures. PRnet provided a workflow
for screening candidates for complex diseases according to the
reference changes in gene sets.
Step 1: Given the SMILES of the screening compounds library and
unperturbed transcriptional profiles of specific cell lines, PRnet
predicts the perturbed transcriptional profiles of all compounds across
multiple concentration gradients. This prediction is performed with
three repeats to ensure computational robustness.
Step 2: The transcriptional profiles of 978 landmark genes are
transformed into profiles of 12,328 genes using linear transformation
vectors derived from the L1000 project^[208]1. The fold-change of
transcriptional profiles is then calculated, which includes the average
fold-change of each compound across cell lines. Genes are ranked based
on their fold-change values. Gene expression signatures representing
the disease states or sensitive compound responses are generated, known
as query signatures. Query signatures include an up-regulated gene set
and a down-regulated gene set which are the reversed signatures of the
disease. Gene expression signatures of screening compounds are also
generated, which are the ranked gene lists.
Step 3: The Kolmogorov-Smirnov test is employed to calculate enrichment
scores, which measure the connectivity of screening compound profiles
with the query signatures. The score to reverse disease up- and
down-regulated features are calculated separately and then summed
together:
[MATH:
a=maxm
=1~pmp−R(m<
/mrow>)n, :MATH]
33
[MATH:
b=−max
m=1~qR(m<
mo>−1)n−m−1<
mrow>q, :MATH]
34
[MATH:
Score=a−b,whena×b<0,0,whena×b>0, :MATH]
35
where p is the number of genes in the upregulated gene set of query
signature, q is the number of genes in the downregulated gene set of
query signature, n is the number of genes in the computed
transcriptional profile, and R(m) is the rank of a specific gene in the
rank list. The enrichment score is commonly used in the field to
evaluate the distribution of the predicted gene rank in the reference
gene signature. The gene signature is an up- and down-regulated gene
set. The up- and down-regulated gene set of the specific disease is
identified from the reference study. The up- and down-regulated gene
sets of the sensitive compounds to cell lines are computed by PRnet.
Note that the up- and down-regulated gene sets represent the reverse
gene signatures of a specific disease, indicating the up- and
down-regulate gene sets that the compounds are expected to enrich.
Finally, compounds are ranked based on these enrichment scores, and the
top-ranked compounds are recommended for diseases.
Other computational tools
Morgan fingerprints were calculated by the RDKit^[209]31 in Python
(2023.3.2). The atlas of perturbation profiles was organized into
anndata format through the Scanpy package^[210]59 in Python (1.9.1).
T-SNE^[211]33 clustering was performed by the Scanpy package in Python
(1.9.1). Figures related to computation were plotted by
matplotlib^[212]60 (3.5.2), seaborn^[213]61 in Python (0.12.0), and the
ggplot^[214]62 package in R (3.5.1). The GSEA and KEGG pathway
enrichment analysis were performed by the Clusterprofiler
package^[215]63 in R (4.6.2), and plot by the gseaplot, enrichplot and
cnetplot. A P-value < 0.05 was defined as the cutoff criterion.
Experimental figures were plotted by GraphPad Prism. Some figures were
plotted by Chiplot ([216]https://www.chiplot.online/). And some icons
in figures are created in BioRender ([217]https://www.biorender.com/).
The deep learning model was constructed by Pytorch^[218]64 (1.12.1).
Cell culture
All the cell lines used in this work were purchased from the American
Type Culture Collection (ATCC). HT29, SW480, SW620, and HCT116 cells
were cultured in DMEM (Gibco) medium supplemented with 10% fetal bovine
serum, 100 U/mL penicillin, and 100 U/mL streptomycin. Colo205,
NCI-H196, NCI-H209, NCI-H446, NCI-H526, and NCI-H69 cells were cultured
in RPMI-1640 (Gibco) medium supplemented with 10% fetal bovine serum,
100 U/mL penicillin, and 100 µ/mL streptomycin. DMS114 cells were
cultured in Waymouth’s MB 752/1 (Gibco) medium supplemented with 10%
fetal bovine serum, 100 µ/mLpenicillin, and 100 µ/mL streptomycin. All
the cell lines were incubated at 37 ^∘C in a humidified 5% CO[2]
atmosphere. All cells were negative for mycoplasma, and these cell
lines are not among those commonly misidentified by the International
Cell Line Authentication Committee (ICLAC).
Cell viability assays
MTT (3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide)
assay was performed for the evaluation of cell viability. The cells
were seeded into 96-well plates at a density of 2000–20000 cells per
well in full medium. Overnight, test compounds with indicated
concentrations were added for 72 h. 20 μL of MTT (5 mg ⋅ mL^−1 in
saline, Sigma) was then added per well for 2 h, and 50 μL SDS (20%,
dissolved in H[2]O containing 1% HCl) was added per well, followed by
incubating overnight. The absorbance at 570 nm was measured using a
multiscan spectrum reader (BMG lab tech). The cell survival rate was
calculated subsequently.
Atlas of perturbation profile
Data preprocessing
The HTS RNA-seq datasets used for this study all underwent the same
preprocessing. For single-cell RNA-seq data, the gene expression of
each cell was normalized by total counts over all genes with
log-transformation, and 5000 highly variable genes (HVGs) were selected
using Scanpy. For bulk RNA-seq data, 978 different genes of level 3
were normalized with log transformation for training and evaluation.
These data were split into training, validation, and test sets by
dividing various perturbation conditions (such as compounds, cell type,
and pathway) with a ratio of 6:2:2. And 5-fold cross-validation was
applied for training.
The L1000 dataset
The L1000 project^[219]2 contains more than 1 million bulk RNA-seq
observations with 978 landmark genes of level 3. We performed data
cleaning using the following criteria: deleted insufficient compound
conditions(observations < 5); removed invalid compound SMILES which
were not successfully parsed by RDKit; assigned each perturbed
observation an unperturbed observation and removed the unpaired
observations. Finally, we obtained 175,549 cov_compounds_dose_name
condition (unperturbed-perturbed pair), which contains 82 cell lines
and 17,202 compounds.
The sci-Plex3 dataset
The sci-Plex3 study^[220]2 was subseted to 290,888 cells. We performed
data cleaning using the following criteria: randomly subsampled the
dataset to half size; selected 5000 highly variable genes (HVGs) using
Scanpy; removed invalid molecule SMILES which could not successfully be
parsed using RDKit^[221]31; assigned each perturbed observation an
unperturbed observation and remove the unpaired observations. Finally,
we obtained 2244 cov_drug_dose_name condition (unperturbed-perturbed
pair), which contains three cancer cell lines (A549, MCF7, K562), which
were treated with 188 different compounds in 4 dosages (10, 100, 1000,
and 10000 nM) and vehicle for unperturbed cells.
The SCLC dataset
The unperturbed profiles of small cell lung cancer collected from CCLE
contain 5 cell lines, including NCI-H69, NCI-H526, NCI-H446, NCI-H209,
and NCI-H196. We performed data preprocessing using the following
criteria: selected 978 landmark genes of unperturbed profiles; and
filled in the valid gene expression with the mean expression value of
all genes in the current cell line. After the virtual screening, we
obtained 4272486 transcriptional profiles.
The compounds libraries
Six compound libraries were used in this study. The training compound
libraries are collected from the L1000^[222]1 and the sci-Plex3
datasets^[223]2. in silico screening, an FDA-approved library
(TargetMol, n = 935), an active compound library (Selleckchem,
n = 4158), a natural compound library (Herb, n = 30,456), and an
in-house drug-like compound library(n = 29,670) were used to screen the
positive chemicals. The sensitive compounds of cell lines are collected
from Genomics of Drug Sensitivity in Cancer (GDSC)
([224]https://www.cancerrxgene.org/) with z-score≥ 2.0.
The gene signature of diseases
The gene signature of diseases was collected from CRowd Extracted
Expression of Differential Signatures (CREEDS)^[225]43. The CREEDS
project annotated 839 diseases versus normal signatures from Gene
Expression Omnibus (GEO). We collected the gene signature of the
disease by removing the study with the intersected set of up- and
down-regulated genes and 12,328 genes and set less than 1 gene and
finally obtained 577 studies for 233 unique diseases. The gene
signatures of diseases were downloaded from [226]CREEDS.
The scRNA-seq data of pediatric AML patients cohort
We collected scRNA-seq data of paired pre- and post-chemotherapy whole
bone marrow samples from 13 pediatric AML patients who achieved disease
remission following chemotherapy, as described by Zhang et al.^[227]35.
These patients were treated with either a low-dose chemotherapy (LDC)
regimen or a standard-dose chemotherapy (SDC) regimen. The LDC regimen
consisted of cytarabine (10 mg/m^2) and mitoxantrone or Idarubicin,
administered concurrently with G-CSF (5 μg/kg). The SDC regimen
consisted of cytarabine (100 mg/m^2), daunorubicin, and etoposide. We
compiled the expression profiles of all 224,217 cells from the 13
patients and performed normalization and log transformation. Each cell
was annotated with patient information, cell type, and chemotherapy
regimen based on the information provided in the study. The LDC regimen
was annotated with cytarabine and mitoxantrone, while the SDC regimen
was annotated with cytarabine, daunorubicin, and etoposide. Through
highly variable gene analysis, we retained 33,538 highly variable genes
for training. The scRNA-seq data of patients were downloaded from the
Genome Sequence Archive for Humans at the BIG data center, Beijing
Institute of Genomics, Chinese Academy of Sciences, and China National
Center for Bioinformation under accession number ([228]OMIX005223).
The data of A549 and MCF7 cell strains
We collected expression profiles for 27 MCF7 strains and 23 A549
strains from Ben-David et al.^[229]34. Expression profiles for 978
L1000 landmark genes were used for each strain. We included 35
compounds from the chemical screen in the study^[230]34, retaining 33
after removing those that RDKit could not generate to FCFP
fingerprints. We also collected the drug response to 35 active
compounds of MCF7 strains in the chemical screen, including decreasing
(active), decreasing (weakly active), and inactive.
Statistics and reproducibility
All samples discussed in this manuscript (cell viability assays and
perturbation profile atlas generation) were measured and analyzed as
technical triplicates. No data were excluded from the analyses.
Reporting summary
Further information on research design is available in the [231]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[232]Supplementary Information^ (7.2MB, pdf)
[233]Peer Review File^ (4.3MB, pdf)
[234]41467_2024_53457_MOESM3_ESM.pdf^ (30KB, pdf)
Description of Additional Supplementary Files
[235]Supplementary Data 1^ (22.4KB, xlsx)
[236]Supplementary Data 2^ (23.1KB, xlsx)
[237]Supplementary Data 3^ (14.5KB, csv)
[238]Supplementary Data 4^ (990.3KB, csv)
[239]Reporting Summary^ (76.6KB, pdf)
Source data
[240]Source Data^ (2.7MB, zip)
Acknowledgements