Abstract

   Understanding transcriptional responses to chemical perturbations is
   central to drug discovery, but exhaustive experimental screening of
   disease-compound combinations is unfeasible. To overcome this
   limitation, here we introduce PRnet, a perturbation-conditioned deep
   generative model that predicts transcriptional responses to novel
   chemical perturbations that have never experimentally perturbed at bulk
   and single-cell levels. Evaluations indicate that PRnet outperforms
   alternative methods in predicting responses across novel compounds,
   pathways, and cell lines. PRnet enables gene-level response
   interpretation and in-silico drug screening for diseases based on gene
   signatures. PRnet further identifies and experimentally validates novel
   compound candidates against small cell lung cancer and colorectal
   cancer. Lastly, PRnet generates a large-scale integration atlas of
   perturbation profiles, covering 88 cell lines, 52 tissues, and various
   compound libraries. PRnet provides a robust and scalable candidate
   recommendation workflow and successfully recommends drug candidates for
   233 diseases. Overall, PRnet is an effective and valuable tool for
   gene-based therapeutics screening.

   Subject terms: Virtual screening, Computational models, High-throughput
   screening, Virtual drug screening
     __________________________________________________________________

   Understanding transcriptional responses to chemical perturbations is
   crucial for drug discovery. Here, authors present PRnet, a deep
   generative model that predicts gene responses to novel chemical
   perturbations, enabling in-silico drug screening and the identification
   of candidate compounds for various diseases.

Introduction

   Transcriptional responses to chemical perturbations reveal fundamental
   insights into biological functioning and play an integral role in both
   disease understanding and drug discovery. Bulk and single-cell
   RNA-sequencing (scRNA-seq) experiments facilitate the high-throughput
   screen (HTS) of chemical perturbations at the omics level. Recent HTS
   studies^[44]1,[45]2 have experimental profiled thousands of independent
   perturbations exposing cells or cell lines to compounds. These
   transcriptional responses to chemical perturbations revealed coherent
   interpretable gene-level programs representing individual and cellular
   processes and quantified them in response to chemical perturbation.
   Although encouraging progress has been made, experimentally screening
   chemical perturbations remains a time-consuming and expensive process
   with a low discovery rate of new therapies^[46]3. It is unfeasible to
   conduct an exhaustive exploration of the vast, novel chemical
   perturbation space by experimentally screening disease and compound
   combinations.

   In the past decades, deep learning-based methods have emerged as
   important tools for modeling transcriptional responses to
   perturbations. Many approaches have been proposed recently for modeling
   HTS perturbation responses. CPA^[47]4 utilized an auto-encoder-based
   model to map chemical induce transcriptomic effect into a latent space
   to reconstruct perturbation response. Biolord^[48]5, scGen^[49]6, and
   scVIDR^[50]7 performed counterfactual prediction via deep
   encoder-decoder/generator-based generative frameworks, which take a
   control cell and unseen labels as input to predict the gene expression
   of the unseen cellular states. These recently published cell
   perturbation modeling tools precisely simulated chemical perturbations
   and predicted chemical perturbations in gene expression of unseen cell
   types, but few predict responses to novel chemicals. Following
   CPA^[51]4, chemCPA^[52]8 introduced a new encoder-decoder architecture
   that incorporated the compounds’ structure to predict the
   perturbational effects of unseen drugs. Deep generative frameworks with
   encoder-decoder and its variants effectively predicted single-cell gene
   expression perturbations. CellOT and CINEMA-OT^[53]9,[54]10 leveraged
   optimal transport to match paired unperturbed-perturbed observations.
   Optimal-transport-based matched the observations experimentally
   perturbed but were incapable of modeling novel perturbations, such as
   novel compounds or novel cell types. Linear regression-based
   methods^[55]11,[56]12 estimate the impact of perturbations on gene
   expression by linearly combining the effects of genetic perturbation.
   However, this linear combination approach leads to limitations in
   accurately modeling the nonlinear chemical perturbations across diverse
   cell types and compound combinations. GEARS and
   CellOracle^[57]13,[58]14 leveraged a knowledge graph of gene-gene
   relationships to predict genetic perturbation outcomes. However,
   graph-based models relied on accurate prior knowledge leading to the
   lack of scalability. Given that most diseases are associated with
   characteristic gene expression profiles, Connectivity Map (CMap)^[59]15
   proposed a concept that connected genes, drugs, and diseases by virtue
   of common gene-expression signatures, leading to projects such as
   CMap^[60]15, L1000^[61]1, etc. Inspired by this concept, DLEPS^[62]16,
   OCTAD^[63]17 and other studies (such as^[64]18–[65]20, etc.) utilized
   gene signature matching methods to screen candidates by finding drugs
   that reverse the disease signature. DLEPS^[66]16 predicted chemically
   induced changes in transcriptional profiles directly from molecular
   structure, and OCTAD^[67]17 virtually screened compounds by matching
   the cancer-specific expression signature to compound-induced gene
   expression profiles. Although DLEPS and OCTAD screened candidates
   effectively based on bulk HTSs, they failed to predict
   cell-type-specific transcriptional response to novel perturbations and
   model cellular heterogeneity, which is highly relevant to treatments.
   Consequently, perturbations response models are required to address the
   limited exploration power to novel perturbations of existing
   experimental and computational methods, which are also needed for
   predicting the response to unseen perturbations and discovering
   promising therapeutic drug candidates. Deep generative models,
   including Generative Adversarial Network (GAN^[68]21), Variational
   Auto-Encoder (VAE^[69]22), Denoising Diffusion Probabilistic Model
   (Diffusion^[70]23), Normalization Flow (NF^[71]24), Generative
   Pre-Trained Transformer (GPT^[72]25), and so on, learn the probability
   density of observable samples and generate new samples. Deep generative
   models have greatly improved diverse areas (for example, natural
   language processing^[73]25,[74]26, computer vision^[75]27,[76]28,
   chemicals^[77]29, and so on), suggesting the potential for applications
   in drug discovery.

   Here we present PRnet, which is a flexible and scalable
   perturbation-conditioned deep generative model for predicting
   transcriptional responses to novel chemical perturbations that were
   never experimentally perturbed at bulk and single-cell levels. PRnet is
   a new encoder-decoder architecture based generative model which
   comprises three components, including the Perturb-adapter, the
   Perturb-encoder, and the Perturb-decoder. PRnet adapts novel compounds
   and diseases in various perturbation scenarios by taking compound
   structures and unperturbed transcriptional profiles as input to predict
   transcriptional responses. The Perturb-adapter uses simplified
   molecular-input line-entry system^[78]30 (SMILES) chemical encoding as
   input, enabling generalization to unseen compounds without prior
   knowledge and annotation. The learnable latent space of PRnet
   facilitates gene-level response interpretation and capturing
   heterogeneity. PRnet was trained with close to one hundred million bulk
   HTS observations perturbed by 175,549 compounds and tens of millions
   single cell HTS observations perturbed by 188 compounds. Crucially, the
   model operates as a data-driven model, allowing for effective
   generalization to novel perturbations. The evaluation indicated that
   PRnet outperformed alternative approaches in predicting changes and
   expression in transcription response to novel compounds, pathways, and
   cell lines in bulk and single-cell HTS data. To further validate the
   effectiveness, PRnet has been utilized to identify novel bioactive
   compounds against small cell lung cancer (SCLC) and seek novel natural
   compounds against colorectal cancer (CRC). Experimental validation
   demonstrated the activity of novel candidate compounds against SCLC and
   CRC cell lines within the appropriate predicted ranges of
   concentration.

   This model’s flexibility and scalability make it a valuable tool to
   screen candidates for various diseases. We, therefore, leveraged PRnet
   to in silico screen various compound libraries and generated a virtual
   large integration atlas of perturbation profiles, covering 88 cell
   lines and 52 tissues, as well as compound libraries comprising 935
   FDA-approved drugs, 4158 active compounds, 30,456 natural compounds,
   and 29,670 drug-like compounds. PRnet also provided a robust and
   scalable candidate recommendation workflow for diseases according to
   the reference changes in gene sets. Given disease-specific or sensitive
   compound gene signatures, gene set enrichment analysis (GSEA) is
   employed to assess the potential efficacy of compounds against these
   diseases. PRnet successfully recommended 577 drug candidate lists from
   577 studies for 233 different diseases based on the profiles atlas. In
   three cases of metabolic disorders, including non-alcoholic
   steatohepatitis (NASH), polycystic ovary syndrome (PCOS) patients, and
   inflammatory bowel disease (IBD), drugs recommended by PRnet have been
   supported by previous literature with human or animal studies. PRnet
   enables effective prediction of transcriptional responses to novel
   complex chemical perturbations and screening large-scale compound
   libraries for specific diseases, thus being a valuable tool for
   gene-based therapeutics screening.

Results

PRnet: perturbation-conditioned deep generative model for transcriptional
response prediction

   In this paper, we formulate the transcriptional response prediction as
   a distribution generation problem conditioned on perturbations. Cells
   inherently recognize and respond to chemical stimuli, with their
   responses influenced by both external stimuli and their intracellular
   state. In the single cell or bulk HTS, transcriptional responses to
   chemical perturbations are affected by multiple conditions, such as
   compound structures, compound dosages, and covariates, such as cell
   types, and cell lines. Given an n-dimensional unperturbed
   transcriptional profile (a single cell or bulk RNA-seq observation) and
   a chemical perturbation imposed by a compound with a specific structure
   and dosage, PRnet aims to predict the distribution of the perturbed
   transcriptional profile. Modeling perturbation patterns enables
   capturing the novel chemical perturbation effect on gene-level programs
   and quantifying them in various perturbation scenarios (Fig. [79]1a).

Fig. 1. Overview of the methodological approach.

   [80]Fig. 1
   [81]Open in a new tab

   a Problem formulation: Given unperturbed transcriptional profiles
   (single cell or bulk) and applied perturbations (structures and the
   dosages of the compounds), predict transcriptional responses. Red
   arrows indicate changes in transcriptional profiles. b Model
   architecture: PRnet is a perturbation-conditioned deep generative model
   for transcriptional response prediction with three components,
   including the Perturb-adapter, the Perturb-encoder, and the
   Perturb-decoder. Crucially, the model operates as a data-driven model,
   allowing for effective generalization to novel perturbations. c
   Screening candidates with PRnet: The training compound library
   comprises a bulk high-throughput screening library (175,549 bioactive
   compounds), and a single-cell high-throughput screening library (188
   active compounds). The screening library comprises four compound
   libraries, including 935 FDA-approved drugs, 4158 active compounds,
   30,455 natural compounds, and 29,670 in-house druglike compounds,
   respectively. For in-silico screening, PRnet initially predicts the
   average transcriptional profile, fold-change in the gene, and gene rank
   after perturbing the specific cell line with screening compounds. Given
   the gene signature for a particular disease or sensitive drug, the
   enrichment scores of screening compounds were computed for compound
   ranking. In vitro validation experiments were conducted using MTT
   assays on colorectal cancer and small cell lung cancer cell lines to
   validate the activity of compound candidates. After in vitro
   validation, PRnet is utilized to recommend drug candidates for 233
   different diseases. d The large-scale integration atlas of perturbation
   profiles, including the L1000 dataset, the Sci-plex3 dataset, the
   FDA-approved drugs dataset, the anti-cancer compounds dataset, the
   natural compounds dataset, the drug-like compounds dataset, and the
   Gtex dataset. Some icons were created in BioRender. Qi, X. (2024)
   [82]biorender.com/l45e810.

   PRnet (as detailed in “Methods”) is a flexible and scalable
   perturbation-conditioned deep generative model designed to predict
   transcriptional responses to novel complex chemical perturbations at
   bulk and single-cell levels. The design of PRnet comprises three
   components, including the Perturb-adapter, the Perturb-encoder, and the
   Perturb-decoder (Fig. [83]1b). In preprocessing, each perturbed profile
   is assigned an unperturbed transcriptional profile of the same cell
   line. Initially, given a chemical perturbation imposed by the structure
   of compounds represented by Simplified Molecular Input Line Entry
   System^[84]30 (SMILES) strings and the dosages of the corresponding
   compounds, PRnet leverages RDKit^[85]31 to capture the functional
   topology information of the structures, generate the Functional-Class
   Fingerprints (FCFP) of compounds. These FCFPs were scaled by their
   dosages and then summed to generate rescaled Functional-Class
   Fingerprints (rFCFP) embedding. Without loss of generality, for the
   i-th compound perturbation, the Perturb-adapter encodes the fingerprint
   rFCFP[i] to an additive latent embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   which allows generalization to novel compounds and compound
   combinations. Then, the Perturb-encoder maps the chemical perturbation
   effect on heterogeneous unperturbed states
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi>
   </mrow></msubsup> :MATH]
   into the interpretable latent space
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>l</mi>
   </mrow></msubsup> :MATH]
   . At last, the Perturb-decoder estimates the distribution of the
   transcriptional response
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>x</mi>
   </mrow><mrow><mi>i</mi></mrow></msub><mo>∣</mo><msub><mrow><mi>μ</mi></
   mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msubsup><mrow><mi>σ</mi><
   /mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow><m
   o>)</mo></mrow> :MATH]
   within the context of the chemical perturbation effect on the
   unperturbed state
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>l</mi>
   </mrow></msubsup> :MATH]
   , the applied perturbation
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   and a noise
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>n</mi>
   </mrow></msubsup> :MATH]
   . PRnet encodes the chemical effect on the unperturbed state to a
   learnable latent space, estimates the distribution, and performs
   conditioned sampling to generate a transcriptional response with
   biological and chemical contexts. Sampling generates a specific
   transcriptional profile
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub> :MATH]
   that provides gene-level up- and down-regulation information. For bulk
   HTS data, the predicted transcriptional responses of 978 landmark genes
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub> :MATH]
   are transformed into 12,328 genes by linear transformation. For
   single-cell HTS data, 5000 highly variable genes (HVGs) of
   transcriptional profile are selected. SMILES^[86]30 is widely used for
   representing chemical structures due to its simplicity and efficiency
   in encoding complex molecules as strings. By taking SMILES of the
   compound as the input, the Perturb-adapter has sufficient flexibility
   to screen large-scale compound libraries without any prior knowledge.
   Driven by data, PRnet automatically identifies the heterogeneity in
   latent space corresponding to compound, dosage contexts, and cell-type
   specific contexts. This allows the model to directly generalize to
   novel perturbation scenarios involving novel compounds, pathways, cell
   types, and cell lines that have not been previously perturbed.

   With the ability to predict transcriptional responses to novel
   perturbations, PRnet enables efficient screening of candidates for
   complex diseases (Fig. [87]1c). Inspired by the assumption embodied in
   the CMap^[88]15 concept that gene signatures are used as indicators
   reflecting the underlying mechanisms of diseases, PRnet predicted new
   therapeutic candidates by finding drugs that reverse the disease
   signature. There are two steps to applying PRnet to downstream tasks.
   In step 1, for in silico screening, PRnet predicts the transcriptional
   profile of specific cell lines perturbed by a user-defined compound
   library with multiple gradient concentrations. In step 2, we calculate
   the average transcriptional profile and fold-change in gene expression
   of each compound, and rank genes according to their fold-change values.
   Then, given the query gene signature for a particular disease or known
   sensitive compounds, gene set enrichment analysis (GSEA^[89]32) was
   employed to evaluate compound efficacy with their enrichment scores.
   Finally, compounds are ranked based on these enrichment scores.
   Large-scale high-throughput screening data are initially fitted to the
   model, facilitating its adaptability to diverse compound libraries and
   diseases.

   PRnet was trained on two compound libraries and screened four compound
   libraries (Fig. [90]1c). The training compound libraries contain a bulk
   high-throughput screening library consisting of over 883,269
   transcriptional profiles of 175,549 bioactive compounds^[91]1, and a
   single-cell high-throughput screening library consisting of 290,888
   transcriptional profiles of 188 active compounds^[92]2. Being well
   trained in HTS observations, PRnet enables in silico high-throughput
   screening of novel compound libraries for various cell lines. PRnet has
   further been applied to screen active and drug-like compounds for SCLC
   and natural compounds for CRC. In vitro validation experiments with MTT
   assays confirmed the efficacy of the candidate compounds against SCLC
   and CRC cell lines. Lastly, PRnet screened four compound libraries and
   generated a large-scale integration atlas of perturbation profiles
   (Fig. [93]1d), including (1) 82 cell lines perturbed by 935
   FDA-approved drugs, (2) 88 cell lines perturbed by 4158 active
   compounds, (3) 14 CRC cell lines perturbed by 30,456 natural compounds,
   (4) 6 SCLC cell lines perturbed by 29,670 drug-like compounds and (5)
   54 tissues perturbed by 935 FDA-approved drugs. Based on the
   large-scale integration atlas of perturbation profiles, PRnet is
   capable of a variety of downstream applications. PRnet has been
   utilized to recommend drugs for 233 different diseases. PRnet
   successfully predicted drug candidates for these diseases and
   demonstrates the potential in drug discovery.

PRnet robustly predicted the response of novel perturbation and learned
interpretable latent embeddings

   To evaluate the performance of PRnet for predicting responses to unseen
   perturbations, all datasets were strictly split by perturbation
   attributes (compound_split, cell_line_split, and pathway_split) into 3
   subsets: train, validation, and test (Supplementary Fig. [94]1a). The
   held-out test sets were used to simulate datasets of novel
   perturbations. Three train-test data split strategies were employed to
   assess the performance of out-of-distribution perturbation scenarios,
   including (1) Random Split: randomly divides compounds and cell lines,
   (2) Unseen Compounds: testing compounds not seen perturbed during
   training, and (3) Unseen cell lines: testing cell lines not seen
   perturbed during training. Five-fold cross-validation was applied in
   each split strategy, and the average performance over five folds was
   computed as the overall metric for comparison. Two high-throughput
   screening data of different resolutions were used to test model
   performance, consisting of a bulk HTS dataset (from the L1000
   project^[95]1) and a single cell HTS dataset (from the sci-Plex3
   assay^[96]2). All models were trained and compared separately on two
   HTS datasets.

   We employed bulk high-throughput screening data from the L1000
   project^[97]1 to fit the model, in which 978 genes (hereinafter called
   landmark genes) were selected to represent the diversity of biological
   pathways and processes in human cells. We first preprocessed these data
   (as detailed in “Methods”) and obtained 836,352 paired bulk RNA-seq
   observations (represented by the expression levels of 978 landmark
   genes), covering 82 cell lines and their perturbation data perturbated
   by 175,549 compounds. To quantitatively evaluate the compound-induced
   gene expression changes, we compared the Pearson correlation between
   the true and predicted post-perturbation of the average logarithm of
   fold-change in gene expression (log(FC)) for the hold-out test set with
   alternative approaches. The “Pearson of log(FC) in compounds” metric
   evaluated the Pearson correlation between the true and predicted mean
   log(FC) perturbed by the same compound in the test set. We demonstrated
   the performance of the PRnet on the bulk HTS data in Fig. [98]2a, where
   a higher value indicates a better performance. PRnet consistently
   demonstrated the best performance across all three split strategies,
   particularly well-fitted in unseen compound prediction scenarios with
   an average Pearson Correlation (PCC) of 0.8. PRnet significantly
   outperformed in predicting unseen cell line log(FC) with an increase in
   PCC over 0.3 compared to other approaches. Some hold-out predicted
   cases in predicting transcriptional responses of unseen compounds and
   cell lines were illustrated in Supplementary Fig. [99]1b–d. In more
   challenging scenarios, we evaluated the performances of predicting
   cell-line-specific compound-induced changes in genes by the “Pearson of
   log(FC) in cov_compounds” metric. The “Pearson of log(FC) in
   cov_compounds”﻿ metric is the Pearson correlation between the true and
   predicted mean log(FC) perturbed by the same compound within the same
   cell line in the test set. PRnet achieved the best performance in the
   “Pearson of log(FC) in cov_compounds” metric across three scenarios. In
   particular, PRnet exhibited more than two times better performance in
   unseen cell line predictions compared to other methods and demonstrated
   improvements of 0.16 in unseen compound predictions, demonstrating the
   generalization of PRnet to novel perturbations (Fig. [100]2b).

Fig. 2. PRnet outperforms alternative approaches in predicting
post-perturbation change in gene expression.

   [101]Fig. 2
   [102]Open in a new tab

   a Performance of out-of-distribution perturbation scenarios in three
   train-test data split strategies (1) Random split, (2) Unseen compounds
   split, and (3) Unseen cell lines split. The “Pearson of log(FC) in
   compounds” metric evaluates the mean log(FC) perturbed by the same
   compounds in the test set, which are presented as mean values  ± SD.
   Error bars indicate the standard error (SD) for each method, and the
   points (n = 5) refer to the 5-fold used for validation. b The “Pearson
   of log(FC) in cov_compounds” metric evaluated mean the (log(FC)) of
   compounds within the same cell line in the test set. Results are
   presented as mean values  ± SD (n = 5). c T-SNE representations of
   latent embeddings learned by PRnet. Post-perturbation transcriptional
   profiles are from 82 cell lines from 20 different organs/tissues. d The
   “R2 in compounds" metric evaluates the R2 score of the average gene
   expression perturbed by the same compounds in the test set. Results are
   presented as mean values ± SD (n = 5). e The “R2 in cov_compounds”
   metric evaluates the R2 score of the average gene expression perturbed
   by the same compounds within the same cell line. Results are presented
   as mean values  ± SD (n = 5). f T-SNE representation of latent
   embeddings from massive single-cell screens of 188 drugs across 3
   cancer cell lines. g T-SNE representation of latent embedding of
   transcriptional profiles from MCF7 cell line perturbed with AG-14361.
   The pseudo-dose trajectory is displayed with a black line. [103]Source
   data are provided as a Source Data file.

   To better characterize the heterogeneous gene-level change under
   certain perturbations, it is desirable to identify the set of cells or
   cell lines and isolate the precise variations enriched in the data from
   the corresponding cell lines or cells. After training, PRnet learned
   interpretable embeddings in the latent space in the context of the base
   unperturbed state and the applied perturbation. The low-dimensional
   (t-SNE^[104]33) representation (Fig. [105]2c) illustrates the latent
   embeddings of post-perturbation transcriptional profiles learned by
   PRnet. In the latent space, embeddings from the same cell line tend to
   form a cluster together. Each cancer cell lines form specific
   gene-level responses for corresponding perturbations. In a way, PRnet
   captures strong cell line-specific transcriptional profile variations
   under different conditions. Interestingly, we observed that the
   embedding learned by PRnet also represents cell line similarity in
   response to various perturbations. Figure [106]2c illustrates the t-SNE
   representation of the latent embeddings for all cell lines, where cell
   lines originating from the same organ exhibit a similarity preference
   in the latent space, resulting in close spatial locations, such as cell
   lines from the colon, breast, and lung. Supplementary Fig. [107]6a–c
   demonstrates in detail the low-dimensional (t-SNE) representation of
   the latent embeddings for cell lines from colorectal cancer, breast
   cancer, and lung cancer, respectively. To quantitatively evaluate the
   similarity of perturbations among cell types, we computed the
   normalized cosine similarity among the mean embeddings in the latent
   space for all cell lines. The resulting cosine similarity heatmap
   (Supplementary Fig. [108]6g) shows that most cell lines from the same
   tissue exhibit higher similarity in the latent space compared to those
   from different tissues.

   Human cancer cell lines have facilitated drug discoveries in cancer
   biology, but they are neither clonal nor genetically stable. This
   instability can generate variability in drug sensitivity, as shown by
   Ben-David et al.^[109]34. The genomic evolution of cancer cell lines
   leads to a high degree of variation across cell line strains. To
   explore the impact of inter- and intra-heterogeneity of cell lines on
   drug responses, we conducted additional experiments focusing on A549
   and MCF7 cell strains, as described by Ben-David et al.^[110]34 in
   Supplementary Note [111]2. These experiments were designed to explore
   inter- and intra-cellular heterogeneity, similar patterns of the same
   MOA drug across cell strains, and screening candidate compounds for
   heterogeneous cells. These results indicated that drug response was
   highly similar to genetic or gene expression which aligned with
   findings in the original study. These findings underscore the utility
   of our model in capturing both inter- and intra-heterogeneity,
   enhancing its application in screening for specific cancer strains.

   PRnet adapted to model profiles of HTS at different resolutions. The
   sci-Plex assay^[112]2 screened 188 compounds in 3 cancer cell lines at
   single-cell resolution, measuring millions of cells. The screened cell
   lines A549 (lung adenocarcinoma), K562 (chronic myeloid leukemia), and
   MCF7 (breast adenocarcinoma) were treated to each of these 188
   compounds at four dosages (10 nM, 100 nM, 1 μM, 10 μM). To
   quantitatively evaluate the performance of PRnet in single-cell HTS
   data, we followed commonly used metric^[113]4–[114]6 to compare the R^2
   score between the true and predicted post-perturbation gene expression
   for the hold-out test set with alternative approaches (Fig. [115]2d,
   e). The “R^2 in compound”﻿ metric evaluated the R^2 score of the
   average post-perturbation gene expression of the same compound in the
   test set. We also compared the cell-type-specific performance “R^2 in
   cov_compound”﻿ metric, which evaluated the R^2 score of the average
   post-perturbation of the same compound within each specific cell line.
   PRnet outperformed alternative models in the “R^2 in the compound”﻿ and
   “R^2 in cov_compound”﻿ metrics in unseen compounds (R^2 in the
   compound: 0.969) and unseen pathways scenarios (R^2 in the compound:
   0.97) than the other methods. The low-dimensional (t-SNE)
   representation illustrates latent embeddings learned by PRnet from
   massive single-cell screens of 188 compounds across 3 cancer cell lines
   (Fig. [116]2f). The latent embeddings automatically cluster the
   concordance of cells to their cell types. The results demonstrated that
   PRnet not only captures the heterogeneous responses of large-scale HTS
   but also resolves similar responses of homogeneous cells to various
   perturbations. The t-SNE embedding (Fig. [117]2g) in the latent space
   of MCF7 cells treated with AG-14361 produces a pseudo-dose trajectory,
   indicating that AG-14361 induced heterogeneous responses. Several other
   pseudo-dose trajectory examples were illustrated in Supplementary
   Fig. [118]6d–f.

   To validate the clinical relevance of PRnet, we incorporated scRNA-seq
   data from a cohort of pediatric acute myeloid leukemia (AML)
   patients^[119]35. We trained PRnet to predict transcriptional responses
   to chemotherapy treatments and validated its potential for clinical
   application. PRnet showed robustness in predicting transcriptional
   responses for pediatric AML patients post-chemotherapy, demonstrating
   PRnet’s potential to assist in clinical applications. For detailed
   experimental results, please see Supplementary Note [120]1 and
   Supplementary Fig. [121]8.

PRnet captures gene programs in various perturbation scenarios

   We reason that PRnet can be a valuable tool to analyze and capture
   functional transcriptional experiments that aim to uncover gene
   programs or effects on various conditions by introducing perturbation
   to the system. To investigate this in greater detail, we first
   collected and tested the hold-out perturbation data of compound
   vorinostat. Vorinostat (suberoylanilide hydroxamic acid) is the first
   FDA-approved HDAC inhibitor for the treatment of cutaneous
   manifestations of cutaneous T-cell lymphoma (CTCL) and is also
   currently being studied as monotherapy and in combination therapy for
   other types of cancers^[122]36,[123]37. Results demonstrated a
   comprehensive ability of PRnet to capture gene-level fold changes in
   all of the 71 cell lines (Fig. [124]3a). By comparing the
   transcriptional responses of cell lines from different organs, we
   observed that PRnet captured cell-type-specific responses. For
   instance, cell lines from muscle exhibited relatively weaker responses
   and smaller fold changes compared to those from the lung.
   Figure [125]3a shows that PRnet correctly captures both the right trend
   and the magnitude of perturbation of the top up- and down-regulation
   genes across 71 cell lines from 16 tissues/organs. Taking gene FAM57A
   and TP53 as examples, PRnet made accurate predictions in cases of both
   up- and down-regulation in all cell lines after perturbation. In
   addition, PRnet even correctly predicted the fold change values in the
   expression of FAM57A across all cell lines. Figure [126]3b shows a
   detailed comparison between the predicted and actual distributions of
   fold changes in gene expression for three representative cell lines
   from different organs (HT29: colorectal adenocarcinoma, A549: lung
   adenocarcinoma and MCF7: breast adenocarcinoma,) treated with
   vorinostat. It can be observed that PRnet aligns consistently with the
   distribution of predicted and true observations and accurately predicts
   the up- and down-regulation trends of the top 5 genes with high log(FC)
   values. We employed the KEGG pathway Gene Set Enrichment Analysis
   (GSEA) on the average predicted post-perturbations gene rank of
   Vorinostat across all cell lines. The GSEA results (Fig. [127]3c, d and
   Supplementary Fig. [128]3a) reveal that Vorinostat is enriched in
   pathways related to fundamental cellular processes related to tumor
   suppressor mechanisms. The GSEA results indicated that Vorinostat
   suppresses pathways such as Cell cycle, DNA replication, and
   Spliceosome, and activates pathways including Autophagy - animal,
   Lysosome, Phagosome, and so on, which are all associated with autophagy
   and apoptosis in tumor cells^[129]38.

Fig. 3. PRnet captures gene programs in various perturbation scenarios.

   [130]Fig. 3
   [131]Open in a new tab

   a The heatmap illustrates the average log(FC) of the gene expression
   profiles predicted by PRnet and the ground truth for 71 cell lines
   under the drug Vorinostat. The horizontal axis represents genes, while
   the vertical axis represents cell lines, with the color orange
   indicating upregulation and the color green indicating downregulation.
   b The violin plots illustrate the distribution of log(FC) for the top 5
   up-and down-regulated genes in multiple perturbations of cell lines,
   including A549, HT29, and MCF7, from three cancer types treated with
   the drug Vorinostat. c The GSEA results of average post-perturbation
   changes in expression of 71 cell lines perturbed by Vorinostat. GSEA
   was performed for KEGG pathway enrichment. Enriched terms were
   identified by adjusted p-values (p.adjust)  < 0.05 (Benjamini-Hochberg
   method). d The category net plot visualizes the functional enrichment
   result of Vorinostat with suppressed pathways colored in blue and
   activated pathways colored in red. e The box plots illustrate the
   log(FC) of the top 20 up-and down-regulated genes in cell lines HT29
   with the drug bortezomib (n = 865 observations), MG-132 (n = 892
   observations), and wortmannin (n = 207 observations), respectively. The
   middle line in the box plot, median; box boundary, IQR; whiskers,
   1.5 × IQR; minimum and maximum, not indicated in the box plot; gray
   dots, points beyond the minimum or maximum whisker. f The box plots
   illustrate the expression of the 10 different expression genes of cell
   lines A549 (n = 450 cells), K562 (n = 430 cells), and MCF7 (n = 988
   cells) perturbed by the drug GSK-LSD1, respectively. The middle line in
   the box plot, median; box boundary, IQR; whiskers, 1.5 × IQR; minimum
   and maximum, not indicated in the box plot; gray dots, points beyond
   the minimum or maximum whisker. [132]Source data are provided as a
   Source Data file.

   To demonstrate the generalization of PRnet, we also analyzed
   perturbation observations of other hold-out compounds on a gene-by-gene
   basis. We collected and tested some case perturbation data of HT29 with
   the most observations. Figure [133]3e illustrates the log(FC) of the
   top 20 up- and down-regulated genes in multiple perturbations of HT29
   cell lines treated with the hold-out compounds bortezomib, MG-132, and
   wortmannin, respectively. These results suggested that PRnet is able to
   capture regulated gene level information that is consistent with
   evidence from corresponding compounds in inferring different cancer
   transcriptional profile conditions that can be missed by perturbations
   analysis. The predicted distribution of fold changes in gene expression
   after perturbation closely aligned with the actual observed
   distribution, indicating the accuracy of PRnet in capturing the
   perturbation effects. More predicted gene-level perturbation responses
   of cell lines of breast cancer exhibited similar performance
   (Supplementary Fig. [134]2a). Besides, Fig. [135]3f demonstrates the
   ability of PRnet to predict cell-type-specific gene-level perturbation
   transitions in single-cell HTS observations. In the case of predicting
   the response of perturbing A549, K562, and MCF7 cell lines with
   GSK-LSD1, PRnet correctly captured both the right trend in gene
   expression and the magnitude of response across all 10 differentially
   expressed genes (Fig. [136]3f). Similar performance was observed for
   several other examples across perturbation conditions (Supplementary
   Fig. [137]4a–c). The ability to capture changes in gene-level programs
   under different compound conditions and resolutions indicates the
   robustness, generalization, and precise performance of PRnet in
   predicting perturbation responses.

PRnet identified active compounds against small-cell lung cancer

   Having been trained to simulate experimental measurements of
   high-throughput screening, PRnet was applied to identify potential
   novel compound candidates for the treatment of small cell lung cancer
   (SCLC). SCLC is an extremely aggressive lung cancer characterized by
   small cells with limited cytoplasm forming clusters or
   spheroids^[138]39. Despite an initial positive response to conventional
   chemotherapy and radiation, SCLC often recurs rapidly, with less than
   5% of patients surviving five years. Currently, highly effective
   treatment options for this disease remain unresolved, making drug
   development efforts for this cancer a high priority.

   Given several novel cell lines of SCLC, namely NCI-H69, NCI-H526,
   NCI-H446, NCI-H209, and NCI-H196, and DMS114, we first employed PRnet
   to predict the transcriptional response of SCLC cell lines to sensitive
   compounds. Then, we in silico screened two user-defined compound
   libraries to identify potential compound candidates against SCLC
   (Fig. [139]4a), namely an active compound library (4158 compounds) from
   Selleckchem, and an in-house druglike compound library (29,670
   compounds). Through in silico screening, PRnet predicted the
   transcriptional responses of each compound across eight concentration
   gradients perturbing 6 cell lines, with each scenario repeated three
   times for computational robustness and calculated gene rank of changes
   in the average post-perturbation expression of each compound. After
   that, the predicted up/downregulated genes of sensitive compounds on
   their cell lines were used as the GSEA gene signature input to
   calculate the enrichment scores. We then performed GSEA to calculate
   the enrichment scores of compounds in libraries and ranked them
   according to their scores (Supplementary Datas [140]1, [141]2).
   Ultimately, three compounds ((+)-Fangchinoline, (+)-JQ-1, and
   SEL120-34A HCl) at the top rank (rank ≤ 3 for enrichment score of
   sensitive compounds) were chosen as the candidate set (Fig. [142]4b).
   Among them, it has been proved that small cell lung cancer (SCLC) cells
   are exquisitely sensitive to growth inhibition by the BET inhibitor
   (+)-JQ-1 (CAS No. : 1268524-70-4)^[143]40, and (+)-Fangchinoline and
   SEL120-34A HCl were assessed experimentally. We also explored the
   suitable activity concentrations of candidate compounds by calculating
   the enrichment scores of perturbed transcriptional profiles at
   concentration gradients. As shown in Fig. [144]4c, d, concentrations in
   the range of 1–10 μmol/L might be the proper inhibitory concentration
   for these candidate compounds.

Fig. 4. The PRnet-identified candidates against small cell lung cancer and
colorectal cancer.

   [145]Fig. 4
   [146]Open in a new tab

   a PRnet predicted the perturbed transcriptional profiles of 6 SCLC cell
   lines and 14 CRC cell lines. For in silico screening, PRnet first
   predicted the transcriptional profiles of SCLC cell lines perturbed by
   a multi-concentration gradient of 4158 active compounds, as well as
   29,670 in-house druglike compounds. PRnet also predicted the
   transcriptional profiles of 14 CRC cell lines perturbed by 30,456
   natural compounds. Then, post-perturbation fold-changes and average
   fold-changes of compounds across cell lines are computed. The gene
   ranking is performed based on the fold-change values. Given the
   predicted gene signature of sensitive compounds, the model computes the
   enrichment scores for up- and down-regulated gene sets of screening
   compounds. Finally, compounds are ranked based on the enrichment
   scores. For in vitro testing, MTT assays were performed for the
   evaluation of cell viability. b The heatmap illustrates the enrichment
   score of candidates for sensitive compounds. c, d The line charts plot
   the enrichment score of SCLC cells exposed to (+)-Fangchinoline and
   SEL120-34A HCl. e, f The cell survival curve of SCLC cells exposed to
   (+)-Fangchinoline and SEL120-34A HCl (n = 3 replicates). IC[50]
   (half-maximal inhibitory concentration) are presented as mean values
    ± SD. g, h The cell survival curve of CRC cells exposed to
   7-Methoxyrosmanol and Mulberrofuran Q (n = 3 replicates). IC[50] are
   presented as mean values  ± SD. [147]Source data are provided as a
   Source Data file. Some icons were created in BioRender. Qi, X. (2024)
   [148]biorender.com/c85y849.

   We utilized MTT assays to examine the activities of compound candidates
   against SCLC cells. Six human SCLC cell lines (NCI-H69, NCI-H526,
   NCI-H446, NCI-H209, and NCI-H196, and DMS114) were chosen for
   experiments. Results revealed the activity of SEL120-34A HCl (CAS No. :
   1609452-30-3) and (+)-Fangchinoline (CAS No. : 436-77-1) against small
   cell lung cancer (SCLC) cell lines. SEL120-34A HCl and
   (+)-Fangchinoline showed significant inhibitory effects on the
   proliferation of SCLC cells, and exhibited an IC[50] (half-maximal
   inhibitory concentration) of less than 10 μmol/L, indicating their
   inhibitory effect on SCLC cell viability (Fig. [149]4e, f).
   (+)-Fangchinoline and SEL120-34A HCl moderately inhibited the viability
   of SCLC cell lines. Detailed antiviability activities of two candidates
   are shown in Supplementary Table [150]3. These findings suggested the
   potential therapeutic efficacy of SEL120-34A HCl and (+)-Fangchinoline
   in the context of SCLC treatment, highlighting their promising role as
   active compounds against this aggressive lung cancer subtype.

PRnet found natural compounds against colorectal cancer

   Colorectal cancer (CRC) ranks as the third most common cancer and the
   second leading cause of cancer death worldwide^[151]41. Advances in
   molecularly targeted therapy and immunotherapy over the past decades
   have significantly improved patient survival^[152]41. However, some CRC
   patients, initially responsive to these treatments, quickly develop
   insensitivity. The emergence of drug resistance in cancer treatment
   significantly reduces patient outcomes. We aim to explore more
   potential novel treatment options for colorectal cancer to mitigate the
   impact of drug resistance through in silico screening of a large-scale
   library of natural compounds (30,456 compounds^[153]42).

   We extended the application of PRnet to screening novel natural
   compounds for the treatment of CRC. We first leveraged PRnet to predict
   the post-perturbation transcriptional profiles of sensitive compounds
   in 14 CRC cell lines (HT29, LOVO, MDST8, RKO, HT115, SW948, SNU1040,
   SNUC4, SNUC5, SW480, SW620, HCT116, CL34, and NCIH508). Then, we used
   the predicted up/downregulated genes of sensitive compounds on their
   cell lines as the GSEA gene signature. After that, we also in silico
   screened a large-scale natural compounds library containing 30,456
   natural ingredients. PRnet predicted the transcriptional responses of
   each compound across eight concentration gradients perturbing 14 cell
   lines with three repeats and calculated the post-perturbation average
   gene rank of compounds. Two natural compounds (7-Methoxyrosmanol and
   Mulberrofuran Q) at the top rank (rank ≤ 5 for enrichment score of
   sensitive compounds) were chosen as the candidate set (see
   Supplementary Tables. [154]1, 2).

   We utilized MTT assays to examine the activities of candidate compounds
   against CRC cells. Six human CRC cell lines, namely HT29, SW480,
   Caco-2, SW620, HCT116, and Colo205, were chosen for the experiments.
   The results revealed that 7-Methoxyrosmanol (CAS No. : 113085-62-4) and
   Mulberrofuran Q (CAS No. : 101383-35-1) inhibited the viability of CRC
   cell lines (Fig. [155]4e, f). 7-Methoxyrosmanol and Mulberrofuran Q
   moderately inhibited the viability of CRC cell lines. Detailed
   antiviability activities of two candidates are shown in Supplementary
   Table [156]3. These findings showed the activity of 7-Methoxyrosmanol
   and Mulberrofuran Q against CRC cell lines, suggesting their potential
   efficacy in CRC treatment.

PRnet generated a large-scale integration atlas of perturbation profiles

   With the ability to characterize specific gene-level perturbation
   responses and identify anti-cancer compounds, PRnet was applied to in
   silico screen novel compound libraries and cell lines and generate a
   large-scale integration atlas of perturbation profiles across various
   scenarios (Fig. [157]5a). PRnet was trained with two datasets: (1) the
   L1000 dataset, a bulk high-throughput screening library consisting of
   883,269 transcriptional profiles from 82 cell lines perturbed by
   175,549 biologically active compounds, and (2) the Sci-plex3 dataset, a
   single-cell high-throughput screening library consisting of 290,888
   transcriptional profiles from 3 cell lines perturbed by 188 active
   compounds. The L1000 dataset^[158]1 screened cell lines derived from
   over 20 diverse tissues and exposed to compounds targeting multiple
   genes and pathways (Supplementary Fig. [159]5a–c). The Sci-plex3
   dataset screened three cancer cell lines treated by 188 compounds
   targeting a diverse range of targets and molecular pathways, covering
   various mechanisms of action (Supplementary Fig. [160]5d–f). After
   training, PRnet was applied to screen various perturbation scenarios to
   generate a large-scale integration atlas of perturbation profiles.
   Through virtual screening, PRnet has predicted over 25 million
   post-perturbation expression profile atlas of perturbation profiles
   which consists of five parts: (1) the FDA-approved drugs dataset: a
   bulk virtual high-throughput screening library containing 1,891,330
   transcriptional profiles from 82 cell lines perturbed by 935
   FDA-approved drugs, (2) the anti-cancer compounds dataset: a bulk
   virtual high-throughput screening library containing 8,781,784
   transcriptional profiles from 88 cell lines perturbed by 4158 active
   compounds, (3) the natural compounds dataset: a bulk virtual
   high-throughput screening library containing 10,233,230 transcriptional
   profiles from 14 colorectal cancer cell lines perturbed by 30,456
   natural compounds, (4) the bioactive compounds dataset: a bulk virtual
   high-throughput screening library containing 4,272,486 transcriptional
   profiles from 6 small cell lung cancer cell lines perturbed by 29,670
   druglike compounds, and (5) the Gtex dataset: a bulk virtual
   high-throughput screening library containing 1,245,510 transcriptional
   profiles from 54 tissues perturbed by 935 FDA-approved drugs. Details
   of all cell lines are provided in Supplementary Data [161]3. PRnet
   offered a broad perspective by providing insights into how
   perturbations impact the transcriptional landscape on a large scale and
   extended its utility to diverse screening contexts. The large-scale
   integrated atlas of perturbation profiles can be applied to a variety
   of downstream application scenarios. For example, the FDA-approved drug
   dataset can be used for drug repositioning to recommend drugs for
   specific diseases based on gene signatures (see PRnet provided a robust
   and scalable drug 568 recommendation workflow based on the profiles
   atlas). The anti-cancer compounds dataset, the natural compounds
   dataset, and the bioactive compounds dataset are valuable for screening
   new anti-cancer compounds (see PRnet identified active compounds
   against small cell 430 lung cancer and PRnet found natural compounds
   against colorectal 482 cancer). In addition, the Gtex dataset can be
   useful to analyze the toxicity of compounds in different tissues. PRnet
   imports gene-level functionality in perturbations of different
   compounds and empowers flexibility by utilizing user-defined compounds’
   structures and transcription profiles to estimate the gene expression
   matrix. These profiles can be compared against various perturbation
   conditions(dosages, structures of compounds) using PRnet, to evaluate
   the impact of using different compounds on specific gene-expression
   profiles from single cell and bulk data. These diverse profiles
   provided potential solutions for drug discovery, disease treatment, and
   toxicity analysis.

Fig. 5. PRnet provided a robust and scalable recommend workflow on a large
integration atlas of perturbation profile.

   [162]Fig. 5
   [163]Open in a new tab

   a The large-scale integration atlas of perturbation profiles. b PRnet
   provided a recommended workflow for specific diseases. In step 1, given
   the structures of the screening library, PRnet predicts the
   post-perturbed transcriptional profiles of all compounds across
   multiple concentration gradients in 82 cell lines. In step 2, the
   transcriptional profiles of 978 landmark genes are transformed into
   12,328 genes through linear transformation. Subsequently, the
   perturbation fold-change and the average fold-change across cell lines
   for the transformed expression profile are computed. Gene ranking is
   then performed based on the fold-change values. In step 3, given the
   gene signature for a particular disease, the model computes the
   enrichment scores for up- and down-regulated gene sets of screening
   compounds. Finally, compounds are ranked based on the enrichment
   scores. c–e Scatter plots of the enrichment score of compounds against
   NASH, Crohn’s disease, and PCOS. The computationally predicted
   candidates are highlighted with their names in the upper right corner.
   The color gradient shows the density of the dots. [164]Source data are
   provided as a Source Data file. Some icons were created in BioRender.
   Qi, X. (2024) [165]biorender.com/z56i777.

PRnet provided a robust and scalable drug recommendation workflow based on
the profiles atlas

   PRnet provides a comprehensive drug recommendation workflow based on
   the large-scale integration atlas of perturbation profiles
   (Fig. [166]5b). In step 1, given the compounds’ structure of the
   screening library, PRnet predicts the post-perturbed transcriptional
   profiles of all compounds across multiple concentration gradients
   across cell lines. In step 2, the transcriptional profiles of 978
   landmark genes are transformed into 12,328 genes through linear
   transformation. Subsequently, the perturbation fold-change and the
   average fold-change across cell lines for the transformed expression
   profile are computed. The gene ranking is then performed based on the
   fold-change values. In step 3, given the gene signature for a specific
   disease, PRnet computes the enrichment scores for up- and
   down-regulated gene sets of screening compounds. Finally, compounds in
   the screening library are ranked based on these enrichment scores.
   These downstream applications demonstrated the versatility and utility
   of PRnet in addressing diverse challenges in the field of drug
   discovery and perturbation analysis.

   We leveraged the drug recommendation workflow to identify drug
   candidates for the treatment of 233 different diseases. For the
   user-defined compound library, we here chose the FDA-approved drug
   dataset, which comprises 935 FDA-approved drugs. All gene signatures of
   233 diseases versus normal were collected from the CREEDS project
   (CRowd Extracted Expression of Differential Signatures^[167]43). The
   up/downregulated genes of a specific disease were used as the GSEA gene
   signature input to calculate enrichment scores (see “Methods”) of
   drugs. Finally, we obtained 577 candidate lists from 577 studies for
   233 unique diseases (Supplementary Data [168]4). Taking three metabolic
   disorder diseases, including nonalcoholic steatohepatitis (NASH),
   inflammatory bowel disease (IBD), and polycystic ovary syndrome
   patients (PCOS) as examples, we provided enrichment scores of each drug
   for the three diseases and the recommended drug candidates for them.
   The enrichment scores are plotted in Fig. [169]5c–e, drugs positioned
   at the upper right corner (rank≤ 10 for enrichment score) were chosen
   as the candidate set for literature verification. Supplementary
   Table [170]4 lists the Top 10 enrichment scores for compounds against
   NASH, Crohn’s disease, and PCOS.

   Nonalcoholic steatohepatitis (NASH) is liver inflammation and damage
   caused by a buildup of fat in the liver without standard treatment and
   any well-established drug targets. After calculation, we ultimately
   selected a set of candidate drugs from the upper right corner (rank≤ 6
   for enrichment score) as candidate treatments for NASH (Fig. [171]5c).
   Literature verification revealed that Mirabegron (rank 1), Vidofludimus
   (rank 5), and Rifaximin (rank 6) have literature support for use in the
   treatment of NASH. A study on high-fat diet rats^[172]44 suggested that
   mirabegron may have a protective effect against NASH as it improves
   liver enzymes, lipids, serum HbA1c, fasting insulin and glucose,
   insulin resistance index, and serum adiponectin levels, as well as
   ameliorates hepatic histopathologic changes in NASH-induced rats. A
   study on obesity mice^[173]45 suggested vidofludimus reduced hepatic
   steatosis and inflammation in obesity mice and has been repurposed to a
   therapeutic potential in the treatment of NASH by targeting FXR based
   on the newly established relationships among drugs, targets, and
   diseases. Rifaximin therapy appeared to be effective and safe in
   modifying NASH through reduction of serum endotoxin and improvement of
   insulin resistance, proinflammatory cytokines, CK-18, and NAFLD-liver
   fat score. Patients with biopsy-proven NASH with Rifaximin therapy
   showed that Rifaximin appeared to be effective and safe in modifying
   NASH^[174]46,[175]47. We also performed KEGG pathway enrichment
   analysis of predicted up/down-regulated genes of Mirabegron,
   Vidofludimus, and Rifaximin (Supplementary Fig. [176]7). The KEGG
   pathway enrichment analysis of downregulated genes for both Mirabegron
   and Rifaximin suggested they may downregulate pathways associated with
   NASH.

   Crohn’s disease and ulcerative colitis are idiopathic inflammatory
   bowel disorders (IBD). Crohn’s disease is a relapsing systemic
   inflammatory disease that primarily affects the gastrointestinal tract
   with extraintestinal manifestations and associated immune
   disorders^[177]48. After ranking the drugs, two drugs positioned at the
   upper right corner (rank≤ 9 for enrichment score) were chosen as the
   candidate set (Fig. [178]5d). The literature verification revealed that
   Escin (rank 3) and Ozanimod (rank 9) have literature support for use in
   the treatment of Crohn’s disease. Escin is the main bioactive
   ingredient of Semen aesculi, which improves intestinal barrier
   dysfunction of IBD via Akt/NF-κB signaling pathway^[179]49. Ozanimod is
   a once-daily sphingosine 1-phosphate receptor modulator for the
   treatment of inflammatory bowel disease and was approved by the FDA.
   Phase 2 clinical trial had proved that Ozanimod can be a novel oral
   small molecule therapy for the treatment of Crohn’s
   disease^[180]50,[181]51. The KEGG pathway enrichment analysis results
   of predicted up/down-regulated genes of Escin and Ozanimod are
   illustrated in Supplementary Fig. [182]7. The predicted upregulated
   pathways (Supplementary Fig. [183]7) suggested Escin may upregulate the
   PI3K/Akt signaling pathway which regulates a broad cascade of target
   proteins including nuclear factor kappa B (NF-κB) and glycogen synthase
   kinase-3β (GSK-3β)^[184]52.

   Polycystic ovary syndrome (PCOS) is a heterogeneous endocrine disorder
   in which the major endocrine disruption is excessive androgen secretion
   or activity, and a large proportion of women also have abnormal insulin
   activity. We also chose positively scored drugs as candidates for PCOS
   (rank≤ 9 for enrichment score, Fig. [185]5e). The literature
   verification revealed that Enzalutamide (rank 3), Linagliptin (rank 8),
   and Topiramate (rank 9) have literature support for use in the
   treatment of PCOS. Enzalutamide has been suggested for use as an
   antiandrogen to treat hirsutism and hyperandrogenism in women with
   polycystic ovary syndrome^[186]53,[187]54 investigated the effect of
   linagliptin and/or I3C on experimentally-induced PCOS in female rats,
   and the results suggested that linagliptin/I3C combination might
   represent a beneficial therapeutic modality for amelioration of PCOS.
   PCOS has completed phase 3 trials for Topiramate, and
   phentermine-topiramate extended-release (PHEN/TPM) resulted in the most
   loss of weight and total body fat^[188]55. The predicted regulated
   pathways (Supplementary Fig. [189]7) of Enzalutamide showed that
   Enzalutamide may upregulate the PI3K/Akt signaling pathway and MAPK
   signaling pathway, which are related to Androgen signaling^[190]56. The
   predicted downregulated pathways suggested Topiramate may downregulate
   the Lipid and atherosclerosis and the Fat digestion and absorption
   pathway, thereby contributing to weight loss. All three literature
   verification results illustrated that PRnet recommended reliable and
   effective drugs for different diseases, and hence is a valuable tool
   for drug discovery.

Discussion

   Recent advancements in high-throughput screening have profiled
   thousands of independent chemical perturbations, providing crucial
   insights into the fundamental responses of biological systems to
   perturbations. However, screening all possible disease and compound
   combinations is experimentally unfeasible. To address the limited
   exploration power of existing experimental methods, we developed PRnet,
   which supports the prediction of transcriptional responses to novel
   chemical perturbations that were never experimentally studied at bulk
   and single-cell levels. PRnet serves as a valuable tool that
   facilitates gene-level response interpretation in various novel
   perturbation scenarios and effectively recommends candidates for
   various diseases.

   PRnet has the unique ability to screen various novel compound libraries
   for specific diseases and infer their post-perturbed transcriptional
   response. Given the structures of compounds in libraries and
   unperturbed transcriptional profile, PRnet encodes the perturbation and
   unperturbed state to an interpretable latent space as condition
   contexts to generate transcriptional responses. Trained on extensive
   data, PRnet is able to adapt to complex chemical perturbations of
   generalization to novel compounds and cell lines. To further validate
   the effectiveness of PRnet, we identified novel compounds candidates
   against SCLC and natural compounds against CRC. Experimental
   validations confirmed the activity of candidate compounds, showcased
   the capabilities of PRnet to guide the design of new screens, and
   reduced the time and costs of experiments. Lastly, PRnet generated a
   large-scale integration atlas of perturbation profiles and demonstrates
   its capability to recommend drug candidates for 233 complex diseases
   based on reference changes in gene sets. The flexibility and
   scalability of PRnet make it a valuable tool for guiding the design of
   screening strategies for gene-based therapeutics.

   Although PRnet is effective in predicting drug candidates, there is
   still room for improvement. Due to the scalability of the SMILES
   format^[191]30 and the RDKit^[192]31, PRnet is able to encode unseen
   complex compounds as embeddings. SMILES is widely used for representing
   chemical structures due to its simplicity and efficiency in encoding
   complex molecules as strings. However, SMILES may not fully capture
   complex molecular features such as 3D geometry, conformational
   flexibility, or dynamic behavior, which can be important for
   understanding molecular interactions. Alternative encoding methods such
   as MOL/SDF files and graph-based representations are more flexible
   representations in capturing the 3D geometry structure of compounds.
   MOL/SDF formats represent chemical structures with explicit atom
   coordinates and bond connectivity. Graph-based representations use
   graph theory to represent molecules where atoms are nodes and bonds are
   edges. These methods can provide a more intuitive and flexible
   representation. However, they are more complex or require pre-training
   to obtain better representations. In the scenario of large-scale in
   silico screening, we chose SMILES to encode chemical structures for its
   simplicity and scalability. Alternative encoding methods, such as
   MOL/SDF files and graph-based representations, will be considered
   in the future work.

   The reverse signature paradigm to connect disease and drug established
   by Lamb et al in CMap^[193]15 has demonstrated its effectiveness in
   previous studies^[194]16–[195]20. However, this may not hold true for
   all diseases, and there are instances where transcriptional changes do
   not correlate directly with drug sensitivity. Jie Cheng et al.^[196]57
   found the reverse signature paradigm may perform poor accuracy when
   some disease signatures are of low quality. Rasool Bhat^[197]58
   discusses how cancer cells exhibit phenotypic plasticity under drug
   treatment, leading to drug resistance which often involves complex
   regulatory networks that are not fully captured by changes in gene
   expression alone. The reverse signature paradigm is an effective
   matching algorithm in many diseases. But there are also cases that not
   fit this algorithm, more comprehensive omics data, such as genomics,
   epigenetics, and proteomics should be considered in the future work to
   better characterize disease states and facilitate drug discovery.

   In addition, in the process of compound screening, we mainly focus on
   the impact at the gene level, lacking consideration for the effects on
   phenotypes, such as the Area Under the Curve (AUC) or half-maximal
   inhibitory concentration (IC[50]) derived from the experimental
   dose-response curve. To comprehensively assess compound impacts, future
   research directions could involve incorporating more phenotype data for
   a holistic assessment of the relationship between genes and phenotypes.
   Moreover, expanding the scope of perturbation scenarios and
   incorporating extensive biological knowledge will be considered to
   enhance the model’s predictive capabilities in the future. Furthermore,
   we expect PRnet to extend its utility beyond the prediction of chemical
   perturbations to encompass various perturbation experiments, including
   genetic perturbations and other forms, thereby contributing to the
   advancement of drug discovery.

Methods

The PRnet algorithm

   In this work, we formulate transcriptional response prediction as a
   distribution generation problem conditioned on perturbations. Given a
   dataset
   [MATH:
   <mi>D</mi><mo>=</mo><msubsup><mrow><mrow><mo>(</mo><mrow><msub><mrow><m
   i>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>
   P</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow></mr
   ow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></
   msubsup> :MATH]
   , where
   [MATH:
   <msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>n</mi></mrow></msup>
   :MATH]
   denotes as the n-dimensional gene expression and P[i] is the attribute
   set of perturbation, PRnet was trained to learn a function f that
   predicts transcriptional responses
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>=</mo><mi>f</mi><mrow><mo>(</mo><mrow><msubsup
   ><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi></mrow><
   /msubsup><mo>,</mo><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow>
   </msub></mrow><mo>)</mo></mrow> :MATH]
   to novel perturbations. Here,
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi>
   </mrow></msubsup> :MATH]
   represents the unperturbed gene expression,
   [MATH: <mover
   accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mo>^</mo></mover> :MATH]
   is the predicted perturbed gene expression, and P[i] = (s[i], d[i]) is
   the attribute set of perturbation, which includes the chemical
   structures s[i] and dosages of the compounds d[i]. As a generation
   model, the Gaussian negative log-likelihood loss is chosen as the loss
   function. For HTS RNA-seq datasets
   [MATH:
   <mi>D</mi><mo>=</mo><msubsup><mrow><mrow><mo>(</mo><mrow><msub><mrow><m
   i>x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>
   P</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow></mr
   ow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></
   msubsup><mo>=</mo><msubsup><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi>
   x</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mrow><mo>(</mo><m
   row><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</m
   o><msub><mrow><mi>d</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo
   >)</mo></mrow></mrow><mo>)</mo></mrow></mrow><mrow><mi>i</mi><mo>=</mo>
   <mn>1</mn></mrow><mrow><mi>N</mi></mrow></msubsup> :MATH]
   , the gene expression x[i], compounds s[i], and dosages d[i] attributes
   are usually considered, in which
   s[i] = (s[i,1], s[i,2], . . . , s[i,M]) describes the Canonical SMILES
   format of M compounds of perturbation i and
   d[i ]= (d[i,1], d[i,2], . . . , d[i,M]) are the dosages of compounds.
   When d[i][,][j] = 0, the compound j was not applied in perturbation i.
   If M = 0, there is no perturbation performed, and in this case,
   perturbation i is an unperturbed state
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi>
   </mrow></msubsup> :MATH]
   , otherwise is in a perturbed state. In addition, the covariates vector
   c[i] contains discrete covariates such as cell types, cell lines, or
   species, depending on the available data. While c[i] does not directly
   serve as input to the function f, it relates to the intermediate
   variable vector
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi>
   </mrow></msubsup> :MATH]
   of the function f. Each perturbed profile x[i] is assigned an
   unperturbed transcriptional profile of the same cell line
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi>
   </mrow></msubsup> :MATH]
   by random selection in the dataset. The goal of PRnet is to learn a
   function f that predicts transcriptional responses
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>=</mo><mi>f</mi><mrow><mo>(</mo><mrow><msubsup
   ><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi></mrow><
   /msubsup><mo>,</mo><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow>
   </msub></mrow><mo>)</mo></mrow> :MATH]
   to novel perturbations. Given an unperturbed gene expression
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi>
   </mrow></msubsup> :MATH]
   , PRnet maps the unperturbed state to a distribution
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>x</mi>
   </mrow><mrow><mi>i</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>∣</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>,</mo><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><
   /mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mo>)</mo></mrow> :MATH]
   , and the sample to a perturbed state
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub> :MATH]
   .

   PRnet is a perturbation-conditioned generative model aimed at
   predicting gene expression profiles under different perturbations. The
   design of PRnet consists of three components: (1) the Perturb-adapter,
   a scalable adapter
   [MATH:
   <msub><mrow><mi>f</mi></mrow><mrow><mi>p</mi><mi>e</mi><mi>r</mi><mi>t<
   /mi></mrow></msub><mo>:</mo><mi
   mathvariant="double-struck">Z</mi><mo>→</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>k</mi></mrow></msup>
   :MATH]
   encodes complex perturbations (s[i], d[i]) into a k-dimensional
   perturbation embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   , (2) the Perturb-encoder
   [MATH:
   <msub><mrow><mi>f</mi></mrow><mrow><mi>e</mi><mi>n</mi><mi>c</mi></mrow
   ></msub><mo>:</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>k</mi><mo>+</mo><mi>
   n</mi></mrow></msup><mo>→</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>e</mi></mrow></msup>
   :MATH]
   encodes the combined learnable perturbation embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   and unperturbed gene expression
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi>
   </mrow></msubsup> :MATH]
   to a latent space to get the embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>l</mi>
   </mrow></msubsup> :MATH]
   , and (3) the Perturb-decoder
   [MATH:
   <msub><mrow><mi>f</mi></mrow><mrow><mi>d</mi><mi>e</mi><mi>c</mi></mrow
   ></msub><mo>:</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>e</mi><mo>+</mo><mi>
   k</mi><mo>+</mo><mi>m</mi></mrow></msup><mo>→</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>2</mn><mi>n</mi></mr
   ow></msup> :MATH]
   decodes combines perturbation embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   , latent state
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>l</mi>
   </mrow></msubsup> :MATH]
   and noise
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>n</mi>
   </mrow></msubsup> :MATH]
   into Gaussian likelihood distribution
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>x</mi>
   </mrow><mrow><mi>i</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>∣</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>,</mo><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><
   /mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mo>)</mo></mrow> :MATH]
   of perturbed state and generates perturbed gene expression
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub> :MATH]
   by sampling from
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>x</mi>
   </mrow><mrow><mi>i</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>∣</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>,</mo><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><
   /mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mo>)</mo></mrow> :MATH]
   . We elaborate on the three components in the following.

The Perturb-adapter

   The Perturb-adapter encodes the i-th perturbation P[i] as a fixed-size
   embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup><mo>∈</mo><msup><mrow><mi>R</mi></mrow><mrow><mi>k</mi
   ></mrow></msup> :MATH]
   (k = 64). Given the i-th perturbation P[i], the j-th compound of P[i]
   was first represented as the canonical SMILES format s[i,j], and was
   converted to a fixed size fingerprint embedding FCFP[i,j] (FCFP4
   fingerprints: Functional-Class Fingerprints with radius = 2 which focus
   on the functional topological pharmacophore) by RDKit^[198]31 encoder
   [MATH: <mi>H</mi><mo>:</mo><mi
   mathvariant="double-struck">Z</mi><mo>→</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>h</mi></mrow></msup>
   :MATH]
   (h = 1024). Then, the rescaled Functional-Class Fingerprints rFCFP[i]
   embedding of P[i] was calculated by the weighted sum of all fingerprint
   embeddings of all compounds applied by P[i]:
   [MATH:
   <mi>r</mi><mi>F</mi><mi>C</mi><mi>F</mi><msub><mrow><mi>P</mi></mrow><m
   row><mi>i</mi></mrow></msub><mo>=</mo><msub><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi></mrow></msub><mi>ϕ</mi><mr
   ow><mo>(</mo><mrow><msub><mrow><mi>d</mi></mrow><mrow><mi>i</mi><mo>,</
   mo><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>×</mo><mi>H</mi>
   <mrow><mo>(</mo><mrow><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi><mo>
   ,</mo><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</mo><msub><
   mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi></mrow></msub><mi>l</mi><mi
   >o</mi><msub><mrow><mi>g</mi></mrow><mrow><mn>10</mn></mrow></msub><mro
   w><mo>(</mo><mrow><msub><mrow><mi>d</mi></mrow><mrow><mi>i</mi><mo>,</m
   o><mi>j</mi></mrow></msub><mo>+</mo><mn>1</mn></mrow><mo>)</mo></mrow><
   mo>×</mo><mi>H</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>s</mi></mrow><
   mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow><mo>)</mo></mro
   w><mo>,</mo> :MATH]
   1

   where ϕ is a log scale function using the dosage of j-th compound. At
   last, the final perturbation embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   was generated by the Perturb-adapter
   [MATH:
   <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>p</mi></mrow></msub></mrow></msub> :MATH]
   :
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup><mo>=</mo><msub><mrow><mi>f</mi></mrow><mrow><mi>p</mi
   ><mi>e</mi><mi>r</mi><mi>t</mi></mrow></msub><mrow><mo>(</mo><mrow><msu
   b><mrow><mi>s</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msub>
   <mrow><mi>d</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo><
   /mrow><mo>=</mo><msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</m
   i></mrow><mrow><mi>p</mi></mrow></msub></mrow></msub><mrow><mo>(</mo><m
   row><mi>G</mi><mrow><mo>(</mo><mrow><mi>H</mi><mrow><mo>(</mo><mrow><ms
   ub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</m
   o></mrow><mo>,</mo><msub><mrow><mi>log</mi></mrow><mrow><mn>10</mn></mr
   ow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>d</mi></mrow><mrow><mi>
   i</mi></mrow></msub><mo>+</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow>
   <mo>)</mo></mrow></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
   2

   where θ[p] are the parameters of the Perturb-adapter, which is a
   2-layer feedforward neural network.

   Thanks to the scalability of the SMILES^[199]30 format and the
   RDKit^[200]31, which can encode any compound to a latent embedding, the
   design of the Perturb-adapter is scalable to in silico screen novel
   compounds and cell line perturbations.

The Perturb-encoder and the Perturb-decoder

   Inspired by the Variational Auto-Encoder (VAE^[201]22) framework, PRnet
   uses the encoder and decoder framework to estimate the Gaussian
   distribution of each gene parameterized with the mean (μ[i]) and the
   variance (
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup> :MATH]
   ). Due to the technical limitation, only unpaired perturbed and
   unperturbed transcriptional profiles were observed. To match the
   expression profiles before and after perturbation, for scRNA-seq
   datasets, each perturbed cell was assigned with an unperturbed cell
   from the same cell type by random selection in the dataset, and for
   bulk RNA-seq datasets, the same cell line observation was assigned for
   the perturbed state. Given the perturbation embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   generated by the Perturb-apaptor, the Perturb-encoder
   [MATH:
   <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>e</mi></mrow></msub></mrow></msub> :MATH]
   mapped the
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   and the unperturbed gene expression
   [MATH:
   <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi>
   </mrow></msubsup> :MATH]
   to a latent embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>l</mi>
   </mrow></msubsup><mo>∈</mo><msup><mrow><mi>R</mi></mrow><mrow><mi>e</mi
   ></mrow></msup> :MATH]
   (e = 64). Then, the perturb-decoder
   [MATH:
   <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>d</mi></mrow></msub></mrow></msub> :MATH]
   accepted
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   ,
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>l</mi>
   </mrow></msubsup> :MATH]
   , and the noise
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>n</mi>
   </mrow></msubsup> :MATH]
   to estimate the Gaussian likelihood distribution
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>x</mi>
   </mrow><mrow><mi>i</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>∣</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>,</mo><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><
   /mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mo>)</mo></mrow> :MATH]
   of the perturbed profile:
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>l</mi>
   </mrow></msubsup><mo>=</mo><msub><mrow><mi>E</mi></mrow><mrow><msub><mr
   ow><mi>θ</mi></mrow><mrow><mi>e</mi></mrow></msub></mrow></msub><mrow><
   mo>(</mo><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mi>u</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>z</mi></mr
   ow><mrow><mi>i</mi></mrow><mrow><mi>p</mi></mrow></msubsup></mrow><mo>)
   </mo></mrow><mo>,</mo> :MATH]
   3
   [MATH: <mrow><mo>[</mo><mrow><msub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>,</mo><msubsup><mrow><mover
   accent="true"><mrow><mi>σ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mo>]</mo></mrow>
   <mo>=</mo><msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi></mr
   ow><mrow><mi>d</mi></mrow></msub></mrow></msub><mrow><mo>(</mo><mrow><m
   subsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi></
   mrow></msubsup><mo>,</mo><msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</m
   i></mrow><mrow><mi>l</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>
   z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>n</mi></mrow></msubsup></
   mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
   4

   where both
   [MATH:
   <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>e</mi></mrow></msub></mrow></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>d</mi></mrow></msub></mrow></msub> :MATH]
   are 2-layer feedforward neural networks, and θ[e]and θ[d] are the
   parameters of
   [MATH:
   <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>e</mi></mrow></msub></mrow></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>d</mi></mrow></msub></mrow></msub> :MATH]
   , respectively. In the design of PRnet, the perturbation embedding
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   was used as the input for both the perturb-encoder and the
   perturb-decoder. Multiplexing
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>p</mi>
   </mrow></msubsup> :MATH]
   in the perturb-decoder combined the chemical and biology context helped
   assemble a more precise output. And the noise
   [MATH:
   <msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>n</mi>
   </mrow></msubsup> :MATH]
   increases the robustness of the model. Once the estimated Gaussian
   likelihood distribution
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>x</mi>
   </mrow><mrow><mi>i</mi></mrow></msub><mspace
   width="0.25em"></mspace><mo>∣</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mo>,</mo><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi><
   /mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mo>)</mo></mrow> :MATH]
   was generated, PRnet sampled the estimated perturbed gene expression
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub> :MATH]
   from it:
   [MATH:
   <msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   sub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>,</mo> :MATH]
   5
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mo>=</mo><mi>φ</mi><mrow><mo>(</mo><mrow><msubsup><mr
   ow><mover
   accent="true"><mrow><mi>σ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mo>)</mo></mrow>
   <mo>,</mo> :MATH]
   6
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>~</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>μ</mi>
   </mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><msubsup><mrow><mi>σ</mi
   ></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow>
   <mo>)</mo></mrow><mo>,</mo> :MATH]
   7

   where φ(.) denotes the non-linear softplus function. μ[i] and
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup> :MATH]
   are the mean and the variance for each gene, which are n-dimensional
   vectors. And
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub> :MATH]
   is the estimated n-dimensional transcriptional responses to
   perturbations.

Training and testing

   The training objective of PRnet is to minimize the Gaussian negative
   log-likelihood loss defined as:
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>θ</mi>
   </mrow><mrow><mi>p</mi></mrow></msub><mo>,</mo><msub><mrow><mi>θ</mi></
   mrow><mrow><mi>e</mi></mrow></msub><mo>,</mo><msub><mrow><mi>θ</mi></mr
   ow><mrow><mi>d</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</mo><msu
   b><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi></mrow></msub><mfrac><mrow>
   <mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><mfenced close=")"
   open="("><mrow><mi>log</mi><mfenced close=")"
   open="("><mrow><mi>max</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mn>2</mn></mrow></msubsup><mo>,</mo><mi>e</mi><mi>p</mi><mi>s</mi
   ></mrow></mfenced></mrow></mfenced><mo>+</mo><mfrac><mrow><msup><mrow><
   mrow><mo>(</mo><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>i</mi></mro
   w></msub><mo>−</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow>
   </msub></mrow><mo>)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup></m
   row><mrow><mi>max</mi><mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mn>2</mn></mrow></msubsup><mo>,</mo><mi>e</mi><mi>p</mi><mi>s</mi
   ></mrow></mfenced></mrow></mfrac></mrow></mfenced><mo>,</mo> :MATH]
   8

   Where eps is used to clamp
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup> :MATH]
   for stability, and is set to 1e-6 by default. By minimizing the loss of
   all cells/samples, PRnet is able to learn the most proper parameters.

   The training datasets (a bulk HTS dataset from the L1000 project^[202]1
   and a single cell HTS dataset from the sci-Plex3 assay^[203]2) were
   split by dividing the attribute set of perturb P[i] =
   (s[i], d[i], c[i]).

   To train and test PRnet, all datasets were split into three subsets:
   train, valid, and test in a ratio of 6:2:2, according to different
   split strategies. A total of four split strategies were designed:
     * random_split: randomly divides compounds and cell lines
     * compound_split: groups the datasets according to s[i] for each
       profile, and then splits groups of the dataset
     * cell_line_split: groups the datasets according to c[i] for each
       profile, and then splits groups of the dataset
     * pathway_split: groups the datasets according to c[i] from the same
       pathway for each profile, and then splits groups of the dataset

   For bulk RNA-seq datasets, random_split, compounds_split, and
   cell_line_split were applied, and for scRNA-seq datasets, random_split,
   compound_split, and pathway_split were applied. Strict data split can
   prevent data leakage and bring better generalization performance. PRnet
   is trained using 5-fold cross-validation for each split category.

   The table below (see Table [204]1) outlines the values for the
   hyperparameters involved in PRnet training. The same set of
   hyperparameters is used across all splits and datasets.

Table 1.

   Hyperparameters set of PRnet
   Module Hyperparameter Default value hyperparameters set
   PRnet batch size 512 256, 512, 1024
   learning rate 1e-3 1e-3, 1e-5, 3e-3
   weight decay 1e-8 0,1e-3
   scheduler factor 0.5 0.5
   scheduler patience 5 5
   early stopping patience 20 20
   num of epochs 100 100
   gradient descent Adam Adam,RMSprop
   loss function GaussNLLLoss^1 GaussNLLLoss, NBLoss^2, MSE^3, KLLoss^4
   Perturbe-adapter drug dimension(h) 1024 1024
   input size h 1024, 2048
   hidden size 128 64, 128, 256
   output size(k) 64 64, 128
   Perturbe-encoder input size 5000, 978 5000(scRNA-seq), 978(bulk
   RNA-seq)
   hidden size 128 128, 256
   output size(e) 64 64, 128
   Perturbe-decoder input size(e + k + m) 138 138, 266
   hidden size 128 128, 256
   output size n × 2 10000, 1956
   [205]Open in a new tab

   ^1 GaussNLLLoss is the Gaussian negative log-likelihood loss.

   ^2 NBLoss is the negative binomial negative log-likelihood loss.

   ^3 MSE is the mean squared error loss.

   ^4 KLLoss is the Kullback-Leibler divergence loss.

Model evaluation

   Four metrics were used to quantitatively evaluate the performance of
   PRnet:
    1. fold − change: describes the ratio changes between the gene
       expression of unperturbed and perturbed state:

   [MATH:
   <mi>f</mi><msub><mrow><mi>c</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>=</mo><mfrac><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mro
   w></msub><mo>+</mo><mn>1</mn></mrow><mrow><msubsup><mrow><mi>x</mi></mr
   ow><mrow><mi>i</mi></mrow><mrow><mi>u</mi></mrow></msubsup></mrow></mfr
   ac><mo>,</mo> :MATH]
       9

   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>f</mi><mi>c</mi></mrow><mo>^</mo></mover></mrow
   ><mrow><mi>i</mi></mrow></msub><mo>=</mo><mfrac><mrow><msub><mrow><move
   r
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>+</mo><mn>1</mn></mrow><mrow><msubsup><mrow><m
   over
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow><mrow><mi>u</mi></mrow></msubsup></mrow></mfrac><mo>.</mo
   > :MATH]
       10
       Where fc[i] and
       [MATH: <msub><mrow><mover
       accent="true"><mrow><mi>f</mi><mi>c</mi></mrow><mo>^</mo></mover></
       mrow><mrow><mi>i</mi></mrow></msub> :MATH]
       are the ground truth and predicted perturbed fold change in gene
       expression x[i], respectively. A higher value of fold − change
       indicates a better performance.
    2. R^2: is the mean coefficient of determination score between the
       predicted and true perturbed gene expression of all cells:

   [MATH:
   <msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><m
   frac><mrow><mn>1</mn></mrow><mrow><mi>O</mi></mrow></mfrac><msubsup><mr
   ow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>O</mi></mrow></msubsup><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mfrac><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>j</mi><mo>=</mo><mi>n</mi></mrow></msubsup><msup><mrow><mrow>
   <mo>(</mo><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow><mrow><mi>j</mi></mrow></msubsup><mo>−</mo><msubsup><mrow
   ><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>j</mi></mrow></msubs
   up></mrow><mo>)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup></mrow>
   <mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>j</mi><mo>=</mo><mi>n</mi></mrow></msubsup><msup><mrow><mrow>
   <mo>(</mo><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow><mrow><mi>j</mi></mrow></msubsup><mo>−</mo><msubsup><mrow
   ><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow><mrow><mi
   >i</mi></mrow><mrow><mi>j</mi></mrow></msubsup></mrow><mo>)</mo></mrow>
   </mrow><mrow><mn>2</mn></mrow></msup></mrow></mfrac></mrow></mfenced><m
   o>,</mo> :MATH]
       11
       where O denotes the number of cells in test datasets,
       [MATH: <msub><mrow><mover
       accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow
       ><mi>i</mi></mrow></msub> :MATH]
       , x[i] and
       [MATH: <msub><mrow><mover
       accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow><mrow
       ><mi>i</mi></mrow></msub> :MATH]
       are the predicted, true perturbed gene expression and mean of the
       ground truth perturbed gene expression, respectively. A higher
       value of R^2 indicates a better performance.
    3. Pearson of log(FC) in compounds: evaluates the Pearson correlation
       between the true and predicted post-perturbation of the average
       logarithm of the fold-change in gene expression (log(FC)) perturbed
       by the same compound:

   [MATH: <mover
   accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mi>f</mi><msub><
   mrow><mi>c</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow><mo>)</mo></
   mrow></mrow><mo
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>n</mi></mrow></mfrac><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>i
   </mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></msubsup><mi>lo
   g</mi><mrow><mo>(</mo><mrow><mi>f</mi><msub><mrow><mi>c</mi></mrow><mro
   w><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow><
   mo>,</mo> :MATH]
       12

   [MATH: <mover
   accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><msub><mrow><move
   r
   accent="true"><mrow><mi>f</mi><mi>c</mi></mrow><mo>^</mo></mover></mrow
   ><mrow><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mo
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>n</mi></mrow></mfrac><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>i
   </mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></msubsup><mi>lo
   g</mi><mrow><mo>(</mo><mrow><msub><mrow><mover
   accent="true"><mrow><mi>f</mi><mi>c</mi></mrow><mo>^</mo></mover></mrow
   ><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow><mo>)</mo></m
   row><mo>,</mo> :MATH]
       13

   [MATH:
   <mi>P</mi><mi>C</mi><msub><mrow><mi>C</mi></mrow><mrow><mi>j</mi></mrow
   ></msub><mo>=</mo><mi>P</mi><mi>C</mi><mi>C</mi><mrow><mo>(</mo><mrow><
   mover
   accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mi>f</mi><msub><
   mrow><mi>c</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow><mo>)</mo></
   mrow></mrow><mo accent="true">¯</mo></mover><mo>,</mo><mover
   accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><msub><mrow><move
   r
   accent="true"><mrow><mi>f</mi><mi>c</mi></mrow><mo>^</mo></mover></mrow
   ><mrow><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mo
   accent="true">¯</mo></mover></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
       14

   [MATH: <mi mathvariant="normal">Pearson</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">of</mi><mspace
   width="0.25em"></mspace><mi>log</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">FC</mi></mrow><mo>)</mo></mrow><mspace
   width="0.25em"></mspace><mi mathvariant="normal">in</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">compounds</mi><mo>=</mo><mfrac><mrow><mn>1</mn></m
   row><mrow><mi>m</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>m</mi></mrow></msubsup><mi>P</mi><mi>C</mi><msub><mrow><mi>C<
   /mi></mrow><mrow><mi>j</mi></mrow></msub><mo>,</mo> :MATH]
       15
       where fc[i,j] and
       [MATH: <msub><mrow><mover
       accent="true"><mrow><mi>f</mi><mi>c</mi></mrow><mo>^</mo></mover></
       mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub> :MATH]
       denote as the ground truth and predicted post-perturbation fold
       change in gene expression x[i] perturbed by compound j. The set
       [MATH:
       <msubsup><mrow><mrow><mo>(</mo><mrow><mi>f</mi><msub><mrow><mi>c</m
       i></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow><
       mo>)</mo></mrow></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><
       mrow><mi>n</mi></mrow></msubsup> :MATH]
       represents all perturbed fold-changes for compound j.
       [MATH: <mover
       accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mi>f</mi><ms
       ub><mrow><mi>c</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow><mo>
       )</mo></mrow></mrow><mo accent="true">¯</mo></mover> :MATH]
       and
       [MATH: <mover
       accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><msub><mrow><
       mover
       accent="true"><mrow><mi>f</mi><mi>c</mi></mrow><mo>^</mo></mover></
       mrow><mrow><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
       mo accent="true">¯</mo></mover> :MATH]
       are the ground truth and predicted post-perturbation average
       logarithm of logarithm of the fold-change in gene expression for
       compound j, respectively. PCC[j] is the Pearson correlation
       coefficient of the mean log(FC) perturbed by compound j. The
       average of the Pearson correlations for all m compounds in the test
       set is referred to as the “Pearson of log(FC) in compounds”. A
       higher value of “Pearson of log(FC) in compounds” indicates a
       better performance.
    4. Pearson of log(FC) in cov_compounds: evaluates the Pearson
       correlation between the true and predicted post-perturbation of the
       average logarithm of the fold-change in gene expression (log(FC))
       perturbed by the same compound within the same cell line:

   [MATH: <mover
   accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mi>F</mi><msubsu
   p><mrow><mi>C</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi>c</mi></mrow>
   </msubsup></mrow><mo>)</mo></mrow></mrow><mo
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>n</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><mi>log</mi><mrow><mo>(</mo><mrow><mi>
   f</mi><msubsup><mrow><mi>c</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</
   mi></mrow><mrow><mi>c</mi></mrow></msubsup></mrow><mo>)</mo></mrow><mo>
   ,</mo> :MATH]
       16

   [MATH: <mover
   accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mover
   accent="true"><mrow><mi>F</mi><msubsup><mrow><mi>C</mi></mrow><mrow><mi
   >j</mi></mrow><mrow><mi>c</mi></mrow></msubsup></mrow><mo>^</mo></mover
   ></mrow><mo>)</mo></mrow></mrow><mo
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>n</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></msubsup><mi>log</mi><mrow><mo>(</mo><mrow><mov
   er
   accent="true"><mrow><mi>f</mi><msubsup><mrow><mi>c</mi></mrow><mrow><mi
   >i</mi><mo>,</mo><mi>j</mi></mrow><mrow><mi>c</mi></mrow></msubsup></mr
   ow><mo>^</mo></mover></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
       17

   [MATH:
   <mi>P</mi><mi>C</mi><msubsup><mrow><mi>C</mi></mrow><mrow><mi>j</mi></m
   row><mrow><mi>c</mi></mrow></msubsup><mo>=</mo><mi>P</mi><mi>C</mi><mi>
   C</mi><mrow><mo>(</mo><mrow><mover
   accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mi>F</mi><msubsu
   p><mrow><mi>C</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi>c</mi></mrow>
   </msubsup></mrow><mo>)</mo></mrow></mrow><mo
   accent="true">¯</mo></mover><mo>,</mo><mover
   accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mover
   accent="true"><mrow><mi>F</mi><msubsup><mrow><mi>C</mi></mrow><mrow><mi
   >j</mi></mrow><mrow><mi>c</mi></mrow></msubsup></mrow><mo>^</mo></mover
   ></mrow><mo>)</mo></mrow></mrow><mo
   accent="true">¯</mo></mover></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
       18

   [MATH: <mi mathvariant="normal">Pearson</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">of</mi><mspace
   width="0.25em"></mspace><mi>log</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">FC</mi></mrow><mo>)</mo></mrow><mspace
   width="0.25em"></mspace><mi mathvariant="normal">in</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">cov</mi><mo>_</mo><mi
   mathvariant="normal">compounds</mi><mo>=</mo><mfrac><mrow><mn>1</mn></m
   row><mrow><mi>z</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>c</mi><mo>,</mo><mi>j</mi><mo>=</
   mo><mn>1</mn></mrow><mrow><mi>z</mi></mrow></msubsup><mi>P</mi><mi>C</m
   i><msubsup><mrow><mi>C</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi>c</m
   i></mrow></msubsup><mo>,</mo> :MATH]
       19
       where
       [MATH:
       <mi>f</mi><msubsup><mrow><mi>c</mi></mrow><mrow><mi>i</mi><mo>,</mo
       ><mi>j</mi></mrow><mrow><mi>c</mi></mrow></msubsup> :MATH]
       and
       [MATH: <mover
       accent="true"><mrow><mi>f</mi><msubsup><mrow><mi>c</mi></mrow><mrow
       ><mi>i</mi><mo>,</mo><mi>j</mi></mrow><mrow><mi>c</mi></mrow></msub
       sup></mrow><mo>^</mo></mover> :MATH]
       denote the ground truth and predicted post-perturbation fold change
       in gene expression x[i] perturbed by compound j in covariate c,
       respectively. The covariate c represents cell line or cell type of
       x[i].
       [MATH:
       <msubsup><mrow><mrow><mo>(</mo><mrow><mi>f</mi><msubsup><mrow><mi>c
       </mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow><mrow><mi>c<
       /mi></mrow></msubsup></mrow><mo>)</mo></mrow></mrow><mrow><mi>i</mi
       ><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></msubsup>
       :MATH]
       represents all perturbed fold-changes for compound j in covariate
       c.
       [MATH: <mover
       accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mi>F</mi><ms
       ubsup><mrow><mi>C</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi>c</mi
       ></mrow></msubsup></mrow><mo>)</mo></mrow></mrow><mo
       accent="true">¯</mo></mover> :MATH]
       and
       [MATH: <mover
       accent="true"><mrow><mi>log</mi><mrow><mo>(</mo><mrow><mover
       accent="true"><mrow><mi>F</mi><msubsup><mrow><mi>C</mi></mrow><mrow
       ><mi>j</mi></mrow><mrow><mi>c</mi></mrow></msubsup></mrow><mo>^</mo
       ></mover></mrow><mo>)</mo></mrow></mrow><mo
       accent="true">¯</mo></mover> :MATH]
       are the ground truth and predicted post-perturbation average
       logarithm of logarithm of the fold-change in gene expression for
       compound j in covariate c, respectively.
       [MATH:
       <mi>P</mi><mi>C</mi><msubsup><mrow><mi>C</mi></mrow><mrow><mi>j</mi
       ></mrow><mrow><mi>c</mi></mrow></msubsup> :MATH]
       is the Pearson correlation coefficient of the mean log(FC)
       perturbed by compound j in covariate c. The average of the Pearson
       correlations for all z “cov_compounds” conditions in the test set
       is referred to as the “Pearson of log(FC) in cov_compounds”. A
       higher value of “Pearson of log(FC) in cov_compounds” indicates a
       better performance.
    5. R^2 in compounds: is the mean coefficient of determination score
       between the predicted and true post-perturbation gene expression of
       the same compound:

   [MATH: <mover
   accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>t</mi></mrow
   ></msub></mrow><mo
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>p</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>p</mi></mrow></msubsup><msub><mrow><mi>x</mi></mrow><mrow><mi
   >i</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>,</mo> :MATH]
       20

   [MATH: <mover accent="true"><mrow><mover
   accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>t</mi></mrow
   ></msub></mrow><mo>^</mo></mover></mrow><mo
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>p</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>p</mi></mrow></msubsup><mover
   accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mo>,<
   /mo><mi>t</mi></mrow></msub></mrow><mo>^</mo></mover><mo>,</mo> :MATH]
       21

   [MATH:
   <msubsup><mrow><mi>R</mi></mrow><mrow><mi>t</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mo>=</mo><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn
   ></mrow></msup><mrow><mo>(</mo><mrow><mover
   accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>t</mi></mrow
   ></msub></mrow><mo accent="true">¯</mo></mover><mo>,</mo><mover
   accent="true"><mrow><mover
   accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>t</mi></mrow
   ></msub></mrow><mo>^</mo></mover></mrow><mo
   accent="true">¯</mo></mover></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
       22

   [MATH: <msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mi
   mathvariant="normal">in</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">compounds</mi><mo>=</mo><mfrac><mrow><mn>1</mn></m
   row><mrow><mi>T</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>T</mi></mrow></msubsup><msubsup><mrow><mi>R</mi></mrow><mrow>
   <mi>t</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo>,</mo> :MATH]
       23
       where x[i,t] and
       [MATH: <mover
       accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi><m
       o>,</mo><mi>t</mi></mrow></msub></mrow><mo>^</mo></mover> :MATH]
       denote as the ground truth and predicted post-perturbation gene
       expression x[i] perturbed by compound t. The set
       [MATH:
       <msubsup><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi>x</mi></mrow><
       mrow><mi>i</mi><mo>,</mo><mi>t</mi></mrow></msub></mrow><mo>)</mo><
       /mrow></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>p
       </mi></mrow></msubsup> :MATH]
       represents all post-perturbation gene expression for compound t.
       [MATH: <mover
       accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>t</mi></
       mrow></msub></mrow><mo accent="true">¯</mo></mover> :MATH]
       and
       [MATH: <mover accent="true"><mrow><mover
       accent="true"><mrow><msub><mrow><mi>x</mi></mrow><mrow><mi>t</mi></
       mrow></msub></mrow><mo>^</mo></mover></mrow><mo
       accent="true">¯</mo></mover> :MATH]
       are the ground truth and predicted post-perturbation average gene
       expression for compound t, respectively.
       [MATH:
       <msubsup><mrow><mi>R</mi></mrow><mrow><mi>t</mi></mrow><mrow><mn>2<
       /mn></mrow></msubsup> :MATH]
       is the coefficient of determination score of the mean
       post-perturbation gene expression of compounds t. The average of
       the coefficient of determination score for all T compounds in the
       test set is referred to as the “R^2 in compounds”. A higher value
       of “R^2 in compounds” indicates a better performance.
    6. R^2 in cov_compounds: is the mean coefficient of determination
       score between the predicted and true post-perturbation gene
       expression of the same compound within each specific cell line:

   [MATH: <mover
   accent="true"><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>t</mi></m
   row><mrow><mi>c</mi></mrow></msubsup></mrow><mo
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>p</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>p</mi></mrow></msubsup><msubsup><mrow><mi>x</mi></mrow><mrow>
   <mi>i</mi><mo>,</mo><mi>t</mi></mrow><mrow><mi>c</mi></mrow></msubsup><
   mo>,</mo> :MATH]
       24

   [MATH: <mover accent="true"><mrow><mover
   accent="true"><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>t</mi></m
   row><mrow><mi>c</mi></mrow></msubsup></mrow><mo>^</mo></mover></mrow><m
   o
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><mi>p</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>p</mi></mrow></msubsup><mover
   accent="true"><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mo
   >,</mo><mi>t</mi></mrow><mrow><mi>c</mi></mrow></msubsup></mrow><mo>^</
   mo></mover><mo>,</mo> :MATH]
       25

   [MATH:
   <msubsup><mrow><mi>R</mi></mrow><mrow><mi>t</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup><mo>=</mo><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn
   ></mrow></msup><mrow><mo>(</mo><mrow><mover
   accent="true"><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>t</mi></m
   row><mrow><mi>c</mi></mrow></msubsup></mrow><mo
   accent="true">¯</mo></mover><mo>,</mo><mover accent="true"><mrow><mover
   accent="true"><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>t</mi></m
   row><mrow><mi>c</mi></mrow></msubsup></mrow><mo>^</mo></mover></mrow><m
   o accent="true">¯</mo></mover></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
       26

   [MATH:
   <msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mspace
   width="0.25em"></mspace><mspace
   width="0.25em"></mspace><mstyle><mtext>in cov</mtext></mstyle><mo>_</mo
   ><mstyle><mtext>compounds</mtext></mstyle><mspace
   width="0.25em"></mspace><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><
   mi>Q</mi></mrow></mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>Q</mi></mrow></msubsup><msubsup><mrow><msup><mrow><mi>R</mi><
   /mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><mi>t</mi></mrow><mrow
   ><mi>c</mi></mrow></msubsup><mo>,</mo> :MATH]
       27
       where
       [MATH:
       <msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>t</mi
       ></mrow><mrow><mi>c</mi></mrow></msubsup> :MATH]
       and
       [MATH: <mover
       accent="true"><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>i</mi
       ><mo>,</mo><mi>t</mi></mrow><mrow><mi>c</mi></mrow></msubsup></mrow
       ><mo>^</mo></mover> :MATH]
       denote as the ground truth and predicted post-perturbation gene
       expression x[i] perturbed by compound t in covariate c,
       respectively. The covariate c represents the cell line or cell type
       of x[i]. The set
       [MATH:
       <msubsup><mrow><mrow><mo>(</mo><mrow><msubsup><mrow><mi>x</mi></mro
       w><mrow><mi>i</mi><mo>,</mo><mi>t</mi></mrow><mrow><mi>c</mi></mrow
       ></msubsup></mrow><mo>)</mo></mrow></mrow><mrow><mi>i</mi><mo>=</mo
       ><mn>1</mn></mrow><mrow><mi>p</mi></mrow></msubsup> :MATH]
       represents all post-perturbation gene expression for compound t in
       covariate c.
       [MATH: <mover
       accent="true"><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>t</mi
       ></mrow><mrow><mi>c</mi></mrow></msubsup></mrow><mo
       accent="true">¯</mo></mover> :MATH]
       and
       [MATH: <mover accent="true"><mrow><mover
       accent="true"><mrow><msubsup><mrow><mi>x</mi></mrow><mrow><mi>t</mi
       ></mrow><mrow><mi>c</mi></mrow></msubsup></mrow><mo>^</mo></mover><
       /mrow><mo accent="true">¯</mo></mover> :MATH]
       are the ground truth and predicted post-perturbation average gene
       expression for compound t in covariate c, respectively.
       [MATH:
       <msubsup><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow>
       </msup></mrow><mrow><mi>t</mi></mrow><mrow><mi>c</mi></mrow></msubs
       up> :MATH]
       is the coefficient of determination score of the mean
       post-perturbation gene expression of compounds t in covariate c.
       The average of the coefficient of determination score for all Q
       “cov_compounds” conditions in the test set is referred to as the
       “R^2 in cov_compounds”. A higher value of “R^2 in cov_compounds”
       indicates a better performance.

   We use the following baseline models to compare model performance:
    1. Linear model: This model uses a linear regression model to learn
       weights between all genes and perturbations. The linear model uses
       MSELoss to learn perturbation effects applied to the control
       sample/cell. Let θ[l] represent the weight matrix of the linear
       model, the perturbed gene expression would be:

   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>=</mo><msub><mrow><mi>E</mi></mrow><mrow><msub
   ><mrow><mi>θ</mi></mrow><mrow><mi>l</mi></mrow></msub></mrow></msub><mr
   ow><mo>(</mo><mrow><msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mr
   ow><mrow><mi>p</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>x</mi>
   </mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi></mrow></msubsup></mrow><
   mo>)</mo></mrow><mo>,</mo> :MATH]
       28

   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi>i</mi></mrow></msub><mrow><
   mo>(</mo><mrow><msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi
   ></mrow><mrow><mi>l</mi></mrow></msub></mrow></msub></mrow><mo>)</mo></
   mrow><mo>=</mo><mi>M</mi><mi>S</mi><mi>E</mi><mi>L</mi><mi>o</mi><mi>s<
   /mi><mi>s</mi><mrow><mo>(</mo><mrow><msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i
   </mi></mrow></msub></mrow><mo>)</mo></mrow><mo>.</mo> :MATH]
       29
    2. MLP model: MLP model utilizes Multilayer Perceptron (MLP) to fit
       the effect of perturbation to genes. The input is passed to an MLP
       model(Input Layer, Linear with output dimension 128, BatchNorm1d,
       LeakyReLU, Linear with output dimension as input size, and ReLU).
       Let θ[MLP] represent the parameters of the linear model, the
       perturbed gene expression would be:

   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>=</mo><mi>M</mi><mi>L</mi><msub><mrow><mi>P</m
   i></mrow><mrow><msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi
   ></mrow><mrow><mi>M</mi><mi>L</mi><mi>P</mi></mrow></msub></mrow></msub
   ></mrow></msub><mrow><mo>(</mo><mrow><msubsup><mrow><mi>z</mi></mrow><m
   row><mi>i</mi></mrow><mrow><mi>p</mi></mrow></msubsup><mo>,</mo><msubsu
   p><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>u</mi></mrow>
   </msubsup></mrow><mo>)</mo></mrow><mo>,</mo> :MATH]
       30

   [MATH: <msub><mrow><mi class="MJX-tex-caligraphic"
   mathvariant="script">L</mi></mrow><mrow><mi>i</mi></mrow></msub><mrow><
   mo>(</mo><mrow><msub><mrow><mi>E</mi></mrow><mrow><msub><mrow><mi>θ</mi
   ></mrow><mrow><mi>M</mi><mi>L</mi><mi>P</mi></mrow></msub></mrow></msub
   ></mrow><mo>)</mo></mrow><mo>=</mo><mi>M</mi><mi>S</mi><mi>E</mi><mi>L<
   /mi><mi>o</mi><mi>s</mi><mi>s</mi><mrow><mo>(</mo><mrow><msub><mrow><mo
   ver
   accent="true"><mrow><mi>x</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i
   </mi></mrow></msub></mrow><mo>)</mo></mrow><mo>.</mo> :MATH]
       31

The pseudo-dose trajectory

   To calculate the pseudo-dose trajectory, we take the mean t-SNE
   embedding of cells with the same dosage as a point on the pseudo-dose
   trajectory:
   [MATH: <mover
   accent="true"><mrow><msub><mrow><mi>z</mi></mrow><mrow><mi>d</mi></mrow
   ></msub></mrow><mo
   accent="true">¯</mo></mover><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mr
   ow><msub><mrow><mi>n</mi></mrow><mrow><mi>d</mi></mrow></msub></mrow></
   mfrac><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><msub><mrow><mi>n</mi></mrow><mrow><mi>d</mi></mrow></msub></mrow
   ></msubsup><msub><mrow><mi>z</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi>d
   </mi></mrow></msub><mo>,</mo> :MATH]
   32

   where z[i,d] is the t-SNE latent embedding of cell i at dose d, n[d] is
   the number of cells at dose d, and
   [MATH: <mover
   accent="true"><mrow><msub><mrow><mi>z</mi></mrow><mrow><mi>d</mi></mrow
   ></msub></mrow><mo accent="true">¯</mo></mover> :MATH]
   is the mean t-SNE embedding for dose d, representing a point on the
   pseudo-dose trajectory. The sequence of these mean latent embeddings
   [MATH: <mover
   accent="true"><mrow><msub><mrow><mi>z</mi></mrow><mrow><mi>d</mi></mrow
   ></msub></mrow><mo accent="true">¯</mo></mover> :MATH]
   forms the pseudo-dose trajectory.

Screening candidates based on gene signatures

   Inspired by the CMap connectivity score^[206]15 and L1000
   project^[207]1, PRnet employed a similar reverse signature paradigm to
   screen candidates based on gene signatures. PRnet provided a workflow
   for screening candidates for complex diseases according to the
   reference changes in gene sets.

   Step 1: Given the SMILES of the screening compounds library and
   unperturbed transcriptional profiles of specific cell lines, PRnet
   predicts the perturbed transcriptional profiles of all compounds across
   multiple concentration gradients. This prediction is performed with
   three repeats to ensure computational robustness.

   Step 2: The transcriptional profiles of 978 landmark genes are
   transformed into profiles of 12,328 genes using linear transformation
   vectors derived from the L1000 project^[208]1. The fold-change of
   transcriptional profiles is then calculated, which includes the average
   fold-change of each compound across cell lines. Genes are ranked based
   on their fold-change values. Gene expression signatures representing
   the disease states or sensitive compound responses are generated, known
   as query signatures. Query signatures include an up-regulated gene set
   and a down-regulated gene set which are the reversed signatures of the
   disease. Gene expression signatures of screening compounds are also
   generated, which are the ranked gene lists.

   Step 3: The Kolmogorov-Smirnov test is employed to calculate enrichment
   scores, which measure the connectivity of screening compound profiles
   with the query signatures. The score to reverse disease up- and
   down-regulated features are calculated separately and then summed
   together:
   [MATH:
   <mi>a</mi><mo>=</mo><msub><mrow><mi>max</mi></mrow><mrow><mi>m</mi><mo>
   =</mo><mn>1</mn><mo>~</mo><mi>p</mi></mrow></msub><mfenced close="]"
   open="["><mrow><mfrac><mrow><mi>m</mi></mrow><mrow><mi>p</mi></mrow></m
   frac><mo>−</mo><mfrac><mrow><mi>R</mi><mrow><mo>(</mo><mrow><mi>m</mi><
   /mrow><mo>)</mo></mrow></mrow><mrow><mi>n</mi></mrow></mfrac></mrow></m
   fenced><mo>,</mo> :MATH]
   33
   [MATH:
   <mi>b</mi><mo>=</mo><mo>−</mo><msub><mrow><mi>max</mi></mrow><mrow><mi>
   m</mi><mo>=</mo><mn>1</mn><mo>~</mo><mi>q</mi></mrow></msub><mfenced
   close="]"
   open="["><mrow><mfrac><mrow><mi>R</mi><mrow><mo>(</mo><mrow><mi>m</mi><
   mo>−</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow><mrow><mi>n</mi></mro
   w></mfrac><mo>−</mo><mfrac><mrow><mi>m</mi><mo>−</mo><mn>1</mn></mrow><
   mrow><mi>q</mi></mrow></mfrac></mrow></mfenced><mo>,</mo> :MATH]
   34
   [MATH:
   <mi>S</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</mi><mo>=</mo><mfenced
   open="{"><mrow><mtable><mtr><mtd
   columnalign="right"><mi>a</mi><mo>−</mo><mi>b</mi></mtd><mtd
   columnalign="center"><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mo>,</mo></mtd><mtd
   columnalign="left"><mi>w</mi><mi>h</mi><mi>e</mi><mi>n</mi><mspace
   width="1.0em"></mspace><mi>a</mi><mo>×</mo><mi>b</mi><mspace
   width="0.25em"></mspace><mo><</mo><mspace
   width="0.25em"></mspace><mn>0</mn><mo>,</mo></mtd></mtr><mtr><mtd
   columnalign="right"><mspace width="1.0em"></mspace><mspace
   width="0.25em"></mspace><mspace
   width="0.25em"></mspace><mn>0</mn></mtd><mtd
   columnalign="center"><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mo>,</mo></mtd><mtd
   columnalign="left"><mi>w</mi><mi>h</mi><mi>e</mi><mi>n</mi><mspace
   width="1.0em"></mspace><mi>a</mi><mo>×</mo><mi>b</mi><mspace
   width="0.25em"></mspace><mo>></mo><mspace
   width="0.25em"></mspace><mn>0</mn><mo>,</mo></mtd></mtr></mtable></mrow
   ></mfenced> :MATH]
   35

   where p is the number of genes in the upregulated gene set of query
   signature, q is the number of genes in the downregulated gene set of
   query signature, n is the number of genes in the computed
   transcriptional profile, and R(m) is the rank of a specific gene in the
   rank list. The enrichment score is commonly used in the field to
   evaluate the distribution of the predicted gene rank in the reference
   gene signature. The gene signature is an up- and down-regulated gene
   set. The up- and down-regulated gene set of the specific disease is
   identified from the reference study. The up- and down-regulated gene
   sets of the sensitive compounds to cell lines are computed by PRnet.
   Note that the up- and down-regulated gene sets represent the reverse
   gene signatures of a specific disease, indicating the up- and
   down-regulate gene sets that the compounds are expected to enrich.
   Finally, compounds are ranked based on these enrichment scores, and the
   top-ranked compounds are recommended for diseases.

Other computational tools

   Morgan fingerprints were calculated by the RDKit^[209]31 in Python
   (2023.3.2). The atlas of perturbation profiles was organized into
   anndata format through the Scanpy package^[210]59 in Python (1.9.1).
   T-SNE^[211]33 clustering was performed by the Scanpy package in Python
   (1.9.1). Figures related to computation were plotted by
   matplotlib^[212]60 (3.5.2), seaborn^[213]61 in Python (0.12.0), and the
   ggplot^[214]62 package in R (3.5.1). The GSEA and KEGG pathway
   enrichment analysis were performed by the Clusterprofiler
   package^[215]63 in R (4.6.2), and plot by the gseaplot, enrichplot and
   cnetplot. A P-value  < 0.05 was defined as the cutoff criterion.
   Experimental figures were plotted by GraphPad Prism. Some figures were
   plotted by Chiplot ([216]https://www.chiplot.online/). And some icons
   in figures are created in BioRender ([217]https://www.biorender.com/).
   The deep learning model was constructed by Pytorch^[218]64 (1.12.1).

Cell culture

   All the cell lines used in this work were purchased from the American
   Type Culture Collection (ATCC). HT29, SW480, SW620, and HCT116 cells
   were cultured in DMEM (Gibco) medium supplemented with 10% fetal bovine
   serum, 100 U/mL penicillin, and 100 U/mL streptomycin. Colo205,
   NCI-H196, NCI-H209, NCI-H446, NCI-H526, and NCI-H69 cells were cultured
   in RPMI-1640 (Gibco) medium supplemented with 10% fetal bovine serum,
   100 U/mL penicillin, and 100 µ/mL streptomycin. DMS114 cells were
   cultured in Waymouth’s MB 752/1 (Gibco) medium supplemented with 10%
   fetal bovine serum, 100 µ/mLpenicillin, and 100 µ/mL streptomycin. All
   the cell lines were incubated at 37 ^∘C in a humidified 5% CO[2]
   atmosphere. All cells were negative for mycoplasma, and these cell
   lines are not among those commonly misidentified by the International
   Cell Line Authentication Committee (ICLAC).

Cell viability assays

   MTT (3-(4,5-dimethylthiazol-2-yl)-2,5-diphenyltetrazolium bromide)
   assay was performed for the evaluation of cell viability. The cells
   were seeded into 96-well plates at a density of 2000–20000 cells per
   well in full medium. Overnight, test compounds with indicated
   concentrations were added for 72 h. 20 μL of MTT (5 mg ⋅ mL^−1 in
   saline, Sigma) was then added per well for 2 h, and 50 μL SDS (20%,
   dissolved in H[2]O containing 1% HCl) was added per well, followed by
   incubating overnight. The absorbance at 570 nm was measured using a
   multiscan spectrum reader (BMG lab tech). The cell survival rate was
   calculated subsequently.

Atlas of perturbation profile

Data preprocessing

   The HTS RNA-seq datasets used for this study all underwent the same
   preprocessing. For single-cell RNA-seq data, the gene expression of
   each cell was normalized by total counts over all genes with
   log-transformation, and 5000 highly variable genes (HVGs) were selected
   using Scanpy. For bulk RNA-seq data, 978 different genes of level 3
   were normalized with log transformation for training and evaluation.
   These data were split into training, validation, and test sets by
   dividing various perturbation conditions (such as compounds, cell type,
   and pathway) with a ratio of 6:2:2. And 5-fold cross-validation was
   applied for training.

The L1000 dataset

   The L1000 project^[219]2 contains more than 1 million bulk RNA-seq
   observations with 978 landmark genes of level 3. We performed data
   cleaning using the following criteria: deleted insufficient compound
   conditions(observations  < 5); removed invalid compound SMILES which
   were not successfully parsed by RDKit; assigned each perturbed
   observation an unperturbed observation and removed the unpaired
   observations. Finally, we obtained 175,549 cov_compounds_dose_name
   condition (unperturbed-perturbed pair), which contains 82 cell lines
   and 17,202 compounds.

The sci-Plex3 dataset

   The sci-Plex3 study^[220]2 was subseted to 290,888 cells. We performed
   data cleaning using the following criteria: randomly subsampled the
   dataset to half size; selected 5000 highly variable genes (HVGs) using
   Scanpy; removed invalid molecule SMILES which could not successfully be
   parsed using RDKit^[221]31; assigned each perturbed observation an
   unperturbed observation and remove the unpaired observations. Finally,
   we obtained 2244 cov_drug_dose_name condition (unperturbed-perturbed
   pair), which contains three cancer cell lines (A549, MCF7, K562), which
   were treated with 188 different compounds in 4 dosages (10, 100, 1000,
   and 10000 nM) and vehicle for unperturbed cells.

The SCLC dataset

   The unperturbed profiles of small cell lung cancer collected from CCLE
   contain 5 cell lines, including NCI-H69, NCI-H526, NCI-H446, NCI-H209,
   and NCI-H196. We performed data preprocessing using the following
   criteria: selected 978 landmark genes of unperturbed profiles; and
   filled in the valid gene expression with the mean expression value of
   all genes in the current cell line. After the virtual screening, we
   obtained 4272486 transcriptional profiles.

The compounds libraries

   Six compound libraries were used in this study. The training compound
   libraries are collected from the L1000^[222]1 and the sci-Plex3
   datasets^[223]2. in silico screening, an FDA-approved library
   (TargetMol, n = 935), an active compound library (Selleckchem,
   n = 4158), a natural compound library (Herb, n = 30,456), and an
   in-house drug-like compound library(n = 29,670) were used to screen the
   positive chemicals. The sensitive compounds of cell lines are collected
   from Genomics of Drug Sensitivity in Cancer (GDSC)
   ([224]https://www.cancerrxgene.org/) with z-score≥ 2.0.

The gene signature of diseases

   The gene signature of diseases was collected from CRowd Extracted
   Expression of Differential Signatures (CREEDS)^[225]43. The CREEDS
   project annotated 839 diseases versus normal signatures from Gene
   Expression Omnibus (GEO). We collected the gene signature of the
   disease by removing the study with the intersected set of up- and
   down-regulated genes and 12,328 genes and set less than 1 gene and
   finally obtained 577 studies for 233 unique diseases. The gene
   signatures of diseases were downloaded from [226]CREEDS.

The scRNA-seq data of pediatric AML patients cohort

   We collected scRNA-seq data of paired pre- and post-chemotherapy whole
   bone marrow samples from 13 pediatric AML patients who achieved disease
   remission following chemotherapy, as described by Zhang et al.^[227]35.
   These patients were treated with either a low-dose chemotherapy (LDC)
   regimen or a standard-dose chemotherapy (SDC) regimen. The LDC regimen
   consisted of cytarabine (10 mg/m^2) and mitoxantrone or Idarubicin,
   administered concurrently with G-CSF (5 μg/kg). The SDC regimen
   consisted of cytarabine (100 mg/m^2), daunorubicin, and etoposide. We
   compiled the expression profiles of all 224,217 cells from the 13
   patients and performed normalization and log transformation. Each cell
   was annotated with patient information, cell type, and chemotherapy
   regimen based on the information provided in the study. The LDC regimen
   was annotated with cytarabine and mitoxantrone, while the SDC regimen
   was annotated with cytarabine, daunorubicin, and etoposide. Through
   highly variable gene analysis, we retained 33,538 highly variable genes
   for training. The scRNA-seq data of patients were downloaded from the
   Genome Sequence Archive for Humans at the BIG data center, Beijing
   Institute of Genomics, Chinese Academy of Sciences, and China National
   Center for Bioinformation under accession number ([228]OMIX005223).

The data of A549 and MCF7 cell strains

   We collected expression profiles for 27 MCF7 strains and 23 A549
   strains from Ben-David et al.^[229]34. Expression profiles for 978
   L1000 landmark genes were used for each strain. We included 35
   compounds from the chemical screen in the study^[230]34, retaining 33
   after removing those that RDKit could not generate to FCFP
   fingerprints. We also collected the drug response to 35 active
   compounds of MCF7 strains in the chemical screen, including decreasing
   (active), decreasing (weakly active), and inactive.

Statistics and reproducibility

   All samples discussed in this manuscript (cell viability assays and
   perturbation profile atlas generation) were measured and analyzed as
   technical triplicates. No data were excluded from the analyses.

Reporting summary

   Further information on research design is available in the [231]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [232]Supplementary Information^ (7.2MB, pdf)
   [233]Peer Review File^ (4.3MB, pdf)
   [234]41467_2024_53457_MOESM3_ESM.pdf^ (30KB, pdf)

   Description of Additional Supplementary Files
   [235]Supplementary Data 1^ (22.4KB, xlsx)
   [236]Supplementary Data 2^ (23.1KB, xlsx)
   [237]Supplementary Data 3^ (14.5KB, csv)
   [238]Supplementary Data 4^ (990.3KB, csv)
   [239]Reporting Summary^ (76.6KB, pdf)

Source data

   [240]Source Data^ (2.7MB, zip)

Acknowledgements