Graphical abstract

   graphic file with name fx1.jpg
   [43]Open in a new tab

Highlights

     * •
       We propose a gray box framework with Boolean networks and a black
       box optimizer
     * •
       GREY is a learned optimizer specialized in Boolean network
       optimization
     * •
       Gray box framework can predict drug responses and reveal underlying
       mechanisms

Motivation

   The challenge of predicting cellular responses to perturbations amid
   the complex non-linearities of molecular interactions has spurred the
   development of machine learning-based models. However, interpreting
   these models in terms of molecular regulatory dynamics remains
   difficult. Conversely, logical network models like Boolean networks
   offer interpretability but struggle with large-scale networks due to
   high-dimensional search spaces. To overcome these hurdles, we introduce
   a scalable derivative-free optimizer, trained via meta-reinforcement
   learning, for Boolean network models. This approach enables prediction
   of anti-cancer drug responses in cancer cell lines while offering
   valuable insights into their molecular regulatory mechanisms.
     __________________________________________________________________

   Kim et al. present a gray box framework that combines a white box
   logical model with a black box optimizer, addressing challenges in
   interpreting molecular regulatory dynamics. The gray box framework
   successfully predicts anti-cancer drug responses of cancer cells, while
   shedding light on the underlying molecular regulatory mechanisms.

Introduction

   Controlling cell fate with molecular regulatory perturbations is an
   important challenge in various areas of biology. However, the control
   effect of a perturbation on specific molecules is often
   counterintuitive due to the complexity of the cellular network arising
   from non-linear molecular regulatory interactions.[44]^1 Molecular
   regulatory scale network structures, such as feedback loops and
   crosstalk, cause perturbation responses of the network to be highly
   dynamic processes. Therefore, predicting cellular response to a
   perturbation from static information alone, such as mutation profiles
   incorporated into the cellular network, obscures the underlying
   mechanism.[45]^2 In contrast, dynamic information regarding network
   state changes over time after a perturbation reveals the key regulatory
   mechanisms of the network response and aids the precision of response
   predictions. In this view, dynamic models of molecular regulation
   capable of simulating cellular responses to perturbations provide
   critical insight.[46]^3^,[47]^4^,[48]^5

   Various machine learning (ML) models, including deep learning, have
   shown high accuracy in predicting non-linear dynamic processes of
   cellular networks, such as responses to molecular regulatory
   perturbations.[49]^6^,[50]^7^,[51]^8^,[52]^9^,[53]^10 However, ML-based
   models cannot provide biological rationale for their predictions since
   interpreting the internal workings of the model is difficult. To
   rectify this “black box” nature of ML models, some interpretable
   methods such as random forest and linear regression have been applied
   to predict cellular responses to
   perturbations,[54]^11^,[55]^12^,[56]^13^,[57]^14^,[58]^15^,[59]^16
   which provide biological features highly correlated to the predictions.
   More recently, explainable artificial intelligence[60]^17^,[61]^18 and
   deep neural networks with architectures that mimic the relationships of
   biological entities, such as hierarchies of biological pathways,[62]^19
   subsystems,[63]^20^,[64]^21 or molecular interactions,[65]^22 have been
   utilized. These methods suggest biological features related to the
   decision of the models and identify their relationships. However, due
   to the lack of dynamic information provided by these suggestions from
   the ML-based models, a dynamic model of molecular regulation is needed
   to simulate and predict perturbation responses.

   In systems biology, a cellular network is considered as a non-linear
   dynamical system of interacting cellular components, such as genes and
   transcription factors.[66]^23^,[67]^24 Thus, the network can be
   abstracted to a system model containing molecular components and their
   regulatory parameters represented by nodes and links. Continuous
   dynamic models, such as ordinary differential equation network models,
   can be constructed by integrating multiple sources of biological data
   and a priori knowledge, and therefore can successfully recapitulate
   complex dynamic processes such as cellular responses to
   perturbations.[68]^25^,[69]^26 Reconstructing and simulating a dynamic
   network model reproduces molecular regulatory mechanisms, which not
   only predicts the perturbation responses but also enables
   interpretation of molecular regulation dynamics such as perturbation
   signal flows related with those responses. This approach is called
   “white box” modeling due to the interpretable nature of the predicted
   responses driven by the system. Despite this, estimating every kinetic
   parameter of the continuous dynamic model is only feasible for small
   and well-investigated networks. As the network size increases,
   accurately estimating parameters becomes more difficult due to the
   increase in the dimensionality of the search space.

   Discrete logic-based models, such as Boolean network models, can
   partially overcome these problems by abstracting the network dynamics
   into simple but essential regulatory functions such as logical
   equations. These functions can be represented by network parameters or
   edge weights. The influence of neighboring nodes on a specific node in
   the network can be determined by considering the states of these
   neighboring nodes and the weights of the directed edges connecting them
   to the specific node. This influence is quantified by a weighted sum,
   calculated by multiplying the state value of each neighboring node by
   the corresponding weight of the edge connecting it to the specific
   node. If this weighted sum is greater than 0, the specific node becomes
   activated, influenced by the states of its neighboring nodes.
   Conversely, if the sum is less than 0, the node is deactivated, again
   due to the influence of the states of its neighboring nodes. Several
   studies have been made to estimate regulatory functions of these
   discrete models based on information theory[70]^27^,[71]^28^,[72]^29 or
   meta-heuristic optimization[73]^30^,[74]^31 due to the gradient-free
   characteristics of discrete model optimization, but such approaches
   still have scalability problems with high computational complexity.
   Moreover, the growing demand for discrete dynamic models of in silico
   drug discovery for precision medicine requires more efficient
   optimization algorithms, due to the complexity of estimating network
   parameters for each unique in silico model that predicts diverse drug
   response dynamics of patient samples. Therefore, a more sophisticated
   optimization algorithm is needed to efficiently estimate regulatory
   functions and scale up to larger networks.

   The learning to optimize (L2O)[75]^32 framework is a methodology that
   utilizes ML to train an optimization algorithm. Whereas traditional ML
   trains a particular model, L2O trains an optimizer that can be
   subsequently applied to optimize several related models. By using L2O
   to train a specialized optimizer for a specific type of optimization
   problem, it has the potential to outperform classical optimization
   algorithms in terms of computational complexity.[76]^33^,[77]^34 This
   is because the trained optimizer tries to learn the characteristics of
   the search space for the class of problems at hand, which enables it to
   efficiently explore the space of a particular one of those problems. In
   this study, we present a “gray box framework” that optimizes discrete
   dynamic models of molecular regulatory networks using an optimizer
   trained for discrete models with the L2O approach. This gray box
   framework combines the interpretability of a white box model with the
   scalability of a black box optimizer, allowing it to be both
   interpretable and able to scale up to larger networks ([78]Figure 1).
   We trained the black box optimizer, which we call GREY for
   gradient-free optimizer trained by meta-reinforcement learning to yield
   edge weights of Boolean networks, using synthetic optimization problems
   that imitate white box models consisting of Boolean networks.

Figure 1.

   [79]Figure 1
   [80]Open in a new tab

   A schematic representation of the gray box framework

   In the gray box framework, the white box model provides topological
   information of intracellular networks to the black box model, and the
   black box model optimizes the edge weights of intracellular networks
   from biological data. More specifically, the black box optimizer is
   trained on synthetic environments generated by random networks with
   perturbation fold change simulations. By blending these modeling
   frameworks, the gray box model can predict precise drug responses that
   are biologically interpretable.

   GREY can learn a derivative-free exploration strategy to find an
   optimal solution in the high-dimensional discontinuous search space of
   discrete dynamical models such as Boolean networks. To validate the
   optimization ability of GREY for real biological perturbation data, we
   applied it to drug response prediction. We constructed a drug response
   core network (DRCN) structure for each individual drug extracted from a
   pan-cancer core network (PCCN) integrated from public databases. The
   DRCN optimized by GREY can successfully predict the drug responses of
   cancer cell lines by utilizing the area under the curve (AUC) values
   derived from the viability curve data in the database Genomics of Drug
   Sensitivity in Cancer 2 (GDSC2):
   [81]https://www.cancerrxgene.org/.[82]^35 GREY was able to optimize the
   logical parameters of the DRCN more efficiently than a state-of-the-art
   derivative-free optimization algorithm. Further mechanism analysis
   showed that the optimized network reveals key network motifs related
   with perturbation response dynamics, which were utilized to infer drug
   response biomarkers that match basket trial marker genes reported from
   previous experiments.

Results

A gray box framework that optimizes a white box logical model using a black
box optimizer

   We developed a gray box framework by combining a black box optimizer
   with a white box modeling approach that generates biologically
   interpretable predictions. The framework constructs a Boolean network
   to model drug response at the cellular level, and optimizes the edge
   weights of the network to match existing biological data, such as
   cellular responses to drug perturbation. The gray box framework can be
   broken into three parts ([83]Figure 2): (1) an extraction module for
   the DRCN structure, (2) a preprocessing module for the cell line
   profiles and drug response data, and (3) a gray box module that
   optimizes the logical relationships within the DRCN.

Figure 2.

   [84]Figure 2
   [85]Open in a new tab

   Overall procedure for constructing the gray box framework

   The extraction module (green dotted box) constructs the pan-cancer core
   network (PCCN) based on pan-cancer related networks gathered from
   cancer-related genes and network topology data and extracts drug
   response core networks (DRCNs). The preprocessing module (blue dotted
   box) gathers mutation profiles and drug response data of cell lines,
   and projects them onto the DRCN, generating DRCNs specific to each
   drug. The gray box module (black dotted box) optimizes edge weights
   using GREY based on the DRCN and preprocessed data. The ensemble of
   optimized DRCNs is utilized by the drug response dynamic network
   simulator to predict drug responses and generate mechanistic
   interpretations.

   The DRCN is constructed to distill existing knowledge of regulatory
   interactions between genes and proteins related to drug responses in
   cancer into a minimal network. It begins with a pan-cancer related
   network that isolates genes contained in cancer-related biological
   networks, frequently mutated genes in cancer, and target genes of
   anti-cancer drugs. The links from a structural network database of
   intracellular signaling among genes and proteins, Omnipath,[86]^36 that
   have both a direction (identified source and target nodes) and a sign
   (whether the source node activates or inhibits the target node) are
   chosen to connect the selected genes into the pan-cancer related
   network. The PCCN was then extracted by removing irrelevant inter-nodes
   in the pan-cancer related network that are not contained in a path
   between cancer-related genes. Next, we extracted the DRCN from the
   constructed PCCN. The DRCN is required to keep the network size as
   small as possible to minimize the computational requirements of the
   Boolean network simulations, while retaining all drug response-related
   genes and signaling pathways. To accomplish these requirements, a
   minimum number of genes were selected that were significantly related
   to the drug response of cancer cell lines in GDSC2. Then, a minimum
   number of links and intermediate nodes were selected by Edmonds’
   algorithm, and utilized to connect the selected genes to the DRCN.
   Finally, phenotypic “proliferation” and “apoptosis” nodes, and their
   links from the CancerGeneNet,[87]^37 were added to the extracted DRCN
   as phenotypic markers. As a result, the constructed DRCNs incorporate
   both drug response-related genes and proteins as nodes, recapitulating
   the fundamental molecular regulatory dynamics of the selected drug
   responses (see [88]STAR Methods).

   The ability for the proposed gray box framework to predict drug
   responses was evaluated across three different DRCNs to ensure
   robustness across a variety of network structures. We constructed three
   different DRCNs of US Food and Drug Administration-approved targeted
   cancer drugs from Molecular Analysis for Therapy Choice (MATCH) basket
   trials run by the National Cancer Institute (NCI-MATCH).[89]^38 We
   selected afatinib (an inhibitor of EGFR and ERBB2), trametinib (an
   inhibitor of MAP2K1 and MAP2K2, also known as MEK1 and MEK2,
   respectively), and palbociclib (an inhibitor of CDK4 and CDK6). Each of
   these drugs inhibits well-known and unique cancer-related machinery,
   and together they engage a large fraction of the nodes covering the
   PCCN structure (see [90]STAR Methods). The afatinib DRCN comprised 57
   nodes, whereas the trametinib and palbociclib DRCNs comprised 66 nodes
   and 192 nodes, respectively.

   While the constructed three different DRCNs that GREY optimizes
   recapitulate essential regulatory network structures related with drug
   responses, each drug response of each individual cancer cell line can
   be simulated by applying the mutational profiles of each cell line from
   the GDSC2 database to the DRCN of each of the three drugs. To apply
   mutations from a cell line to the DRCN of a given drug, the
   corresponding nodes in the DRCN whose mutations correspond to the
   mutation profiles of that cell line in GDSC2, are set to a specific
   invariant state. The invariant node state is set to 1 if the mutation
   is either activation, amplification, or gain of function and set to −1
   if the mutation is suppression, deletion, or loss of function.
   Mutations that are in nodes outside of the DRCN are treated as a
   constant input signal that activates or inactivates any downstream
   nodes in the DRCN. The mutation profiles of each cell line were
   obtained from the Catalogue of Somatic Mutations in Cancer
   (COSMIC)[91]^39 database and the effect of each mutation was determined
   from their oncogenic or tumor suppressive q values (see [92]STAR
   Methods).[93]^40 In this way, we mapped numerous mutation profiles of
   cell lines to a small number of nodes in our DRCNs. For example, 716
   cell lines were represented identically by the afatinib DRCN composed
   of 57 nodes. Having incorporated mutation profiles and drug responses,
   the edge weights of the resulting DRCNs are then ready to be optimized
   by GREY.

Construction and optimization procedure of GREY

   We developed GREY to optimize edge weights of DRCNs for the mutation
   profiles and the drug response data from GDSC2. We formalized DRCNs
   using weighted sum Boolean logic (see [94]STAR Methods), and the edge
   weights of the DRCN were assigned to each link. In weighted sum Boolean
   logic, the activation of each node is determined by a weighted sum of
   values from its incoming connections, where the weights represent the
   regulatory influence of each link. Due to the non-continuous fitness
   space that results from the discrete states of the nodes, and the
   time-consuming attractor simulation processes of Boolean networks, we
   needed a derivative-free and sample-efficient optimizer. Thus, we
   trained GREY, which is an optimizer specialized for optimizing the edge
   weights of Boolean networks, through L2O, which employs multi-agent
   meta-reinforcement learning[95]^41^,[96]^42^,[97]^43 for its
   implementation. An agent in reinforcement learning is an entity that
   learns to make optimal decisions by interacting with an environment and
   receiving feedback in the form of rewards. In multi-agent reinforcement
   learning, multiple agents must learn how to cooperate with each other
   to achieve their collective objectives.

   In the current setting, the environment is an optimization problem
   (optimizing the edge weights of Boolean networks), and there is one
   agent for each edge weight. The agents explore the search space of the
   optimization problem (all possible edge weight values) to find good
   solutions that have high fitness. The edge weight values of each
   solution are utilized to simulate a Boolean network. Fitness is then
   defined as the Pearson’s correlation coefficient between the simulated
   fold change of the node values and the ground truth extracted from a
   database or biological experiment. The reward for agents is
   proportional to the difference between its current fitness and its
   previous best fitness, and the agents are trained to maximize reward.
   In traditional reinforcement learning, an agent learns an exploration
   strategy to maximize its reward within a specific environment. For our
   problem, each drug corresponds to a single DRCN structure. These
   different DRCNs represent different environments for the optimizer. To
   accommodate multiple environments in the form of multiple different
   DRCN structures, we adopted meta-reinforcement learning, which allows
   agents to learn adaptable exploration strategies by training across a
   variety of different environments.

   In a given environment, GREY receives the network structure of the
   environment with edge weights and reward as input, and produces new
   edge weights as output. Then, fitness is computed using these new edge
   weights, followed by a reward. The entire sequence from entering input
   to GREY to obtaining a reward constitutes a single step. The new edge
   weights and the reward obtained from the preceding step are used as
   input for the subsequent step. In the initial step, edge weights and
   reward used as input are initialized to random values and zero,
   respectively. An episode is a sequence of interactions between GREY and
   an environment, i.e., a sequence of steps, and GREY is trained from the
   cumulative experiences gained throughout the episode.

   GREY was trained using 100,000 synthetic environments with random
   networks composed of 20–40 nodes, random edge weights, and a generated
   ground truth dataset (see [98]STAR Methods; [99]Figures 3A and
   [100]S1). The random network was formalized by weighted sum Boolean
   logic, and the ground truth dataset comprises the fold change of the
   average activity for nodes between the first attractors attained from
   random initial states and the second attractors reached from the first
   attractor after fixing some nodes to 1 or –1. The resulting GREY
   trained on randomly generated synthetic environments can be applied to
   optimize the edge weights of biological networks with any given
   structure and corresponding data ([101]Figure 3B).

Figure 3.

   [102]Figure 3
   [103]Open in a new tab

   Overall architecture of GREY and training procedure

   (A) An illustration of the training procedure.
   [MATH: <mrow><mover accent="true"><mi>A</mi><mo>ˆ</mo></mover></mrow>
   :MATH]
   is the true edge weights of the environment, and
   [MATH: <mrow><msub><mi>A</mi><mi>t</mi></msub></mrow> :MATH]
   is the new edge weights at step
   [MATH: <mrow><mi>t</mi></mrow> :MATH]
   .
   [MATH: <mrow><msub><mi>f</mi><mi>t</mi></msub></mrow> :MATH]
   is the fitness at step
   [MATH: <mrow><mi>t</mi></mrow> :MATH]
   , and
   [MATH: <mrow><msub><mi>R</mi><mi>t</mi></msub></mrow> :MATH]
   is the observed improvement (OI) of the fitness (PCC, Pearson’s
   correlation coefficient) as a reward at step
   [MATH: <mrow><mi>t</mi></mrow> :MATH]
   .
   [MATH: <mrow><msub><mi>m</mi><mi>t</mi></msub></mrow> :MATH]
   includes hidden states. After generating an episode by repeating 80
   steps, the trajectory is stored in the buffer. Neural network
   parameters of GREY (
   [MATH: <mrow><msub><mi>θ</mi><mi>G</mi></msub></mrow> :MATH]
   ) are updated whenever 64 episodes are saved, and then the buffer is
   cleared. Supervised pre-training to predict the true edge weights is
   performed until the initial 8,000 episodes are used and, from then on,
   reinforcement learning is performed with the DPPO algorithm to maximize
   the cumulative discounted rewards.

   (B) Trained GREY can optimize the edge weights of the white box Boolean
   network model regardless of network structure. During each step of
   optimization, trained GREY iteratively receives network structure with
   preceding edge weights and its reward as input, and then attempts to
   find new edge weights with better fitness and returns those as output.

   (C) Inputs are processed using graph neural networks and node, edge,
   and global features are generated.

   (D) For each edge, a reorganized feature is formed by concatenating its
   edge feature, the node features of sender and receiver nodes, and the
   global feature.

   (E) LSTMs are assigned to each edge. While the parameters of these
   LSTMs are shared, they maintain independent hidden states.

   The architecture of GREY consists of three layers: the input, the
   submodule, and the main layer ([104]Figure S1A). The input and the
   submodule layers serve an auxiliary role of providing information to
   the main layer. The input layer is composed of several graph neural
   networks (GNNs),[105]^44 which are a type of neural network designed to
   process and learn from data structured as graphs by capturing
   relationships between nodes. GNNs in the input layer produce node,
   edge, and global (graph-level) features by aggregating information from
   neighboring nodes, edges, and the entire graph, and they represent
   underlying dependencies within the graph ([106]Figure 3C). Those
   features are then reorganized for each edge by concatenating the edge
   feature with the node features of sender and receiver nodes, as well as
   the global features ([107]Figure 3D). These reorganized features are
   then passed to the submodule and the main layers and processed by long
   short-term memory (LSTM) ([108]Figure 3E). In the submodule and the
   main layers, one LSTM is assigned to each edge, and LSTMs in the main
   layer each act as one agent. Each agent produces a new weight for its
   corresponding edge, and the new weights from all agents collectively
   determine fitness, since fitness depends on the dynamics of the network
   as a whole. The submodule layer has a context estimator, which helps
   the agent adapt quickly to the environment by estimating information
   about the environment, and a global state estimator, which helps the
   agent assigned to each edge to cooperate by estimating the behavior of
   the other agents.

Application of GREY to synthetic and biological environments

   We assessed the optimization performance of GREY against unseen
   environments. In each application, GREY produces five new candidate
   edge weight sets for each step (i.e., generating five outputs at each
   step) and randomly replaces elements (e.g., 50%) from each candidate
   edge weight set by the edge weights with the best fitness from all
   previous steps. Then, we evaluated the fitness of each candidate edge
   weight set (i.e., five evaluations per step) and used the candidate
   with the best fitness as an input for the next step. First, we checked
   that GREY can perform optimization on an example synthetic environment
   not used in training. As more evaluations were performed, GREY found
   edge weights with higher fitness ([109]Figure 4A). Random agents, which
   are untrained GREY, showed no meaningful behavior. Thus, we confirmed
   that GREY can perform optimization properly.

Figure 4.

   [110]Figure 4
   [111]Open in a new tab

   Performance of GREY on synthetic environments and DRCNs

   For synthetic environments and DRCNs, we used 128 and 1,000 random
   initial states, respectively. Initial states are sampled for every step
   during optimization. During each step, 5 sets of edge weights are
   generated and evaluated, and the set of edge weights with the highest
   fitness becomes an input for the next step.

   (A) GREY optimization for a synthetic environment (50 nodes, 94 links)
   that was not used in training. The larger line plot shows the best
   fitness (y axis) across evaluations (x axis), while the inset plot
   shows fitness at each evaluation. Results of five random agents are
   shown for comparison. Shaded areas have 95% confidence intervals over
   three independent runs.

   (B and C) Comparison of the results of CMA-ES on various scales.
   Initial standard deviation (
   [MATH: <mrow><msub><mi>σ</mi><mn>0</mn></msub></mrow> :MATH]
   ) of CMA-ES was chosen among (0.1, 0.3, 0.5, 1.0, 2.0) by selecting
   that which exhibited the highest average fitness. The number of
   evaluations per step (population size) of CMA-ES are (
   [MATH: <mrow><mn>4</mn><mo linebreak="goodbreak"
   linebreakstyle="after">+</mo><mn>3</mn><mspace
   width="0.25em"></mspace><mi>ln</mi><mrow><mo>(</mo><mrow><mi>n</mi><mi>
   e</mi><mi>t</mi><mi>w</mi><mi>o</mi><mi>r</mi><mi>k</mi><mspace
   width="0.25em"></mspace><mi>s</mi><mi>i</mi><mi>z</mi><mi>e</mi></mrow>
   <mo>)</mo></mrow></mrow> :MATH]
   ), which is the default setting, in (B) and 5 in (C). The results
   depict the average maximum fitness after 100 evaluations. Error bars
   are 95% confidence intervals over 20 random networks (∗p < 0.05,
   ∗∗p < 0.01, ∗∗∗p < 0.001, Student’s t test, two-sided).

   (D) Optimization performance for DRCNs of afatinib, trametinib, and
   palbociclib. We randomly selected 300 cell lines for each drug. Shaded
   areas are 95% confidence intervals over three independent runs.

   (E) Median and median absolute deviations of F1 scores for 21 drugs in
   four methods.

   Next, we compared the optimization performance of GREY to covariance
   matrix adaptation evolution strategy (CMA-ES),[112]^45 which is
   considered the state-of-the-art derivative-free optimization method for
   challenging settings, such as non-convex and non-continuous problems.
   During each optimization step, evaluating fitness incurs the greatest
   computational cost for both GREY and CMA-ES, because of the
   time-consuming attractor simulation processes of Boolean networks.
   Thus, we compared the performance in terms of sample efficiency, aiming
   for higher fitness with fewer evaluations. GREY achieved statistically
   significant higher fitness for all scales of networks (20–200 links
   with 20 networks for each size interval) that were not used during
   training ([113]Figure 4B). Furthermore, GREY outperformed CMA-ES when
   the number of evaluations per iteration step was the same
   ([114]Figure 4C).

   We then applied GREY to optimizing the DRCNs of three drugs, afatinib,
   trametinib, and palbociclib, with respect to their drug response data
   from GDSC2 ([115]Figure 4D). When optimizing the edge weights of the
   DRCN for a certain drug, the fitness of the solution is the accuracy of
   DRCN drug-response predictions. More specifically, the DRCN predicts
   drug responses by calculating the difference between the average
   activity of phenotype nodes between the prior attractor and the new
   attractor. The prior attractor is obtained from random initial states
   with the given mutation profile for each cell line, and the new
   attractor is obtained from the prior attractor while maintaining drug
   target nodes at −1. The fitness is then defined as the Pearson’s
   correlation coefficient between the predicted drug responses from DRCN
   and the ground truth, as before. However, unlike with the simulated
   datasets, the ground truth drug response of each node is not known.
   Instead, only the activity change of the phenotype nodes are measured
   and compared with the ground truth to determine fitness. The ground
   truth of the drug response is the measured area under the viability
   curve (AUC) value collected from GDSC2 (see [116]STAR Methods).

   Compared with CMA-ES, GREY consistently achieved better fitness when
   optimizing DRCNs for all three drugs. For the trametinib DRCN, GREY
   reached a fitness of 0.3 after an average of 403 evaluations, which is
   approximately 3 times more sample-efficient than an average of 1,102
   evaluations of CMA-ES. GREY successfully optimized the palbociclib DRCN
   consisting of 655 edge weights, despite the search space size being
   significantly larger than that of the synthetic environments used in
   the training of GREY. These results suggest that GREY is sample
   efficient and highly scalable.

   We then compared prediction performance of optimized DRCNs over 21
   drugs that have various modes of action ([117]Table S1) with ML models
   SRMF[118]^46 and DeepCDR,[119]^10 and a logical prediction model
   LOBICO.[120]^47 Each of the 21 drugs represents a corresponding group
   of drugs targeting the specific pathway according to the labels in the
   GDSC2 database. For each drug, 300 cell lines were randomly selected to
   form a training set and the remainder formed a test set. All comparison
   models were trained using the same set of training data with default
   settings. Drug-response values were binarized according to the rule
   defined by Knijnenburg et al.[121]^47 Predicted drug-response values
   ([122]Figure S2) from DRCNs are binarized using an optimal threshold,
   determined based on the highest F1 scores observed across the training
   set cell lines. F1 score ([123]Equation 2) was calculated for each
   drug. Medians and median absolute deviations of all drugs are
   summarized in [124]Figure 4E.
   [MATH: <mrow><mtext>Precision</mtext><mo
   linebreak="badbreak">=</mo><mfrac><mrow><mtext>True</mtext><mspace
   width="0.25em"></mspace><mtext>positive</mtext></mrow><mrow><mtext>True
   </mtext><mspace width="0.25em"></mspace><mtext>positive</mtext><mo
   linebreak="badbreak">+</mo><mtext>False</mtext><mspace
   width="0.25em"></mspace><mtext>positive</mtext></mrow></mfrac><mo>,</mo
   ><mtext>Recall</mtext><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mtext>True</mtext><mspace
   width="0.25em"></mspace><mtext>positive</mtext></mrow><mrow><mtext>True
   </mtext><mspace width="0.25em"></mspace><mtext>positive</mtext><mo
   linebreak="badbreak">+</mo><mtext>False</mtext><mspace
   width="0.25em"></mspace><mtext>negative</mtext></mrow></mfrac></mrow>
   :MATH]
   (Equation 1)
   [MATH: <mrow><mi>F</mi><mn>1</mn><mspace
   width="0.25em"></mspace><mtext>score</mtext><mo
   linebreak="badbreak">=</mo><mn>2</mn><mo
   linebreak="goodbreak">×</mo><mfrac><mrow><mtext>Recall</mtext><mo
   linebreak="badbreak">×</mo><mtext>Precision</mtext></mrow><mrow><mtext>
   Recall</mtext><mo
   linebreak="badbreak">+</mo><mtext>Precision</mtext></mrow></mfrac></mro
   w> :MATH]
   (Equation 2)

   The prediction performances from the optimized DRCNs using GREY (GREY
   opt.) are lower than that of the ML models such as SRMF or DeepCDR.
   However, GREY opt. showed better performance than the logical
   prediction model LOBICO, and it has further strengths in terms of
   interpretability. For example, a marker predicted for afatinib by
   LOBICO is an ERBB2 mutation,[125]^47 which is a direct target of
   afatinib, but it does not explain the underlying dynamics of how the
   mutation affects the phenotype. On the other hand, our model can
   suggest molecular regulatory dynamics by simulating the optimized DRCN
   with sensitive and resistant cell lines, as we describe in the
   following sections.

Successful prediction of drug responses and molecular regulatory dynamics
with the optimized DRCNs

   Interpretable drug responses depend on accurate inference of both the
   phenotypic response and the underlying molecular regulatory dynamics.
   We explored whether the DRCNs optimized through GREY can predict not
   only the phenotype-level drug responses but also the changes
   represented by fold changes in gene expression. To ensure the
   applicability of our gray box framework throughout the pan-cancer, we
   selected two cell lines, one sensitive and one resistant to afatinib
   treatments, respectively, for each of three frequently occurring cancer
   types in humans: breast, colon, and stomach cancer. Each of the six
   selected cell lines has been carefully chosen to serve as a
   representative model, demonstrating either a sensitive or resistant
   response to the selected drug, afatinib, based on the AUCs of the GDSC2
   database ([126]Figure 5A). However, within each cancer type, the
   sensitive and resistant cell lines share identical genetic markers for
   drug response predictions, as suggested by the MATCH basket trials
   conducted by the National Cancer Institute (NCI-MATCH). Therefore,
   classifying drug response differences between sensitive and resistant
   cell lines requires interpretation of the underlying molecular
   dynamics, which remain obfuscated by classical marker-based methods
   (see [127]STAR Methods).

Figure 5.

   [128]Figure 5
   [129]Open in a new tab

   In-house experimental validation of simulated changes in gene
   expression

   (A) Relative afatinib responses of six selected cell lines from the
   GDSC2 drug responses database. Three afatinib-sensitive (in green) and
   three afatinib-resistant (in red) cell lines were well separated with
   respect to the area under the curve (AUC) of cell viability after
   various dosages of afatinib treatments given in the GDSC2 database. Two
   breast cancer cell lines used for further analysis are colored with
   higher saturations.

   (B) Simulated afatinib responses of six cell lines. Note that the first
   two graphs show a decrease in the average activity of the proliferation
   node (y axis < 0) and the third graph shows an increase in average
   activity of the proliferation node (y axis > 0) (∗p < 0.05, ∗∗p < 0.01,
   ∗∗∗p < 0.001, Student’s t test, two-sided).

   (C) In-house afatinib response data of six cell lines, quantified by
   relative confluence. Representative microscopic images for one pair of
   cell lines are shown (∗p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001, Student’s t
   test, two-sided).

   (D) The pathway enrichment analysis results of DEGs before and after
   each drug treatment. The most enriched Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathway of DEGs in afatinib-treated BT-474 is
   “decreasing cell cycle,” The most enriched pathway in afatinib-treated
   BT-20 is “increasing DNA replication.”

   (E) Min-max normalized log2-fold change values of each gene of the
   in-house experimental data compared with min-max normalized average
   activity changes of each afatinib DRCN node computed by the drug
   treatment simulation. Parameter sets are optimized by a random agent
   and GREY, each repeated three times. Min-max normalized fold changes in
   the iterative simulation results of phenotype nodes and STAT1
   (downstream node of EGFR) computed by the optimized parameter sets
   found by GREY and a random agent are compared with experimental data in
   histograms.

   With the selected six cell lines, we firstly confirmed that
   GREY-optimized DRCNs are capable of reproducing the expected phenotypes
   in response to drug perturbations. To do so, we compared our in-house
   experimental data of drug responses in six cell lines with the drug
   responses simulated by optimized DRCNs. We mapped the genetic profile
   of each cell line onto the basal node activities of the afatinib DRCN,
   and GREY optimized the logical parameters to reproduce the viability of
   the cell lines in a robust manner. Using the optimized afatinib
   response dynamic network simulator, we simulated the proliferation
   response of each cell line, calculated as the average change of the
   proliferation node from 34 repeats ([130]Figure 5B). Average activities
   of the proliferation nodes after afatinib treatment in two of the
   sensitive cell lines (BT-474 and SW48) were significantly decreased
   compared with those of two of the resistant cell lines (BT-20 and
   SW620). For the pair of stomach cancer cell lines, the average activity
   of the proliferation nodes did not decrease following afatinib
   treatment in the simulations. However, the average increase of activity
   in the sensitive stomach cancer cell line (NCI-N87) was smaller than
   that of the resistant stomach cancer cell line (MKN45). Thus, all of
   the predicted drug responses were consistent with the relative afatinib
   responses from the GDSC2 database. We assessed the effect of afatinib
   on proliferation of the cell lines in culture. The experimental results
   of afatinib responses were consistent with the response data from the
   GDSC2 database and with the simulated predictions of drug responses
   through DRCNs ([131]Figure 5C). Detailed in-house experimental data of
   drug treatment in cancer cells are presented in [132]Figure S3.

   After confirming that the optimized model reproduces the expected
   phenotypic responses, we compared the model predictions of each gene in
   response to the drug perturbations with both existing databases and
   in-house biological data. To determine plausible gene sets that
   distinguish between sensitive and resistant pairs of cell lines at the
   cellular level, we first performed a pathway enrichment analysis with
   the web-based gene set analysis toolkit (WebGestalt)[133]^48 based on
   the Kyoto Encyclopedia of Genes and Genomes
   pathways.[134]^49^,[135]^50^,[136]^51 We selected BT-474 and BT-20 cell
   lines, because experimentally these cells had the most significantly
   different proliferation responses to afatinib. We calculated the
   log2-fold changes of differentially expressed genes (DEGs) before and
   after drug treatments in both cell lines by RNA sequencing data and
   analyzed the pathways with altered expressions ([137]Figure 5D). The
   pathway with the highest enrichment score of DEGs in the
   afatinib-treated BT-474 was “decreasing cell cycle” (hsa04110), which
   is consistent with the response of this sensitive cell line. The
   pathway with the highest enrichment score of DEGs in the
   afatinib-treated BT-20 was “increasing DNA replication” (hsa03030),
   which is known to be related to drug resistance
   mechanisms,[138]^52^,[139]^53 but did not meet the threshold for false
   discovery rate, indicating that there were no significant changes in
   gene expression induced by afatinib in these cells.

   We then preprocessed the expression fold change data of BT-20 and
   BT-474 cells to enable comparison with simulated afatinib treatment in
   GREY-optimized DRCNs. Although the expression fold changes in both
   BT-20 and BT-474 cell lines display a sufficiently large number of
   preprocessed DEGs exceeding threshold values to ensure quality control,
   the two cell lines exhibit a different number of significant DEGs.
   BT-20 cells had fewer genes with a base mean count over 100 (5,190
   genes) than the BT-474 cells (15,583 genes), of which only 21 of 55
   nodes were matched with the afatinib GREY-optimized DRCN (excluding the
   two phenotypic nodes). In the case of BT-474, 45 of 55 nodes in the
   afatinib DRCN were matched with the measured genes with a base mean
   count over 100. Both in-house expression data and DRCN simulation data
   are calculated in the form of log2-fold changes and min-max normalized.
   The log2-fold change values of each gene of the in-house experimental
   data and the average activity changes of each afatinib DRCN node
   computed by the drug treatment simulation were min-max normalized and
   compared. As the afatinib DRCN is an essential regulatory network of
   intricately interconnected genes and proteins that collectively
   influence drug response, it is imperative to validate the correlation
   between changes in gene expression levels and the resulting shifts in
   the activities of proteins within this network.[140]^54 Consequently,
   the activity changes of proteins represented by nodes in the afatinib
   DRCN can be compared and validated with the RNA expression fold changes
   measured by our in-house experiments.

   The insights into drug response differences between sensitive and
   resistance cell lines based on the in-house data were then compared
   with the predictions of GREY-optimized DRCNs. The optimized afatinib
   DCRN was simulated, and the average change in node activity due to drug
   perturbation was compared with the gene expression fold changes of the
   experimental data. The computed average activity change of each node
   was an average value of three simulations from three parameter sets
   optimized by GREY or by a random agent, which had the same structure as
   GREY but with randomly initialized weights. To check the consistency of
   fold changes in each node computed by the optimized parameter sets from
   GREY, we compared the results of 30 iterative afatinib treatment
   simulations, each with ten repetitions and different initial states for
   each of the three parameter sets, against the corresponding 30
   simulation results computed by three parameter sets optimized by the
   random agent. Genes with moderate fold changes in experimental data and
   nodes with inconsistent fold changes, which depended on the parameter
   sets or the initial state sets, were filtered according to log2-fold
   changes and fold change variance over 30 simulations, respectively. As
   a result, the fold change tendencies of 14 nodes simulated by the
   parameters of GREY-optimized afatinib DRCN and mutational profiles of
   BT-474 projected on the network were in accord with afatinib treatment
   experiments. Whereas in the random agent-optimized parameters, only 5
   of 42 results matched the experiments. In BT-20, 7 of the simulated
   node fold changes optimized by GREY showed the same tendency as the
   experimental data, while only 1 node fold change optimized by the
   random agent was well in accord with the experiments ([141]Figure 5E).
   The correlation between the simulated node fold changes optimized by
   GREY and the experimental data is further validated with a statistical
   test and compared with the data of randomly shuffled node labels
   ([142]Figure S4).

   The major difference between the networks with parameters optimized by
   GREY and the random agent was that the former depends both on network
   topology and logic, whereas the latter depends on only network
   topology. Although random agent-optimized parameters were able to
   correctly predict some of the node fold changes, the simulated network
   converged into a single attractor. Note that the iterative fold change
   results of phenotypic nodes and STAT1 (downstream node of EGFR)
   simulated by GREY-optimized afatinib DRCNs were more spread out,
   indicating various attractors, while those simulated by random
   agent-optimized afatinib DRCNs were more concentrated in a few rescaled
   fold change intervals, indicating a single attractor (histograms in
   [143]Figure 5E). The node states of the converged attractor with random
   agent-optimized afatinib DRCNs were predictable by topologically
   estimating the effect of the afatinib perturbation with the network
   diffusion algorithm suggested by Hofree et al.[144]^55 The result
   infers that the dynamics of the network are simple, lacking any complex
   feedback or crosstalk dynamics, and hence inadequate for reproducing
   the behavior of a biological network. Moreover, the single attractor
   converged to by each of the optimized afatinib DRCNs using the random
   agent differed depending on the parameter sets. This implied that with
   additional optimized parameter sets, the simulated network dynamics
   would not converge into a single distribution. As a result, increasing
   the number of parameters sets cannot guarantee prediction performance
   improvement.

   In stark contrast, the networks with the parameters optimized by GREY
   converged into a higher number of attractors, corresponding to varied
   regulatory dynamics, which exhibited similar phenotypic predictions.
   These results suggest that the networks optimized by GREY comprise not
   only topological information, but also dynamic properties including
   feedbacks and crosstalk, which are known to be important features to
   understand in order to overcome drug resistance mechanisms in cancer
   therapy.[145]^56 Moreover, even though average responses of GREY and
   the random agent-optimized DRCNs are similar, GREY-optimized networks
   can be said to more properly recapitulate the biological nature of
   perturbation responses of regulatory networks. Biology regulatory
   networks exhibit stochasticity in their gene expressions, molecular
   activities, and probability of chemical reactions, which tend to
   produce a range of distinct
   attractors.[146]^57^,[147]^58^,[148]^59^,[149]^60 The multiple
   attractors exhibited by GREY-optimized DRCNs are hence more realistic.
   Moreover, the distributions of attractors from the network optimized by
   different parameter sets from GREY were more highly clustered than
   those of the random agents. This illustrates that the prediction
   performance of the drug-response networks optimized by GREY will not be
   derailed by a larger number of parameter sets. Thus, the networks
   optimized by GREY are more suitable than the networks optimized by the
   random agent for the purpose of cancer drug-response predictions.

Interpreting regulatory mechanisms of drug resistance with the gray box
framework

   After establishing the ability of GREY to infer regulatory dynamics, we
   explored if those cellular-level insights could be interpreted to
   reveal underlying mechanisms of drug resistance. We analyzed the
   simulated responses of BT-20 and BT-474 cells to discover network
   motifs behind the differences in afatinib response. We calculated the
   difference between the average activity fold change of afatinib-treated
   BT-20 and that of BT-474 by subtracting fold change values of the
   BT-474 afatinib core network from that of the BT-20 afatinib core
   network ([150]Figure 6A). Note that some nodes had significantly
   different fold change values. With an assumption that the nodes with
   significant difference in fold change between the sensitive and the
   resistant cell lines cause the difference in afatinib responses, we
   selected the five nodes with the largest positive difference (PTPN6,
   AR, PDPK1, IGF1R, and SGK1, in decreasing order of absolute values) and
   five with the largest negative difference (MAP3K3, LCK, NFKBIA, PPARA,
   and STK3, in decreasing order of absolute values) in fold changes of
   simulated node activities.

Figure 6.

   [151]Figure 6
   [152]Open in a new tab

   Mechanistic analysis of afatinib treatment to identify the most
   influential nodes for predicting and controlling drug resistance

   (A) Afatinib DRCN. The color of each node represents the predicted
   relative activity changes of resistant and sensitive cell line models,
   and the color of the arrows denotes whether the signal flow increases
   (red) or decreases (blue) after afatinib treatment.

   (B) Activity difference sub-network for afatinib response from BT-20
   and BT-474 cells. This sub-network was extracted from the afatinib
   response network in (A).

   (C) Identification of context-dependent flows of afatinib treatment
   effects toward the proliferation node. The propagated mutational
   profile of BT-20 on the activity difference sub-network contained two
   nodes with activating mutations (EGFR and PTPN6), and three inputs from
   mutations outside this sub-network that activated (NGKB1, SYK, and AR),
   and one that is inactivated (PPARA). The propagated mutational profile
   of BT-474 on the activity difference sub-network had three nodes with
   activating mutations (EGFR, PPARA, and NFKB1), two nodes activated by
   input from mutations outside this sub-network (STAT1 and ERBB2), and
   one node inactivated by external mutation (CDK1). Signal flow and node
   activity changes in activity difference sub-network of resistant cells
   and sensitive cells upon perturbation of EGFR and ERBB2 (afatinib
   treatment) are illustrated.

   (D) The balance between mutational effects on two essential negative
   feedback loops is primarily responsible for determining the resistance
   to afatinib treatment.

   To explore the interactions between these nodes in the context of drug
   resistance, we extracted an activity difference sub-network from the
   afatinib DRCN. This activity difference sub-network was formed by
   connecting the selected ten nodes with links and inter-nodes by
   PathLinker,[153]^61^,[154]^62 which explored up to 50 paths from the
   drug target nodes (EGFR, ERBB2) to the phenotypic node (proliferation)
   through the selected 10 nodes ([155]Figure 6B). We found that the ERBB2
   cascade and AR-AKT1 negative feedback are activated in BT-20, whereas
   the EGFR cascade and NFKBIA-NFKB1 negative feedback are activated in
   BT-474 after afatinib treatment.

   To interpret the mechanism underlying the different afatinib responses
   with respect to EGFR and ERBB2 inhibition signals using the activity
   difference sub-network, mutational effects and afatinib perturbation
   signal flows were analyzed ([156]Figure 6C). Signal flows from a node
   are defined by adding the incoming signal flow to the product of the
   node activity change with the sign (positive or negative) of the
   outgoing links.[157]^63 The effect of a mutation on a node inside the
   activity difference sub-network was treated as fixing that node to a
   state, such that incoming links were deleted and a constant signal
   flows out from that node. The color of the arrows denotes whether the
   signal flow increases (red) or decreases (blue) after afatinib
   treatment. The effect of a mutation from outside the activity
   difference sub-network was incorporated by propagating the change in
   signal flow caused by a mutation through exclusive simple paths
   computed by the maximum matching algorithm. The propagated mutational
   profiles of two cell lines were incorporated into the sub-network as
   the indirect influence of mutation on the sub-network nodes (see
   [158]STAR Methods).

   The activity-difference sub-network reveals how drug responses
   differently affect each cell line. In BT-20, the decreasing signal
   flows from the afatinib-inhibited EGFR node through the
   STAT1-PPARA-NFKBIA cascade and the ESR1-AR cascade were offset by the
   increasing signal flows into the NFKB1 node and the AR node,
   respectively. The NFKB1 node and AR node were indirectly influenced by
   mutations in nodes outside the afatinib sub-network, but within the
   PCCN. The increasing signal from the afatinib-inhibited ERBB2 node
   flows through the CDK1-AR cascade and is further enhanced by the
   increasing signal flow at the indirectly activated node, AR, increasing
   the proliferation node activity. The increasing signal flowing into SYK
   from the mutations outside the afatinib DRCN also inactivated LCK,
   which dampened the ability for afatinib-inhibited EGFR to decrease the
   flow into STAT1 and NFKBIA. The net outcome of the inhibitory effect of
   afatinib on EGFR and ERBB2 is predicted to be an increase in
   proliferation node activity in BT-20.

   In BT-474, the decreasing signal flow from the afatinib-inhibited EGFR
   node into the STAT1 node is blocked by the PPARA mutation, but the
   afatinib-inhibited EGFR decreases AR node activity by reducing the
   signal flow through the ESR1-AR cascade. The increasing signal from the
   afatinib-inhibited ERBB2 node is offset by decreasing the signal flow
   from the CDK1 node, which is indirectly inhibited by mutations. As a
   result, the activity of the AR node is decreased after afatinib
   treatment, resulting in decreased activity of the proliferation node.

   Together, the signal flow analyses of the activity difference
   sub-networks predicted a reason why BT-20 and BT-474 have different
   responses to afatinib treatment: the signal flows in the two cell lines
   differently affect the NFKB1-NFKBIA and the AR-AKT1 negative feedback
   regulations ([159]Figure 6D). The underlying reason for the resistance
   to afatinib treatment for BT-20 can be summarized as the following:
   the mutational profile activates AR-AKT negative feedback regulation
   and inactivates NFKB1-NFKBIA negative feedback regulation.

   The genetic marker for the NCI-MATCH afatinib basket trial was mutated
   or amplified EGFR.[160]^64 Both BT-20 and BT-474 cells have
   gain-of-function mutations in EGFR. Thus, using only that single
   genetic marker, one would expect the same response to afatinib for
   these two cell lines. On the other hand, our framework not only
   predicted the different responses of these two cell lines but also
   revealed their underlying mechanisms. The collection of genes involved
   in the two relevant feedbacks, NFKB1-NFKBIA and AR-AKT1, may constitute
   a new type of dynamic drug-response biomarker. A biomarker composed of
   a group of genes and their holistic dynamics, which respond
   differentially to drug treatment, may predict drug responsiveness more
   effectively than the classical mutation biomarkers of afatinib. Owing
   to the interpretable and simulatable afatinib DRCN, we can properly
   predict the drug response, distinguishing resistant cell lines that
   need additional treatment. Moreover, by interpreting the resistance
   mechanisms, such as feedback and crosstalk, we can identify efficient
   combinatorial drug targets to compensate for those resistances.
   Additional validations for suggesting dynamic drug-response biomarkers
   are in [161]Figure S5. Detailed signal flow analyses of
   trametinib-treated SW620 and SW48 cell lines are discussed in
   [162]Figure S6.

Discussion

   Predicting cellular response to a perturbation is of great interest in
   systems biology and ML.[163]^4^,[164]^9^,[165]^20^,[166]^46 Systems
   biology uses a white box approach based on mechanistic modeling of
   intracellular molecular regulatory networks, and can provide
   interpretable insights into regulatory dynamics, but is difficult to
   scale up to larger networks. ML uses a black box approach by modeling
   input-output relationships, and can offer high prediction performance,
   but it is difficult to interpret its underlying rationale. There are
   several works[167]^14^,[168]^15^,[169]^20^,[170]^22^,[171]^47^,[172]^65
   that have attempted to merge these approaches to make interpretable
   predictions, but they fall short of achieving detailed insight into
   molecular regulatory dynamics about cellular responses after a
   perturbation. Therefore, to provide interpretations of molecular
   regulation dynamics, we constructed a gray box framework by combining
   the white box and the black box L2O approaches to model molecular
   regulatory dynamics. We devised a black box optimizer, GREY, to tackle
   the high computational complexity of parameter optimization for DRCNs.
   GREY learned efficient derivative-free search rules by identifying the
   relationship between the structure of a network and its dynamical
   response. To accomplish this, we adopted a GNN[173]^44 to represent
   attributes of the network components (i.e., nodes and edges), and used
   meta-reinforcement learning to learn different dynamic responses to
   perturbations for each network. Remarkably, GREY optimized the edge
   weights of the DRCN for trametinib three times more efficiently than
   the state-of-the-art derivative-free optimization method, CMA-ES, with
   respect to sample efficiency. Moreover, GREY optimized the DRCN for
   palbociclib to a fitness value of 0.5, which is difficult to achieve by
   inferring the regulatory logic of a network of such scale (∼600 edges)
   using any other optimization method. We applied GREY to optimize the
   DRCNs of 21 drugs. Constructing a DRCN for each drug involves numerous
   steps that influence the number of nodes and edges, and the optimal
   settings may vary for each drug. However, we applied the same settings
   to construct all DRCNs to allow for a systematic comparison of the
   prediction performance of optimized DRCNs. Consequently, certain drugs
   inevitably have a suboptimal DRCN, influencing our results.
   Investigating the impact of drug-specific settings on the prediction
   performance of DRCNs would be an intriguing avenue for further
   research.

   To demonstrate the efficacy of the gray box framework on recapitulating
   cellular dynamics, we applied it to the drug-response predictions in
   cancer networks. Our results have shown the broad applicability of the
   framework with a wide range of cancer types and drug targets. As the
   mechanistic interpretation of DRCNs is not detached from drug-response
   prediction performance, a question whether the molecular mechanism
   interpretation of afatinib drug responses is perfectly correct or
   cannot arise. However, since the Boolean networks with optimized
   weights not only fully reproduce the quantitative responses for various
   cell lines of several drugs, but also reproduce the activity changes
   after those drug treatments, we can assume that the regulatory logic
   recapitulated by the weighted Boolean logic of the optimized DRCNs
   accurately models the underlying molecular regulatory mechanism, at
   least for the scope of drug response. Moreover, the optimized DRCNs not
   only correctly predicted drug responses of BT-20 and BT-474 cell lines,
   but also suggested a biomarker based on regulatory dynamics that was
   directly related with drug-response mechanisms and thus more accurately
   predict the resistance. This result is well in accord with the research
   of Choi et al.,[174]^4 where classifying cancer cell lines with similar
   changes in expression levels induced by drug treatment exhibits more
   similar drug responses than classifying cancer cell lines with similar
   mutation profiles. Therefore, we believe that the interpretability of
   the model can provide insight into biological mechanisms, such as drug
   resistance, which can subsequently help to overcome these hurdles in
   real biological applications.

   Cancer is a patient-specific as well as tissue-specific disease, but
   there are recurrent regulatory architectures common to all cancer
   types.[175]^66^,[176]^67 Based on these recurrent architectures, we
   hypothesized the existence of a PCCN and developed a method for
   reconstructing such a core regulatory network from large datasets.
   Feltes et al.[177]^68 also tried to curate core regulatory networks
   relevant to all cancer types.[178]^69^,[179]^70 However, these
   approaches might contain indirect interactions as curated links depend
   on the data. Due to the risk that few incorrect links in the network
   structure can vastly change the dynamics of the model, we instead
   focused on using a well-known network structure database such as
   Omnipath and cross-checked every link in the PCCN with other network
   databases: Pathway Commons, DoRothEA, and Cui’s human signaling
   network.[180]^71^,[181]^72^,[182]^73 Each drug has a different mode of
   action at the cellular level and thereby affects a different part of
   the intracellular network. Thus, we extracted the core structure
   responsible for each drug and optimized the parameters separately.

   The gray box framework can be applicable to readily accessible
   perturbation data, while enabling deep interpretation of molecular
   regulatory dynamics. Single-cell sequencing of drug-induced cellular
   state transitions provides detailed dynamic information of drug
   responses. Using these data, a number of studies predict and interpret
   the response dynamics of molecular regulatory
   perturbations.[183]^74^,[184]^75 However, aside from the high financial
   costs, producing such data is a difficult and time-consuming task that
   requires experimentally well-prepared samples of cells undergoing
   specific dynamics and properly selected pre-processing and analyzing
   tools.[185]^76 Under these limitations, the most commonly accessible
   data of molecular regulatory perturbations are drug responses of cancer
   cell lines in terms of cell viabilities. It is notable that discrete
   DRCNs optimized by GREY have successfully interpreted the
   drug-resistance mechanisms of afatinib in terms of molecular regulatory
   dynamics without explicit molecular regulatory experimental data, but
   only with cell viability. Owing to the vastness of public drug-response
   viability data in various diseases caused by malfunctioning cell
   networks, the suggested framework can also be applicable to controlling
   fates of pluripotent stem cells[186]^77^,[187]^78 or macrophage
   differentiation.[188]^79 Moreover, prediction performance and
   interpretation of molecular regulatory mechanisms of the developed gray
   box framework can be expected to further improve if data on molecular
   regulatory dynamics are also explicitly incorporated. Currently, GREY
   is trained with scale-free random networks for the specific task of
   optimizing biological networks, although many biological networks are
   not scale free.[189]^80 Thus, there is room for improving the
   optimization performance of GREY by training it with more biologically
   plausible random networks. Furthermore, in the future, GREY could be
   trained with other types of random networks with different structural
   characteristics to optimize various discrete dynamic models.

Limitations of the study

   The DRCN formulates an "essential structure" connecting biological
   molecules tied to drug responses in the PCCN, aiming to minimize
   computational burdens of the optimizer by limiting number of links
   extracted between molecules in the PCCN. While this method efficiently
   identifies core network structures among molecules associated with drug
   responses, it overlooks long-distance bypass regulations involving
   multiple connections. Such pathways, despite potentially lower chances
   of affecting drug response dynamics, are critical but might be missed
   by our methodology. Dynamics influenced by these bypasses tend to be
   embedded through optimized parameters in shorter pathways among the
   selected molecules, omitting a few significant long-distance
   interactions. We utilized GREY to optimize DRCNs for 21 drugs. However,
   the prediction performance for certain drugs did not reach
   significance. Although constructing each DRCN involves varying optimal
   settings, we employed uniform settings for systematic
   performance comparison, potentially leading to suboptimal DRCNs for
   some drugs. In addition, GREY does not incorporate regularization in
   its optimization procedure for DRCNs. Incorporating regularization in
   future studies could enhance performance.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Chemicals, peptides, and recombinant proteins
     __________________________________________________________________

   Trametinib APExBIO Cat#A3018
   Afatinib Selleckchem Cat#S1011
   Palbociclib Selleckchem Cat#S4482
     __________________________________________________________________

   Critical commercial assays
     __________________________________________________________________

   RNA-spin kit INTRON Cat#17211
   QuantSeq 3′ mRNA-Seq Library Prep Kit Lexogen Cat#015.24
     __________________________________________________________________

   Deposited data
     __________________________________________________________________

   Gene expression data This paper [190]GSE184731
   Harmonizome Rouillard et al.[191]^81 SCR_016176
   Kyoto Encyclopedia of Genes and Genomes Kanehisa et al.[192]^49
   SCR_012773
   Genomics of Drug Sensitivity in Cancer 2 Iorio et al.[193]^35
   SCR_011956
   Catalog Of Somatic Mutations In Cancer Tate et al.[194]^39 SCR_002260
   Pathway Commons Cerami et al.[195]^72 SCR_002103
     __________________________________________________________________

   Experimental models: Cell lines
     __________________________________________________________________

   SW48 ATCC CCL-231
   SW620 ATCC CCL-227
   BT20 Korean Cell Line Bank(KCLB) KCLB #: 60061
   NCI-N87 Korean Cell Line Bank(KCLB) KCLB #: 60113
   MKN45 Korean Cell Line Bank(KCLB) KCLB #: 80103
   BT474 Korean Cell Line Bank(KCLB) KCLB #: 60062
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   GRAY This paper [196]https://github.com/hanyh0807/GREY
   [197]https://doi.org/10.5281/zenodo.10989715
   Tensorflow version 1.15.5 Google [198]https://www.tensorflow.org/
   Graph_nets python package Peter et al.[199]^39^,[200]^44
   [201]https://github.com/deepmind/graph_nets/tree/master
   SRMF Wang et al.[202]^46 [203]https://github.com/linwang1982/SRMF
   DeepCDR Liu et al.[204]^10 [205]https://github.com/kimmo1019/DeepCDR
   LOBICO Knijnenburg et al.[206]^47
   [207]https://github.com/TianxiaoNYU/LOBICO-Matlab
   MATLAB R2019b MathWorks [208]https://kr.mathworks.com/
   pycma v3.1.0 Nikolaus et al.[209]^45
   [210]https://github.com/CMA-ES/pycma
   STAR Dobin et al.[211]^82 [212]https://github.com/alexdobin/STAR
   DESeq2 Love et al.[213]^83
   [214]http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html
   Signal flow analysis Lee et al.[215]^63
   [216]https://github.com/dwgoon/sfa
   GDSCTools Cokelaer et al.[217]^84
   [218]http://github.com/CancerRxGene/gdsctools
   Hierarchical HotNet Reyna et al.[219]^85
   [220]https://github.com/raphael-group/hierarchical-hotnet
   The PathLinker app Gil et al.[221]^62
   [222]https://apps.cytoscape.org/apps/pathlinker
   [223]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources and reagents should be
   directed to and will be fulfilled by the lead contact, Kwang-Hyun Cho
   (ckh@kaist.ac.kr)

Materials availability

   This study did not generate new unique reagents.

Data and code availability

     * •
       Gene expression data have been deposited at GEO and are publicly
       available as of the date of publication. Accession number is listed
       in the [224]key resources table.
     * •
       All original code has been deposited at GitHub and is publicly
       available as of the date of publication. The GitHub repository and
       DOI are listed in the [225]key resource table.
     * •
       Any additional information required to reanalyze the data reported
       in this paper is available from the [226]lead contact upon request.

Experimental model and study participant details

Setup for in-house experimental validations

   The relevant cell cultures, SW48 (Colon), SW620 (Colon), BT20 (Breast),
   NCI-N87 (Breast), MKN45 (Stomach), and BT474 (Stomach), were obtained
   from Korean Cell Line Bank and the American Type Culture Collection
   (ATCC). All cell lines were cultured in Dulbecco’s modified Eagle’s
   medium (WelGENE Inc., Gyeongsan, Republic of Korea) with 10% fetal
   bovine serum (FBS, WelGENE Inc.) and antibiotics (100 U/ml of
   penicillin, 100 μg/mL streptomycin, and 0.25 μg/mL of Fungizone) (Life
   Technologies Corp., Carlsbad, CA) at 37°C in a humidified atmosphere
   containing 5% CO2.

   To obtain reagents, trametinib (an inhibitor of MAP2K1 and MAP2K2, also
   known as MEK1 and MEK2, respectively) was purchased from APExBIO
   (Boston, MA, USA). afatinib (an inhibitor of EGFR and ERBB2) and
   palbociclib (an inhibitor of CDK4 and CDK6) were obtained from
   Selleckchem (Houston, TX, USA).

   To conduct cell growth assays, cells were seeded into a 96-well plate
   at a density of 3–7×103 cells/well in growth medium, incubated for 24 h
   and then treated with the indicated drugs. Following incubation of the
   plates for 72–120 h, relative cell viability was measured. After
   seeding, cells were imaged using IncuCyte ZOOM. To assess cell growth,
   the average area of each cell was determined at each time point using
   the IncuCyte ZOOM analysis software. Images were captured at 3 h
   intervals from three separate regions per well with a 20× objective.

   To conduct RNA isolation and bulk RNA-seq analysis, total RNA was
   extracted using an RNA-spin kit (INTRON), and RNA quality was assessed
   by Agilent 2100 bioanalyzer using the RNA 6000 Nano Chip (Agilent
   Technologies). RNA quantification was performed using ND-2000
   Spectrophotometer (Thermo, Inc.), and 500 ng of RNA per sample were
   used for library construction. The library construction was performed
   using the QuantSeq 3′ mRNA-Seq Library Prep Kit (Lexogen, Inc.), and
   the library was single-end sequenced on an Illumina NextSeq 500
   (Illumina, Inc.). The reads were aligned using STAR,[227]^82 and the
   read counts were estimated by featureCounts.

   Differentially expressed gene (DEG) analysis was performed by using
   significance analysis of microarrays method implemented in the DESeq2
   package. We calculated the differentially expressed genes in log2-fold
   changes of expression levels measured in both sensitive and resistant
   cell lines for each type of cancer (colon, breast and stomach) using
   the DESeq2[228]^83 R package. The cutoff values were set with the
   ‘adjusted p-value’ under 0.05 and the base mean over 100.

Method details

Constructing DRCNs

   To efficiently elucidate the construction of the Pan-Cancer Core
   Network (PCCN) and the Drug Response Core Network (DRCN), we integrated
   a comprehensive list of cancer-related genes from multiple
   high-throughput data sources. These included oncogenic genes and
   compound targets from ten major signaling pathways identified in The
   Cancer Genome Atlas[229]^66 (TCGA), such as RTK/RAS, Nrf2, PI3K, TGFβ,
   Wnt, Myc, p53, Cell cycle, Notch and Hippo. Additionally, we
   incorporated frequently amplified or deleted mutations,[230]^71 the
   NCI-MATCH basket trial markers,[231]^38 notably co-occurring or
   mutually exclusive mutations,[232]^86 and genes from ‘Pathways in
   cancer’ (hsa05200) of the KEGG pathway.[233]^49^,[234]^50^,[235]^51
   Through these integrations, we identified a total of 2,822 pan-cancer
   genes.

   Using the PathLinker algorithm,[236]^61^,[237]^62 these pan-cancer
   genes were each used as a source node to connect to all other genes,
   employing interconnecting pathways derived from the Omnipath network
   structure databse.[238]^36 The connected pathways, comprising 3,615
   nodes after exclusion of paths with overlapping sources and targets,
   represented comprehensive interconnections among the pan-cancer related
   genes. To refine this network and extract the core structure, we pruned
   leaf nodes and redundant inter-nodes, smoothing the network using a
   network propagation algorithm,[239]^87 and eventually extracted a core
   structure with 2,574 nodes and 5,992 links.

   For the construction of the DRCN, we applied hierarchical
   HotNet[240]^85 and Edmonds’ algorithm[241]^88 to extract a core network
   that captures the essential genes and pathways involved in drug
   responses. This involved gathering genomic features such as mutations,
   copy number alterations, and expression levels of genes from various
   databases including the Cell Model Passport,[242]^89 OncoKB, and
   COSMIC. These features were used to compute z-scores in analysis of
   variance (ANOVA) of the GDSCTools[243]^84 package, which then served as
   vertices in the hierarchical HotNet algorithm, identifying the most
   statistically significant clusters of genes. We connected these
   clusters using Edmonds’ algorithm with link weights derived from
   inverse values of joint similarity scores calculated through HotNet.
   This process culminated in a DRCN that minimized network size while
   retaining critical drug response-related genes and pathways, further
   enriched with phenotypic markers such as ’proliferation’ and
   ‘apoptosis’ nodes from the CancerGeneNet.[244]^37

   To extract a core structure from the pan-cancer related network, we
   took the following four steps: first, the nodes that only have incoming
   links or outgoing links were pruned. Second, pan-cancer related genes
   were smoothed across the network by the network propagation algorithm
   shown in the [245]Equation 3 suggested by Vanunu et al.[246]^90
   [MATH: <mrow><msup><mi>P</mi><mrow><mi>t</mi><mo
   linebreak="badbreak">+</mo><mn>1</mn></mrow></msup><mo
   linebreak="badbreak">=</mo><mrow><mo>(</mo><mrow><mn>1</mn><mo>−</mo><m
   i>r</mi></mrow><mo>)</mo></mrow><mi>W</mi><msup><mi>P</mi><mi>t</mi></m
   sup><mo
   linebreak="goodbreak">+</mo><mi>r</mi><msup><mi>P</mi><mn>0</mn></msup>
   </mrow> :MATH]
   (Equation 3)

   Where
   [MATH: <mrow><msup><mi>P</mi><mi>t</mi></msup></mrow> :MATH]
   is a probability vector at time step
   [MATH: <mrow><mi>t</mi></mrow> :MATH]
   in which the
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   -th element holds the probability of a random walker that will be
   placed at that time step. The initial probability vector,
   [MATH: <mrow><msup><mi>P</mi><mn>0</mn></msup></mrow> :MATH]
   , is a vector with a probability of 1 for the cancer related genes and
   0 for the other genes, which were normalized by a number of pan-cancer
   related genes, such that the vector sums to 1.
   [MATH: <mrow><mi>W</mi></mrow> :MATH]
   is a column-normalized adjacency matrix of the pan-cancer related
   network.
   [MATH: <mrow><mi>r</mi></mrow> :MATH]
   is a probability of restarting from the pan-cancer related genes at
   every time step, which was set as 0.3 in our extraction. Third, an
   iterative random walker with a limited number of transitions walked
   around the network. The number of iterations was set to 10,000 and the
   maximum number of transitions was set to 30. Fourth, the pan-cancer
   core network was constructed by selecting a sub-network that was
   visited by a larger number of walkers than a specific cutoff value,
   which was set to 12. As a result, the pan-cancer core network was
   constructed with 2,574 nodes and 5,992 links.

Weighted sum Boolean logics

   To simulate drug responses of a given DRCN, we adopted “weighted sum
   Boolean logic” by modifying the simple majority rule suggested by
   Font-Clos et al.[247]^91 A state of a node
   [MATH: <mrow><mi>k</mi></mrow> :MATH]
   in a network with
   [MATH: <mrow><mi>N</mi></mrow> :MATH]
   nodes,
   [MATH: <mrow><msub><mi>s</mi><mi>k</mi></msub></mrow> :MATH]
   , can be either 1 (activated) or −1 (inactivated). The state of a
   network at a discrete time point,
   [MATH: <mrow><mi>t</mi></mrow> :MATH]
   , can be defined as a state vector,
   [MATH: <mrow><msubsup><mi>V</mi><mi>s</mi><mi>t</mi></msubsup></mrow>
   :MATH]
   , which is composed of the
   [MATH: <mrow><mi>N</mi></mrow> :MATH]
   states of its nodes, such as
   [MATH: <mrow><msubsup><mi>s</mi><mi>k</mi><mi>t</mi></msubsup></mrow>
   :MATH]
   , at a certain time point. The tendency of each node state without any
   incoming signals can be represented as a basal level vector,
   [MATH: <mrow><msub><mi>V</mi><mi>b</mi></msub></mrow> :MATH]
   , which usually has a value of zero. Relative influences of each
   incoming link to a node can be a value in a specific range, which
   includes integers between −5 and 5 in our work. Positive numbers
   represent activating regulations and negative numbers refer to
   inhibitory regulations. A relation matrix can be defined as a
   [MATH: <mrow><mi>N</mi></mrow> :MATH]
   -by-
   [MATH: <mrow><mi>N</mi></mrow> :MATH]
   connectivity matrix of the network, wherein each element
   [MATH: <mrow><msub><mrow><mo
   stretchy="true">(</mo><msub><mi>M</mi><mi>R</mi></msub><mo
   stretchy="true">)</mo></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub></
   mrow> :MATH]
   represents the relative influence of a node
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   on a node
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   . The network state at the next time point,
   [MATH:
   <mrow><msubsup><mi>V</mi><mi>s</mi><mrow><mi>t</mi><mo>+</mo><mn>1</mn>
   </mrow></msubsup></mrow> :MATH]
   , can be decided by the sign of element-wise multiplication of a
   weighted sum vector,
   [MATH: <mrow><msub><mi>V</mi><mi>w</mi></msub></mrow> :MATH]
   , computed by the matrix operation shown in [248]Equation 4. If some
   elements of the weighted sum vector are zero, the next state of the
   corresponding nodes follows the current state of each node. Network
   nodes are synchronously updated at each discrete time point:
   [MATH: <mrow><msubsup><mi>V</mi><mi>s</mi><mrow><mi>t</mi><mo
   linebreak="badbreak">+</mo><mn>1</mn></mrow></msubsup><mo
   linebreak="badbreak">=</mo><mtext>sign</mtext><mrow><mo
   stretchy="true">(</mo><msub><mi>V</mi><mi>w</mi></msub><mo
   stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">=</mo><mtext>sign</mtext><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>M</mi><mi>R</mi></msub><mo
   linebreak="badbreak">×</mo><msubsup><mi>V</mi><mi>s</mi><mi>t</mi></msu
   bsup><mo
   linebreak="badbreak">+</mo><msub><mi>V</mi><mi>b</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   (Equation 4)

   The attractors, converged steady states of the network, can represent
   the biological phenotypes by analyzing average states of each node
   within the attractor states.[249]^92^,[250]^93 Average activity of each
   molecule represented by a node state of an attractor is computed as
   suggested by Kim et al.[251]^94 The average state of each node is
   computed as the average of the node state within a specific attractor
   multiplied by the proportional ratio of initial states which converged
   into that attractor. If an attractor does not converge into a single
   state and cycles a set of states, the average state of each node in
   that attractor is computed as a sum of node states in the cycle divided
   by the period of the cycle. The converging ratio of each attractor is
   approximately measured by repeated enumeration of the network state
   starting from a random initial network state to a particular attractor.

Synthetic environments for training GRAY

   We designed a synthetic environment to train GRAY for Boolean network
   optimization through meta-reinforcement learning. A randomly generated
   scale-free network[252]^95 of 20∼40 nodes, with edge weights varying
   within [-5,5] except 0, were assigned to each environment. We generated
   100,000 synthetic environments. At the start of each episode, one of
   100,000 environments was randomly selected, and its virtual
   perturbation data was generated. The virtual perturbation data contains
   the fold change of the average activity for each node between the first
   attractors that were reached from 128 random initial states, and the
   second attractors that were reached starting from the first attractor
   after fixing several random nodes to 1 or -1 to imitate a drug
   treatment. By varying the fixed nodes to compute the second attractor,
   16 ground truth datasets were generated for each environment. The
   fitness is defined as the average of the Pearson’s correlation
   coefficients (PCC) between the ground truth datasets and the fold
   change values computed by applying the edge weights (
   [MATH: <mrow><msup><mi>A</mi><mi>t</mi></msup></mrow> :MATH]
   , obtained edge weights at step
   [MATH: <mrow><mi>t</mi></mrow> :MATH]
   ) obtained from GRAY onto the network. The fixed node sets used to
   compute fold change values for
   [MATH: <mrow><msup><mi>A</mi><mi>t</mi></msup></mrow> :MATH]
   were the same as the ground truth datasets. The reward was given as the
   observed improvement[253]^33 as follows:
   [MATH: <mrow><msup><mi>R</mi><mi>t</mi></msup><mo
   linebreak="badbreak">=</mo><mi>max</mi><mrow><mo
   stretchy="true">{</mo><mrow><msup><mi>f</mi><mi>t</mi></msup><mo>−</mo>
   <munder><mi>max</mi><mrow><mi>i</mi><mo><</mo><mi>t</mi></mrow></munder
   ><mrow><mo stretchy="true">(</mo><msup><mi>f</mi><mi>i</mi></msup><mo
   stretchy="true">)</mo></mrow><mo>,</mo><mn>0</mn></mrow><mo
   stretchy="true">}</mo></mrow></mrow> :MATH]
   (Equation 5)

   Here,
   [MATH: <mrow><msup><mi>f</mi><mi>t</mi></msup></mrow> :MATH]
   is the fitness and
   [MATH: <mrow><msup><mi>R</mi><mi>t</mi></msup></mrow> :MATH]
   is the reward at step
   [MATH: <mrow><mi>t</mi></mrow> :MATH]
   . The goal of GRAY against a given environment is to maximize the
   discounted cumulative sum of reward.

Neural network structure of GRAY

   GRAY consists of three components: the input, the submodule, and the
   main layer ([254]Figure S1A). The input layer has a graph module for
   embedding networks and a message module for communications between
   agents. The graph module consists of two graph neural networks (GNN),
   which is a class of neural networks tailored for processing input
   represented as graphs. The first (
   [MATH: <mrow><mi>G</mi></mrow> :MATH]
   ) is for the main layer and the global state estimator of the submodule
   layer, and the second (
   [MATH: <mrow><msup><mi>G</mi><mi>C</mi></msup></mrow> :MATH]
   ) is for the context estimator of the submodule layer. The message
   module (
   [MATH: <mrow><msup><mi>G</mi><mi>m</mi></msup></mrow> :MATH]
   ) is also a GNN and is used for all modules in both the submodule and
   the main layer.

   In each step, GNNs in the graph module receives eight centrality
   measures for nodes (betweenness,[255]^96 degree,[256]^97
   eigenvector,[257]^98 katz,[258]^99 closeness,[259]^100 load,[260]^101
   harmonic,[261]^102 pagerank[262]^103) and three values for edges
   (weight, sign of each edge in the network of the environment, and edge
   betweenness centrality[263]^96) as input, and updates nodes, edges, and
   the global feature by iteratively aggregating information from
   neighboring nodes, edges, and the entire graph. The weight in the input
   for the edge is from the output of the preceding step. The global
   feature represents a graph-level attribute. The message module
   initially receives a zero-vector input for all nodes and edges at the
   beginning of an episode, and generates nodes, edges, and the global
   feature. From the second step, the message module receives nodes,
   edges, global features, and edge weights generated from the preceding
   step as input, and updates its features. The outputs of the GNNs in the
   graph module and the message module are reorganized for each edge by
   concatenating the global feature, the edge feature, and the features of
   sender and receiver nodes of each edge ([264]Figure 3D). Each
   reorganized feature represents the state of each edge. We use this as
   the feature vector to be fed into the subsequent modules. The
   reorganized feature of the ith edge from each of the three GNNs (
   [MATH:
   <mrow><mi>G</mi><mo>,</mo><msup><mi>G</mi><mi>C</mi></msup><mo>,</mo><m
   sup><mi>G</mi><mi>m</mi></msup></mrow> :MATH]
   ) are as follows:
   [MATH: <mrow><msub><mi>L</mi><mi>i</mi></msub><mo
   linebreak="badbreak">=</mo><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>n</mi><mrow><mi>i</mi><mo>_</mo><
   mi>s</mi></mrow></msub><mo>,</mo><msub><mi>n</mi><mrow><mi>i</mi><mo>_<
   /mo><mi>t</mi></mrow></msub><mo>,</mo><msub><mi>l</mi><mi>i</mi></msub>
   <mo>,</mo><mi>g</mi></mrow><mo
   stretchy="true">)</mo></mrow><mo>,</mo><msubsup><mi>L</mi><mi>i</mi><mi
   >C</mi></msubsup><mo linebreak="badbreak">=</mo><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mi>n</mi><mrow><mi>i</mi><mo>_</m
   o><mi>s</mi></mrow><mi>C</mi></msubsup><mo>,</mo><msubsup><mi>n</mi><mr
   ow><mi>i</mi><mo>_</mo><mi>t</mi></mrow><mi>C</mi></msubsup><mo>,</mo><
   msubsup><mi>l</mi><mi>i</mi><mi>C</mi></msubsup><mo>,</mo><msup><mi>g</
   mi><mi>C</mi></msup></mrow><mo
   stretchy="true">)</mo></mrow><mo>,</mo><msubsup><mi>L</mi><mi>i</mi><mi
   >m</mi></msubsup><mo linebreak="badbreak">=</mo><mrow><mo
   stretchy="true">(</mo><mrow><msubsup><mi>n</mi><mrow><mi>i</mi><mo>_</m
   o><mi>s</mi></mrow><mi>m</mi></msubsup><mo>,</mo><msubsup><mi>n</mi><mr
   ow><mi>i</mi><mo>_</mo><mi>t</mi></mrow><mi>m</mi></msubsup><mo>,</mo><
   msubsup><mi>l</mi><mi>i</mi><mi>m</mi></msubsup><mo>,</mo><msup><mi>g</
   mi><mi>m</mi></msup></mrow><mo stretchy="true">)</mo></mrow></mrow>
   :MATH]
   (Equation 6)
   [MATH: <mrow><mi>L</mi><mo
   linebreak="badbreak">=</mo><msub><mi>L</mi><mrow><mn>1</mn><mo>:</mo><m
   i>k</mi></mrow></msub><mo>,</mo><msup><mi>L</mi><mi>C</mi></msup><mo
   linebreak="goodbreak">=</mo><msubsup><mi>L</mi><mrow><mn>1</mn><mo>:</m
   o><mi>k</mi></mrow><mi>C</mi></msubsup><mo>,</mo><msup><mi>L</mi><mi>m<
   /mi></msup><mo
   linebreak="goodbreak">=</mo><msubsup><mi>L</mi><mrow><mn>1</mn><mo>:</m
   o><mi>k</mi></mrow><mi>m</mi></msubsup></mrow> :MATH]
   (Equation 7)

   Here,
   [MATH:
   <mrow><msub><mi>L</mi><mi>i</mi></msub><mo>,</mo><msubsup><mi>L</mi><mi
   >i</mi><mi>C</mi></msubsup><mo>,</mo><msubsup><mi>L</mi><mi>i</mi><mi>m
   </mi></msubsup></mrow> :MATH]
   are the reorganized features of i^th edge from
   [MATH:
   <mrow><mi>G</mi><mo>,</mo><msup><mi>G</mi><mi>C</mi></msup><mo>,</mo><m
   sup><mi>G</mi><mi>m</mi></msup></mrow> :MATH]
   respectively.
   [MATH:
   <mrow><msub><mi>n</mi><mrow><mi>i</mi><mo>_</mo><mi>s</mi></mrow></msub
   ><mo>,</mo><msub><mi>n</mi><mrow><mi>i</mi><mo>_</mo><mi>t</mi></mrow><
   /msub></mrow> :MATH]
   are the node features of the source and target node of i^th edge,
   respectively.
   [MATH: <mrow><msub><mi>l</mi><mi>i</mi></msub></mrow> :MATH]
   is the edge feature of i^th edge.
   [MATH: <mrow><mi>g</mi></mrow> :MATH]
   is the global feature of a network.
   [MATH:
   <mrow><mi>L</mi><mo>,</mo><msup><mi>L</mi><mi>C</mi></msup><mo>,</mo><m
   sup><mi>L</mi><mi>m</mi></msup></mrow> :MATH]
   are the reorganized features for all edges (1 to
   [MATH: <mrow><mi>k</mi></mrow> :MATH]
   ) in the network from each module,
   [MATH:
   <mrow><mi>G</mi><mo>,</mo><msup><mi>G</mi><mi>C</mi></msup><mo>,</mo><m
   sup><mi>G</mi><mi>m</mi></msup></mrow> :MATH]
   , respectively. Features without any superscript are from
   [MATH: <mrow><mi>G</mi></mrow> :MATH]
   , and features with superscript C and m are from
   [MATH: <mrow><msup><mi>G</mi><mi>C</mi></msup></mrow> :MATH]
   and
   [MATH: <mrow><msup><mi>G</mi><mi>m</mi></msup></mrow> :MATH]
   , respectively. Modules in the submodule and main layer receive the
   reorganized features as an input ([265]Figure S1A).

   A submodule layer consists of a context estimator module and a global
   state estimator module. Both modules are one-layer LSTMs with layer
   normalization.[266]^104 In each module, there is one LSTM for each
   reorganized feature (i.e., one LSTM for each edge). LSTMs in each
   module share their parameters (weights of LSTM) but maintain
   independent hidden states. The context estimator is a module that helps
   the agent to quickly adapt to the selected environment by estimating
   information about the environment.[267]^105 During training, the true
   edge weights of the synthetic network from the selected environment in
   each episode are known. Thus, we can use the true edge weights as a
   target for the context estimator.
   [MATH: <mrow><msup><mi>L</mi><mi>C</mi></msup></mrow> :MATH]
   ,
   [MATH: <mrow><msup><mi>L</mi><mi>m</mi></msup></mrow> :MATH]
   , and the edge weight and the reward from the preceding step are set as
   inputs to the LSTM, and the 11-dimensional logits (edge weights are
   between −5 and 5) are generated through two dense layers with
   exponential linear unit (ELU) activation functions.[268]^106 Supervised
   training is done using cross entropy loss with respect to the true edge
   weight of the network from the selected environment in each episode.
   The global state estimator is a module used to provide information that
   can help learn cooperative policy in order to encourage the agents to
   cooperate and perform efficient exploration in high-dimensional search
   space, inspired by Foerster et al.,[269]^107 Das et al.,[270]^108 and
   Kim et al.[271]^109 Since the output (edge weight) of an agent (main
   layer) for each edge collectively determines the fitness, the output of
   all the agents needs to be well-organized. We trained the communication
   protocol to gather
   [MATH: <mrow><mi>L</mi></mrow> :MATH]
   ,
   [MATH: <mrow><msup><mi>L</mi><mi>m</mi></msup></mrow> :MATH]
   , and the edge weight of the current step into the global state
   estimator module, and then predict the feature (
   [MATH: <mrow><mi>L</mi></mrow> :MATH]
   ) of the next step. Since the input for the subsequent step is
   determined by the outputs of all agents, it is necessary to infer the
   plans of other agents in order to predict the features of the
   subsequent step. Therefore, the global state estimator provides
   inferred information of the outputs of the other agents. The global
   state estimator module predicts the average and standard deviation for
   the features of the subsequent step and is trained to increase the log
   probability for the actual value of that feature through supervised
   training.

   In the main layer (agent),
   [MATH: <mrow><mi>L</mi></mrow> :MATH]
   ,
   [MATH: <mrow><msup><mi>L</mi><mi>m</mi></msup></mrow> :MATH]
   , the edge weight and the reward from the preceding step, and the
   activations of the second-to-last layer of the two submodules are set
   as inputs to a two-layer LSTMs with layer normalization. Similar to the
   submodule layer, there is a two-layer LSTM for each reorganized
   feature. LSTMs in this layer share parameters, while maintaining
   independent hidden states. Output and value are generated through an
   independent dense layer with an ELU activation function. The output is
   11-dimensional logits (to represent −5 to 5) which estimates the next
   weight of an edge assigned to the agent, and the value is an expected
   discounted reward of the current input.

Training procedure

   To train a derivative-free search rule for optimizing the edge weights
   of Boolean networks, GRAY uses meta-reinforcement
   learning.[272]^32^,[273]^42 GRAY is trained on various synthetic
   environments, where the environment involves optimizing edge weights of
   random synthetic networks for ground truth datasets. At the beginning
   of each episode, an environment is sampled out of 100,000 synthetic
   environments and fixed until the end of the episode. Each episode
   consists of 80 steps. Hidden states of all LSTM in GRAY are set to the
   zero vector at the beginning of the episode. The initial state (
   [MATH: <mrow><msub><mi>S</mi><mn>1</mn></msub></mrow> :MATH]
   ) of the episode is a combination of a random parameter vector (
   [MATH: <mrow><msub><mi>A</mi><mn>0</mn></msub></mrow> :MATH]
   ) and zero (
   [MATH: <mrow><msub><mi>R</mi><mn>0</mn></msub></mrow> :MATH]
   ). The sign of each parameter in
   [MATH: <mrow><msub><mi>A</mi><mn>0</mn></msub></mrow> :MATH]
   is kept the same as that of the corresponding link in the network of
   the selected task. All episodes have 80 steps and each iteration of the
   input and output processes of an episode is saved to episode buffer
   when the episode is finished.

   GRAY is trained using 10 workers and one master, where each worker
   spends 8,000 episodes in the supervised pretraining phase and 12,000
   episodes in the reinforcement learning phase. Each worker independently
   collects episodes and computes gradients. The master collects gradients
   from each worker and then averages them. The averaged gradients are
   applied to the neural network of the master, and the updated parameters
   are redistributed to each worker.

   The training procedure for the main layer consists of two phases: a
   supervised pre-training phase based on directly predicting the true
   edge weights of the selected environment and a reinforcement learning
   phase for maximizing the discounted cumulative sum of rewards using the
   distributed proximal policy optimization (DPPO) algorithm.[274]^110 In
   the supervised pretraining phase, we randomly replaced 50% elements of
   the new parameters with the true parameters that have a probability of
   0.05 for each step. Whenever 64 episodes are collected, GRAY is trained
   using Adam on 4 minibatch samples (4 episodes) for one or three epochs,
   and the buffer is cleared. In the supervised pretraining phase, all
   neural networks are trained using supervised learning. However, the
   parameters of G^C and the submodule layer are updated for three epochs
   for every 64 episodes and the remaining parameters (G, G^m, main layer)
   are updated for one epoch for every 64 episodes. In the reinforcement
   learning phase, G^C and submodule layer are trained using supervised
   learning for one epoch for every 64 episodes. Moreover, the remaining
   parameters (G, G^m, main layer) are updated using the DPPO algorithm
   for three epochs for every 64 episodes. Detailed hyperparameters are
   summarized in [275]Table S2.

Selecting six cell lines for in-house experimental validations

   To validate the molecular-scale interpretability of the DRCNs
   constructed by the gray box framework ensuring that the DRCNs can
   effectively capture significant drug response dynamics across a diverse
   range of cancer cell lines, with an affordable number of experimental
   validations, we have designed experiments by selecting specific cell
   lines based on a set of well-defined criteria that allows us to
   represent the broader landscape of drug response dynamics in pan-cancer
   settings.

   Our selection criteria was carefully designed to ensure that the chosen
   cell lines could serve as representative samples for validating the
   predictive capabilities of our DRCNs. These criteria include the
   following four considerations. First, we selected three of the most
   commonly occurring cancer types—colon, breast, and stomach cancers.
   Each of these cancer types has unique gene families associated with
   them, which makes each cancer type having distinctive molecular
   machinery related with drug responses to others. Second, we chose three
   representative U.S. Food and Drug Administration (FDA)-approved
   targeted cancer drugs—afatinib (EGF signaling), trametinib (MEK
   signaling), and palbociclib (CDK signaling). These drugs act on
   distinct signaling pathways that are most commonly altered in various
   cancers, including our three selected types. Third, we selected two
   groups of cell lines corresponding to those ranked in the top 15 for
   sensitivity and resistance, respectively, to each of the three selected
   drugs. These rankings were based on the area under the dose-response
   curve (AUC) values from the GDSC2 database. Fourth, among the selected
   sensitive and resistant cell lines, we chose one cell line from each
   group that shared the same genetic markers as suggested by the
   Molecular Analysis for Therapy Choice (MATCH) basket trials conducted
   by the National Cancer Institute (NCI-MATCH).

   As a result, we have used the mutational drug response markers from the
   EAY131 NCI-MATCH screening trial as follows: EGFR activating mutations
   and ERBB2 activating mutations for afatinib, BRAF fusions, BRAF V600E
   or V600K mutations, GNAQ or GNA11 mutations and NF1 mutations for
   trametinib, CCND1 or CCND2 or CCND3 amplifications, CDK4 or CDK6
   amplifications for Palbociclib. For the six selected cancer cell lines,
   both BT-20 and BT-474 have the same mutation profiles for EGFR and
   ERBB2 (mutated EGFRs and not mutated ERBB2s), while they show different
   afatinib responses (sensitive BT-474 and resistant BT-20). Both NCI-87
   and MKN45 have the same mutation profiles of no EGFR mutations, while
   they show different afatinib responses (sensitive NCI-N87 and resistant
   MKN45). Both SW48 and SW620 have the same mutation profiles of no ERBB2
   mutations, while they show different afatinib responses (sensitive SW48
   and resistant SW620). Both SW48 and SW620 have no mutations on BRAF,
   GNA11, GNAQ and NF1, while they show different Trametinib responses
   (sensitive SW48 and resistant SW620). Both SW48 and SW620 have no
   mutations on CCND1, CDK4 and CDK6, while they show different
   Palbociclib responses (sensitive SW48 and resistant SW620). This
   approach allowed us to address the challenge of classifying drug
   response differences and interpret the resulting molecular dynamics,
   which remain obfuscated by classical marker-based methods. By adhering
   to these selection criteria, we aimed to ensure that our experiments
   could effectively represent and validate the predictive capabilities of
   the DRCNs optimized by the gray box framework. These experiments serve
   as representative examples of how our framework can predict drug
   responses at the molecular level across diverse cancer types and
   targeted drug categories, surpassing the predictions made by
   traditional marker-based methods.

   In summary, we have been able to demonstrate the applicability of the
   gray box framework by strategically selecting representative drugs and
   cell lines for validation. This approach was chosen to avoid the need
   for experimental validation across all 300 cell lines used in our
   training set, as it would not provide additional insights beyond what
   has already been achieved with our carefully designed experiments.

Integrating cell line mutation profiles with functional effect

   The mutational profile of each cell line is curated from the
   preprocessed COSMIC database, which was processed by Béal
   et al.[276]^111 The functional effect of each mutation is decided by
   comparing oncogenic and tumor suppressive q-values suggested by Tokheim
   et al.[277]^40 If the oncogenic q-value of a specific gene is smaller
   than the tumor suppressive q-value, then the functional effect of the
   mutation on that gene is considered oncogenic. Therefore, the
   functional effect of the mutation on the gene is treated as a gain of
   function. In the case that the tumor suppressive q-value is smaller
   than the oncogenic q-value, the functional effect of the mutation on
   the gene is treated as a loss of function. These functional effects of
   mutations are then cross-validated with functionally associated genes
   in Harmonizome.[278]^81

Maximum matching-based mutation data projection

   Several mutated genes of cancer cell lines exist within the pan-cancer
   network but not in the DRCN. Thus, we needed to approximate the effects
   of each mutation on the DRCN by leveraging structural information. Hu
   et al.[279]^112^,[280]^113 defined the structural control configuration
   (SCC) of a network model to approximate the effects of certain drugs on
   the downstream genes. We employed this method to approximate the
   effects of mutations. An SCC of a network model is a maximally matched
   graph within the network, although there exist many different SCCs
   within a large network. SCC consists of elementary paths, which are
   cascades connected by maximum links, and elementary cycles, which are
   cycles connected by maximum links. Additional links indicate those that
   connect elementary paths and elementary cycles. We generated 1,000 SCCs
   until the proportion of new maximum matching links reached zero. Next,
   we extracted all possible simple paths consisting of maximum matching
   links and additional links from mutated genes to other genes within the
   DRCN of each SCC. We utilized the sign information of each link from
   the paths among the mutated genes as well as the genes within the DRCN
   to obtain the effects of each mutation on its downstream genes.

Incorporating mutations and drugs onto the simulation

   During simulations of drug responses of the core network, the effect of
   a mutation on a node inside the core network was reflected by fixing
   the state of the corresponding node to either 1 or -1 for a gain of
   function or a loss of function mutation, respectively. The effects of
   mutations on a node outside of the core network were projected into the
   core network using maximum matching. These effects were then applied
   onto their corresponding elements of the basal level vector
   [MATH: <mrow><msub><mi>V</mi><mi>b</mi></msub></mrow> :MATH]
   , 1 or −1 for a gain of function or a loss of function mutation,
   respectively. To estimate the inhibition effect of drugs on the core
   network, we applied a two-step drug response simulation. First, the
   network converged into attractors with the given mutation profile.
   Second, the network converged again from the prior attractor states to
   new attractors while the drug target nodes were additionally fixed to
   −1. If some of the prior attractor states were cycles, the first state
   of each cycle was treated as the prior attractor state in the second
   step of the drug response simulation. The difference between the
   average activity of the phenotype node in the two states (the new
   attractor and the prior attractor) were compared with the measured area
   under the viability curve (AUC) value collected from GDSC2.

Signal flow analysis of DRCN perturbations

   Signal flows after the DRCN perturbations are analyzed by an algorithm
   suggested by Lee et al.[281]^63 Signal flows from a node are defined by
   adding the incoming signal flow to the product of the node activity
   change with the sign (positive or negative) of the outgoing links.
   Effects of mutations on signal flows are incorporated differently
   depending on whether the mutations were on a node inside the activity
   difference sub-network or not. The effect of a mutation on a node
   inside the activity difference sub-network was treated as pinning that
   node to a state, such that incoming links were deleted and a constant
   signal flows out from that node. The effect of a mutation from outside
   the activity difference sub-network was incorporated by propagating the
   change in signal flow caused by a mutation through exclusive simple
   paths computed by the maximum matching algorithm.

Quantification and statistical analysis

   For comparison between two groups, two-sided Student’s t test was used.
   All pathway enrichment analyses were conducted using WebGestalt[282]^48
   and KEGG.[283]^49^,[284]^50^,[285]^51 ANOVA was conducted using
   GDSCTools.[286]^84 Statistical details can be found in the respective
   figure legends, results, and the [287]STAR Methods section.

Acknowledgments