Abstract

   Prioritizing genes for their role in drug sensitivity, is an important
   step in understanding drugs mechanisms of action and discovering new
   molecular targets for co-treatment. To formalize this problem, we
   consider two sets of genes X and P respectively composing the gene
   signature of cell sensitivity at the drug IC[50] and the genes involved
   in its mechanism of action, as well as a protein interaction network
   (PPIN) containing the products of X and P as nodes. We introduce
   Genetrank, a method to prioritize the genes in X for their likelihood
   to regulate the genes in P. Genetrank uses asymmetric random walks with
   restarts, absorbing states, and a suitable renormalization scheme.
   Using novel so-called saturation indices, we show that the conjunction
   of absorbing states and renormalization yields an exploration of the
   PPIN which is much more progressive than that afforded by random walks
   with restarts only. Using MINT as underlying network, we apply
   Genetrank to a predictive gene signature of cancer cells sensitivity to
   tumor-necrosis-factor-related apoptosis-inducing ligand (TRAIL),
   performed in single-cells. Our ranking provides biological insights on
   drug sensitivity and a gene set considerably enriched in genes
   regulating TRAIL pharmacodynamics when compared to the most significant
   differentially expressed genes obtained from a statistical analysis
   framework alone. We also introduce gene expression radars, a
   visualization tool embedded in MA plots to assess all pairwise
   interactions at a glance on graphical representations of
   transcriptomics data. Genetrank is made available in the Structural
   Bioinformatics Library
   ([34]https://sbl.inria.fr/doc/Genetrank-user-manual.html). It should
   prove useful for mining gene sets in conjunction with a signaling
   pathway, whenever other approaches yield relatively large sets of
   genes.

1 Introduction

1.1 Single-cell differential expression analyses

   Gene differential expression analyses quantify the changes in gene
   expression levels between tested experimental conditions. Although gene
   functions can often be derived from this type of analyses, the
   associations can be confounded with incidental gene induction.
   Therefore interpreting differential expression data with gene set
   enrichment analysis (GSEA) and pathway analysis can be misleading
   without attentive curation, producing an unfounded link between a gene
   differential expression and its functional role in the treatment
   response. Single-cell differential expression analyses have elevated
   the issue, where gene differential expression between experimental
   groups is hindered by gene expression variability between cells.

   Although cell-to-cell variability in gene expression is typically
   overlooked in most analyses, we and others have observed that these
   differences between clonal cells, can impact the overall cell
   population response to a stimulation [[35]1–[36]3]. Originating from
   stochastic processes such as transcription initiation, cell-to-cell
   variability gives rise to an equilibrium of co-existing cellular states
   within an isogenic cell population [[37]4, [38]5]. The different phases
   of the cell cycles are one illustration of cell states that robustly
   proportioned resting cell populations [[39]6], but other functional
   cell states can be phenomenologically evidenced by the fractional
   response to cytotoxic cancer drugs for example (IC[50], Emax>0, [[40]7,
   [41]8]). However, most single-cell technologies are still unable to
   access meaningful cell information within clonal populations once cell
   cycle states signatures are regressed out, and important functional
   cell states remain confounded in gene expression noise [[42]1]. Apart
   from cell cycle genes, one hypothesis for the undetected (or
   unmeasurable) differences in seemingly homogeneous cell populations is
   that they contain cells in a wide variety of possible cell states that
   predisposed them to a number of responses or functions such as cell
   death, impairing pathway enrichment analyses. To recover the molecular
   determinants of clonal cells response to cancer drugs from the measured
   gene expression variability, we recently designed a same-cell
   functional pharmacogenomics approach, named fate-seq, that couples
   prior knowledge on the cell state (predicted drug response) to the
   transcriptomic profile of the same cell [[43]1]. With our same-cell
   approach, we could reveal the molecular factors regulating the efficacy
   of a drug treatment, from differential expression analyses of one
   sample of isogenic cells with no gene induction. Although genes
   differential expressions can now be linked to their functional role in
   drug response using fate-seq, prioritizing genes as best potential
   targets for co-treatment remains a difficult task.

1.2 Diffusion distances

   Indeed, gene prioritization and protein function prediction are
   challenging due to the small-world nature of interaction networks
   [[44]9, [45]10], in particular. To go beyond analysis using the direct
   neighbors of a node or shortest paths, the value of diffusion distances
   has been recognized long ago.

   Based on the correlation between the expression profile and the
   phenotype, as well as (diffusion) distances, various similarity
   measures between genes have been studied [[46]11]. The Diffusion State
   Distance (DSD) was defined as the L1 norm of m walks (RW) of k steps
   [[47]12]. The DSD was shown to be more effective than shortest-path
   distance to transfer functional annotations across nodes in
   protein-protein interaction networks (PPIN). The DSD was further
   extended to exploit annotations (weights) on edges, and to exploit an
   augmented graph incorporating specific interactions [[48]13]. To bias
   the random walk towards certain nodes, a random walk restarting (RWR)
   at those nodes can be applied, as initially used in the context of
   internet surfing [[49]14]. The stationary distribution of the strategy,
   called the page rank, depends on the restart probability vector
   [[50]15]. In [[51]16], the minimum of the page rank probability between
   two nodes is used to qualify the mutual affinity of two proteins in a
   network. RWR were also used to predict drug-target interactions on
   heterogeneous networks [[52]17], and also for layered/multiplex graphs,
   so as to combine complementary pieces of information [[53]18].
   Diffusion distances have also been used recently [[54]19] to understand
   how drugs treat diseases, using both a network of physical interactions
   and a hierarchy of biological functions. However, in the resulting
   multiscale interactome, the diffusion profiles are computed using a
   standard random walk with restart.

   A related topic is the problem of ranking differentially expressed
   genes across two conditions [[55]20]. Following the seminal work on
   gene set enrichment methods [[56]21], the template of gene set
   enrichment analysis (GSEA) consist of three steps, namely computing an
   enrichment score for each gene, estimating its statistical
   significance, and performing a correction for multiple hypothesis
   testing. Setting aside the issue of correlations between genes, such
   methods combine feature selection and clustering (one cluster per cell
   state/condition) [[57]22], but do not address the question of
   connecting two sets of genes using an interaction network.

   Finally, yet another related challenge is pathway enrichment analysis
   [[58]23]. Given a set of experimentally determined genes and a database
   of pathways, the goal here is to find pathways whose genes are
   over-represented in the gene set of interest. When pathways are known
   exhaustively, such analysis are sufficient to screen gene sets. If this
   assumption does not hold, finding intermediates between the gene set
   and the molecules in a pathway becomes mandatory.

   Adding to other prioritization methods [[59]24], Genetrank utilizes
   prior knowledge from functional cell states (transcriptomic profile of
   a predicted drug response), protein-protein interactions, and the
   expected target signaling pathway of a drug of interest.

1.3 Contributions

   We focus on gene prioritization related to a complex phenotype, based
   on expression profiles from single-cell RNA-seq. Formally, let X be a
   set of proteins associated with differentially expressed (DE) genes,
   and P be a set of proteins involved in a signaling pathway. Rather than
   to identify DE genes in X that are involved in the pathway of interest,
   our goal is to prioritize genes in X because they have been previously
   identified to be involved in a pathway of interest. In our case study,
   the DE genes have been identified for their effect on cell sensitivity
   to TRAIL [[60]1]. This dataset comprised DE genes that are the results
   of a drug efficacy screen: it is a gene signature of cell sensitivity
   at the drug IC[50]. Therefore the significantly DE genes in this screen
   are potential drug co-targets; they are not known to be drug targets
   and Genetrank seeks to prioritize them, to suggest the corresponding
   gene products as co-targets in combination treatments with the drug
   used in the screen. Hence, we prioritize genes in X given the knowledge
   of genes in P (which gene products are involved in the drug mechanisms
   of action), using an underlying PPIN, to pick out the genes having a
   higher likelihood to regulate the pathway. Previous work on diffusion
   distances (RW or RWR) has three limitations in this context. First, in
   using hit vectors or stationary distributions, all nodes of the network
   contribute to the comparison of two sources. Instead, we wish to focus
   on nodes in P. Second, RWR use a bias on sources, but in our case, the
   sets X and P may be considered on an equal footing, which commands
   analysis in both directions, i.e. from X to P and from P to X. Third,
   instead of using a single restart rate [[61]18], we study a filtration
   (sequence of nested sets) of genes retrieved, in tandem with so-called
   saturation indices revealing accessibility scales in the PPIN.

   To accommodate this rationale, we present a novel analysis technique
   based on random walks with restarts and absorbing states. Recall that
   in a Markov chain, an absorbing state is a state which is never exited.
   Since stationary distributions are irrelevant in this context [[62]25],
   we resort to hitting probabilities for the nodes on the target set P.
   As we show, doing so yields scores for pairs of genes in S × T and T ×
   TS, from which a ranking of genes in X is defined. As a case study, we
   use a dataset of differentially expressed genes involved in the
   regulation of a cancer drugs pharmacodynamics [[63]1].

2 Material

2.1 Goal: Formal statement

   Consider two sets of genes X and P respectively composing the
   predictive gene signature of sensitivity to a drug and the genes
   involved in its mechanism of action, as well as a protein interaction
   network (PPIN) containing the products of X and P as nodes. We
   introduce Genetrank, a method to prioritize the genes in X for their
   likelihood to regulate the genes in P.

2.2 Biological problem

   The main goal of our approach is to ameliorate the ranking of gene
   signatures from differential expression analyses in order to better
   select the genes whose products are likely to impact the phenotypic
   response of the cells. Using fate-seq, we focus on cancer cells briefly
   treated with the TNF-related Apoptosis Inducing Ligand (TRAIL), so for
   the drug response to be predicted for each cell that is then profiled
   by single-cell RNA sequencing [[64]1]. In such single-cell analyses
   performed with isogenic cells treated together, the stable differences
   in gene expression between cells from the two groups, namely predicted
   sensitive and predicted resistant, are small and otherwise confounded
   in gene expression noise. In addition, the short treatment necessary
   for the cell response prediction does not induce a detectable genomic
   response (Mendeley data Dataset
   [65]https://doi.org/10.17632/m289yp5skd.1 [[66]1, [67]26]) that would
   confound the functional role of the differentially expressed genes
   between the two groups with gene induction, as it is often the case in
   other studies.

2.3 Sets X and P

   The list of differentially expressed genes obtained in this study
   constitutes our set X [[68]1], S2 Table in [69]S1 File. The criteria
   used to define the two groups of single-cells compared in the
   differential expression analyses giving X, is the activation rate of
   caspase-8. Caspase-8 is a protein of the receptor-mediated apoptosis
   pathway, which is part of the TRAIL mechanism of action [[70]8].
   Therefore, in our case study the set P is a set of target genes, whose
   products are proteins of the receptor-mediated apoptosis signaling
   pathway.

   The sets X and P are of size 320 and 49, respectively. Altogether,
   these sets X and P represent a unique dataset to assess the benefit of
   our approach in ranking genes based on their likelihood to impact drug
   response.

2.4 PPIN

   A number of interaction databases coexists, each with specific
   features, in particular a trade-off between exhaustivity and
   confidence. We use MINT due in particular to the compliance with the
   protein naming standards [[71]27].

   The PPIN was constructed from interactions downloaded from the MINT
   website [72]https://mint.bio.uniroma2.it/. Only proteins with the
   species label Homo Sapiens were downloaded. The interacting proteins
   are identified in UniProtKB format. The resulting network is called the
   MINT network in the sequel. This initial graph, containing 11,672
   vertices representing proteins, and 52,839 edges representing
   protein—protein interactions, is edited as follows. First, the PPIN
   being disconnected, we focus on its largest connected component,
   encompassing 11,427 vertices. Second, we remove all self-interactions.
   Third, multiple interactions between the same two proteins are
   collapsed into a single edge. Summarizing, we obtain a graph with
   11,427 vertices and 36,526 edges.

2.5 Sets X and P within the PPIN

   Selected genes in the sets X and P being absent from the largest
   connected component of the PPIN were removed, finally obtaining sets X
   and P of size 227 and 41 (from sets of size 320 and 49 initially).

2.6 Variations on the set X

   In order to evaluate the impact of the size difference of X and P on
   the scores, we analyse the symmetry H(I) of [73]Eq (7) for instances
   involving sets X′ of varying size. Practically, for each s ∈ {|P|, 110,
   220} we pick N[r](= 1000) random subsets of X. We then compare the
   distributions of H(I) obtained.

3 Methods

3.1 Rationale and positioning with respect to previous work

   We consider a set X of experimentally determined genes, and a set P of
   genes belonging to a pathway. We introduce methods using random walks
   on graphs with two sets of vertices as input, referred to as sources
   (S) and targets (T). In order to analyze the (lack of) symmetry between
   paths joining X and P and vice-versa, we use our methods twice:
     * direction S = X ⇝ T = P: starting from nodes in X to reach nodes in
       P;
     * direction S = P ⇝ T = X: starting from nodes in P to reach nodes in
       X.

   To study the relationship between the node sets S and T, our
   modifications of diffusion based distances rely on absorbing states and
   a renormalization scheme. These modifications are actually motivated by
   two structural properties of interaction networks ([74]Fig 1).

Fig 1. Example interaction graph and Markov chain: Two structural properties
motivating absorbing states and the renormalization of hit probabilities.

   [75]Fig 1
   [76]Open in a new tab

   Arcs indicate transitions in the Markov chains—transition probabilities
   are omitted. The set of sources is S = {s[1], s[2], s[3]} and the set
   of targets is T = {t[1], t[2], t[3], t[4]}. When defining absorbing
   states, transitions corresponding to blue arcs are removed. This
   implies that t[2] will not be highlighted because t[1] or t[4] must be
   visited before any random walk starting from any source s ∈ S.
   Furthermore, the normalization aims at reducing the importance of t[3]
   in the study (due to its high proximity to the three sources).

   The first difficulty owes to a notion of subsidiarity amongst targets,
   meaning that if some vertices of T are neighbors (or very close from
   one another) in the graph, targets upstream will artificially modify
   the weights of targets downstream. Indeed, if a target is just after
   another (important) target, then its weight will be large even in the
   absence of direct paths from S ([77]Fig 1, target t[4]). In this
   context, absorbing states make it possible to stop the exploration
   process at such upstream/ancestor nodes, and force direct connexions
   from sources to subsidiary targets.

   The second difficulty owes to the close vicinity of selected targets to
   sources ([78]Fig 1, target t[3]), motivating the introduction of hit
   scores (Def. 3). For example, if a target is a neighbor of some sources
   in the graph, these hit scores decrease the weights of such direct
   paths, stressing the importance of the other non trivial paths. One can
   note that this normalization can be done with other functions, e.g. a
   distance-based function that could be the average of the |T| lengths of
   shortest paths between every source s ∈ S and a given target t.

   We now formally introduce our methods.

3.2 Graphs

   To connect X and P, we consider a PPIN whose vertices are the
   individual molecules, and whose edges represent pairwise interactions.
   Such a network is modeled by a vertex-weighted edge-weighted directed
   graph G = (V, E). The weight of any vertex u ∈ V is denoted w[u] and
   the weight of any edge uv ∈ E is denoted w[uv]. We assume that w[u],
   w[uv] ∈ (0, 1] for every u ∈ V and every uv ∈ E. In the unweighted
   case, we set w[u] = 1 and w[uv] = 1 for every u ∈ V and every uv ∈ E.
   In the undirected case, we have uv ∈ E if and only if vu ∈ E.

   Let n = |V| be the number of vertices and let V = {v[1], …, v[n]}. The
   set of out-neighbors (resp. in-neighbors) of a vertex u ∈ V is denoted
   [MATH:
   <mrow><msubsup><mi>N</mi><mi>G</mi><mo>+</mo></msubsup><mrow><mo>(</mo>
   <mi>u</mi><mo>)</mo></mrow></mrow> :MATH]
   (resp.
   [MATH:
   <mrow><msubsup><mi>N</mi><mi>G</mi><mo>-</mo></msubsup><mrow><mo>(</mo>
   <mi>u</mi><mo>)</mo></mrow></mrow> :MATH]
   ).

   To analyze paths between vertices of X and vertices of P, we formalize
   Markov chains and random walks.

3.3 Random walks and Markov chains with absorbing states and restart

3.3.1 Model

   In the sequel, we consider two sets of vertices S and T from the graph
   G, with S ∩ T = ∅.

   We define a Markov chain for which the set of states is exactly the set
   of vertices V. The transition matrix M is defined as follows for every
   pair (u, v) ∈ V × TV:
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>M</mi><mrow><mo>(</mo><mi>u</mi><mo>,</mo
   ><mi>v</mi><mo>)</mo></mrow><mo>=</mo><mo>{</mo><mtable><mtr><mtd
   columnalign="left"><mstyle displaystyle="true"
   scriptlevel="0"><mfrac><mrow><msub><mi>w</mi><mrow><mi>u</mi><mi>v</mi>
   </mrow></msub><msub><mi>w</mi><mi>v</mi></msub></mrow><msub><mo>∑</mo><
   mrow><msup><mi>v</mi><mo>′</mo></msup><mo>∈</mo><msubsup><mi>N</mi><mi>
   G</mi><mo>+</mo></msubsup><mrow><mo>(</mo><mi>u</mi><mo>)</mo></mrow><m
   sub><mi>w</mi><mrow><mi>u</mi><msup><mi>v</mi><mo>′</mo></msup></mrow><
   /msub><msub><mi>w</mi><msup><mi>v</mi><mo>′</mo></msup></msub></mrow></
   msub></mfrac></mstyle></mtd><mtd columnalign="left"><mrow><mspace
   width="4pt"></mspace><mtext>if</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mi>v</mi><mo>∈</mo><mi>E</mi><mspace
   width="4pt"></mspace><mtext>and</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mo>∉</mo><mi>T</mi><mo>,</mo></mrow></m
   td></mtr><mtr><mtd columnalign="left"><mn>1</mn></mtd><mtd
   columnalign="left"><mrow><mspace
   width="4pt"></mspace><mtext>if</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mo>=</mo><mi>v</mi><mspace
   width="4pt"></mspace><mtext>and</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mo>∈</mo><mi>T</mi><mo>,</mo></mrow></m
   td></mtr><mtr><mtd columnalign="left"><mrow><mn>0</mn><mspace
   width="4pt"></mspace><mtext>otherwise.</mtext></mrow></mtd></mtr></mtab
   le><mo></mo></mrow></mtd></mtr></mtable> :MATH]
   (1)

   Particular cases of this construction are as follows:
     * (Symmetric unweighted case)
       [MATH:
       <mrow><mi>M</mi><mrow><mo>(</mo><mi>u</mi><mo>,</mo><mi>v</mi><mo>)
       </mo></mrow><mo>=</mo><mfrac><mrow><msub><mi>w</mi><mrow><mi>u</mi>
       <mi>v</mi></mrow></msub><msub><mi>w</mi><mi>u</mi></msub></mrow><ms
       ub><mo>∑</mo><mrow><msup><mi>v</mi><mo>′</mo></msup><mo>∈</mo><msub
       sup><mi>N</mi><mi>G</mi><mo>+</mo></msubsup><mrow><mo>(</mo><mi>u</
       mi><mo>)</mo></mrow><msub><mi>w</mi><mrow><mi>u</mi><msup><mi>v</mi
       ><mo>′</mo></msup></mrow></msub><msub><mi>w</mi><msup><mi>v</mi><mo
       >′</mo></msup></msub></mrow></msub></mfrac><mo>=</mo><mfrac><mn>1</
       mn><mrow><msub><mi>d</mi><mi>G</mi></msub><mrow><mo>(</mo><mi>u</mi
       ><mo>)</mo></mrow></mrow></mfrac></mrow> :MATH]
       (first line of [79]Eq (1)).
     * (Symmetric edge-weighted case) [80]Eq (1) also generalizes the
       formulae used in [[81]13].

   Recall that a state is absorbing if once reached by a walk, it is never
   exited. Observe that the set of states T is absorbing in the Markov
   chain defined by transition matrix M. We now define the Markov chain
   with restart from M and from a subset S′ ⊆ S. Intuitively, for each
   vertex u ∈ V\T which is not a target, we add a transition to every
   vertex of S′ Formally, given a real number r ∈ [0, 1), the transition
   matrix
   [MATH:
   <msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo></msup><mi>r</mi></msubsup
   > :MATH]
   is defined as follows for every pair (u, v) ∈ V^2.
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo>
   </msup><mi>r</mi></msubsup><mrow><mo>(</mo><mi>u</mi><mo>,</mo><mi>v</m
   i><mo>)</mo></mrow><mo>=</mo><mo>{</mo><mtable><mtr><mtd
   columnalign="left"><mstyle displaystyle="true"
   scriptlevel="0"><mfrac><mrow><mrow><mo>(</mo><mn>1</mn><mo>-</mo><mi>r<
   /mi><mo>)</mo></mrow><msub><mi>w</mi><mrow><mi>u</mi><mi>v</mi></mrow><
   /msub><msub><mi>w</mi><mi>u</mi></msub></mrow><msub><mo>∑</mo><mrow><ms
   up><mi>v</mi><mo>′</mo></msup><mo>∈</mo><msubsup><mi>N</mi><mi>G</mi><m
   o>+</mo></msubsup><mrow><mo>(</mo><mi>u</mi><mo>)</mo></mrow><msub><mi>
   w</mi><mrow><mi>u</mi><msup><mi>v</mi><mo>′</mo></msup></mrow></msub><m
   sub><mi>w</mi><msup><mi>v</mi><mo>′</mo></msup></msub></mrow></msub></m
   frac></mstyle></mtd><mtd columnalign="left"><mrow><mspace
   width="4pt"></mspace><mtext>if</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mi>v</mi><mo>∈</mo><mi>E</mi><mspace
   width="4pt"></mspace><mtext>and</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mo>∉</mo><mi>T</mi><mo>,</mo></mrow></m
   td></mtr><mtr><mtd
   columnalign="left"><mfrac><mi>r</mi><mrow><mrow><mo>|</mo></mrow><msup>
   <mi>S</mi><mo>′</mo></msup><mrow><mo>|</mo></mrow></mrow></mfrac></mtd>
   <mtd columnalign="left"><mrow><mspace
   width="4pt"></mspace><mtext>if</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mo>∉</mo><mi>T</mi><mspace
   width="4pt"></mspace><mtext>and</mtext><mspace
   width="4pt"></mspace><mi>v</mi><mo>∈</mo><msup><mi>S</mi><mo>′</mo></ms
   up><mo>,</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="left"><mn>1</mn></mtd><mtd
   columnalign="left"><mrow><mspace
   width="4pt"></mspace><mtext>if</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mo>=</mo><mi>v</mi><mspace
   width="4pt"></mspace><mtext>and</mtext><mspace
   width="4pt"></mspace><mi>u</mi><mo>∈</mo><mi>T</mi><mo>,</mo></mrow></m
   td></mtr><mtr><mtd columnalign="left"><mrow><mn>0</mn><mspace
   width="4pt"></mspace><mtext>otherwise.</mtext></mrow></mtd></mtr></mtab
   le><mo></mo></mrow></mtd></mtr></mtable> :MATH]
   (2)

   Note that restart transitions may have the same origin/destination as
   an existing transition in M. (Equivalently, in graph theoretical terms,
   one has two arcs between the same two vertices). In that case, the
   probabilities specified in ([82]2) should be added. The transition
   matrix M is a particular case of
   [MATH:
   <msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo></msup><mi>r</mi></msubsup
   > :MATH]
   when r = 0. Note also that when |S′| = 1, one restarts to a single
   vertex.

   Definition. 1 (State distribution) Consider an initial distribution
   uniform in the set S′. Given any r ∈ [0, 1) and any S′ ⊆ S, the state
   distribution at each step i ≥ 0 is denoted
   [MATH: <mrow><mtable equalcolumns="true"
   equalrows="true"><mtr><mtd><mrow><msubsup><mi>π</mi><mrow><msubsup><mi>
   M</mi><msup><mi>S</mi><mo>′</mo></msup><mi>r</mi></msubsup></mrow><mi>i
   </mi></msubsup><mo>=</mo><mfenced><mrow><msubsup><mi>π</mi><mrow><msubs
   up><mi>M</mi><msup><mi>S</mi><mo>′</mo></msup><mi>r</mi></msubsup></mro
   w><mi>i</mi></msubsup><mfenced><mrow><msub><mi>v</mi><mn>1</mn></msub><
   /mrow></mfenced><mo>,</mo><mo>…</mo><mo>,</mo><msubsup><mi>π</mi><mrow>
   <msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo></msup><mi>r</mi></msubsup
   ></mrow><mi>i</mi></msubsup><mfenced><mrow><msub><mi>v</mi><mi>n</mi></
   msub></mrow></mfenced></mrow></mfenced><mo>,</mo></mrow></mtd></mtr></m
   table></mrow> :MATH]

   with
   [MATH:
   <msubsup><mi>π</mi><mrow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo><
   /msup><mi>r</mi></msubsup></mrow><mn>0</mn></msubsup><mo
   stretchy="false">(</mo><mi>u</mi><mo
   stretchy="false">)</mo><mo>=</mo><mn>0</mn> :MATH]
   for every u ∉ S′ and
   [MATH:
   <msubsup><mi>π</mi><mrow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo><
   /msup><mi>r</mi></msubsup></mrow><mn>0</mn></msubsup><mo
   stretchy="false">(</mo><mi>u</mi><mo
   stretchy="false">)</mo><mo>=</mo><mo>=</mo><mfrac><mn>1</mn><mrow><mo>|
   </mo><msup><mi>S</mi><mo>′</mo></msup><mo>|</mo></mrow></mfrac> :MATH]
   for every u ∈ S′.

   Under the assumption that, from every u ∈ S′ there exists a path in G
   to some v ∈ T, the limit of this vector exists when i → ∞, for every
   value of r ∈ [0, 1). We denote it with
   [MATH:
   <msub><mi>π</mi><mrow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo></ms
   up><mi>r</mi></msubsup></mrow></msub> :MATH]
   . The probabilities in this vector are commonly known as hitting
   probabilities of the target set T.

   Using the target set, we define:

   Definition. 2 (Hit probability vector) Let T = {t[1], …, t[k]} be the
   target set. Given any r ∈ [0, 1) and any S′ ⊆ S, the hit vector
   [MATH:
   <msub><mi>π</mi><mrow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo></ms
   up><mi>r</mi></msubsup></mrow></msub><mo>=</mo><mfenced><mrow><msub><mi
   >π</mi><mrow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo></msup><mi>r<
   /mi></msubsup></mrow></msub><mfenced><mrow><msub><mi>t</mi><mn>1</mn></
   msub></mrow></mfenced><mo>,</mo><mo>…</mo><mo>,</mo><msub><mi>π</mi><mr
   ow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo></msup><mi>r</mi></msub
   sup></mrow></msub><mfenced><mrow><msub><mi>t</mi><mi>k</mi></msub></mro
   w></mfenced></mrow></mfenced> :MATH]
   is composed of the hitting probabilities for states of T.

   In the following, we abuse the notation writing
   [MATH: <msubsup><mi>M</mi><mi>s</mi><mi>r</mi></msubsup> :MATH]
   instead of
   [MATH:
   <msubsup><mi>M</mi><mrow><mo>{</mo><mi>s</mi><mo>}</mo></mrow><mi>r</mi
   ></msubsup> :MATH]
   . Furthermore, we will write M^r instead of
   [MATH: <msubsup><mi>M</mi><mi>S</mi><mi>r</mi></msubsup> :MATH]
   . Finally, we renormalize the vectors with the hit vector of the chain
   without restart (i.e. r = 0):

   Definition. 3 (Hit score) Given r ∈ [0, 1], define the score from s to
   t as
   [MATH:
   <mrow><msup><mi>Q</mi><mrow><mfenced><mi>r</mi></mfenced></mrow></msup>
   <mfenced><mrow><mi>s</mi><mo>,</mo><mi>t</mi></mrow></mfenced><mo>=</mo
   ><msub><mi>π</mi><mrow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′</mo></m
   sup><mi>r</mi></msubsup></mrow></msub><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo><mo>/</mo><msub><mi>π</mi><mrow><msup><mi>M</mi>
   <mn>0</mn></msup></mrow></msub><mo stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo><mo>.</mo></mrow> :MATH]
   (3)

   The hit score vector associated with each source s ∈ S is (Q^(r)(s,
   t[1]), …, Q^(r)(s, t[k])).

   The log score log Q^(r)(s, t) is the natural logarithm of Q^(r)(s, t).

   Finally the rank of the score of a pair (s, t)∈S × T is defined as
   follows:
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>r</mi><mi>a</mi><mi>n</mi><msubsup><mi>k<
   /mi><mrow><mi>S</mi><mi>T</mi></mrow><mrow><mo>(</mo><mi>r</mi><mo>)</m
   o></mrow></msubsup><mrow><mo>(</mo><mi>s</mi><mo>,</mo><mi>t</mi><mo>)<
   /mo></mrow><mo>=</mo><mn>1</mn><mo>+</mo><mrow><mo>|</mo><mrow><mo>{</m
   o><msup><mi>Q</mi><mrow><mo>(</mo><mi>r</mi><mo>)</mo></mrow></msup><mr
   ow><mo>(</mo><msup><mi>s</mi><mo>′</mo></msup><mo>,</mo><msup><mi>t</mi
   ><mo>′</mo></msup><mo>)</mo></mrow><mo>></mo><msup><mi>Q</mi><mrow><mo>
   (</mo><mi>r</mi><mo>)</mo></mrow></msup><mrow><mo>(</mo><mi>s</mi><mo>,
   </mo><mi>t</mi><mo>)</mo></mrow><mo>,</mo><msup><mi>s</mi><mo>′</mo></m
   sup><mo>∈</mo><mi>S</mi><mo>,</mo><msup><mi>t</mi><mo>′</mo></msup><mo>
   ∈</mo><mi>T</mi><mo>,</mo><mrow><mo>(</mo><msup><mi>s</mi><mo>′</mo></m
   sup><mo>,</mo><msup><mi>t</mi><mo>′</mo></msup><mo>)</mo></mrow><mo>≠</
   mo><mrow><mo>(</mo><mi>s</mi><mo>,</mo><mi>t</mi><mo>)</mo></mrow><mo>}
   </mo></mrow><mo>|</mo></mrow><mo>.</mo></mrow></mtd></mtr></mtable>
   :MATH]
   (4)

   Remark 1 The hit score from [83]Eq 3 incorporates three ingredients,
   namely (i) a random walk with restart, (ii) absorbing states, and (iii)
   renormalization by the value obtained without restart. We may define
   other scores by removing any of these ingredients, e.g. the mechanism
   of absorbing states.

3.4 Scores and their symmetry

3.4.1 Scores: X ⇝P versus P ⇝X

   In order to analyze the (lack of) symmetry between paths joining X and
   P and vice-versa, We apply the previous construction to two settings:
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mo>{</mo><mtable><mtr><mtd
   columnalign="left"><mrow><mi>X</mi><mo>⇝</mo><mi>P</mi><mo>:</mo><mo>(<
   /mo><mi>S</mi><mo>=</mo><mi>X</mi><mo>,</mo><mi>T</mi><mo>=</mo><mi>P</
   mi><mo>)</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="left"><mrow><mi>P</mi><mo>⇝</mo><mi>X</mi><mo>:</mo><mo>(<
   /mo><mi>S</mi><mo>=</mo><mi>P</mi><mo>,</mo><mi>T</mi><mo>=</mo><mi>X</
   mi><mo>)</mo></mrow></mtd></mtr></mtable></mrow></mtd></mtr></mtable>
   :MATH]
   (5)

3.4.2 Instances (PPIN, X, P)

   Using a different PPIN, a different experimental gene set X or a
   different pathway gene set P will affect the score Q^(r)(x, p) for a
   given pair (x, p). In order to make it clearer we define an instance of
   execution as the triplet I = (PPIN, X, P), and we refer to the score
   obtained under this instance as
   [MATH:
   <mrow><msubsup><mi>Q</mi><mrow><mi>I</mi></mrow><mrow><mo>(</mo><mi>r</
   mi><mo>)</mo></mrow></msubsup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>p
   </mi><mo>)</mo></mrow></mrow> :MATH]
   . However, for conciseness, we simply denote this score Q^(r)(x, p).

3.4.3 Symmetry at the gene level

   Consider an experimental gene x ∈ X and a pathway gene p ∈ P. Using
   [84]Eq (3), we assess the asymmetry using the ratio of log scores
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msubsup><mi>R</mi><mtext>log</mtext><mrow><m
   o>(</mo><mi>r</mi><mo>)</mo></mrow></msubsup><mrow><mo>(</mo><mi>s</mi>
   <mo>,</mo><mi>t</mi><mo>)</mo></mrow><mo>=</mo><mtext>min</mtext><mrow>
   <mo>(</mo><mtext>log</mtext><msup><mi>Q</mi><mrow><mo>(</mo><mi>r</mi><
   mo>)</mo></mrow></msup><mrow><mo>(</mo><mi>s</mi><mo>,</mo><mi>t</mi><m
   o>)</mo></mrow><mo>,</mo><mtext>log</mtext><msup><mi>Q</mi><mrow><mo>(<
   /mo><mi>r</mi><mo>)</mo></mrow></msup><mrow><mo>(</mo><mi>t</mi><mo>,</
   mo><mi>s</mi><mo>)</mo></mrow><mo>)</mo></mrow><mo>/</mo><mtext>max</mt
   ext><mrow><mo>(</mo><mtext>log</mtext><msup><mi>Q</mi><mrow><mo>(</mo><
   mi>r</mi><mo>)</mo></mrow></msup><mrow><mo>(</mo><mi>s</mi><mo>,</mo><m
   i>t</mi><mo>)</mo></mrow><mo>,</mo><mtext>log</mtext><msup><mi>Q</mi><m
   row><mo>(</mo><mi>r</mi><mo>)</mo></mrow></msup><mrow><mo>(</mo><mi>t</
   mi><mo>,</mo><mi>s</mi><mo>)</mo></mrow><mo>)</mo></mrow><mo>.</mo></mr
   ow></mtd></mtr></mtable> :MATH]
   (6)

3.4.4 Symmetry at the instance level

   Consider an instance I = (PPIN, X, P). (Because throughout this paper
   we make use of a single PPIN (MINT) and the same set P, we will use the
   shorthand (X) for the instance). To study the symmetry at the instance
   level, we consider the proportion of pairs (x, p) ∈ X × P such that
   [MATH:
   <mrow><msubsup><mi>Q</mi><mrow><mi>I</mi></mrow><mrow><mo>(</mo><mi>r</
   mi><mo>)</mo></mrow></msubsup><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>p
   </mi><mo>)</mo></mrow><mo>></mo><msubsup><mi>Q</mi><mrow><mi>I</mi></mr
   ow><mrow><mo>(</mo><mi>r</mi><mo>)</mo></mrow></msubsup><mrow><mo>(</mo
   ><mi>p</mi><mo>,</mo><mi>x</mi><mo>)</mo></mrow></mrow> :MATH]
   . Formally, denoting
   [MATH: <msub><mn mathvariant="bold">1</mn><mi>b</mi></msub> :MATH]
   the indicator function of the boolean b, the symmetry of the results
   for I is given by
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>H</mi><mrow><mo>(</mo><mi>I</mi><mo>)</mo
   ></mrow><mo>=</mo><mfrac><mn>1</mn><mrow><mo>|</mo><mi>X</mi><mo>|</mo>
   <mo>*</mo><mo>|</mo><mi>P</mi><mo>|</mo></mrow></mfrac><munder><mo>∑</m
   o><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>p</mi><mo>)</mo><mo>∈</mo><mi
   >X</mi><mo>×</mo><mi>P</mi></mrow></munder><msub><mn
   mathvariant="bold">1</mn><mrow><msubsup><mi>Q</mi><mrow><mi>I</mi></mro
   w><mrow><mo>(</mo><mi>r</mi><mo>)</mo></mrow></msubsup><mrow><mo>(</mo>
   <mi>x</mi><mo>,</mo><mi>p</mi><mo>)</mo></mrow><mo>></mo><msubsup><mi>Q
   </mi><mrow><mi>I</mi></mrow><mrow><mo>(</mo><mi>r</mi><mo>)</mo></mrow>
   </msubsup><mrow><mo>(</mo><mi>p</mi><mo>,</mo><mi>x</mi><mo>)</mo></mro
   w></mrow></msub><mo>.</mo></mrow></mtd></mtr></mtable> :MATH]
   (7)

3.5 Genetrank, saturation indices, hits

   Using the scores in the two directions X ⇝P versus P ⇝X ([85]Eq 5), we
   now define a ranking on the genes of X:

   Definition. 4 (Average score) Let 1 ≤ τ ≤ |P| be an integer. Consider a
   fixed value of the restart rate r. For a source x, let the average
   score be the arithmetic mean over the top τ values max(log Q^(r)(x, p),
   log Q^(r)(p, x)) observed for p ∈ P. The gene network ranking
   (Genetrank) of genes in X is the ranking associated with the
   aforementioned average values. The set of top k genes of the ranking is
   denoted
   [MATH:
   <mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mr
   ow><mo stretchy="false">(</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup></mrow> :MATH]
   .

   Note that when τ = 1, the ranking of a gene in X is determined by its
   largest max score. Averaging scores over τ targets makes intuitive
   sense here when identifying whole connectedness to a pathway.

   To assess the stability of this ranking, we proceed as follows.
   Consider a set of values R = {r[1], …, r[N]}, sorted by increasing or
   decreasing value. We define the set of genes found in
   [MATH:
   <mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mr
   ow><mo stretchy="false">(</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup></mrow> :MATH]
   up to a given value r[l], with 1 ≤ l ≤ N, by
   [MATH:
   <mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mr
   ow><mo
   stretchy="false">(</mo><mo>→</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><msub><mstyle
   displaystyle="true"
   mathsize="140%"><mo>∪</mo></mstyle><mrow><mi>j</mi><mo>=</mo><mn>1</mn>
   <mo>,</mo><mo>…</mo><mo>,</mo><mi>l</mi></mrow></msub><msubsup><mi>T</m
   i><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mrow><mo
   stretchy="false">(</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup><mo>.</mo></mrow> :MATH]
   (8)

   We now use this set to qualify the speed at which we discover the
   sources in X when increasing the upper bound on the restart rate:

   Definition. 5 (Saturation indices for an increasing sequence of values
   of r.) The saturation index at threshold r[l] is the fraction of
   sources present in
   [MATH:
   <mrow><mo>|</mo><mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>
   k</mi></mrow><mrow><mo
   stretchy="false">(</mo><mo>→</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mo>|</mo></mrow> :MATH]
   , that is:
   [MATH:
   <mrow><mi>S</mi><mi>a</mi><msubsup><mi>t</mi><mrow><mi>τ</mi><mo>,</mo>
   <mi>k</mi></mrow><mrow><mo
   stretchy="false">(</mo><mo>→</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mfrac><mrow><mfenced
   close="|"
   open="|"><mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi><
   /mrow><mrow><mo
   stretchy="false">(</mo><mo>→</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup></mrow></mfenced></mrow><mrow><
   mfenced close="|"
   open="|"><mi>X</mi></mfenced></mrow></mfrac><mfenced><mrow><mo>≤</mo><m
   n>1</mn></mrow></mfenced><mo>.</mo></mrow> :MATH]
   (9)

   The relative saturation index is the latter normalized by the value of
   k used:
   [MATH: <mrow><msubsup><mrow><mover
   accent="true"><mrow><mi>S</mi><mi>a</mi><mi>t</mi></mrow><mo
   stretchy="true">¯</mo></mover></mrow><mrow><mi>τ</mi><mo>,</mo><mi>k</m
   i></mrow><mrow><mo
   stretchy="false">(</mo><mo>→</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mfrac><mrow><mi>S</m
   i><mi>a</mi><msubsup><mi>t</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mr
   ow><mrow><mo
   stretchy="false">(</mo><mo>→</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mi>k</mi></mfrac></mrow
   > :MATH]
   (10)

   In the absence of overlap between consecutive
   [MATH:
   <mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mr
   ow><mo stretchy="false">(</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup></mrow> :MATH]
   , one would have
   [MATH:
   <mrow><mo>|</mo><mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>
   k</mi></mrow><mrow><mo
   stretchy="false">(</mo><mo>→</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mo>|</mo><mo>=</mo><mi>
   k</mi><mo>×</mo><mi>l</mi></mrow> :MATH]
   . Thus, normalizing by k provides a measure of the overlap between
   consecutive sets.

   We note in passing that the previous sets can be used to define how
   many hits in a given reference list of genes
   [MATH: <mi mathvariant="script">L</mi> :MATH]
   are obtained:

   Definition. 6 (Hits) Consider a reference list of genes
   [MATH: <mi mathvariant="script">L</mi> :MATH]
   . The number of hits for particular values (r, k) is the size of the
   set
   [MATH:
   <mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mr
   ow><mo stretchy="false">(</mo><mi>r</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mrow><mo>∩</mo><mi
   mathvariant="script">L</mi></mrow> :MATH]
   . For a fixed k, we similarly consider the size of the set
   [MATH:
   <mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mr
   ow><mo stretchy="false">(</mo><mo>→</mo><mi>r</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mrow><mo>∩</mo><mi
   mathvariant="script">L</mi></mrow> :MATH]
   .

   Remark 2 Saturation indices for a decreasing sequence of values of r
   readily generalize from Eqs [86]8, [87]9 and [88]10. In fact, a larger
   (resp. smaller) value of r amounts to zooming in (resp. out) towards
   the sources.

3.6 Graphical representations with radar scatter plots

   We wish to rank genes from X using genes from P, exploiting the
   directions X ⇝P and P ⇝X.

3.6.1 Score radar plots

   The difficulty in working with values LogCount(x), LogFoldChange(x) for
   x ∈ X represented in an MA plot, is that all pairs (x, p) get mapped
   onto the same point. To get around this difficulty, we associate a
   radar plot with each point x ∈ X, yielding an overall score radar
   scatter plot. Each gene score radar plot is defined as follows:
     * the background of the gene radar plot is colored using a heat map
       indexed on the largest (X ⇝P or P ⇝X) log score observed for that
       gene. This background color makes it easy to spot the individual
       radar plots with high scores.
     * the gene radar plot has a number of spokes equal to the top k (user
       defined) scores.
     * on each spoke, two values are found, namely the scores log Q^(r)(x,
       p) and log Q^(r)(p, x).
     * finally, the radar plot title is set set to the gene name
       accompanied by the interval of scores (log scale).

3.6.2 Score radar MA plot

   Displaying all individual score radar plots in the

   LogCount(x), LogFoldChange(x) plane yields the so-called Score radar MA
   plot.

3.7 Complexity and implementation

3.7.1 Complexity

   The running time of one instance i.e. computing the hit scores at a
   fixed r, depends on the sizes of the PPIN, and of sets X and P.

   In contrast with the classical stationary probabilities of ergodic
   Markov chains, hit probabilities are dependent on the initial state.
   Therefore, where a stationary distribution needs to be computed once
   and occupies the space of a vector of dimension n = |V|—the number of
   nodes, hit probabilities need to be computed for each initial state
   envisioned and they occupy a total space of size |X| × |P|. Exact
   formulas for stationary probabilities and for hit probabilities both
   involve matrix inversions [[89]28], at a practical cost of order n^3.
   They are therefore not practical for large sparse graphs, and iterative
   methods are usually preferred [[90]29], leading to an approximate
   result with any desired accuracy. The simplest iterative method
   actually computes state probability vectors
   [MATH:
   <mrow><msubsup><mi>π</mi><mrow><msubsup><mi>M</mi><msup><mi>S</mi><mo>′
   </mo></msup><mi>r</mi></msubsup></mrow><mi>i</mi></msubsup></mrow>
   :MATH]
   (Def. 1) at successive steps until some convergence criterion is met.
   It applies both to stationary probabilities and to hit probabilities.
   It produces an ϵ-approximation with a computation time of order
   C|E|log(1/ϵ), C being a constant related with the spectrum of the
   graph. For Markov chains with restart probability r, this constant
   increases with r and we observe this phenomenon in our experiments.

   Practically, we compute the hit scores of Def. 3 using the C++ Marmote
   library ([[91]30] and [92]https://wiki.inria.fr/MARMOTE/Welcome).
   Specifically, we use from Marmote the iterative method which
   approximates the result with iterations of order m, with m = |E| the
   number of edges. It is faster and also more stable in practice, since
   it involves only positive numbers. The stopping criterion chosen for
   iterations is that the L[1] distance between successive iterates is
   less than 10^−6.

3.7.2 Implementation

   The whole pipeline is implemented in the Genetrank package of the
   Structural Bioinformatics Library ([[93]31] and
   [94]http://sbl.inria.fr), see
   [95]https://sbl.inria.fr/doc/Genetrank-user-manual.html.

   Practically, processing one instance took a few minutes (<5)
   worst-case, on a standard desktop computer (Precision 7920 Tower, 64 Go
   of RAM, Intel(R) Xeon(R) Silver 4214 CPU at 2.20GHz; OS: Fedora Core
   32).

3.8 Tests: Setup

3.8.1 Contenders

   As already noticed (Rmk. 1), the hit score from [96]Eq 3 incorporates
   three ingredients, namely (i) a random walk with restart, (ii)
   absorbing states, and (iii) renormalization by the value obtained
   without restart. To assess the importance of the latter two
   ingredients, three contenders of nested complexity are considered in
   the sequel:
     * pr-affinity: the minimum page rank affinity introduced in [[97]16],
       using a plain random walk with restart model.
     * Genetrank-renorm: score obtained from a random walk with restart
       and renormalization using [98]Eq 3—but no absorbing state.
     * Genetrank-AS: score obtained from a random walk with restart and
       absorbing states—but without the renormalization of [99]Eq 3.
     * Genetrank: score obtained using all ingredients: random walk with
       restart, absorbing states, and the renormalization of [100]Eq 3.

3.8.2 Parameters used

   The following values are used in our experiments:
     * 81 values of r, from r = 0 to r = 0.8 by steps of 0.01,
     * three values of τ (Def. 4): τ ∈ {1, 20, 41}, (recall that |P| =
       41),
     * ten values of k: k ∈ {5, 10, 15, 20, 25, 30, 35, 40, 45, 50},
     * saturation indices for r sorted by increasing/decreasing values
       (Rmk. 2).

4 Results

4.1 Genetrank and saturation indices

   The saturation index makes it possible to study the variation of the
   size of the set of genes selected by the three contenders when r
   increases or decreases. To promote the vicinity of sources in the PPIN,
   we process values r by decreasing value (Rmk. 2). We inspect in turn
   the saturation, relative saturation and number of hits (Figs [101]2 and
   [102]3, S4, S5 Figs in [103]S1 File). We use the median value τ = 20 to
   compute average scores—see the Supplemental for the values τ = 1 and τ
   = 40.

Fig 2. (pr-affinity) saturation plots (Def. 5) for τ = 20.

   [104]Fig 2
   [105]Open in a new tab

   Values of r processed in decreasing order. (Left column) Saturation
   index and slices at r = cst and k = cst (See [106]Eq 9) (Right column)
   Relative saturation index and slices at r = cst and k = cst (See
   [107]Eq 10).

Fig 3. (Genetrank) saturation plots (Def. 5) for τ = 20.

   [108]Fig 3
   [109]Open in a new tab

   Values of r processed in decreasing order. (Left column) Saturation
   index and slices at r = cst and k = cst (See [110]Eq 9) (Right column)
   Relative saturation index and slices at r = cst and k = cst (See
   [111]Eq 10).

   The methods pr-affinity, Genetrank-renorm, and Genetrank-AS yield a
   very similar behavior in two respects. First, the saturation gets
   maximum (one) for large values of k, and is relatively insensitive to
   the value of r ([112]Fig 2(B), S4(B), S5(B) Figs in [113]S1 File).
   Also, the relative saturation drops down to very small values rapidly
   ([114]Fig 2(E), S4(E), S5(E) Figs in [115]S1 File). In sharp contrast,
   the maximum saturation yielded by Genetrank is equal to ∼0.4, and shows
   a marked dependence on the restart rate ([116]Fig 3(B)). The relative
   saturation is also much more sensitive to the value of k used, with a
   coherent behavior as a function of k at fixed values of r ([117]Fig
   3(E) and 3(F)).

   These observations stress the specificity yielded by the combination of
   renormalization and absorbing states in Genetrank. To further these
   insights, we proceed with a more detailed analysis of the incidence of
   the parameters r and τ:
     * (r) The restarting rate r defines the bias towards sources.
       Whatever the value of k, in processing values of r in decreasing
       values, we observe that the saturation increases when r approaches
       zero ([118]Fig 3(C)). The slope of the curves increases for values
       of r ≤ 0.1, showing that for such values of the restart rate, the
       random walks get to explore a larger region of the PPIN. In
       Genetrank, larger values of the restart rate are therefore
       instrumental in promoting a more specific exploration.
     * (τ) Increasing τ consists of averaging on more targets. This
       averaging yields a marked increase of the saturation (whatever the
       value of k), especially for small values of r ([119]Fig 3, S6, S7
       Figs in [120]S1 File).

   Let us now consider the relative saturation ([121]Fig 3(D)–3(F);
   S6(D)-S6(F), S7(D)-S7(F) Figs in [122]S1 File). Sections of the surface
   at r = cst yield a monotonic behavior, that is the smaller the restart
   rate the larger the relative saturation. (We also note that for the
   first value of r processed, that is r = 0, 8, the relative saturation
   is always equal to 1/|X|∼0.0044.) Interestingly, slices at r = cst are
   not monotonic. The valleys crossed on such slices show that increasing
   k increases the saturation but not necessarily the relative saturation.
   This may be interpreted as accessibility scales in the PPIN.

4.2 Evaluating the effect of varying restart rates on scores, using
experimentally validated gene hits

   Consider the list of reference genes (and their products) that had been
   experimentally validated following the single-cell functional genomics
   approach using predictive cell dynamics [[123]1] [124]O15304 (SIVA1),
   [125]P78537 (BLOC1S1), [126]P0CW18 (PRSS56), [127]P53007 (SLC25A1),
   [128]Q9Y2X8 (UBE2D4), DNM1L([129]O00429). Note that [130]Q6UW78(UQCC3)
   cannot be considered here, as it is not present in our PPIN of
   interest. We compute the number of genes from this list found in the
   sets
   [MATH:
   <mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mr
   ow><mo stretchy="false">(</mo><mi>r</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow> :MATH]
   and
   [MATH:
   <mrow><mo>|</mo><mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>
   k</mi></mrow><mrow><mo
   stretchy="false">(</mo><mo>→</mo><msub><mi>r</mi><mi>l</mi></msub><mo
   stretchy="false">)</mo></mrow></msubsup></mrow></mrow> :MATH]
   .

   We compare the results of the four methods (Figs [131]4 and [132]5, S8,
   S9 Figs in [133]S1 File). Consistent with the analysis in the previous
   section, the methods pr-affinity and Genetrank-renorm are rather non
   specific, with a number of hits yielded by
   [MATH:
   <mrow><msubsup><mi>T</mi><mrow><mi>τ</mi><mo>,</mo><mi>k</mi></mrow><mr
   ow><mo stretchy="false">(</mo><mi>r</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow> :MATH]
   essentially constant whatever the value of r at a fixed value of k,
   whatever the value of τ (S4, S8 Figs in [134]S1 File). The method
   Genetrank-AS shows a more contrived behavior, with the same number of
   genes uncovered (i.e. 5) for small values of r, yet more variations
   when varying r at fixed values of k (S9 Fig in [135]S1 File).

Fig 4. (pr-affinity) hits (Def. 6) for the list of reference genes
[136]O15304 (SIVA1), [137]P78537 (BLOC1S1), [138]P0CW18 (PRSS56), [139]P53007
(SLC25A1), [140]Q9Y2X8 (UBE2D4), DNM1L([141]O00429).

   [142]Fig 4
   [143]Open in a new tab

Fig 5. (Genetrank) hits (Def. 6) for the list of reference genes [144]O15304
(SIVA1), [145]P78537 (BLOC1S1), [146]P0CW18 (PRSS56), [147]P53007 (SLC25A1),
[148]Q9Y2X8 (UBE2D4), DNM1L([149]O00429).

   [150]Fig 5
   [151]Open in a new tab

   The method Genetrank goes one step beyond, namely discovers genes more
   selectively when increasing k and changing the restart rate ([152]Fig
   5). Small values of τ allow retrieving three genes using fewer values
   of r, clearly showing the specificity/robustness of nodes of X highly
   ranked with a large restart rate. Conversely, using larger values of τ
   in conjunction with smaller values of r (larger excursions in the PPIN)
   allows reporting four genes in total. Except for τ = 1, using large
   values of r requires larger values of k to retrieve the known genes:
   for τ = 20 and τ = 41, a single gene is retrieved when r > 0.5.

   These experiments show the progressiveness yielded by Genetrank is
   exploring the PPIN, as opposed to avoid the fast mixing observed in
   pr-affinity and Genetrank-renorm, and to a lesser extent Genetrank-AS.
   Methods without absorbing states see the whole PPIN in a more
   homogeneous way, due to its small-world nature. This behaviour is even
   more pronounced in our case, as the target genes form a pathway.
   Indeed, a random walk reaching one such node is likely to discover its
   neighbors right after (See Comparison to previous work, Sec. 3.3.) When
   a absorbing states are used, the random walks halts, and the discovery
   of neighbors does not take place. The introduction of absorbing states
   therefore appears crucial to localize the exploration of connexions, in
   conjunction the choice of the restart rate r—generally taken to r = 0.7
   in previous work – see [[153]18] and citations therein.

4.3 Symmetry analysis on a per-source basis: Radar plots

   The symmetry ratio H(I) is a global assessment based on all pairs in
   the Cartesian product X × TP. For an assessment of the asymmetry on a
   per gene basis, we resort to radar plots (Sec. 3.6). Example radar
   plots for genes experimentally validated are provided are provided on
   [154]Fig 6). The individual radar plots can be assembled in a global MA
   plot ([155]Fig 7).

Fig 6. Radar plots for four experimentally validated genes.

   [156]Fig 6
   [157]Open in a new tab

   The range of values covers the range of log scores observed. The two
   bullets on a spoke read as follows: blue dot: direction PX ⇝; green
   dot: direction XP ⇝.

Fig 7. Score radar scatter plot.

   [158]Fig 7
   [159]Open in a new tab

   The individual radar plots ([160]Fig 6) are assembled in a MA plot.
   Zooming on a radar reveals the scores for that gene (80 data points per
   gene in a radar).

4.4 Biological analysis

4.4.1 Differentially expressed genes

   The set of differentially expressed (DE) genes obtained from the
   single-cell multimodal profiling approach fate-seq, allows comparing
   transcriptomic profiles between single cells of an isogenic population,
   grouped by their predicted drug response [[161]1]. Practically, the DE
   genes have been identified for their effect on cell sensitivity at the
   drug IC[50], and represent the signature of drug efficacy. (The DE
   genes here, are potential combination targets to increase the treatment
   efficacy.) Although this response prediction allowed the differential
   analysis using the edgeR likelihood ratio test framework [[162]32] by
   defining two groups (predicted sensitive vs. tolerant cells), the most
   significant DE genes (false discovery rate FDR < 0.1 and |log 2(FC)| >
   2) still constitute a list of more than 60 genes. Such a large list
   defies the expected practical set of potential targets that would serve
   to design co-therapeutic strategies increasing overall treatment
   efficacy. In addition, ranking only on differential expression might
   underestimate a gene for its potential function as regulator of the
   pathway overall activity. As an example, among these gene hits, we have
   observed that, at equal distance (the shortest path between the source
   and the target gene caspase-8), the noisier the gene expression was,
   the larger the effect a gene perturbation had on cell death, and at
   comparable expression variability: the longer the shortest path, the
   larger the effect [[163]1]. We reasoned that diffusion distances could
   also ameliorate ranking of cell-to-cell differential expression
   analyses.

   We use Genetrank to sort genes for their connectedness to molecular
   factors regulating the signaling pathway triggered by the drug of
   interest (TRAIL). We intersect the list of the 65 most significant DE
   genes obtained by edgeR, with the set of genes obtained with Genetrank
   (k = 50, range of values of r: 0.0.8, τ = 41; [164]Eq (8) and S7 Fig in
   [165]S1 File), resulting in 18 genes (S1 Table in [166]S1 File).

   This gene shortlist presents a number of valuable advantages over the
   list of significant DE genes, as it becomes more manageable for
   experimental validation, and drug target discovery. In addition we
   found that this shortlist contained genes that had been previously
   reported to regulate cell death and importantly, it was enriched in
   genes that we had experimentally validated for having an effect on
   TRAIL response.

4.4.2 Previously validated genes

   Out of the 18 genes (S1 Table in [167]S1 File) we prioritized for their
   likelihood of having an effect on the drug mechanism of action (MoA)
   using Genetrank, we found 4 genes that we had previously validated
   experimentally, namely BLOC1S1, DNM1L, UBE2D4, SLC25A1. This result
   indicates that we successfully enriched the list with gene hits that
   have functional relevance as co-drug targets. Indeed, among the target
   genes shortlisted solely based on statistical criteria (differential
   expression analyses between predicted resistant and predicted sensitive
   cells to TRAIL), only DNM1L, had previous reports on TRAIL sensitivity.
   Indeed, DNM1L encoding the dynamin-1-like protein DRP1, involved in
   mitochondrial division and apoptosis, has been reported to increase
   TRAIL sensitivity regardless of the co-treatment strategy (recruiting
   DRP1 at mitochondrial membrane, or inhibiting DRP1), which lead the
   authors to suggest that DRP1 might not be the target of mdivi-1, its
   originally reported inhibitor [[168]33, [169]34]. In our experimental
   screen, cells that were predicted resistant to TRAIL based on low
   caspase-8 early activation rates, showed increased DNM1L expression
   levels [[170]1]. Also in this recent study, we could show that DNM1L
   over expression reduced caspase-8 activation and cell death in response
   to TRAIL, and consistently, that DRP1 or dynamins inhibition (using
   mdivi-1 or dynasore), both increase caspase-8 activation and cell
   death. All together, these results suggest that in addition to the
   pro-apoptotic role of DRP1 at the mitochondrial membrane, DRP1 could
   play an anti-apoptotic role at the receptor level on caspase-8, further
   validating the approach presented here and the relevance of Genetrank.

4.4.3 Novel genes

   Genetrank also puts forward some genes that were initially further down
   the list and therefore in an unfavorable position to command gene
   validation or functional studies. JADE1 for example, has been shown to
   promote apoptosis in renal cancer cells [[171]35], and MUL1 [[172]36,
   [173]37], which should motivate further experimental investigations.

5 Discussion

5.1 Method

   This work presents a gene prioritization method, Genetrank, which can
   be coupled with single-cell functional genomics approaches to rank a
   set of genes X for their likelihood to regulate the cell signaling
   pathway P, which is targeted by the drug of interest (i.e. rank genes
   for their impact on drug sensitivity). While diffusion based distances
   have been used for several problems in interaction network analysis,
   the Genetrank introduces several refinements. The first one is to use
   random walks with restarts and absorbing states to focus on certain
   nodes of the PPIN. The second one is to exploit the asymmetry of random
   walks from X to P and P to X across the PPIN. The third one is to use a
   whole set of restart rates to define a filtration (sequence of nested
   sets) of genes, using saturation indices. The analysis yielded by
   saturation indices shows the ability of Genetrank to progressively
   unveil regions of the PPIN, thanks to absorbing states and our
   renormalization scheme. Instead, classical methods based on random
   walks (with or without restarts) see the whole network in a more
   homogeneous fashion due to its small-world nature. In this context,
   absorbing states force the evaluation of direct connexions between
   sources and targets, and the renormalization scheme makes it possible
   to tone down large weights due to the proximity between sources and
   selected targets. Altogether, these modifications allow Genetrank at
   various restart rates to progressively explore the network, and
   incrementally investigate interactions within a pathway. For gene
   prioritization, our novel ingredients make it possible to perform a
   delicate study of the interplay operating between the different
   parameters defining the RW, providing a stratification of genes of X
   according to their proximity to genes in P.

5.2 Biology

   As a case study, we used Genetrank with a TRAIL-sensitivity gene
   signature obtained from the single-cell functional genomics approach
   using predictive cell dynamics called fate-seq [[174]1]. Here, we show
   that we could enrich the most significant differentially expressed
   genes between predicted sensitive and resistant cells, with genes that
   were experimentally validated for increasing drug treatment efficacy
   (or previously described as doing so).

   The nature and design of large transcriptomic profiling experiments
   (single-cell and bulk) and their analyses, impose a number of limits on
   the biological interpretation of gene expression analyses, especially
   in the context of understanding the determinants of drug sensitivity.
   Firstly, the differential analyses between cell types of varying drug
   sensitivities can be confounded with bystander genes (regarding their
   role in the drug MoA). Secondly, within a cell type, differential
   analyses between treated versus untreated samples lacks specificity
   over genes at the origin of the cell response versus the genes induced
   by the drug in resistant cells (not to mention that sensitive cells are
   rarely recovered in experiments). Moreover, the subsequent analyses
   such as gene set enrichment analysis and pathway-based analyses
   [[175]21, [176]38, [177]39], dependent on prior knowledge of the gene
   functions and their interactions, often determined with the
   aforementioned experimental designs. Gene annotations themselves (from
   Gene Ontology (GO) database for example) might hinder gene-based drug
   sensitivity predictions, by introducing biases related to errors or
   incompleteness due to unknown function, protein moonlighting [[178]40],
   or technical and biological inherent limits [[179]41, [180]42]. Yet,
   gene expression remain a central piece of data in drug sensitivity
   prediction [[181]43, [182]44], providing successful use with cancer
   cell lines to discover gene involved in drug resistance [[183]45] with
   computational methods using prior knowledge [[184]46–[185]50]. Although
   some analyses performed these tasks on basal gene expression [[186]24],
   which aim at harnessing cell states at the origin of drug response (as
   opposed to drug-perturbed gene expression studies), all pursue cell
   lines comparative profiling.

   However, profiling cell lines remains ineffective with respect to
   determining the molecular factors involved in the incomplete –or
   fractional– response of one cell line, due to intrinsic drug resistance
   of non-genetic origins (a phenomenon observed for all drugs at their
   IC[50] in all cell lines). Both single-cell genomics and single-cell
   drug response analyses [[187]8] have revealed a range of heterogeneity
   within isogenic cell populations (within one “cell type”) that has not
   been fully comprehended up to now, for technical reasons [[188]1].
   Therefore, the natural gene expression variability of isogenic cells
   may be referred to as gene expression noise only because of the actual
   deficiency in specific gene sets that define cell states such as
   drug-sensitive or drug-tolerant state (as opposed to a drug-induced
   state in the resistant cells), that could inform on genes likely to
   perturb the MoA of the drug of interest. Therefore, single-cell
   experimental methods to determine the MoA-perturbing genes remain
   critical to increase treatment efficacy (or reduce treatment toxicity).

   Our approach utilizes predicted drug response knowledge from fate-seq
   to rank genes among MoA-perturbing gene signature and associate prior
   knowledge from protein-protein interaction networks to favor protein
   that are connected to the targeted signaling pathway, which may also
   reveal novel biological activities.

5.3 Future work

   Our results suggest that combining same-cell functional
   pharmacogenomics screens such as fate-seq, with gene prioritization
   technique described here, are promising novel methods to improve gene
   definitions in GO with respect to their association to novel drug
   efficacy gene signatures, and help revealing the most effective
   co-targets for combination therapy.

   From the theoretical standpoint, our gene prioritization strategy
   underscore several future challenges, two of which are of direct
   interest in biology and medicine. Given a pair of genes highlighted
   (one source in X, one target in P), the first challenge resides in the
   identification of sets of intermediate nodes accounting for the (high)
   hit score observed between these two nodes. Indeed, such intermediates
   could be used to delineate the biochemistry of interactions (enzymes,
   non covalent interactions, etc), paving the way to quantitative ODE
   based models involving reaction rates and/or affinity constants for
   (sub-) pathways. Specifically, quantitative approaches based on target
   controllable linear systems, which aim at driving an interaction
   network to a desired state based using a linear—ODE based system
   [[189]51, [190]52], could benefit from the interactions and paths
   unveiled by our models. The second challenge relates to the precise
   link between the progressive nature of interactions highlighted by our
   modified diffusion distances, and the hierarchical nature of
   interactions within complex networks. We indeed anticipate that our
   tools will prove useful to unveil certain aspects of multiscale
   interactome models.

Supporting information

   S1 File

   (PDF)
   [191]Click here for additional data file.^ (2.9MB, pdf)

Data Availability

   The source code of the software presented in the paper is freely
   available in the Structural Bioinformatics Library, see
   [192]https://sbl.inria.fr/ and
   [193]https://sbl.inria.fr/doc/Genetrank-user-manual.html.

Funding Statement

   (FC, AJM, DM, JR) Avenir UCA JEDI project, ANR-15-IDEX-01; (FC) the 3IA
   C\^ote d’Azur Investments in the Future project managed by the National
   Research Agency, ANR-19-P3IA-0002; (JR) the INCa Plan Cancer Biologie
   Des Systèmes, ITMO Cancer (proposal IMoDRez, no.18CB001-00).

References