Abstract
Prioritizing genes for their role in drug sensitivity, is an important
step in understanding drugs mechanisms of action and discovering new
molecular targets for co-treatment. To formalize this problem, we
consider two sets of genes X and P respectively composing the gene
signature of cell sensitivity at the drug IC[50] and the genes involved
in its mechanism of action, as well as a protein interaction network
(PPIN) containing the products of X and P as nodes. We introduce
Genetrank, a method to prioritize the genes in X for their likelihood
to regulate the genes in P. Genetrank uses asymmetric random walks with
restarts, absorbing states, and a suitable renormalization scheme.
Using novel so-called saturation indices, we show that the conjunction
of absorbing states and renormalization yields an exploration of the
PPIN which is much more progressive than that afforded by random walks
with restarts only. Using MINT as underlying network, we apply
Genetrank to a predictive gene signature of cancer cells sensitivity to
tumor-necrosis-factor-related apoptosis-inducing ligand (TRAIL),
performed in single-cells. Our ranking provides biological insights on
drug sensitivity and a gene set considerably enriched in genes
regulating TRAIL pharmacodynamics when compared to the most significant
differentially expressed genes obtained from a statistical analysis
framework alone. We also introduce gene expression radars, a
visualization tool embedded in MA plots to assess all pairwise
interactions at a glance on graphical representations of
transcriptomics data. Genetrank is made available in the Structural
Bioinformatics Library
([34]https://sbl.inria.fr/doc/Genetrank-user-manual.html). It should
prove useful for mining gene sets in conjunction with a signaling
pathway, whenever other approaches yield relatively large sets of
genes.
1 Introduction
1.1 Single-cell differential expression analyses
Gene differential expression analyses quantify the changes in gene
expression levels between tested experimental conditions. Although gene
functions can often be derived from this type of analyses, the
associations can be confounded with incidental gene induction.
Therefore interpreting differential expression data with gene set
enrichment analysis (GSEA) and pathway analysis can be misleading
without attentive curation, producing an unfounded link between a gene
differential expression and its functional role in the treatment
response. Single-cell differential expression analyses have elevated
the issue, where gene differential expression between experimental
groups is hindered by gene expression variability between cells.
Although cell-to-cell variability in gene expression is typically
overlooked in most analyses, we and others have observed that these
differences between clonal cells, can impact the overall cell
population response to a stimulation [[35]1–[36]3]. Originating from
stochastic processes such as transcription initiation, cell-to-cell
variability gives rise to an equilibrium of co-existing cellular states
within an isogenic cell population [[37]4, [38]5]. The different phases
of the cell cycles are one illustration of cell states that robustly
proportioned resting cell populations [[39]6], but other functional
cell states can be phenomenologically evidenced by the fractional
response to cytotoxic cancer drugs for example (IC[50], Emax>0, [[40]7,
[41]8]). However, most single-cell technologies are still unable to
access meaningful cell information within clonal populations once cell
cycle states signatures are regressed out, and important functional
cell states remain confounded in gene expression noise [[42]1]. Apart
from cell cycle genes, one hypothesis for the undetected (or
unmeasurable) differences in seemingly homogeneous cell populations is
that they contain cells in a wide variety of possible cell states that
predisposed them to a number of responses or functions such as cell
death, impairing pathway enrichment analyses. To recover the molecular
determinants of clonal cells response to cancer drugs from the measured
gene expression variability, we recently designed a same-cell
functional pharmacogenomics approach, named fate-seq, that couples
prior knowledge on the cell state (predicted drug response) to the
transcriptomic profile of the same cell [[43]1]. With our same-cell
approach, we could reveal the molecular factors regulating the efficacy
of a drug treatment, from differential expression analyses of one
sample of isogenic cells with no gene induction. Although genes
differential expressions can now be linked to their functional role in
drug response using fate-seq, prioritizing genes as best potential
targets for co-treatment remains a difficult task.
1.2 Diffusion distances
Indeed, gene prioritization and protein function prediction are
challenging due to the small-world nature of interaction networks
[[44]9, [45]10], in particular. To go beyond analysis using the direct
neighbors of a node or shortest paths, the value of diffusion distances
has been recognized long ago.
Based on the correlation between the expression profile and the
phenotype, as well as (diffusion) distances, various similarity
measures between genes have been studied [[46]11]. The Diffusion State
Distance (DSD) was defined as the L1 norm of m walks (RW) of k steps
[[47]12]. The DSD was shown to be more effective than shortest-path
distance to transfer functional annotations across nodes in
protein-protein interaction networks (PPIN). The DSD was further
extended to exploit annotations (weights) on edges, and to exploit an
augmented graph incorporating specific interactions [[48]13]. To bias
the random walk towards certain nodes, a random walk restarting (RWR)
at those nodes can be applied, as initially used in the context of
internet surfing [[49]14]. The stationary distribution of the strategy,
called the page rank, depends on the restart probability vector
[[50]15]. In [[51]16], the minimum of the page rank probability between
two nodes is used to qualify the mutual affinity of two proteins in a
network. RWR were also used to predict drug-target interactions on
heterogeneous networks [[52]17], and also for layered/multiplex graphs,
so as to combine complementary pieces of information [[53]18].
Diffusion distances have also been used recently [[54]19] to understand
how drugs treat diseases, using both a network of physical interactions
and a hierarchy of biological functions. However, in the resulting
multiscale interactome, the diffusion profiles are computed using a
standard random walk with restart.
A related topic is the problem of ranking differentially expressed
genes across two conditions [[55]20]. Following the seminal work on
gene set enrichment methods [[56]21], the template of gene set
enrichment analysis (GSEA) consist of three steps, namely computing an
enrichment score for each gene, estimating its statistical
significance, and performing a correction for multiple hypothesis
testing. Setting aside the issue of correlations between genes, such
methods combine feature selection and clustering (one cluster per cell
state/condition) [[57]22], but do not address the question of
connecting two sets of genes using an interaction network.
Finally, yet another related challenge is pathway enrichment analysis
[[58]23]. Given a set of experimentally determined genes and a database
of pathways, the goal here is to find pathways whose genes are
over-represented in the gene set of interest. When pathways are known
exhaustively, such analysis are sufficient to screen gene sets. If this
assumption does not hold, finding intermediates between the gene set
and the molecules in a pathway becomes mandatory.
Adding to other prioritization methods [[59]24], Genetrank utilizes
prior knowledge from functional cell states (transcriptomic profile of
a predicted drug response), protein-protein interactions, and the
expected target signaling pathway of a drug of interest.
1.3 Contributions
We focus on gene prioritization related to a complex phenotype, based
on expression profiles from single-cell RNA-seq. Formally, let X be a
set of proteins associated with differentially expressed (DE) genes,
and P be a set of proteins involved in a signaling pathway. Rather than
to identify DE genes in X that are involved in the pathway of interest,
our goal is to prioritize genes in X because they have been previously
identified to be involved in a pathway of interest. In our case study,
the DE genes have been identified for their effect on cell sensitivity
to TRAIL [[60]1]. This dataset comprised DE genes that are the results
of a drug efficacy screen: it is a gene signature of cell sensitivity
at the drug IC[50]. Therefore the significantly DE genes in this screen
are potential drug co-targets; they are not known to be drug targets
and Genetrank seeks to prioritize them, to suggest the corresponding
gene products as co-targets in combination treatments with the drug
used in the screen. Hence, we prioritize genes in X given the knowledge
of genes in P (which gene products are involved in the drug mechanisms
of action), using an underlying PPIN, to pick out the genes having a
higher likelihood to regulate the pathway. Previous work on diffusion
distances (RW or RWR) has three limitations in this context. First, in
using hit vectors or stationary distributions, all nodes of the network
contribute to the comparison of two sources. Instead, we wish to focus
on nodes in P. Second, RWR use a bias on sources, but in our case, the
sets X and P may be considered on an equal footing, which commands
analysis in both directions, i.e. from X to P and from P to X. Third,
instead of using a single restart rate [[61]18], we study a filtration
(sequence of nested sets) of genes retrieved, in tandem with so-called
saturation indices revealing accessibility scales in the PPIN.
To accommodate this rationale, we present a novel analysis technique
based on random walks with restarts and absorbing states. Recall that
in a Markov chain, an absorbing state is a state which is never exited.
Since stationary distributions are irrelevant in this context [[62]25],
we resort to hitting probabilities for the nodes on the target set P.
As we show, doing so yields scores for pairs of genes in S × T and T ×
TS, from which a ranking of genes in X is defined. As a case study, we
use a dataset of differentially expressed genes involved in the
regulation of a cancer drugs pharmacodynamics [[63]1].
2 Material
2.1 Goal: Formal statement
Consider two sets of genes X and P respectively composing the
predictive gene signature of sensitivity to a drug and the genes
involved in its mechanism of action, as well as a protein interaction
network (PPIN) containing the products of X and P as nodes. We
introduce Genetrank, a method to prioritize the genes in X for their
likelihood to regulate the genes in P.
2.2 Biological problem
The main goal of our approach is to ameliorate the ranking of gene
signatures from differential expression analyses in order to better
select the genes whose products are likely to impact the phenotypic
response of the cells. Using fate-seq, we focus on cancer cells briefly
treated with the TNF-related Apoptosis Inducing Ligand (TRAIL), so for
the drug response to be predicted for each cell that is then profiled
by single-cell RNA sequencing [[64]1]. In such single-cell analyses
performed with isogenic cells treated together, the stable differences
in gene expression between cells from the two groups, namely predicted
sensitive and predicted resistant, are small and otherwise confounded
in gene expression noise. In addition, the short treatment necessary
for the cell response prediction does not induce a detectable genomic
response (Mendeley data Dataset
[65]https://doi.org/10.17632/m289yp5skd.1 [[66]1, [67]26]) that would
confound the functional role of the differentially expressed genes
between the two groups with gene induction, as it is often the case in
other studies.
2.3 Sets X and P
The list of differentially expressed genes obtained in this study
constitutes our set X [[68]1], S2 Table in [69]S1 File. The criteria
used to define the two groups of single-cells compared in the
differential expression analyses giving X, is the activation rate of
caspase-8. Caspase-8 is a protein of the receptor-mediated apoptosis
pathway, which is part of the TRAIL mechanism of action [[70]8].
Therefore, in our case study the set P is a set of target genes, whose
products are proteins of the receptor-mediated apoptosis signaling
pathway.
The sets X and P are of size 320 and 49, respectively. Altogether,
these sets X and P represent a unique dataset to assess the benefit of
our approach in ranking genes based on their likelihood to impact drug
response.
2.4 PPIN
A number of interaction databases coexists, each with specific
features, in particular a trade-off between exhaustivity and
confidence. We use MINT due in particular to the compliance with the
protein naming standards [[71]27].
The PPIN was constructed from interactions downloaded from the MINT
website [72]https://mint.bio.uniroma2.it/. Only proteins with the
species label Homo Sapiens were downloaded. The interacting proteins
are identified in UniProtKB format. The resulting network is called the
MINT network in the sequel. This initial graph, containing 11,672
vertices representing proteins, and 52,839 edges representing
protein—protein interactions, is edited as follows. First, the PPIN
being disconnected, we focus on its largest connected component,
encompassing 11,427 vertices. Second, we remove all self-interactions.
Third, multiple interactions between the same two proteins are
collapsed into a single edge. Summarizing, we obtain a graph with
11,427 vertices and 36,526 edges.
2.5 Sets X and P within the PPIN
Selected genes in the sets X and P being absent from the largest
connected component of the PPIN were removed, finally obtaining sets X
and P of size 227 and 41 (from sets of size 320 and 49 initially).
2.6 Variations on the set X
In order to evaluate the impact of the size difference of X and P on
the scores, we analyse the symmetry H(I) of [73]Eq (7) for instances
involving sets X′ of varying size. Practically, for each s ∈ {|P|, 110,
220} we pick N[r](= 1000) random subsets of X. We then compare the
distributions of H(I) obtained.
3 Methods
3.1 Rationale and positioning with respect to previous work
We consider a set X of experimentally determined genes, and a set P of
genes belonging to a pathway. We introduce methods using random walks
on graphs with two sets of vertices as input, referred to as sources
(S) and targets (T). In order to analyze the (lack of) symmetry between
paths joining X and P and vice-versa, we use our methods twice:
* direction S = X ⇝ T = P: starting from nodes in X to reach nodes in
P;
* direction S = P ⇝ T = X: starting from nodes in P to reach nodes in
X.
To study the relationship between the node sets S and T, our
modifications of diffusion based distances rely on absorbing states and
a renormalization scheme. These modifications are actually motivated by
two structural properties of interaction networks ([74]Fig 1).
Fig 1. Example interaction graph and Markov chain: Two structural properties
motivating absorbing states and the renormalization of hit probabilities.
[75]Fig 1
[76]Open in a new tab
Arcs indicate transitions in the Markov chains—transition probabilities
are omitted. The set of sources is S = {s[1], s[2], s[3]} and the set
of targets is T = {t[1], t[2], t[3], t[4]}. When defining absorbing
states, transitions corresponding to blue arcs are removed. This
implies that t[2] will not be highlighted because t[1] or t[4] must be
visited before any random walk starting from any source s ∈ S.
Furthermore, the normalization aims at reducing the importance of t[3]
in the study (due to its high proximity to the three sources).
The first difficulty owes to a notion of subsidiarity amongst targets,
meaning that if some vertices of T are neighbors (or very close from
one another) in the graph, targets upstream will artificially modify
the weights of targets downstream. Indeed, if a target is just after
another (important) target, then its weight will be large even in the
absence of direct paths from S ([77]Fig 1, target t[4]). In this
context, absorbing states make it possible to stop the exploration
process at such upstream/ancestor nodes, and force direct connexions
from sources to subsidiary targets.
The second difficulty owes to the close vicinity of selected targets to
sources ([78]Fig 1, target t[3]), motivating the introduction of hit
scores (Def. 3). For example, if a target is a neighbor of some sources
in the graph, these hit scores decrease the weights of such direct
paths, stressing the importance of the other non trivial paths. One can
note that this normalization can be done with other functions, e.g. a
distance-based function that could be the average of the |T| lengths of
shortest paths between every source s ∈ S and a given target t.
We now formally introduce our methods.
3.2 Graphs
To connect X and P, we consider a PPIN whose vertices are the
individual molecules, and whose edges represent pairwise interactions.
Such a network is modeled by a vertex-weighted edge-weighted directed
graph G = (V, E). The weight of any vertex u ∈ V is denoted w[u] and
the weight of any edge uv ∈ E is denoted w[uv]. We assume that w[u],
w[uv] ∈ (0, 1] for every u ∈ V and every uv ∈ E. In the unweighted
case, we set w[u] = 1 and w[uv] = 1 for every u ∈ V and every uv ∈ E.
In the undirected case, we have uv ∈ E if and only if vu ∈ E.
Let n = |V| be the number of vertices and let V = {v[1], …, v[n]}. The
set of out-neighbors (resp. in-neighbors) of a vertex u ∈ V is denoted
[MATH:
NG+(
u) :MATH]
(resp.
[MATH:
NG-(
u) :MATH]
).
To analyze paths between vertices of X and vertices of P, we formalize
Markov chains and random walks.
3.3 Random walks and Markov chains with absorbing states and restart
3.3.1 Model
In the sequel, we consider two sets of vertices S and T from the graph
G, with S ∩ T = ∅.
We define a Markov chain for which the set of states is exactly the set
of vertices V. The transition matrix M is defined as follows for every
pair (u, v) ∈ V × TV:
[MATH: M(u,v)={wuv
wv∑<
mrow>v′∈N
G+(u)wuv′<
/msub>wv′
msub>ifuv∈Eandu∉T,1ifu=vandu∈T,0otherwise. :MATH]
(1)
Particular cases of this construction are as follows:
* (Symmetric unweighted case)
[MATH:
M(u,v)
=wu
vwu∑v′∈NG+(u
mi>)wuv′wv′=1
mn>dG(u) :MATH]
(first line of [79]Eq (1)).
* (Symmetric edge-weighted case) [80]Eq (1) also generalizes the
formulae used in [[81]13].
Recall that a state is absorbing if once reached by a walk, it is never
exited. Observe that the set of states T is absorbing in the Markov
chain defined by transition matrix M. We now define the Markov chain
with restart from M and from a subset S′ ⊆ S. Intuitively, for each
vertex u ∈ V\T which is not a target, we add a transition to every
vertex of S′ Formally, given a real number r ∈ [0, 1), the transition
matrix
[MATH:
MS′r :MATH]
is defined as follows for every pair (u, v) ∈ V^2.
[MATH: MS′
r(u,v)={(1-r<
/mi>)wuv<
/msub>wu∑v′∈NG+(u)
wuv′wv′ifuv∈Eandu∉T,r|
S′|
ifu∉Tandv∈S′,1ifu=vandu∈T,0otherwise. :MATH]
(2)
Note that restart transitions may have the same origin/destination as
an existing transition in M. (Equivalently, in graph theoretical terms,
one has two arcs between the same two vertices). In that case, the
probabilities specified in ([82]2) should be added. The transition
matrix M is a particular case of
[MATH:
MS′r :MATH]
when r = 0. Note also that when |S′| = 1, one restarts to a single
vertex.
Definition. 1 (State distribution) Consider an initial distribution
uniform in the set S′. Given any r ∈ [0, 1) and any S′ ⊆ S, the state
distribution at each step i ≥ 0 is denoted
[MATH: π
MS′ri
=πMS′riv1<
/mrow>,…,π
MS′rivn
msub>, :MATH]
with
[MATH:
πMS′<
/msup>r0(u)=0 :MATH]
for every u ∉ S′ and
[MATH:
πMS′<
/msup>r0(u)==1|
S′| :MATH]
for every u ∈ S′.
Under the assumption that, from every u ∈ S′ there exists a path in G
to some v ∈ T, the limit of this vector exists when i → ∞, for every
value of r ∈ [0, 1). We denote it with
[MATH:
πMS′r :MATH]
. The probabilities in this vector are commonly known as hitting
probabilities of the target set T.
Using the target set, we define:
Definition. 2 (Hit probability vector) Let T = {t[1], …, t[k]} be the
target set. Given any r ∈ [0, 1) and any S′ ⊆ S, the hit vector
[MATH:
πMS′r=πMS′r<
/mi>t1
msub>,…,πMS′rtk :MATH]
is composed of the hitting probabilities for states of T.
In the following, we abuse the notation writing
[MATH: Msr :MATH]
instead of
[MATH:
M{s}r :MATH]
. Furthermore, we will write M^r instead of
[MATH: MSr :MATH]
. Finally, we renormalize the vectors with the hit vector of the chain
without restart (i.e. r = 0):
Definition. 3 (Hit score) Given r ∈ [0, 1], define the score from s to
t as
[MATH:
Qr
s,t=πMS′r(t)/πM
0(t). :MATH]
(3)
The hit score vector associated with each source s ∈ S is (Q^(r)(s,
t[1]), …, Q^(r)(s, t[k])).
The log score log Q^(r)(s, t) is the natural logarithm of Q^(r)(s, t).
Finally the rank of the score of a pair (s, t)∈S × T is defined as
follows:
[MATH: rank<
/mi>ST(r)(s,t)<
/mo>=1+|{Q(r)(s′,t′)>Q
(r)(s,
t),s′∈S,t′
∈T,(s′,t′)≠
mo>(s,t)}
|.
:MATH]
(4)
Remark 1 The hit score from [83]Eq 3 incorporates three ingredients,
namely (i) a random walk with restart, (ii) absorbing states, and (iii)
renormalization by the value obtained without restart. We may define
other scores by removing any of these ingredients, e.g. the mechanism
of absorbing states.
3.4 Scores and their symmetry
3.4.1 Scores: X ⇝P versus P ⇝X
In order to analyze the (lack of) symmetry between paths joining X and
P and vice-versa, We apply the previous construction to two settings:
[MATH: {X⇝P:(<
/mo>S=X,T=P
mi>)P⇝X:(<
/mo>S=P,T=X
mi>)
:MATH]
(5)
3.4.2 Instances (PPIN, X, P)
Using a different PPIN, a different experimental gene set X or a
different pathway gene set P will affect the score Q^(r)(x, p) for a
given pair (x, p). In order to make it clearer we define an instance of
execution as the triplet I = (PPIN, X, P), and we refer to the score
obtained under this instance as
[MATH:
QI(r
mi>)(x,p
) :MATH]
. However, for conciseness, we simply denote this score Q^(r)(x, p).
3.4.3 Symmetry at the gene level
Consider an experimental gene x ∈ X and a pathway gene p ∈ P. Using
[84]Eq (3), we assess the asymmetry using the ratio of log scores
[MATH: Rlog(r)(s
,t)=min
(logQ(r<
mo>)(s,t),logQ(<
/mo>r)(t,
mo>s))/max(logQ(<
mi>r)(s,t),logQ(r)(t
mi>,s)). :MATH]
(6)
3.4.4 Symmetry at the instance level
Consider an instance I = (PPIN, X, P). (Because throughout this paper
we make use of a single PPIN (MINT) and the same set P, we will use the
shorthand (X) for the instance). To study the symmetry at the instance
level, we consider the proportion of pairs (x, p) ∈ X × P such that
[MATH:
QI(r
mi>)(x,p
)>QI(r)(p,x) :MATH]
. Formally, denoting
[MATH: 1b :MATH]
the indicator function of the boolean b, the symmetry of the results
for I is given by
[MATH: H(I)=1|X|
*|P|∑(x,p)∈X×P1QI(r)(
x,p)>Q
I(r)
(p,x). :MATH]
(7)
3.5 Genetrank, saturation indices, hits
Using the scores in the two directions X ⇝P versus P ⇝X ([85]Eq 5), we
now define a ranking on the genes of X:
Definition. 4 (Average score) Let 1 ≤ τ ≤ |P| be an integer. Consider a
fixed value of the restart rate r. For a source x, let the average
score be the arithmetic mean over the top τ values max(log Q^(r)(x, p),
log Q^(r)(p, x)) observed for p ∈ P. The gene network ranking
(Genetrank) of genes in X is the ranking associated with the
aforementioned average values. The set of top k genes of the ranking is
denoted
[MATH:
Tτ,k(rl) :MATH]
.
Note that when τ = 1, the ranking of a gene in X is determined by its
largest max score. Averaging scores over τ targets makes intuitive
sense here when identifying whole connectedness to a pathway.
To assess the stability of this ranking, we proceed as follows.
Consider a set of values R = {r[1], …, r[N]}, sorted by increasing or
decreasing value. We define the set of genes found in
[MATH:
Tτ,k(rl) :MATH]
up to a given value r[l], with 1 ≤ l ≤ N, by
[MATH:
Tτ,k(→rl)=∪j=1
,…,lTτ,k(rl). :MATH]
(8)
We now use this set to qualify the speed at which we discover the
sources in X when increasing the upper bound on the restart rate:
Definition. 5 (Saturation indices for an increasing sequence of values
of r.) The saturation index at threshold r[l] is the fraction of
sources present in
[MATH:
|Tτ,
k(→rl)| :MATH]
, that is:
[MATH:
Satτ,
k(→rl)=Tτ,k<
/mrow>(→rl)<
mfenced close="|"
open="|">X≤1. :MATH]
(9)
The relative saturation index is the latter normalized by the value of
k used:
[MATH: Sat¯τ,k(→rl)=Satτ,k(→rl)k :MATH]
(10)
In the absence of overlap between consecutive
[MATH:
Tτ,k(rl) :MATH]
, one would have
[MATH:
|Tτ,
k(→rl)|=
k×l :MATH]
. Thus, normalizing by k provides a measure of the overlap between
consecutive sets.
We note in passing that the previous sets can be used to define how
many hits in a given reference list of genes
[MATH: L :MATH]
are obtained:
Definition. 6 (Hits) Consider a reference list of genes
[MATH: L :MATH]
. The number of hits for particular values (r, k) is the size of the
set
[MATH:
Tτ,k(r)∩L :MATH]
. For a fixed k, we similarly consider the size of the set
[MATH:
Tτ,k(→r)∩L :MATH]
.
Remark 2 Saturation indices for a decreasing sequence of values of r
readily generalize from Eqs [86]8, [87]9 and [88]10. In fact, a larger
(resp. smaller) value of r amounts to zooming in (resp. out) towards
the sources.
3.6 Graphical representations with radar scatter plots
We wish to rank genes from X using genes from P, exploiting the
directions X ⇝P and P ⇝X.
3.6.1 Score radar plots
The difficulty in working with values LogCount(x), LogFoldChange(x) for
x ∈ X represented in an MA plot, is that all pairs (x, p) get mapped
onto the same point. To get around this difficulty, we associate a
radar plot with each point x ∈ X, yielding an overall score radar
scatter plot. Each gene score radar plot is defined as follows:
* the background of the gene radar plot is colored using a heat map
indexed on the largest (X ⇝P or P ⇝X) log score observed for that
gene. This background color makes it easy to spot the individual
radar plots with high scores.
* the gene radar plot has a number of spokes equal to the top k (user
defined) scores.
* on each spoke, two values are found, namely the scores log Q^(r)(x,
p) and log Q^(r)(p, x).
* finally, the radar plot title is set set to the gene name
accompanied by the interval of scores (log scale).
3.6.2 Score radar MA plot
Displaying all individual score radar plots in the
LogCount(x), LogFoldChange(x) plane yields the so-called Score radar MA
plot.
3.7 Complexity and implementation
3.7.1 Complexity
The running time of one instance i.e. computing the hit scores at a
fixed r, depends on the sizes of the PPIN, and of sets X and P.
In contrast with the classical stationary probabilities of ergodic
Markov chains, hit probabilities are dependent on the initial state.
Therefore, where a stationary distribution needs to be computed once
and occupies the space of a vector of dimension n = |V|—the number of
nodes, hit probabilities need to be computed for each initial state
envisioned and they occupy a total space of size |X| × |P|. Exact
formulas for stationary probabilities and for hit probabilities both
involve matrix inversions [[89]28], at a practical cost of order n^3.
They are therefore not practical for large sparse graphs, and iterative
methods are usually preferred [[90]29], leading to an approximate
result with any desired accuracy. The simplest iterative method
actually computes state probability vectors
[MATH:
πMS′
ri
:MATH]
(Def. 1) at successive steps until some convergence criterion is met.
It applies both to stationary probabilities and to hit probabilities.
It produces an ϵ-approximation with a computation time of order
C|E|log(1/ϵ), C being a constant related with the spectrum of the
graph. For Markov chains with restart probability r, this constant
increases with r and we observe this phenomenon in our experiments.
Practically, we compute the hit scores of Def. 3 using the C++ Marmote
library ([[91]30] and [92]https://wiki.inria.fr/MARMOTE/Welcome).
Specifically, we use from Marmote the iterative method which
approximates the result with iterations of order m, with m = |E| the
number of edges. It is faster and also more stable in practice, since
it involves only positive numbers. The stopping criterion chosen for
iterations is that the L[1] distance between successive iterates is
less than 10^−6.
3.7.2 Implementation
The whole pipeline is implemented in the Genetrank package of the
Structural Bioinformatics Library ([[93]31] and
[94]http://sbl.inria.fr), see
[95]https://sbl.inria.fr/doc/Genetrank-user-manual.html.
Practically, processing one instance took a few minutes (<5)
worst-case, on a standard desktop computer (Precision 7920 Tower, 64 Go
of RAM, Intel(R) Xeon(R) Silver 4214 CPU at 2.20GHz; OS: Fedora Core
32).
3.8 Tests: Setup
3.8.1 Contenders
As already noticed (Rmk. 1), the hit score from [96]Eq 3 incorporates
three ingredients, namely (i) a random walk with restart, (ii)
absorbing states, and (iii) renormalization by the value obtained
without restart. To assess the importance of the latter two
ingredients, three contenders of nested complexity are considered in
the sequel:
* pr-affinity: the minimum page rank affinity introduced in [[97]16],
using a plain random walk with restart model.
* Genetrank-renorm: score obtained from a random walk with restart
and renormalization using [98]Eq 3—but no absorbing state.
* Genetrank-AS: score obtained from a random walk with restart and
absorbing states—but without the renormalization of [99]Eq 3.
* Genetrank: score obtained using all ingredients: random walk with
restart, absorbing states, and the renormalization of [100]Eq 3.
3.8.2 Parameters used
The following values are used in our experiments:
* 81 values of r, from r = 0 to r = 0.8 by steps of 0.01,
* three values of τ (Def. 4): τ ∈ {1, 20, 41}, (recall that |P| =
41),
* ten values of k: k ∈ {5, 10, 15, 20, 25, 30, 35, 40, 45, 50},
* saturation indices for r sorted by increasing/decreasing values
(Rmk. 2).
4 Results
4.1 Genetrank and saturation indices
The saturation index makes it possible to study the variation of the
size of the set of genes selected by the three contenders when r
increases or decreases. To promote the vicinity of sources in the PPIN,
we process values r by decreasing value (Rmk. 2). We inspect in turn
the saturation, relative saturation and number of hits (Figs [101]2 and
[102]3, S4, S5 Figs in [103]S1 File). We use the median value τ = 20 to
compute average scores—see the Supplemental for the values τ = 1 and τ
= 40.
Fig 2. (pr-affinity) saturation plots (Def. 5) for τ = 20.
[104]Fig 2
[105]Open in a new tab
Values of r processed in decreasing order. (Left column) Saturation
index and slices at r = cst and k = cst (See [106]Eq 9) (Right column)
Relative saturation index and slices at r = cst and k = cst (See
[107]Eq 10).
Fig 3. (Genetrank) saturation plots (Def. 5) for τ = 20.
[108]Fig 3
[109]Open in a new tab
Values of r processed in decreasing order. (Left column) Saturation
index and slices at r = cst and k = cst (See [110]Eq 9) (Right column)
Relative saturation index and slices at r = cst and k = cst (See
[111]Eq 10).
The methods pr-affinity, Genetrank-renorm, and Genetrank-AS yield a
very similar behavior in two respects. First, the saturation gets
maximum (one) for large values of k, and is relatively insensitive to
the value of r ([112]Fig 2(B), S4(B), S5(B) Figs in [113]S1 File).
Also, the relative saturation drops down to very small values rapidly
([114]Fig 2(E), S4(E), S5(E) Figs in [115]S1 File). In sharp contrast,
the maximum saturation yielded by Genetrank is equal to ∼0.4, and shows
a marked dependence on the restart rate ([116]Fig 3(B)). The relative
saturation is also much more sensitive to the value of k used, with a
coherent behavior as a function of k at fixed values of r ([117]Fig
3(E) and 3(F)).
These observations stress the specificity yielded by the combination of
renormalization and absorbing states in Genetrank. To further these
insights, we proceed with a more detailed analysis of the incidence of
the parameters r and τ:
* (r) The restarting rate r defines the bias towards sources.
Whatever the value of k, in processing values of r in decreasing
values, we observe that the saturation increases when r approaches
zero ([118]Fig 3(C)). The slope of the curves increases for values
of r ≤ 0.1, showing that for such values of the restart rate, the
random walks get to explore a larger region of the PPIN. In
Genetrank, larger values of the restart rate are therefore
instrumental in promoting a more specific exploration.
* (τ) Increasing τ consists of averaging on more targets. This
averaging yields a marked increase of the saturation (whatever the
value of k), especially for small values of r ([119]Fig 3, S6, S7
Figs in [120]S1 File).
Let us now consider the relative saturation ([121]Fig 3(D)–3(F);
S6(D)-S6(F), S7(D)-S7(F) Figs in [122]S1 File). Sections of the surface
at r = cst yield a monotonic behavior, that is the smaller the restart
rate the larger the relative saturation. (We also note that for the
first value of r processed, that is r = 0, 8, the relative saturation
is always equal to 1/|X|∼0.0044.) Interestingly, slices at r = cst are
not monotonic. The valleys crossed on such slices show that increasing
k increases the saturation but not necessarily the relative saturation.
This may be interpreted as accessibility scales in the PPIN.
4.2 Evaluating the effect of varying restart rates on scores, using
experimentally validated gene hits
Consider the list of reference genes (and their products) that had been
experimentally validated following the single-cell functional genomics
approach using predictive cell dynamics [[123]1] [124]O15304 (SIVA1),
[125]P78537 (BLOC1S1), [126]P0CW18 (PRSS56), [127]P53007 (SLC25A1),
[128]Q9Y2X8 (UBE2D4), DNM1L([129]O00429). Note that [130]Q6UW78(UQCC3)
cannot be considered here, as it is not present in our PPIN of
interest. We compute the number of genes from this list found in the
sets
[MATH:
Tτ,k(r) :MATH]
and
[MATH:
|Tτ,
k(→rl) :MATH]
.
We compare the results of the four methods (Figs [131]4 and [132]5, S8,
S9 Figs in [133]S1 File). Consistent with the analysis in the previous
section, the methods pr-affinity and Genetrank-renorm are rather non
specific, with a number of hits yielded by
[MATH:
Tτ,k(r) :MATH]
essentially constant whatever the value of r at a fixed value of k,
whatever the value of τ (S4, S8 Figs in [134]S1 File). The method
Genetrank-AS shows a more contrived behavior, with the same number of
genes uncovered (i.e. 5) for small values of r, yet more variations
when varying r at fixed values of k (S9 Fig in [135]S1 File).
Fig 4. (pr-affinity) hits (Def. 6) for the list of reference genes
[136]O15304 (SIVA1), [137]P78537 (BLOC1S1), [138]P0CW18 (PRSS56), [139]P53007
(SLC25A1), [140]Q9Y2X8 (UBE2D4), DNM1L([141]O00429).
[142]Fig 4
[143]Open in a new tab
Fig 5. (Genetrank) hits (Def. 6) for the list of reference genes [144]O15304
(SIVA1), [145]P78537 (BLOC1S1), [146]P0CW18 (PRSS56), [147]P53007 (SLC25A1),
[148]Q9Y2X8 (UBE2D4), DNM1L([149]O00429).
[150]Fig 5
[151]Open in a new tab
The method Genetrank goes one step beyond, namely discovers genes more
selectively when increasing k and changing the restart rate ([152]Fig
5). Small values of τ allow retrieving three genes using fewer values
of r, clearly showing the specificity/robustness of nodes of X highly
ranked with a large restart rate. Conversely, using larger values of τ
in conjunction with smaller values of r (larger excursions in the PPIN)
allows reporting four genes in total. Except for τ = 1, using large
values of r requires larger values of k to retrieve the known genes:
for τ = 20 and τ = 41, a single gene is retrieved when r > 0.5.
These experiments show the progressiveness yielded by Genetrank is
exploring the PPIN, as opposed to avoid the fast mixing observed in
pr-affinity and Genetrank-renorm, and to a lesser extent Genetrank-AS.
Methods without absorbing states see the whole PPIN in a more
homogeneous way, due to its small-world nature. This behaviour is even
more pronounced in our case, as the target genes form a pathway.
Indeed, a random walk reaching one such node is likely to discover its
neighbors right after (See Comparison to previous work, Sec. 3.3.) When
a absorbing states are used, the random walks halts, and the discovery
of neighbors does not take place. The introduction of absorbing states
therefore appears crucial to localize the exploration of connexions, in
conjunction the choice of the restart rate r—generally taken to r = 0.7
in previous work – see [[153]18] and citations therein.
4.3 Symmetry analysis on a per-source basis: Radar plots
The symmetry ratio H(I) is a global assessment based on all pairs in
the Cartesian product X × TP. For an assessment of the asymmetry on a
per gene basis, we resort to radar plots (Sec. 3.6). Example radar
plots for genes experimentally validated are provided are provided on
[154]Fig 6). The individual radar plots can be assembled in a global MA
plot ([155]Fig 7).
Fig 6. Radar plots for four experimentally validated genes.
[156]Fig 6
[157]Open in a new tab
The range of values covers the range of log scores observed. The two
bullets on a spoke read as follows: blue dot: direction PX ⇝; green
dot: direction XP ⇝.
Fig 7. Score radar scatter plot.
[158]Fig 7
[159]Open in a new tab
The individual radar plots ([160]Fig 6) are assembled in a MA plot.
Zooming on a radar reveals the scores for that gene (80 data points per
gene in a radar).
4.4 Biological analysis
4.4.1 Differentially expressed genes
The set of differentially expressed (DE) genes obtained from the
single-cell multimodal profiling approach fate-seq, allows comparing
transcriptomic profiles between single cells of an isogenic population,
grouped by their predicted drug response [[161]1]. Practically, the DE
genes have been identified for their effect on cell sensitivity at the
drug IC[50], and represent the signature of drug efficacy. (The DE
genes here, are potential combination targets to increase the treatment
efficacy.) Although this response prediction allowed the differential
analysis using the edgeR likelihood ratio test framework [[162]32] by
defining two groups (predicted sensitive vs. tolerant cells), the most
significant DE genes (false discovery rate FDR < 0.1 and |log 2(FC)| >
2) still constitute a list of more than 60 genes. Such a large list
defies the expected practical set of potential targets that would serve
to design co-therapeutic strategies increasing overall treatment
efficacy. In addition, ranking only on differential expression might
underestimate a gene for its potential function as regulator of the
pathway overall activity. As an example, among these gene hits, we have
observed that, at equal distance (the shortest path between the source
and the target gene caspase-8), the noisier the gene expression was,
the larger the effect a gene perturbation had on cell death, and at
comparable expression variability: the longer the shortest path, the
larger the effect [[163]1]. We reasoned that diffusion distances could
also ameliorate ranking of cell-to-cell differential expression
analyses.
We use Genetrank to sort genes for their connectedness to molecular
factors regulating the signaling pathway triggered by the drug of
interest (TRAIL). We intersect the list of the 65 most significant DE
genes obtained by edgeR, with the set of genes obtained with Genetrank
(k = 50, range of values of r: 0.0.8, τ = 41; [164]Eq (8) and S7 Fig in
[165]S1 File), resulting in 18 genes (S1 Table in [166]S1 File).
This gene shortlist presents a number of valuable advantages over the
list of significant DE genes, as it becomes more manageable for
experimental validation, and drug target discovery. In addition we
found that this shortlist contained genes that had been previously
reported to regulate cell death and importantly, it was enriched in
genes that we had experimentally validated for having an effect on
TRAIL response.
4.4.2 Previously validated genes
Out of the 18 genes (S1 Table in [167]S1 File) we prioritized for their
likelihood of having an effect on the drug mechanism of action (MoA)
using Genetrank, we found 4 genes that we had previously validated
experimentally, namely BLOC1S1, DNM1L, UBE2D4, SLC25A1. This result
indicates that we successfully enriched the list with gene hits that
have functional relevance as co-drug targets. Indeed, among the target
genes shortlisted solely based on statistical criteria (differential
expression analyses between predicted resistant and predicted sensitive
cells to TRAIL), only DNM1L, had previous reports on TRAIL sensitivity.
Indeed, DNM1L encoding the dynamin-1-like protein DRP1, involved in
mitochondrial division and apoptosis, has been reported to increase
TRAIL sensitivity regardless of the co-treatment strategy (recruiting
DRP1 at mitochondrial membrane, or inhibiting DRP1), which lead the
authors to suggest that DRP1 might not be the target of mdivi-1, its
originally reported inhibitor [[168]33, [169]34]. In our experimental
screen, cells that were predicted resistant to TRAIL based on low
caspase-8 early activation rates, showed increased DNM1L expression
levels [[170]1]. Also in this recent study, we could show that DNM1L
over expression reduced caspase-8 activation and cell death in response
to TRAIL, and consistently, that DRP1 or dynamins inhibition (using
mdivi-1 or dynasore), both increase caspase-8 activation and cell
death. All together, these results suggest that in addition to the
pro-apoptotic role of DRP1 at the mitochondrial membrane, DRP1 could
play an anti-apoptotic role at the receptor level on caspase-8, further
validating the approach presented here and the relevance of Genetrank.
4.4.3 Novel genes
Genetrank also puts forward some genes that were initially further down
the list and therefore in an unfavorable position to command gene
validation or functional studies. JADE1 for example, has been shown to
promote apoptosis in renal cancer cells [[171]35], and MUL1 [[172]36,
[173]37], which should motivate further experimental investigations.
5 Discussion
5.1 Method
This work presents a gene prioritization method, Genetrank, which can
be coupled with single-cell functional genomics approaches to rank a
set of genes X for their likelihood to regulate the cell signaling
pathway P, which is targeted by the drug of interest (i.e. rank genes
for their impact on drug sensitivity). While diffusion based distances
have been used for several problems in interaction network analysis,
the Genetrank introduces several refinements. The first one is to use
random walks with restarts and absorbing states to focus on certain
nodes of the PPIN. The second one is to exploit the asymmetry of random
walks from X to P and P to X across the PPIN. The third one is to use a
whole set of restart rates to define a filtration (sequence of nested
sets) of genes, using saturation indices. The analysis yielded by
saturation indices shows the ability of Genetrank to progressively
unveil regions of the PPIN, thanks to absorbing states and our
renormalization scheme. Instead, classical methods based on random
walks (with or without restarts) see the whole network in a more
homogeneous fashion due to its small-world nature. In this context,
absorbing states force the evaluation of direct connexions between
sources and targets, and the renormalization scheme makes it possible
to tone down large weights due to the proximity between sources and
selected targets. Altogether, these modifications allow Genetrank at
various restart rates to progressively explore the network, and
incrementally investigate interactions within a pathway. For gene
prioritization, our novel ingredients make it possible to perform a
delicate study of the interplay operating between the different
parameters defining the RW, providing a stratification of genes of X
according to their proximity to genes in P.
5.2 Biology
As a case study, we used Genetrank with a TRAIL-sensitivity gene
signature obtained from the single-cell functional genomics approach
using predictive cell dynamics called fate-seq [[174]1]. Here, we show
that we could enrich the most significant differentially expressed
genes between predicted sensitive and resistant cells, with genes that
were experimentally validated for increasing drug treatment efficacy
(or previously described as doing so).
The nature and design of large transcriptomic profiling experiments
(single-cell and bulk) and their analyses, impose a number of limits on
the biological interpretation of gene expression analyses, especially
in the context of understanding the determinants of drug sensitivity.
Firstly, the differential analyses between cell types of varying drug
sensitivities can be confounded with bystander genes (regarding their
role in the drug MoA). Secondly, within a cell type, differential
analyses between treated versus untreated samples lacks specificity
over genes at the origin of the cell response versus the genes induced
by the drug in resistant cells (not to mention that sensitive cells are
rarely recovered in experiments). Moreover, the subsequent analyses
such as gene set enrichment analysis and pathway-based analyses
[[175]21, [176]38, [177]39], dependent on prior knowledge of the gene
functions and their interactions, often determined with the
aforementioned experimental designs. Gene annotations themselves (from
Gene Ontology (GO) database for example) might hinder gene-based drug
sensitivity predictions, by introducing biases related to errors or
incompleteness due to unknown function, protein moonlighting [[178]40],
or technical and biological inherent limits [[179]41, [180]42]. Yet,
gene expression remain a central piece of data in drug sensitivity
prediction [[181]43, [182]44], providing successful use with cancer
cell lines to discover gene involved in drug resistance [[183]45] with
computational methods using prior knowledge [[184]46–[185]50]. Although
some analyses performed these tasks on basal gene expression [[186]24],
which aim at harnessing cell states at the origin of drug response (as
opposed to drug-perturbed gene expression studies), all pursue cell
lines comparative profiling.
However, profiling cell lines remains ineffective with respect to
determining the molecular factors involved in the incomplete –or
fractional– response of one cell line, due to intrinsic drug resistance
of non-genetic origins (a phenomenon observed for all drugs at their
IC[50] in all cell lines). Both single-cell genomics and single-cell
drug response analyses [[187]8] have revealed a range of heterogeneity
within isogenic cell populations (within one “cell type”) that has not
been fully comprehended up to now, for technical reasons [[188]1].
Therefore, the natural gene expression variability of isogenic cells
may be referred to as gene expression noise only because of the actual
deficiency in specific gene sets that define cell states such as
drug-sensitive or drug-tolerant state (as opposed to a drug-induced
state in the resistant cells), that could inform on genes likely to
perturb the MoA of the drug of interest. Therefore, single-cell
experimental methods to determine the MoA-perturbing genes remain
critical to increase treatment efficacy (or reduce treatment toxicity).
Our approach utilizes predicted drug response knowledge from fate-seq
to rank genes among MoA-perturbing gene signature and associate prior
knowledge from protein-protein interaction networks to favor protein
that are connected to the targeted signaling pathway, which may also
reveal novel biological activities.
5.3 Future work
Our results suggest that combining same-cell functional
pharmacogenomics screens such as fate-seq, with gene prioritization
technique described here, are promising novel methods to improve gene
definitions in GO with respect to their association to novel drug
efficacy gene signatures, and help revealing the most effective
co-targets for combination therapy.
From the theoretical standpoint, our gene prioritization strategy
underscore several future challenges, two of which are of direct
interest in biology and medicine. Given a pair of genes highlighted
(one source in X, one target in P), the first challenge resides in the
identification of sets of intermediate nodes accounting for the (high)
hit score observed between these two nodes. Indeed, such intermediates
could be used to delineate the biochemistry of interactions (enzymes,
non covalent interactions, etc), paving the way to quantitative ODE
based models involving reaction rates and/or affinity constants for
(sub-) pathways. Specifically, quantitative approaches based on target
controllable linear systems, which aim at driving an interaction
network to a desired state based using a linear—ODE based system
[[189]51, [190]52], could benefit from the interactions and paths
unveiled by our models. The second challenge relates to the precise
link between the progressive nature of interactions highlighted by our
modified diffusion distances, and the hierarchical nature of
interactions within complex networks. We indeed anticipate that our
tools will prove useful to unveil certain aspects of multiscale
interactome models.
Supporting information
S1 File
(PDF)
[191]Click here for additional data file.^ (2.9MB, pdf)
Data Availability
The source code of the software presented in the paper is freely
available in the Structural Bioinformatics Library, see
[192]https://sbl.inria.fr/ and
[193]https://sbl.inria.fr/doc/Genetrank-user-manual.html.
Funding Statement
(FC, AJM, DM, JR) Avenir UCA JEDI project, ANR-15-IDEX-01; (FC) the 3IA
C\^ote d’Azur Investments in the Future project managed by the National
Research Agency, ANR-19-P3IA-0002; (JR) the INCa Plan Cancer Biologie
Des Systèmes, ITMO Cancer (proposal IMoDRez, no.18CB001-00).
References