Abstract

   The use of graph theory models is widespread in biological pathway
   analyses as it is often desired to evaluate the position of genes and
   proteins in their interaction networks of the biological systems. In
   this article, we argue that the common standard graph centrality
   measures do not sufficiently capture the informative topological
   organizations of the pathways, and thus, limit the biological
   inference. While key pathway elements may appear both upstream and
   downstream in pathways, standard directed graph centralities attribute
   significant topological importance to the upstream elements and
   evaluate the downstream elements as having no importance.We present a
   directed graph framework, Source/Sink Centrality (SSC), to address the
   limitations of standard models. SSC separately measures the importance
   of a node in the upstream and the downstream of a pathway, as a sender
   and a receiver of biological signals, and combines the two terms for
   evaluating the centrality. To validate SSC, we evaluate the topological
   position of known human cancer genes and mouse lethal genes in their
   respective KEGG annotated pathways and show that SSC-derived
   centralities provide an effective framework for associating higher
   positional importance to the genes with higher importance from a priori
   knowledge. While the presented work challenges some of the modeling
   assumptions in the common pathway analyses, it provides a
   straight-forward methodology to extend the existing models. The SSC
   extensions can result in more informative topological description of
   pathways, and thus, more informative biological inference.

   Keywords: Biological networks, Network analysis, Pathway analysis

Introduction

   Biological pathways represent sets of bio-molecular entities, such as
   genes and proteins, and their cascades of interactions which associate
   with certain cellular functions [[33]1]. The abundance and availability
   of annotated pathways is a key element in bridging the gap between
   molecular level dynamics and high-level biological insight
   [[34]2–[35]5]. Although changes in individual molecules may trigger
   variations in the cellular programs, many biological functions emerge
   from the systematic behaviour of entities and interactions. This
   systems biology concept positions the use of pathways in a significant
   value for discovery, treatment, diagnosis, and prediction in biomedical
   studies [[36]6–[37]9].

   The term “Pathway Analyses” describes a category of models that
   leverage biological interaction networks for the study of
   molecular-level data, such as gene expression. Many of these tools are
   built on the premise of a well-established body of literature which
   indicates that the position of genes/proteins in their associated
   interaction networks can determine their importance in biological
   systems of interest [[38]10–[39]12]. For example, several network-based
   pathway enrichment analysis models (N-PEM) use graph theory concepts to
   prioritize topologically important differential expressions in the
   pathways and produce functional interpretations [[40]13–[41]21].

   Graph centrality models are the premier methods for evaluating the
   topological positions of individual network entities [[42]22]. While
   these models have been successfully utilized in pathway analyses for
   functional interpretation, their abstractions of network organizations
   do not necessarily capture key topological features of pathways,
   suggesting a potential for a more biologically relevant assessment of
   pathways. Biological pathways, particularly signaling pathways, appear
   as an upstream-to-downstream organization, indicating a temporal and
   biochemical order of interactions between associated genes and
   proteins. In a directed graph model, upstream pathway elements are
   mostly represented as nodes with no incoming edges and downstream
   elements are represented as nodes with no out-going edges.
   Subsequently, standard centrality model for directed graphs, such as
   PageRank and Katz, do not assign any topological importance of the
   downstream elements, many of which have been shown to be key elements
   of biological functions.

   The goal of this study is to quantitatively show the limitation of the
   standard centrality models and provide a plausible alternative to
   improve the utility of topological evaluations of pathways. We
   hypothesize that a directed centrality model which accounts for the
   topological position of key elements at downstream and upstream ends of
   pathways can provide a more meaningful characterization of biological
   networks. To achieve our goal, we first formalize the standard
   centrality models into three categories of Source, Sink, and undirected
   frameworks. The Source framework indicates a version of centrality
   models that can capture the importance of a node as a sender of
   information, which relates to the directed graph models used in typical
   pathway analyses. The Sink framework aims to capture to identify
   important receivers of biological information/signals. We then
   introduce Source/Sink centrality (SSC) concept, which is a flexible
   framework that works by applying any centrality model to a graph and
   its transposed graph simultaneously, and combining the two resulting
   profiles. SSC produces a centrality score for each node in a network
   that quantifies the importance of each gene both upstream and
   downstream of a pathway while accounting for the order and the
   direction of the interactions.

   In a recent preliminary study, we reported that the SSC framework of
   common centrality models provides a more informative characterization
   of key pathway elements’ positions in contrast to the standard directed
   models [[43]23]. In particular, we showed that the centrality scores
   produced by SSC have a stronger correlation and more descriptive linear
   relationship with the probability of a gene to be important based on a
   priori known biological functions. In another study, we showed an
   application of SSC modification of Katz centrality in network-based
   enrichment analysis to prioritize the differential expression
   topologically important genes. In that case, we showed that the SSC
   framework produces a more biologically relevant functional
   interpretation of disease genomic data [[44]24]. Following that in a
   recent study, Zaffaroni et al. leveraged the SSC modification of Katz
   centrality for predicting the driver pathways of cellular transition
   [[45]25, [46]26].

   In this study, we expand SSC modeling to multiple spectral centrality
   models and validate it using additional and updated background data. In
   particular, we investigate a battery of standard graph centrality
   models and their SSC extensions for describing the organization of a
   priori known important genes. For a priori important genes, we focus on
   human cancer genes and mouse lethal genes, also known as essential
   genes, i.e. genes whose knockdown results in the death of organisms.
   The rationale for choosing the cancer-related genes is the intuition
   that cancers are regarded as diseases of pathways, i.e. cancers are
   primarily driven by perturbation/alteration of pathways [[47]7,
   [48]27]. Subsequently, the dysfunction of one or more cancer-related
   genes can result in dysfunction of their associated pathways [[49]7].
   Therefore, understanding the topological position of cancer associated
   genes may reveal insight regarding the topological organization of key
   pathway drivers/regulators. The rationale for choosing mouse lethal
   genes is the existence of an extensive literature on the relationship
   between centrality and lethality in protein-protein interaction
   networks, where it has been shown that higher centrality correlates
   with higher probability of being lethal (essential) [[50]28–[51]30].

   From multiple perspectives, we show that the SSC extensions, in
   comparison to the standard models, produce a more descriptive
   topological framework for the positions of cancer gene in human
   pathways, as well as that of essential genes in mouse pathways. These
   results show that the SSC methodology contributes to the biological
   pathway analyses and inference methods by providing a more realistic
   framework for measuring network positions.

Material and methods

Graph modeling of pathways

   Let a directed graph, G=(V,E), represent a pathway where
   [MATH: <mi>V</mi><mo>(</mo><mi>G</mi><mo>)</mo><mo>=</mo><mfenced
   close="}"
   open="{"><mrow><msub><mrow><mi>v</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mn>2</mn></mrow></msub
   ><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>n
   </mi></mrow></msub></mrow></mfenced> :MATH]
   is the set of nodes and
   [MATH: <mi>E</mi><mo>(</mo><mi>G</mi><mo>)</mo><mo>=</mo><mfenced
   close="}"
   open="{"><mrow><msub><mrow><mi>e</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mo>,</mo><msub><mrow><mi>e</mi></mrow><mrow><mn>2</mn></mrow></msub
   ><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mi>e</mi></mrow><mrow><mi>m
   </mi></mrow></msub></mrow></mfenced> :MATH]
   is the set of edges. Each edge, e[k]=(v[i],v[j]), is an ordered pair
   that indicates a directed relationship from gene-encoded element v[i]
   to v[j]. A graph can be alternatively represented as an undirected
   graph where the edges are unordered pairs.

   For any graph, the neighborhood of a node v[i], N(v[i]), is the set of
   all adjacent nodes of v[i], N[G](v[i])={v[j]|(v[i],v[j])∈E(G)}. The
   degree of a node is defined as the size of its neighborhood,
   deg(v[i])=|N[G](v[i])|. For a directed graph, the former notion of
   degree is referred to as out-degree, deg^+(v). For a directed graph,
   neighborhood and degree can be also defined based on in-coming edges,
   i.e. in-degree, deg^−(v[i])=|{v[j]∣(v[j],v[i])∈E}|. A graph with n
   vertices has an equivalent representation of a n×nadjacency matrix,
   A[G]. Formally:
   [MATH:
   <msub><mrow><mo>[</mo><msub><mrow><mi>A</mi></mrow><mrow><mi>G</mi></mr
   ow></msub><mo>]</mo></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub><mo>=</mo><mfenced
   open="{"><mrow><mtable><mtr><mtd><mn>1</mn><mo>,</mo></mtd><mtd><mfence
   d close=")"
   open="("><mrow><msub><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>,</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>j</mi></mrow></msub
   ></mrow></mfenced><mo>∈</mo><mi>E</mi></mtd></mtr><mtr><mtd><mn>0</mn><
   mo>,</mo></mtd><mtd><mtext>otherwise</mtext></mtd></mtr></mtable></mrow
   ></mfenced> :MATH]
   1

   The transpose of a graph, G^T, is a graph with reversed edge
   directions, where V(G^T)=V(G) and E(G^T)={(u,v)|(v,u)∈E(G)}, thus
   [MATH:
   <msub><mrow><mi>A</mi></mrow><mrow><msup><mrow><mi>G</mi></mrow><mrow><
   mi>T</mi></mrow></msup></mrow></msub><mo>=</mo><msubsup><mrow><mi>A</mi
   ></mrow><mrow><mi>G</mi></mrow><mrow><mi>T</mi></mrow></msubsup> :MATH]
   . A graph centrality is a function, C(v),
   [MATH:
   <mi>C</mi><mo>:</mo><mi>V</mi><mo>(</mo><mi>G</mi><mo>)</mo><mo>→</mo><
   mi>ℝ</mi> :MATH]
   , for describing a topological scoring (importance) of the nodes in a
   network [[52]22].

   Degree Centrality of each node is the size of its neighborhood. Studies
   have shown that the degree of nodes in protein-protein interaction
   networks of different organisms correlates with their essentiality,
   meaning the likelihood of a protein’s removal, e.g. knockdown, to be
   lethal for the model organism [[53]10, [54]29, [55]30]. Here, we
   calculated degree centrality as the sum of in-degree and out-degree,
   which the same as the degree centrality in the underlying undirected
   graph:
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">deg</mtext></mrow></msub><mo>(</mo><mi>v</mi><mo>)
   </mo><mo>=</mo><mtext
   mathvariant="italic">de</mtext><msup><mrow><mi>g</mi></mrow><mrow><mo>+
   </mo></mrow></msup><mo>(</mo><mi>v</mi><mo>)</mo><mo>+</mo><mtext
   mathvariant="italic">de</mtext><msup><mrow><mi>g</mi></mrow><mrow><mo>−
   </mo></mrow></msup><mo>(</mo><mi>v</mi><mo>)</mo> :MATH]
   2

   PageRank Centrality is a spectral centrality measure where the
   importance of a node is a function of the centrality of its neighbors.
   In its original definition, PageRank describes the probability
   distribution of a uniform random walk with restart being present at
   each node of a graph after a large number of steps [[56]22, [57]31,
   [58]32]. In graph theory terms, the PageRank of a node v is based on
   the PageRank of the nodes with links to v, divided by their out
   degrees. Formally:
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow></msub><mo>(</mo><msub><mrow><mi
   >v</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>=</mo><msub><
   mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><mi>α</mi
   ><munder><mrow><mo>∑</mo></mrow><mrow><msub><mrow><mi>v</mi></mrow><mro
   w><mi>j</mi></mrow></msub><mo>|</mo><msub><mrow><mi>v</mi></mrow><mrow>
   <mi>i</mi></mrow></msub><mo>∈</mo><msub><mrow><mi>N</mi></mrow><mrow><m
   i>G</mi></mrow></msub><mo>(</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>
   j</mi></mrow></msub><mo>)</mo></mrow></munder><mfrac><mrow><msub><mrow>
   <mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow></msub><mo>(</mo><msub><mrow><mi
   >v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</mo></mrow><mrow><mo>
   |</mo><msub><mrow><mi>N</mi></mrow><mrow><mi>G</mi></mrow></msub><mo>(<
   /mo><msub><mrow><mi>v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</m
   o><mo>|</mo></mrow></mfrac> :MATH]
   3

   β[i]’s are constant values that relate the probability of restarting at
   node v[i]. The parameter α is a dampening factor that relates to the
   transition probability of the random walk. The Formula [59]3 can be
   expressed in a vectorized format as following:
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow></msub><mo>=</mo><mi>β</mi><mo>+
   </mo><mi>α</mi><msup><mrow><mi>A</mi></mrow><mrow><mi>T</mi></mrow></ms
   up><msup><mrow><mi>D</mi></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup
   ><msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow></msub> :MATH]
   4

   where C[pgr] is the vector of centralities and β is the vector of
   initial values. D is the diagonal (out) degree matrix such that
   [D][ii]=max(deg^(+)(v[i]),1). A closed form solution of Formula [60]3
   is achieved by solving for C[pgr] [[61]22]. Formally:
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow></msub><mo>=</mo><msup><mrow><mf
   enced close=")"
   open="("><mrow><mi>I</mi><mo>−</mo><mi>α</mi><msup><mrow><mi>A</mi></mr
   ow><mrow><mi>T</mi></mrow></msup><msup><mrow><mi>D</mi></mrow><mrow><mo
   >−</mo><mn>1</mn></mrow></msup></mrow></mfenced></mrow><mrow><mo>−</mo>
   <mn>1</mn></mrow></msup><mi>β</mi> :MATH]
   5

   PageRank can be used for both directed and undirected graphs. Closely
   related notions of PageRank have been used in applications of pathway
   analysis [[62]14, [63]17].

   We define the PageRank Sink centrality as the standard PageRank of a
   directed graph. The original concept of PageRank, as described by Brin
   and Page, measures the importance of a website based on the importance
   of the websites that have a link to it [[64]31]. Likewise, in the Sink
   component of the PageRank, the downstream nodes have the higher
   importance. This is because a random walk will not be present at any
   node without incoming edges, unless by a restart event. The PageRank
   Sink centrality captures the importance of a node as a receiver of
   information. Formally we define the Sink PageRank centrality (
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup> :MATH]
   ) as:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo><mo>:</mo><mo>=</mo><msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow></msub><mo>(</mo><mi>v</mi><mo>)
   </mo> :MATH]
   6

   To modify PageRank in such a way that captures the importance of nodes
   as source of signal, we derive a PageRank score when applied to the
   transpose of a graph. Formally, we define the PageRank Source
   [MATH: <mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup></mrow></mfenced>
   :MATH]
   as:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>(</mo><msub><mrow><
   mi>v</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>=</mo><msub
   ><mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><mi>α</
   mi><munder><mrow><mo>∑</mo></mrow><mrow><msub><mrow><mi>v</mi></mrow><m
   row><mi>j</mi></mrow></msub><mo>∣</mo><msub><mrow><mi>v</mi></mrow><mro
   w><mi>i</mi></mrow></msub><mo>∈</mo><msub><mrow><mi>N</mi></mrow><mrow>
   <msup><mrow><mi>G</mi></mrow><mrow><mi>T</mi></mrow></msup></mrow></msu
   b><mo>(</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>j</mi></mrow></msub>
   <mo>)</mo></mrow></munder><mfrac><mrow><msubsup><mrow><mi>C</mi></mrow>
   <mrow><mtext mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>(</mo><msub><mrow><
   mi>v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</mo></mrow><mrow><m
   o>|</mo><msub><mrow><mi>N</mi></mrow><mrow><msup><mrow><mi>G</mi></mrow
   ><mrow><mi>T</mi></mrow></msup></mrow></msub><mo>(</mo><msub><mrow><mi>
   v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</mo><mo>|</mo></mrow><
   /mfrac> :MATH]
   7

   β[i] and α are constants that relate to the restart and transition
   probabilities. The PageRank Source of a node is calculated based on the
   centrality of a its neighbors in the transposed graphs. Define the
   diagonal in-degree matrix, D^′, of G such that
   [D^′][ii]=max(1,deg^−(v[i])). Similar to the equations for deriving the
   standard PageRank, the Source component can be solved as following:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>=</mo><msup><mrow><
   mfenced close=")"
   open="("><mrow><mi>I</mi><mo>−</mo><mi>αA</mi><msup><mrow><msup><mrow><
   mi>D</mi></mrow><mrow><mo>′</mo></mrow></msup></mrow><mrow><mo>−</mo><m
   n>1</mn></mrow></msup></mrow></mfenced></mrow><mrow><mo>−</mo><mn>1</mn
   ></mrow></msup><mi>β</mi> :MATH]
   8

   Directed centralities only gives importance to either upstream nodes or
   downstream ones. To address this issue we define the Source/Sink
   PageRank. The fundamental concept of Source/Sink modeling is to measure
   the centrality of nodes as both sources and sinks of information. We
   adapt the Source/Sink concept to the PageRank by calculating Source and
   Sink Centrality values individually and summing them:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">SS</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo><mo>=</mo><msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo><mo>+</mo><msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo> :MATH]
   9

   The above definition has no limitation of using different constant
   parameters for
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">pgr</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup> :MATH]
   , this study uses the same values of α and β for both components.

   Katz Centrality is another spectral centrality model where the
   importance of a node is calculated relative to the sum of centrality of
   its neighbors. Formally:
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow></msub><mo>(</mo><msub><mrow><mi
   >v</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>=</mo><msub><
   mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><mi>α</mi
   ><munder><mrow><mo>∑</mo></mrow><mrow><msub><mrow><mi>v</mi></mrow><mro
   w><mi>j</mi></mrow></msub><mo>∈</mo><msub><mrow><mi>N</mi></mrow><mrow>
   <mi>G</mi></mrow></msub><mo>(</mo><msub><mrow><mi>v</mi></mrow><mrow><m
   i>i</mi></mrow></msub><mo>)</mo></mrow></munder><msub><mrow><mi>C</mi><
   /mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow></msub><mo>(</mo><msub><mrow><mi
   >v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</mo> :MATH]
   10

   In the above formula, β is a constant factor and α is dampening factor.
   The convergence of the Formula [65]10 depends on the largest eigenvalue
   of the adjacency matrix. In particular, α<1/λ[1] is a sufficient
   condition for convergence, with λ[1] being the largest positive
   eigenvalue of the adjacency matrix. Rearranging Formula [66]10 gives a
   closed form solution of Katz centrality. Formally:
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow></msub><mo>=</mo><msup><mrow><mf
   enced close=")"
   open="("><mrow><mi>I</mi><mo>−</mo><mi>αA</mi></mrow></mfenced></mrow><
   mrow><mo>−</mo><mn>1</mn></mrow></msup><mi>β</mi> :MATH]
   11

   where C[ktz] is the vector of centrality values. Katz centrality is
   closely related to the formulations of Cdist and NetGSA for pathway
   enrichment analysis [[67]16, [68]18, [69]24]. Throughout this document,
   Katz centrality refers to the directed graph. Although Katz centrality
   is well defined on undirected graphs we will not analyze this model on
   the dataset of this study since it would impose a small global α. We
   define Source Katz (
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup> :MATH]
   ) component as the standard Katz centrality of a directed graph.
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo><mo>:</mo><mo>=</mo><msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow></msub><mo>(</mo><mi>v</mi><mo>)
   </mo> :MATH]
   12

   Next, we define the Sink Katz
   [MATH: <mfenced close=")"
   open="("><mrow><msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup></mrow></mfenced>
   :MATH]
   component as the Katz centrality of the transposed graph. In this
   context, the centrality of a node relates to the centrality of its
   neighbors in the transposed graph. Formally:
   [MATH: <mtable><mtr><mtd><msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup><mo>(</mo><msub><mrow><
   mi>v</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo></mtd><mtd><mo>
   :</mo><mo>=</mo><msub><mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></m
   sub><mo>+</mo><mi>α</mi><munder><mrow><mo>∑</mo></mrow><mrow><msub><mro
   w><mi>v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>∈</mo><msubsup><mr
   ow><mi>N</mi></mrow><mrow><mi>G</mi></mrow><mrow><mi>T</mi></mrow></msu
   bsup><mo>(</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>)</mo></mrow></munder><msubsup><mrow><mi>C</mi></mrow><mrow><mte
   xt mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup><mo>(</mo><msub><mrow><
   mi>v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</mo></mtd></mtr><mt
   r><mtd><msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup></mtd><mtd><mo>=</mo><m
   sup><mrow><mfenced close=")"
   open="("><mrow><mi>I</mi><mo>−</mo><mi>α</mi><msup><mrow><mi>A</mi></mr
   ow><mrow><mi>T</mi></mrow></msup></mrow></mfenced></mrow><mrow><mo>−</m
   o><mn>1</mn></mrow></msup><mi>β</mi></mtd></mtr></mtable> :MATH]
   13

   In a similar fashion to Source/Sink PageRank. we define the Source/Sink
   Katz as the direct summation of the two individual Source and Sink
   components. Katz Source/Sink Centrality is then defined as:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">SS</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo><mo>=</mo><msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo><mo>+</mo><msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">ktz</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo> :MATH]
   14

   Although the above definition has no limitation of using different
   constant parameters for individual Source and Sink centralities, this
   study uses the same values of α and β for both components. Also, it can
   be shown that Source and Sink components have the same convergence
   criteria when using the same β and α.

   Laplacian graph influence measures are a family of models that capture
   the amount of effect a node has on the other nodes. These measures are
   the core of the heat diffusion kernels of graphs as they relate to the
   Laplacians of the graph, D−A [[70]32–[71]35]. Graph Laplacians are
   generally defined for undirected graphs [[72]32, [73]35]. There are
   modifications for directed graphs either on strongly connected graphs
   or directed acyclic graphs [[74]34, [75]36]. In this study, we use a
   specific version for directed graphs that is related to the model used
   by Shojaie and Michailidis for pathway enrichment analysis (proof of
   model equivalence in the [76]Appendix)[[77]19]. Though this model is
   not discussed in the literature with any specific names, to the best of
   our knowledge, we will refer to it as Laplacian Centrality, for the
   lack of a better terminology. The Laplacian model in this study
   indicates that the centrality of a node relates to the average
   centrality of its neighbors. Formally:
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow></msub><mo>(</mo><msub><mrow><mi
   >v</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>=</mo><msub><
   mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>+</mo><mi>α</mi
   ><munder><mrow><mo>∑</mo></mrow><mrow><msub><mrow><mi>v</mi></mrow><mro
   w><mi>j</mi></mrow></msub><mo>∈</mo><msub><mrow><mi>N</mi></mrow><mrow>
   <mi>G</mi></mrow></msub><mo>(</mo><msub><mrow><mi>v</mi></mrow><mrow><m
   i>i</mi></mrow></msub><mo>)</mo></mrow></munder><mfrac><mrow><msub><mro
   w><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow></msub><mo>(</mo><msub><mrow><mi
   >v</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>)</mo></mrow><mrow><mo>
   |</mo><mi>N</mi><mo>(</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>i</mi>
   </mrow></msub><mo>)</mo><mo>|</mo></mrow></mfrac> :MATH]
   15

   By expressing the above formulation in matrix format and rearranging
   for the vector of centralities we get
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow></msub><mo>:</mo><mo>=</mo><msup
   ><mrow><mfenced close=")"
   open="("><mrow><mi>I</mi><mo>−</mo><mi>α</mi><msup><mrow><mi>D</mi></mr
   ow><mrow><mo>−</mo><mn>1</mn></mrow></msup><mi>A</mi></mrow></mfenced><
   /mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><mi>β</mi> :MATH]
   16

   We define the Laplacian Source component as the Laplacian centrality of
   the directed graph:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>:</mo><mo>=</mo><ms
   ub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow></msub> :MATH]
   17

   Similar to the other measures, we define the Laplacian Sink component
   is the Laplacian centrality of the transposed graph:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup><mo>:</mo><mo>=</mo><ms
   up><mrow><mfenced close=")"
   open="("><mrow><mi>I</mi><mo>−</mo><mi>α</mi><msup><mrow><msup><mrow><m
   i>D</mi></mrow><mrow><mo>′</mo></mrow></msup></mrow><mrow><mo>−</mo><mn
   >1</mn></mrow></msup><msup><mrow><mi>A</mi></mrow><mrow><mi>T</mi></mro
   w></msup></mrow></mfenced></mrow><mrow><mo>−</mo><mn>1</mn></mrow></msu
   p><mi>β</mi> :MATH]
   18

   The Source/Sink Laplacian is then defined as the sum of the two
   components:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow><mrow><mtext
   mathvariant="italic">SS</mtext></mrow></msubsup><mo>:</mo><mo>=</mo><ms
   ubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>+</mo><msubsup><mro
   w><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup> :MATH]
   19

   The Undirected Laplacian is trivial for the connected components of the
   graphs since all members of a component can have any equal value for
   centrality. Therefore, the Laplacian model defined in this report will
   be only used in directed formats.

   The definition of Source/Sink models can be generalized into a format
   where the contribution of the Source and the Sink components to the
   total centrality value are weighted. Interested readers may refer to
   [[78]23, [79]24] for examples of the weighted definitions– where we
   have shown that equal weights (as assumed in this article) can generate
   a SSC value that is most different the individual Source and Sink
   components.

Background pathways and genes

Human pathways and cancer genes

   Human pathways from Kyoto Encyclopedia of Genes and Genomes (KEGG) were
   retrieved (n = 330, April 2019). We pre-processed pathways and excluded
   the entries that exhibited 1- more than 1000 nodes and more than 4000
   interactions (n = 2) 2- less or equal to 20 nodes or 20 edges (n = 86)
   3- largest eigenvalues larger than 10 (n = 16). In addition, pathways
   with a single unique value for any of the centrality measures (e.g. all
   degrees being 10) were excluded from the analysis, resulting in 216
   pathways passing the initial quality criteria.

   Human cancer genes were retrieved from relevant classifications of
   Broad Institute’s MSigDB: Oncogenes, Tumor Suppressors, and
   Translocated cancer genes (n = 417, June 2018) [[80]37]. Cancer Gene
   Census from Sanger Institute was used as an additional reference list
   for cancer-related genes (n= 719, June 2018) [[81]38].

   Pathways with 5 or less cancer associated genes were excluded from
   analysis for consistency of p-value calculations (n = 61). The final
   set of pathways contained 157 entries. The pathways were retrieved and
   analyzed using R-packages “KEGGGraph” and “Pathview” [[82]39, [83]40].

Mouse pathways and lethal genes

   Mouse pathways were retrieved from KEGG databased (n= 326, April 2019).
   We used the same quality criteria (edges, nodes, and eigenvalue
   limits), resulting in 219 pathways. Mouse lethal genes were retrieved
   from International Mouse Phenotyping Consortium (IMPC) through its
   online portal (n = 1053, June 2018) [[84]41]. The IMPC’s gene family
   classifications of the related genes were Lethal, Viable, and
   Sub-viable.

   Pathways with 5 or less Lethal genes were excluded from analysis for
   consistency of p-value calculations. The final set of pathways
   contained 123 entries.

Experimental methods and analyses

   We examine different formats of each centrality model through three
   aspects. 1– The linear relationship between the centrality scores and
   the percentage of the genes that are biologically important. 2– The
   distribution of centrality scores of important genes and other genes
   (normal). 3– The mean difference between the centrality scores of
   important genes versus normal genes for each pathway. Since the
   subjects of study are multiple pathways, rather than a single global
   graph, normalization and ranking procedures were used to create a
   unified framework.

   This study uses β=0.15·𝟙[n×1], and α=0.85 for different formats of
   PageRank, as previously recommended in the original PageRank paper
   [[85]31]. For all Katz centrality formats, the parameter setting was
   α=0.1, β=𝟙[n×1]. Katz models parameters are restricted to be smaller
   than one over the largest eigenvalue of a graph [[86]24], and thus, we
   chose the aforementioned parameter to allow for analysis of a
   reasonable number of pathways. We did not analyze for Undirected Katz
   because of limitation of the largest eigenvalues. For all Laplacian
   centrality formats, the parameters were β=𝟙[n×1], and α=0.85. The
   Laplacian model parameters were chosen to be consistent with PageRank
   parameters.

Regression analysis

   Our regression analysis pipeline initially ranks the node of each
   pathway using one centrality measure at a time. The pipeline then
   aligns the centrality ranks of within each pathway through 100
   quantiles. The 100th quantile indicates most central genes in a pathway
   and 1st quantile indicates the lowest importance. Formally, let
   C[a,j](v[i]) denote the centrality of a node v[i] in pathway j using
   model a. The quantile ranking of a node i, Q[j](v[i]), is then defined
   as:
   [MATH:
   <msub><mrow><mi>Q</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>(</mo><m
   sub><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>
   =</mo><mfenced close="]"
   open="["><mrow><mn>100</mn><mo>×</mo><mfrac><mrow><msub><mrow><mi>C</mi
   ></mrow><mrow><mi>a</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>(</mo><ms
   ub><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo></mro
   w><mrow><mo>|</mo><msub><mrow><mi>V</mi></mrow><mrow><mi>j</mi></mrow><
   /msub><mo>|</mo></mrow></mfrac></mrow></mfenced> :MATH]
   20

   In the above formula, V[j] is the set of nodes in pathway j. The
   quantile ranking allows to compare the centrality rankings among all
   pathways because different pathways have different number of nodes by
   assigning the most central node in the highest quantile score. To
   investigate the relationship between a priori importance of a gene and
   its centrality, the proportion of important genes were calculated on
   each quantile across all pathways, with the important genes coming from
   the cancer genes and lethal genes. The relationship between the
   centrality and importance were investigated separately based on the
   gene type, once for cancer genes in human pathways and once for lethal
   genes in mouse pathways.

   Let Q[ij] denote the set of genes belonging to i-th quantile in pathway
   j — Q[ij]={v∣v∈V[j],Q[j](v)=i}. Let R denote the set of all genes in a
   class of a priori important genes, i.e., cancer or lethal. We define
   the percentage of the a priori important genes in i-th quantile,
   F[i]^c, as:
   [MATH:
   <msup><mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>i</mi></mrow></msub>
   </mrow><mrow><mi>c</mi></mrow></msup><mo>=</mo><mn>100</mn><mo>×</mo><m
   frac><mrow><munder><mrow><mo>∑</mo></mrow><mrow><mi>j</mi></mrow></mund
   er><mfenced close="|" open="|"><mrow><mfenced close="}"
   open="{"><mrow><mi>v</mi><mo>∣</mo><mi>v</mi><mo>∈</mo><mi>R</mi><mo>∩<
   /mo><msub><mrow><mi>Q</mi></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub></mrow></mfenced></mrow></
   mfenced></mrow><mrow><munder><mrow><mo>∑</mo></mrow><mrow><mi>j</mi></m
   row></munder><mfenced close="|" open="|"><mrow><mfenced close="}"
   open="{"><mrow><mi>v</mi><mo>∣</mo><mi>v</mi><mo>∈</mo><msub><mrow><mi>
   Q</mi></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub></mrow></mfenced></mrow></
   mfenced></mrow></mfrac> :MATH]
   21

   Although some genes were occurring in multiple pathways, each
   occurrence was treated as an unique gene because the purpose was to
   evaluate the centrality with respect to pathways. F[i]^c was then
   tested against the level of quantile for assessing linear
   relationships. In the below formula, i indicates the index value of a
   quantile group, e.g. 1 for the 1st quantile and 10 for the 10th
   quantile. Let a[1] and a[0] be the coefficients of the linear
   regression. Formally:
   [MATH:
   <msup><mrow><msub><mrow><mi>F</mi></mrow><mrow><mi>i</mi></mrow></msub>
   </mrow><mrow><mi>c</mi></mrow></msup><mo>=</mo><msub><mrow><mi>a</mi></
   mrow><mrow><mn>1</mn></mrow></msub><mo>·</mo><mi>i</mi><mo>+</mo><msub>
   <mrow><mi>a</mi></mrow><mrow><mn>0</mn></mrow></msub> :MATH]
   22

   For each centrality measure the above linear regression was fitted and
   the adjusted r-squared (coefficient of determination) were evaluated.
   The above procedures were applied to lethal mouse genes and human
   cancer genes separately in their respective annotated and pre-processed
   pathways.

Comparison of cumulative densities

   To compare the distribution of centrality values from a global
   perspective, the centrality scores were normalized within each pathway
   using the following formula:
   [MATH:
   <msub><mrow><mi>N</mi></mrow><mrow><mi>a</mi><mo>,</mo><mi>j</mi></mrow
   ></msub><mo>(</mo><msub><mrow><mi>v</mi></mrow><mrow><mi>i</mi></mrow><
   /msub><mo>)</mo><mo>=</mo><mfrac><mrow><msub><mrow><mi>C</mi></mrow><mr
   ow><mi>a</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>(</mo><msub><mrow><m
   i>v</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>−</mo><msub>
   <mrow><mi>μ</mi></mrow><mrow><mi>a</mi><mo>,</mo><mi>j</mi></mrow></msu
   b></mrow><mrow><msub><mrow><mi>σ</mi></mrow><mrow><mi>a</mi><mo>,</mo><
   mi>j</mi></mrow></msub></mrow></mfrac> :MATH]
   23

   where μ[a,j] and σ[a,j] are the mean and standard deviation of
   centrality scores of pathway j using method a. Accordingly,
   N[a,j](v[i]) is the normalized centrality score of node v[i] in pathway
   j, using the centrality method a. The normalized score for all pathways
   were placed in 100 quantiles. The distribution of quantile scores for
   the types of genes “Important” and “Others” were compared by
   Kolmogorov-Smirnov (KS) test on cumulative distribution function (CDF)
   of important and normal genes. The p-values were calculated based on
   the alternative hypothesis of the CDF of the important genes lying
   below that of the normal. In this test, the CDF of all genes combined
   would follow a straight line. The described procedure was ran
   separately for mouse lethal genes and human cancer genes on their
   respective pre-processed pathways.

Within pathways two-Sample testing

   For each pathway, the difference of the mean raw centrality values
   between a priori important genes and other genes were evaluated using
   Welch’s t-test. Formally:
   [MATH: <mtable><mtr><mtd><mi>t</mi><mspace
   width="1em"></mspace></mtd><mtd><mo>=</mo><mfrac><mrow><mspace
   width="1em"></mspace><mspace width="1em"></mspace><mspace
   width="1em"></mspace><msub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >a</mi><mo>,</mo><mi>c</mi></mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >a</mi><mo>,</mo><mi>n</mi></mrow></msub></mrow><mrow><msqrt><mrow><mfr
   ac><mrow><msubsup><mrow><mi>s</mi></mrow><mrow><mi>a</mi><mo>,</mo><mi>
   c</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mrow><msub><mrow>
   <mi>N</mi></mrow><mrow><mi>c</mi></mrow></msub></mrow></mfrac><mo>+</mo
   ><mfrac><mrow><msubsup><mrow><mi>s</mi></mrow><mrow><mi>a</mi><mo>,</mo
   ><mi>n</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow><mrow><msub><
   mrow><mi>N</mi></mrow><mrow><mi>n</mi></mrow></msub></mrow></mfrac></mr
   ow></msqrt></mrow></mfrac></mtd></mtr><mtr><mtd><msub><mrow><mi>H</mi><
   /mrow><mrow><mn>0</mn></mrow></msub></mtd><mtd><mo>:</mo><mspace
   width="1em"></mspace><mspace width="1em"></mspace><mspace
   width="1em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mi>a</mi><mo>,
   </mo><mi>c</mi></mrow></msub><mo>=</mo><msub><mrow><mi>μ</mi></mrow><mr
   ow><mi>a</mi><mo>,</mo><mi>n</mi></mrow></msub></mtd></mtr><mtr><mtd><m
   sub><mrow><mi>H</mi></mrow><mrow><mi>A</mi></mrow></msub></mtd><mtd><mo
   >:</mo><mspace width="1em"></mspace><mspace
   width="1em"></mspace><mspace
   width="1em"></mspace><msub><mrow><mi>μ</mi></mrow><mrow><mi>a</mi><mo>,
   </mo><mi>c</mi></mrow></msub><mo>></mo><msub><mrow><mi>μ</mi></mrow><mr
   ow><mi>a</mi><mo>,</mo><mi>n</mi></mrow></msub></mtd></mtr></mtable>
   :MATH]
   24

   where
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >a</mi><mo>,</mo><mi>c</mi></mrow></msub> :MATH]
   and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >a</mi><mo>,</mo><mi>n</mi></mrow></msub> :MATH]
   are the estimated means of centrality values for cancer and normal
   genes by model a. Similarly,
   [MATH:
   <msubsup><mrow><mi>s</mi></mrow><mrow><mi>a</mi><mo>,</mo><mi>c</mi></m
   row><mrow><mn>2</mn></mrow></msubsup> :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>s</mi></mrow><mrow><mi>a</mi><mo>,</mo><mi>n</mi></m
   row><mrow><mn>2</mn></mrow></msubsup> :MATH]
   are the variance estimates of the centrality scores of important and
   normal genes, using model a. N[c] and N[n] denote the sample size of
   important genes and normal genes. H[0] is the null hypothesis of
   important and normal genes having the same mean. H[A] is the
   alternative hypothesis where the important genes have a higher mean.

   Since the underlying distribution of the centrality values is unknown,
   we also used Wilcox non-parametric test to evaluate the null hypothesis
   of important and non-important genes having the same mean. Wilcox test
   ranks individual observations and evaluates the difference between the
   sum of the ranking in two classes of the hypothesis. While Wilcox test
   is the more appropriate choice for testing this study’s hypotheses, we
   present both parametric (Welch) and non-parametric (Wilcox) two-sample
   tests for comparison.

   For each centrality model, the p-values from Formula [87]24 and Wilcox
   test were calculated across all pathways. Because of the large number
   of pathways, multiple-hypothesis testing corrected criterion was used
   to determine significant p-values. In particular, Benjamini-Hochberg
   False Discovery Rate was applied to all calculated p-values for each
   centrality method to control type-I error at %5 (FDR<0.05) [[88]42].
   The same procedure was applied to both parametric and non-parametric
   approaches. The sets of significant pathways for each centrality model
   were contrasted against each other. The described procedure in this
   subsection was applied to human cancer genes and mouse lethal genes in
   separate analyses.

Results

Pathway centrality of human cancer genes

   The regression analysis shows an evident increase in the percentage of
   cancer genes with the increased centrality scores from Degree, Katz
   Source/Sink, Laplacian Source/Sink, and all PageRank formats
   (Fig. [89]1, Table [90]1). For all of the other models, the analysis
   fails to identify any linear relationship between the centrality scores
   and the percentage.

Fig. 1.

   Fig. 1
   [91]Open in a new tab

   Linear regression fit of the quantile-normalized centrality scores (Eq.
   [92]20) and the percentage human pathway genes that are cancer-related
   (Eq. [93]21). The Source/Sink extension of the centrality models show
   higher slope and adjusted coefficient of determination (Adjusted
   r-squared) in comparison to the standard variations of the centrality
   models (Table [94]1)

Table 1.

   Linear regression fit of the quantile-normalized centrality scores (Eq.
   [95]20) and the percentage human pathway genes that are cancer-related
   Centrality      term        estimate  std.error statistic p.value
   Degree          (Intercept) 1.50e+01  1.30e+00  1.15e+01  4.98e-20
   Degree          Coefficient 1.37e-01  2.24e-02  6.11e+00  2.01e-08
   Katz-Sink       (Intercept) 2.55e+01  2.31e+00  1.11e+01  8.26e-19
   Katz-Sink       Coefficient 6.23e-03  3.93e-02  1.58e-01  8.74e-01
   Katz-Source     (Intercept) 2.84e+01  2.57e+00  1.10e+01  1.49e-18
   Katz-Source     Coefficient -5.97e-02 4.28e-02  -1.39e+00 1.67e-01
   Katz-SSC        (Intercept) 1.45e+01  1.12e+00  1.30e+01  4.23e-23
   Katz-SSC        Coefficient 1.38e-01  1.93e-02  7.13e+00  1.64e-10
   Lap-Sink        (Intercept) 2.49e+01  1.89e+00  1.32e+01  1.92e-23
   Lap-Sink        Coefficient 1.65e-02  3.23e-02  5.12e-01  6.09e-01
   Lap-Source      (Intercept) 2.91e+01  2.50e+00  1.17e+01  8.94e-20
   Lap-Source      Coefficient -7.29e-02 4.14e-02  -1.76e+00 8.18e-02
   Lap-SSC         (Intercept) 1.27e+01  1.03e+00  1.24e+01  8.89e-22
   Lap-SSC         Coefficient 1.78e-01  1.78e-02  9.99e+00  1.12e-16
   PageRank-Sink   (Intercept) 1.33e+01  2.28e+00  5.85e+00  7.09e-08
   PageRank-Sink   Coefficient 1.91e-01  3.86e-02  4.93e+00  3.43e-06
   PageRank-Source (Intercept) 1.27e+01  1.54e+00  8.23e+00  1.37e-12
   PageRank-Source Coefficient 1.77e-01  2.55e-02  6.94e+00  5.88e-10
   PageRank-SSC    (Intercept) 7.58e+00  9.78e-01  7.76e+00  7.88e-12
   PageRank-SSC    Coefficient 2.67e-01  1.69e-02  1.58e+01  8.22e-29
   PageRank-Und    (Intercept) 9.06e+00  9.39e-01  9.65e+00  6.36e-16
   PageRank-Und    Coefficient 2.33e-01  1.62e-02  1.43e+01  6.49e-26
   [96]Open in a new tab

   For the Degree centrality, as shown in Fig. [97]1, the higher values of
   quantile scores correspond to higher percentages points of cancer
   genes, and low quantile scores exhibit lower percentage of cancer
   genes. The analysis supports this observation by showing a linear
   relationship between the scores and the percentage of genes that are
   cancer-related with an adjusted r-squared (Adj r^2) of 0.27. The
   regression analysis shows a statistically significant positive
   coefficient of 1.37×10^−1 for the quantile scores (Adj p-value =
   3.96×10^−8, Table [98]1).

   The standard directed variation of Katz centrality (the Source
   component) fails to identify an evidence (Adj p-value = 0.167) for
   linear relationship (Fig. [99]1), and the linear regression model
   accounts for an insignificant fraction of the variance (Adj r^2 =
   0.021). Similarly, the Katz Sink Component produces an Adj r^2 =
   0.00026 and Adj p-value = 0.874. In contrast, the combined value of the
   two components, Source/Sink Katz, shows that the linear relationship
   explains a statistically significant portion of the variance (Adj r^2 =
   0.34). In this case, the regression analysis shows a statistically
   significant positive coefficient of 1.38×10^−1 for the quantile scores
   (Adj p-value = 4.50×10^−10, Table [100]1).

   For different formats of Katz centrality, comparison of cumulative
   distribution function (CDF) between the cancer genes and normal genes
   shows that the CDF of cancer genes lies below that of the normal genes
   (Fig. [101]2). Although all of the differences are statistically
   significant, amount of differences depends on the specific variation of
   centrality. For example, more than 65% of the cancer genes have a
   normalized quantile score greater than 50 when measuring centrality
   using Katz Source/Sink method. The pathway-by-pathway two-sample tests
   also show that each variation of Katz centrality is able to detect a
   number of pathways to have a higher mean of centrality for cancer genes
   (Tables [102]2 and [103]3). The number of identified pathways in the
   non-parametric model are higher in comparison to a regular t-test. For
   example, the two-sample t-test detects five pathways with higher
   centrality of cancer genes using Katz Source/Sink, while the Wilcox
   rank-sum test identifies 13 pathways (FDR<0.05).

Fig. 2.

   [104]Fig. 2
   [105]Open in a new tab

   Comparison of the cumulative density between cancer-related genes and
   normal genes. The data points represent the quantile-scores calculated
   based on normalized centrality (Formula [106]23) across all pathways.
   Each panel includes the p-value of Kolmogorov-Smirnov test for the
   hypothesis of the CDF of cancer genes being below that of the normal
   genes. The panels show that cancer genes tend to have higher centrality
   values according to all of the models. This indicates the individual
   values of source and sink components for capturing the topological
   importance of cancer genes. Asterisk marks denote the p-values generate
   by the KS test method in R

Table 2.

   Pathways identified with higher mean centrality for cancer genes by
   t-test
               Deg Katz        Lap         Pgr
               Deg Sink So SSC Sink So SSC So Si SSC Und
   Degree      5   0    2  4   0    2  1   0  1  0   2
   Katz-Sink       6    0  1   5    0  1   2  0  2   1
   Katz-Source          2  2   0    2  1   0  1  0   0
   Katz-SSC                5   1    2  1   0  1  0   1
   Lap-Sink                    17   0  4   3  0  3   1
   Lap-Source                       8  2   0  1  0   0
   Lap-SSC                             15  2  1  2   1
   Pgr-Sink                                5  1  3   1
   Pgr-Source                                 2  1   1
   Pgr-SSC                                       5   2
   Pgr-Und                                           7
   [107]Open in a new tab

Table 3.

   Pathways identified with higher mean centrality for cancer genes by
   Wilcox test
               Deg Katz        Lap         Pgr
               Deg Sink So SSC Sink So SSC So Si SSC Und
   Degree      9   1    4  9   1    2  2   1  3  3   9
   Katz-Sink       13   1  2   10   1  3   12 2  10  7
   Katz-Source          9  5   1    5  2   4  6  4   8
   Katz-SSC                13  2    2  2   3  4  5   11
   Lap-Sink                    20   2  6   13 2  11  6
   Lap-Source                       8  3   4  7  4   5
   Lap-SSC                             15  7  5  6   3
   Pgr-Sink                                25 8  20  13
   Pgr-Source                                 16 12  9
   Pgr-SSC                                       31  15
   Pgr-Und                                           29
   [108]Open in a new tab

   Comparison of different PageRank centrality formats also shows that the
   SSC framework produces a better description of the pathway
   organizations compared to the individual Source and Sink components.
   The centrality values of individual Source (Adj p-value = 1.29×10^−9
   and Adj r^2=0.35) and Sink (Adj p-value = 5.39×10^−6, and Adj r^2=0.20)
   components of PageRank have linear relationship with the percentage of
   cancer genes. The combination of the two components as in the
   Source/Sink PageRank produces a more descriptive and stronger linear
   relationship in term of both the adjusted r-squared and the regression
   coefficient (Adj p-value = 9.04×10^−28 and Adj r^2=0.72). Undirected
   PageRank also provides a stronger linear relationship in comparison to
   the individual Source and Sink components (Adj p-value = 3.57×10^−25
   and Adj r^2=0.68).

   For different formats of PageRank, comparison of the CDF between the
   cancer genes and normal genes shows that the CDF of cancer genes lies
   below that of the normal genes (Fig. [109]2). Although all of the
   differences are statistically significant, amount of differences
   depends on the specific variation of centrality. This indicates that
   each format of PageRank shows higher values of centrality for cancer
   genes in the distribution of the scores, even though the distinction
   may not be apparent according to the linear regression model. For
   example, more than 70% of the cancer genes have a normalized quantile
   score greater than 50 when measuring centrality using PageRank
   Source/Sink method. The pathway-by-pathway two-sample tests also show
   that each variation of PageRank centrality is able to detect pathways
   with a higher mean of centrality for cancer genes (Tables [110]2 and
   [111]3). In this case, the number of identified pathways in the
   non-parametric model are higher in comparison to a regular t-test. For
   example, the two-sample t-test detects five pathways with higher
   centrality of cancer genes using PageRank Source/Sink, while the Wilcox
   rank-sum test identifies 31 pathways (FDR<0.05).

   The Laplacian model also detects a linear relationship only when
   considering the Source/Sink formatting (Adj p-value = 4.12×10^−16 and
   Adj r^2=0.50). For different formats of Laplacian, comparison of the
   CDF between the cancer genes and normal genes shows that the CDF of
   cancer genes lies below that of the normal genes (Fig. [112]2). The
   pathway-by-pathway two-sample tests also show that each variation of
   Laplacian centrality is able to detect a number of pathways to have a
   higher mean of centrality for cancer genes (Tables [113]2, and [114]3).
   In this case, the number of identified pathways in the non-parametric
   model are similar to a regular t-test. For example, the two-sample
   t-test detects fifteen pathways with higher centrality of cancer genes
   using Laplacian Source/Sink, the same number as the Wilcox rank-sum
   (FDR<0.05).

   We wondered whether the differences between the SSC framework and other
   variations were sensitive to the choice of model-specific parameters.
   To address this question, we repeated the regression analysis for the
   four variations of PageRank across the values of α∈[0.1,0.9] with 0.01
   increments (across all the pathways, n = 157). We then compared the
   adjusted r-squared of the linear regression models according to Formula
   [115]22 (Fig. [116]3). We used Fisher’s Z-transformation of correlation
   coefficients of the linear fits to measure the statistical difference
   between SSC and undirected PageRank across the range of α. The results
   are suggestive that PageRank SSC produces the highest adjusted
   r-squared among all the variations for the most of the search range,
   and as well as the undirected model for some part of the range. In
   particular, SSC produces the highest r-squared for any variation of
   PageRank at any alpha at α=0.58.

Fig. 3.

   [117]Fig. 3
   [118]Open in a new tab

   Sensitivity analysis of PageRank variations in the linear regression
   analysis with respect to the α parameter. Panel A shows the adjusted
   r-squared of the linear fit per α(Formula [119]22). Panel B displays
   the negative log p-value of the difference between the correlation
   coefficients of SSC PageRank versus undirected PageRank. The red line
   in panel B denotes the significance threshold of p-value=0.05

Fig. 4.

   Fig. 4
   [120]Open in a new tab

   Linear regression fit of the quantile-normalized centrality scores (Eq.
   [121]20) and the percentage genes that are cancer-related (Eq.
   [122]21). The Source/Sink extension of the centrality models show
   higher slope and adjusted coefficient of determination (Adjusted
   r-squared) in comparison to the standard variations of the centrality
   models

   We also tested the linear regression framework for closeness
   centrality, as an example of a centrality method that is not spectral
   (details in [123]Appendix). Our analysis shows that Source/Sink
   closeness centrality is able to identify a linear relationship between
   quantile-scores and percentage points of cancer genes (Adj r^2=0.27)
   while individual Source and Sink components fail to identify. Also,
   undirected closeness centrality shows a lower coefficient of
   determination (Adj r^2=0.11) in comparison to the SSC variation.

Pathway centrality of mouse lethal genes

   The regression analysis shows an evident increase in the percentage of
   the genes that are lethal in mouse pathways with the increased
   centrality scores from PageRank Source/Sink, Laplacian Source/Sink, and
   undirected PageRank (Fig. [124]5). For all of the other models, the
   analysis fails to identify any linear relationship between the
   centrality scores and the percentages.

Fig. 5.

   [125]Fig. 5
   [126]Open in a new tab

   Linear regression fit of the quantile-normalized centrality scores (Eq.
   [127]20) and the percentage mouse pathway genes that are lethal (Eq.
   [128]21). The Source/Sink extension of the centrality models show
   higher slope and adjusted coefficient of determination (Adjusted
   r-squared) in comparison to the standard variations of the centrality
   models (Table [129]4)

   The Source PageRank centrality fails to identify any evidence (Adj
   p-value = 0.649) for linear relationship (Fig. [130]5), and the linear
   regression model accounts for an insignificant fraction of the variance
   (Adj r^2 = 0.0022), similar to values for the PageRank Sink Component
   (Adj r^2 = 0.004 and Adj p-value = 0.55). In contrast, the combined
   value of the two components, Source/Sink PageRank, shows that the
   linear relationship explains a statistically significant portion of the
   variance (Adj r^2 = 0.21). In this case, the regression analysis shows
   a statistically significant positive regression coefficient of
   5.24×10^−2 (Table [131]4).

Table 4.

   Linear regression fit of the quantile-normalized centrality scores (Eq.
   [132]20) and the percentage of mouse pathway genes that are lethal
   Centrality       term        estimate  std.error statistic p.value
   Degree           (Intercept) 8.08e+00  5.86e-01  1.38e+01  1.19e-24
   Degree           Coefficient 1.14e-02  1.01e-02  1.13e+00  2.62e-01
   Katz-Sink        (Intercept) 1.08e+01  1.07e+00  1.01e+01  9.91e-17
   Katz-Sink        Coefficient -2.27e-02 1.79e-02  -1.27e+00 2.08e-01
   Katz-Source      (Intercept) 1.32e+01  1.09e+00  1.21e+01  6.70e-21
   Katz-Source      Coefficient -5.63e-02 1.83e-02  -3.08e+00 2.69e-03
   Katz-Source-Sink (Intercept) 8.47e+00  6.78e-01  1.25e+01  4.66e-22
   Katz-Source-Sink Coefficient 6.40e-03  1.17e-02  5.46e-01  5.86e-01
   Lap-Sink         (Intercept) 1.03e+01  1.15e+00  8.89e+00  4.62e-14
   Lap-Sink         Coefficient -1.06e-02 1.94e-02  -5.48e-01 5.85e-01
   Lap-Source       (Intercept) 1.30e+01  1.15e+00  1.13e+01  3.82e-19
   Lap-Source       Coefficient -4.50e-02 1.93e-02  -2.33e+00 2.20e-02
   Lap-SSC          (Intercept) 5.51e+00  7.05e-01  7.82e+00  5.70e-12
   Lap-SSC          Coefficient 6.65e-02  1.22e-02  5.46e+00  3.54e-07
   PageRank-Sink    (Intercept) 8.72e+00  1.05e+00  8.28e+00  1.17e-12
   PageRank-Sink    Coefficient 1.04e-02  1.74e-02  6.01e-01  5.50e-01
   PageRank-Source  (Intercept) 9.15e+00  1.15e+00  7.93e+00  4.85e-12
   PageRank-Source  Coefficient 8.83e-03  1.94e-02  4.56e-01  6.49e-01
   PageRank-SSC     (Intercept) 5.92e+00  5.84e-01  1.01e+01  5.26e-17
   PageRank-SSC     Coefficient 5.24e-02  1.01e-02  5.20e+00  1.09e-06
   PageRank-Und     (Intercept) 7.12e+00  5.26e-01  1.35e+01  2.81e-24
   PageRank-Und     Coefficient 2.67e-02  9.08e-03  2.94e+00  4.03e-03
   [133]Open in a new tab

   For different formats of all centrality models, comparison of CDF
   between the cancer genes and normal genes shows that the CDF of lethal
   genes lies below that of the normal genes (Fig. [134]6). The two-sample
   tests shows statistical power in detecting pathways with higher
   centrality of lethal genes using a less conservative FDR threshold
   (Tables [135]5, and [136]6). The pathway-by-pathway two-sample tests
   also show that each variation of Laplacian centrality and some formats
   of PageRank are able to detect a number of pathways to have a higher
   mean of centrality for cancer genes. For Laplacian centrality, the
   number of identified pathways in the non-parametric model is similar in
   comparison to a regular t-test.

Fig. 6.

   Fig. 6
   [137]Open in a new tab

   Comparison of the cumulative density between lethal genes and normal
   (non-lethal) genes. The data points represent the quantile-scores
   calculated based on normalized centrality (Formula [138]23) across all
   pathways. Each panel includes the p-value of Kolmogorov-Smirnov test
   for the hypothesis of the CDF of lethal genes being below that of the
   normal genes. The panels show that lethal genes tend to have higher
   centrality values according to some of the models, including
   Source/Sink PageRank and Source/Sink Laplacian

Table 5.

   Pathways identified with higher mean centrality for mouse lethal genes
   by Wilcox test (FDR<0.25)
              Lap         Pgr
              Sink So SSC Sink So SSC
   Lap-Sink   1    0  1   0    1  0
   Lap-Source      1  0   0    1  1
   Lap-SSC            4   0    2  0
   Pgr-Sink               2    0  0
   Pgr-Source                  6  2
   Pgr-SSC                        4
   [139]Open in a new tab

Table 6.

   Pathways identified with higher mean centrality for mouse lethal genes
   by t-test (FDR<0.25)
            Lap      Pgr
            Sink SSC Sink
   Lap-Sink 1    1   0
   Lap-SSC       6   0
   Pgr-Sink          1
   [140]Open in a new tab

Discussion

   Regression analysis of the topological position of cancer genes in
   human pathways shows that the graph centrality models can account for
   the percentage of the known important genes, particularly when
   formulated in Source/Sink modeling. Individual source or sink
   components of Katz and Laplacian fail to identify evidence for the
   linear relationship of centrality with the percentage of the genes that
   are cancer-related, noting these models have been applied in different
   pathway analysis methods [[141]15, [142]16, [143]18]. In contrast, the
   SSC format of Katz and Laplacian models exhibit a statistically
   significant linear relationship between the centrality and the
   percentage of the genes that are cancer-related (Table [144]1). This
   improvement is due to SSC assigning centrality values to nodes that are
   downstream terminal but topologically important as receivers of
   information.

   Similarly, SSC PageRank shows significant improvement in comparison to
   the standard directed formats in explaining the topological importance
   of cancer genes. These observations provide a noteworthy insight as
   different tools leverage directed formats of PageRank in pathway
   analysis applications [[145]14, [146]17]. The higher adjusted r^2 of
   SSC compared to other standard variations (Fig. [147]3) may be
   explained by noting that SSC PageRank is sensitive both to the
   directionality and the position at the upstream/downstream organization
   of pathways. Consistently, for every one of the centrality models the
   adjusted r^2 and the slope of the linear regression coefficient
   increase when using the Source/Sink framework.

   The analysis of lethal genes in mouse pathways provide additional
   validation for SSC methodology. The association of lethal genes with
   topological importance in biological networks has been extensively
   studied in the context of protein-protein interaction networks, namely
   centrality-lethality rule [[148]28–[149]30]. Our results provide an
   account for the centrality-lethality rule in the biological pathways,
   noting that the pattern is statistically significant when leveraging
   SSC modeling. As evident in Fig. [150]5, SSC formats of Laplacian and
   PageRank, and undirected PageRank to some extent provide evidence for
   the centrality-lethality rule.

   Lack of linear relationship in topological importance versus the
   percentage of biologically important genes may not dismiss the utility
   of a centrality model. As evident in CDF analysis and two-samples
   tests, a centrality model may exhibit distinct patterns between cancer
   (lethal) and non-cancer (non-lethal) genes. This CDF evidence may
   explain why the combination of the source and sink components is more
   informative. We believe that the evidence of increasing linear
   relationship between the topological importance and the percentage of
   biologically important genes provides a critical insight with respect
   to the appropriate choice of directed graph modeling in pathways. In
   particular, our results strongly indicate that the knowledge of the
   topological importance of downstream nodes is as valuable as that of
   upstream nodes, and should not be dismissed as irrelevant as assumed by
   current pathway analysis models [[151]14, [152]15, [153]18]. In fact,
   our results demonstrate a setting wherein the use of SSC and Undirected
   modeling is superior to the directed formats. In these conditions, the
   PageRank SSC can provide a better explanation of the linear
   relationship in comparison to the undirected model for several possible
   values of α.

   A limitation of the present study is the requirement of having
   sufficient information on the underlying network of pathways. Pathway
   databases can contain several entries with insufficient number of
   interaction between genes and other bio-molecular entities. In such
   cases, topological analysis is not feasible and some intermediate
   steps, such as data-driven prediction of interactions, may be necessary
   before using any network-based modeling [[154]20]. Another limitation
   of SSC modeling is the requirement of having information on the
   directionality of interactions, which can be absent in some pathway
   datasets.

   While SSC modeling is simple and straightforward, any potential
   application in other centrality methods should be handled with caution.
   We only focused on spectral centrality models because of their
   widespread use in biological network analysis and their compatibility
   to theoretically express the SSC framework. For other centrality types,
   SSC framework may or may not be the best option. For example,
   betweenness centrality – which measure importance of a node relative to
   the number of shortest paths that pass through it – does not produce a
   meaningful Source Sink variation. If a node k falls within a vu
   shortest path in a graph G, it also falls within the uv shortest path
   in G^T. In another example in the [155]Appendix, we demonstrate the
   linear regression model of cancer genes in human pathways using SSC
   modeling in closeness centrality, which is consistent with the
   presented results.

   The presented results are concordant with our hypothesis that
   accounting for the upstream and downstream organization of pathways
   provides more biologically relevant assessment of organization of
   pathways. The presented results also explain the success of Source/Sink
   modeling in achieving higher sensitivity and biological relevance in
   the enrichment analysis and functional interpretation of genomic data
   as was presented in our previous research [[156]24]. When considering
   the biological context, our results formalize the intuitive observation
   that the key pathway elements may appear at any stage of the pathways.
   The presented results also highlight a disadvantage of the directed
   pathway analysis models that fundamentally assume a higher importance
   for the upstream pathway elements and neglect the changes/perturbations
   of downstream elements. The appropriate choice of centrality measures
   for biological network analyses may vary depending on model assumptions
   and the underlying data. However, when applicable, our results
   recommend adapting SSC framework for fully leveraging the underlying
   structure of the networks.

Conclusion

   This study investigated the explanatory power of different centrality
   models with respect to a priori important pathway genes. We tested
   standard and novel centrality models, and presented a novel alternative
   with a better topological description of the pathways that accounts for
   the importance of the pathway elements with respect to the upstream and
   downstream positions. The two case examples in this study were
   cancer/non-cancer genes in human and lethal/viable genes in mouse. For
   both groups there exists literature on their positions and importance
   in their corresponding biological networks.

   Regression analysis, subsequent comparison of CDFs, and two-sample
   tests of the pathways show that spectral importance determines the
   topological importance of cancer genes. In particular, the SSC modeling
   results in more distinct and clear separation of the a priori important
   genes. These results show that using directions while giving importance
   to terminal nodes in pathways may give higher explanatory power which
   should be of particular interest to the research in biological networks
   and pathway analysis.

Appendix

Closeness centrality

   Closeness centrality describes a model where the importance of each
   node is calculated as the sum of its shortest distance from all the
   other nodes. Formally:
   [MATH: <msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">cls</mtext></mrow></msub><mo>(</mo><mi>v</mi><mo>)
   </mo><mo>:</mo><mo>=</mo><munder><mrow><mo>∑</mo></mrow><mrow><mi>u</mi
   ><mo>∈</mo><mi>V</mi><mo>(</mo><mi>G</mi><mo>)</mo></mrow></munder><mfr
   ac><mrow><mn>1</mn></mrow><mrow><mi>d</mi><mo>(</mo><mi>v</mi><mo>,</mo
   ><mi>u</mi><mo>)</mo></mrow></mfrac> :MATH]
   25

   where d(v,u) denotes the length of the shortest paths between v and u.
   Similar to the other models, we define source closeness as the standard
   closeness centrality.
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">cls</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo><mo>:</mo><mo>=</mo><msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">cls</mtext></mrow></msub><mo>(</mo><mi>v</mi><mo>)
   </mo> :MATH]
   26

   We define sink closeness as:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">cls</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup><mo>(</mo><mi>v</mi><mo
   >)</mo><mo>:</mo><mo>=</mo><munder><mrow><mo>∑</mo></mrow><mrow><mi>u</
   mi><mo>∈</mo><mi>V</mi><mo>(</mo><mi>G</mi><mo>)</mo></mrow></munder><m
   frac><mrow><mn>1</mn></mrow><mrow><mi>d</mi><mo>(</mo><mi>u</mi><mo>,</
   mo><mi>v</mi><mo>)</mo></mrow></mfrac> :MATH]
   27

   Subsequently, we defined the Source Sink closeness centrality as:
   [MATH: <msubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">cls</mtext></mrow><mrow><mtext
   mathvariant="italic">SS</mtext></mrow></msubsup><mo>:</mo><mo>=</mo><ms
   ubsup><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">cls</mtext></mrow><mrow><mtext
   mathvariant="italic">So</mtext></mrow></msubsup><mo>+</mo><msubsup><mro
   w><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">cls</mtext></mrow><mrow><mtext
   mathvariant="italic">Si</mtext></mrow></msubsup> :MATH]
   28

   We analyzed the linear regression fit of quantile-normalized closeness
   centrality values versus the percentage of genes that are
   cancer-related in human pathways. The results show a higher adjusted
   r-squared for the SSC variation compared to the other models.

Proofs

   For an adjacency matrix of a directed graph, A, define the weight
   normalized matrix L using a positive real value d as following:
   [MATH: <msub><mrow><mi>L</mi></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub><mo>(</mo><mi>d</mi><mo>)<
   /mo><mo>=</mo><mfrac><mrow><msub><mrow><mi>A</mi></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub></mrow><mrow><mi>d</mi><mo
   >+</mo><munderover><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn
   >1</mn></mrow><mrow><mi>n</mi></mrow></munderover><mo>|</mo><msub><mrow
   ><mi>A</mi></mrow><mrow><mtext
   mathvariant="italic">ij</mtext></mrow></msub><mo>|</mo></mrow></mfrac>
   :MATH]
   29
   [MATH:
   <mi>L</mi><mo>=</mo><msub><mrow><mo>lim</mo></mrow><mrow><mi>d</mi><mo>
   →</mo><mn>0</mn></mrow></msub><mi>L</mi><mo>(</mo><mi>d</mi><mo>)</mo>
   :MATH]
   30

   Define the influence matrix, L^∗, as the geometric series of L. In the
   case of undirected graphs, this notion is related to the concept of
   normalized Laplacian and heat diffusion kernels [[157]32].
   [MATH:
   <msup><mrow><mi>L</mi></mrow><mrow><mo>∗</mo></mrow></msup><mo>=</mo><m
   underover><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>0</mn></
   mrow><mrow><mi>∞</mi></mrow></munderover><msup><mrow><mi>L</mi></mrow><
   mrow><mi>i</mi></mrow></msup> :MATH]
   31

   On the condition of convergence, the above summation can be written as:
   [MATH:
   <msup><mrow><mi>L</mi></mrow><mrow><mo>∗</mo></mrow></msup><mo>=</mo><m
   sub><mrow><mo>lim</mo></mrow><mrow><mi>d</mi><mo>→</mo><mn>0</mn></mrow
   ></msub><msup><mrow><mo>(</mo><mi>I</mi><mo>−</mo><mi>L</mi><mo>(</mo><
   mi>d</mi><mo>)</mo><mo>)</mo></mrow><mrow><mo>−</mo><mn>1</mn></mrow></
   msup> :MATH]
   32

   According to Shojaie and Michailidis, choice of d as small as 0.01
   would produce consistent and stable results. However, to eliminate the
   need for the parameter d, we rewrite an equivalent formulation for the
   matrix L as :
   [MATH:
   <mi>L</mi><mo>:</mo><mo>=</mo><msup><mrow><mi>D</mi></mrow><mrow><mo>−<
   /mo><mn>1</mn></mrow></msup><mi>A</mi> :MATH]
   33

   where D is the diagonal degree matrix with the same definition as in D
   of PageRank. As noted in [[158]22], for undirected graphs, the solution
   to the matrix L in a matrix geometric series uniquely exist. That is,
   the matrix L^∗ from Formula [159]31 is only guaranteed to uniquely
   exist when we use the symmetric matrix of the undirected graph.

   However, the case might be different for directed graphs. Therefore,
   including a shrinking factor, α<1, that ensures the convergence in a
   geometric summation. We then re-define:
   [MATH:
   <mi>L</mi><mo>:</mo><mo>=</mo><mi>α</mi><msup><mrow><mi>D</mi></mrow><m
   row><mo>−</mo><mn>1</mn></mrow></msup><mi>A</mi> :MATH]
   34

   Using the above Formula, we define the Laplacian centrality of a node
   as the aggregated influence of a node i on all other nodes. This is
   obtained from Formula [160]31:
   [MATH: <mtable><mtr><mtd><msub><mrow><mi>C</mi></mrow><mrow><mtext
   mathvariant="italic">lap</mtext></mrow></msub></mtd><mtd><mo>=</mo><msu
   p><mrow><mi>L</mi></mrow><mrow><mo>∗</mo></mrow></msup><mi>𝟙</mi></mtd>
   </mtr><mtr><mtd><mo>=</mo><msup><mrow><mfenced close=")"
   open="("><mrow><mi>I</mi><mo>−</mo><mi>α</mi><msup><mrow><mi>D</mi></mr
   ow><mrow><mo>−</mo><mn>1</mn></mrow></msup><mi>A</mi></mrow></mfenced><
   /mrow><mrow><mo>−</mo><mn>1</mn></mrow></msup><mi>𝟙</mi></mtd></mtr></m
   table> :MATH]
   35

Acknowledgements