Abstract
Background
Although there are huge volumes of genomic data, how to decipher them
and identify driver events is still a challenge. The current methods
based on network typically use the relationship between genomic events
and consequent changes in gene expression to nominate putative driver
genes. But there may exist some relationships within the
transcriptional network.
Methods
We developed MECoRank, a novel method that improves the recognition
accuracy of driver genes. MECoRank is based on bipartite graph to
propagates the scores via an iterative process. After iteration, we
will obtain a ranked gene list for each patient sample. Then, we
applied the Condorcet voting method to determine the most impactful
drivers in a population.
Results
We applied MECoRank to three cancer datasets to reveal candidate driver
genes which have a greater impact on gene expression. Experimental
results show that our method not only can identify more driver genes
that have been validated than other methods, but also can recognize
some impactful novel genes which have been proved to be more important
in literature.
Conclusions
We propose a novel approach named MECoRank to prioritize driver genes
based on their impact on the expression in the molecular interaction
network. This method not only assesses mutation’s effect on the
transcriptional network, but also assesses the differential
expression’s effect within the transcriptional network. And the results
demonstrated that MECoRank has better performance than the other
competing approaches in identifying driver genes.
Keywords: Driver genes, Cancer, Transcriptional networks
Background
Recent advances in deep sequencing have provided us with an
unprecedented amount of cancer genomics data. With the rapid
accumulation of huge volumes of genomic data, we have tremendous
opportunities to better understand cancer initiation, progression and
development [[33]1]. However, it is still a challenge to decipher those
data (e.g., single nucleotide variants (SNVs), small insertions or
deletions (indels), large copy-number variations (CNVs) and structural
aberrations, etc.) and use them to distinguish driver mutations which
contribute to cancer development from passenger mutations that have
accumulated in somatic cells but without functional consequences
[[34]2, [35]3]. In fact, in the early stages, single data, such as
somatic aberrations data is the most commonly used data to identify
driver genes. For example, frequency-based methods, such as MuSiC
[[36]3] and MutSigCV [[37]4], are the common approach which relies on
the frequency of aberration of a given gene or locus in a population of
tumors [[38]5]. In addition, machine learning approaches based on
alterations knowledge are also used to identify driver genes. For
instance, CHASM adopts random forest which use alterations trained from
known cancer-causing somatic missense mutations to classify driver
mutations [[39]6, [40]7]. Recently, many mathematical and statistical
methods which are based on data integration were proposed to search for
driver genes, driver pathways or core modules [[41]8]. Bayesian
network-based methods such as CONEXIC integrated copy number change and
gene expression data to identify potential driver genes which are
located in some amplified or deleted regions in tumors [[42]9].
With the developing of the research of cancer, we have recognized that
the development and progression of cancer can be promoted by driver
mutations or gene perturbing signaling, regulatory or metabolic
pathways [[43]1]. Thus, several methods that use network and pathway to
understand drivers have been proposed, e.g., MEMo [[44]10] and Dendrix
[[45]11]. MEMo uses the mutual exclusivity of gene mutations to detect
mutated subnetworks critical to carcinogenesis [[46]10]. As well as
Dendrix was designed to identify subnetworks with potential driver
activity which have high coverage and high mutual exclusivity [[47]11].
Another method, MUFFINN measures the functional impact of the network
neighbors of mutated genes, and scores the investigated genes by
considering the influence of either the most frequently mutated
neighbor or all direct neighbors [[48]12]. Although the aforementioned
approaches have achieved great achievements in distinguishing driver
genes, improving the identification accuracy of driver genes is still a
challenge.
In this work, we propose a method named MECoRank to prioritize driver
genes of single patient sample based on their impact on the expression
in the molecular interaction. Our method not only assess mutation’s
effect on gene expression, but also assess the differential
expression’s effect within a transcriptional network. We first
construct a bipartite graph model to formulate associations between
expression and somatic SNVs using protein-protein interaction (PPI)
network. A bipartite graph is a graph whose vertices can be partitioned
into two subsets. In our work, vertices on the left partition of the
bipartite graph correspond to individual gene expression status and
vertices on the right partition represent individual mutated genes. And
then an iterative process was used to propagate the effect of somatic
SNVs and differential expression. After iteration, we will obtain a
ranked gene list for each patient sample. Finally, we applied the
Condorcet voting method to determine the most impactful drivers in a
population. To test the performance of our approach, we analyzed three
datasets which are breast cancer dataset (BRCA), kidney renal clear
cell carcinoma (KIRC) and lung squamous cell carcinoma (LUSC). From
TCGA Data Portal ([49]https://portal.gdc.cancer.gov/), we collected the
data of somatic SNVs. And from UCSC [[50]13], we obtained gene
expression data. Experimental results show that our method not only can
identify more driver genes that have been validated than other methods,
but also can recognize some impactful novel genes. Although these genes
are not presented in Cancer Gene Census (CGC), some evidences show that
these candidate genes have functional roles in cancer or cancer-related
biological processes.
Methods
Method overview
The proposed MECoRank method prioritizes driver genes based on the
impact of their mutations and differential expression on the expression
in the molecular interaction during a single patient. An overview of
its workflow is presented in Fig. [51]1. In the following section, we
first present the bipartite graph model and then give the iterative
framework. Finally, we introduce Condorcet voting method for rank
aggregation.
Fig. 1.
[52]Fig. 1
[53]Open in a new tab
A schematic of MECoRank framework. a The data we used including gene
expression of cancer and normal patients, somatic SNVs and PPI network.
b The left partition of bipartite graph represents an individual
patient’s expression set U where the transition matrices W^UU indicates
the relations in U. The right of the bipartite graph represents the
patient’s mutation set V where the transition matrices W^VU indicates
the interactions between U and V. c We can obtain a score matrix in
which each gene of every patient has a score. d We used the Condorcet
voting to obtain the final rank of the genes
A bipartite graph model
In this section, we detail the bipartite graph model used in our work.
Consider a bipartite graph G = (U ∪ V, E), where U and V represent the
individual expression and mutation respectively. Each edge in E
connects a vertex in U and one in V. Let U = {u[1], u[2], …, u[m]} and
V = {v[1], v[2], …, v[n]} be the two sets of m and n genes. We use u[i]
to denote the i − th vertex in U, and v[j] to denote the j − th vertex
in V, where 1 ≤ i ≤ m and 1 ≤ j ≤ n. For each patient, an edge between
the nodes on the left and right partitions of the graph is drawn if
u[i]and v[j] are known to interact according to PPI network. We
constructed the bipartite graph for each patient by the same way. The
edges between U and V are represented as the transition probability
W^VU [[54]14]. And the edges within U are represented as W^UU. If there
is an edge connecting u[i] and v[j] in PPI,
[MATH: wijvu=1 :MATH]
; otherwise,
[MATH: wijvu=0 :MATH]
. As such, we can describe all edges weights of the graph as a m × n
matrix
[MATH: WVU=wijvu :MATH]
. Similarly, we can easily obtain the transition matrix W^UU. For each
vertex v[j], we denote its degree (sum of connected edges’ weights) as
d[j], and use a diagonal matrix D[v] to denote the degrees of all
vertices in V; and similarly, for d[i] and D[u]. Note that in this
paper, we deal with undirected bipartite graphs.
Iterative framework
To rank vertices based on the graph structure, seminal algorithms like
PageRank [[55]15] and HITS [[56]16] have been proposed [[57]17].
PageRank is an algorithm used by Google Search to rank websites in
their search engine results [[58]18]. This approach estimates the
importance score of vertices as the stationary distribution of a random
walk process – starting from a vertex, the surfer randomly jumps to a
neighbor according to the edge weight [[59]17]. HITS algorithm is
similar to PageRank in some aspects. This method assumes each vertex
has two roles: hub and authority [[60]16]. If a vertex is linked by
many vertices with hub score, this vertex has high authority and vice
versa [[61]17]. These two methods are focus on unipartite graphs. Our
iterative framework references HITS, PageRank and their variants which