Abstract

Background

   Although there are huge volumes of genomic data, how to decipher them
   and identify driver events is still a challenge. The current methods
   based on network typically use the relationship between genomic events
   and consequent changes in gene expression to nominate putative driver
   genes. But there may exist some relationships within the
   transcriptional network.

Methods

   We developed MECoRank, a novel method that improves the recognition
   accuracy of driver genes. MECoRank is based on bipartite graph to
   propagates the scores via an iterative process. After iteration, we
   will obtain a ranked gene list for each patient sample. Then, we
   applied the Condorcet voting method to determine the most impactful
   drivers in a population.

Results

   We applied MECoRank to three cancer datasets to reveal candidate driver
   genes which have a greater impact on gene expression. Experimental
   results show that our method not only can identify more driver genes
   that have been validated than other methods, but also can recognize
   some impactful novel genes which have been proved to be more important
   in literature.

Conclusions

   We propose a novel approach named MECoRank to prioritize driver genes
   based on their impact on the expression in the molecular interaction
   network. This method not only assesses mutation’s effect on the
   transcriptional network, but also assesses the differential
   expression’s effect within the transcriptional network. And the results
   demonstrated that MECoRank has better performance than the other
   competing approaches in identifying driver genes.

   Keywords: Driver genes, Cancer, Transcriptional networks

Background

   Recent advances in deep sequencing have provided us with an
   unprecedented amount of cancer genomics data. With the rapid
   accumulation of huge volumes of genomic data, we have tremendous
   opportunities to better understand cancer initiation, progression and
   development [[33]1]. However, it is still a challenge to decipher those
   data (e.g., single nucleotide variants (SNVs), small insertions or
   deletions (indels), large copy-number variations (CNVs) and structural
   aberrations, etc.) and use them to distinguish driver mutations which
   contribute to cancer development from passenger mutations that have
   accumulated in somatic cells but without functional consequences
   [[34]2, [35]3]. In fact, in the early stages, single data, such as
   somatic aberrations data is the most commonly used data to identify
   driver genes. For example, frequency-based methods, such as MuSiC
   [[36]3] and MutSigCV [[37]4], are the common approach which relies on
   the frequency of aberration of a given gene or locus in a population of
   tumors [[38]5]. In addition, machine learning approaches based on
   alterations knowledge are also used to identify driver genes. For
   instance, CHASM adopts random forest which use alterations trained from
   known cancer-causing somatic missense mutations to classify driver
   mutations [[39]6, [40]7]. Recently, many mathematical and statistical
   methods which are based on data integration were proposed to search for
   driver genes, driver pathways or core modules [[41]8]. Bayesian
   network-based methods such as CONEXIC integrated copy number change and
   gene expression data to identify potential driver genes which are
   located in some amplified or deleted regions in tumors [[42]9].

   With the developing of the research of cancer, we have recognized that
   the development and progression of cancer can be promoted by driver
   mutations or gene perturbing signaling, regulatory or metabolic
   pathways [[43]1]. Thus, several methods that use network and pathway to
   understand drivers have been proposed, e.g., MEMo [[44]10] and Dendrix
   [[45]11]. MEMo uses the mutual exclusivity of gene mutations to detect
   mutated subnetworks critical to carcinogenesis [[46]10]. As well as
   Dendrix was designed to identify subnetworks with potential driver
   activity which have high coverage and high mutual exclusivity [[47]11].
   Another method, MUFFINN measures the functional impact of the network
   neighbors of mutated genes, and scores the investigated genes by
   considering the influence of either the most frequently mutated
   neighbor or all direct neighbors [[48]12]. Although the aforementioned
   approaches have achieved great achievements in distinguishing driver
   genes, improving the identification accuracy of driver genes is still a
   challenge.

   In this work, we propose a method named MECoRank to prioritize driver
   genes of single patient sample based on their impact on the expression
   in the molecular interaction. Our method not only assess mutation’s
   effect on gene expression, but also assess the differential
   expression’s effect within a transcriptional network. We first
   construct a bipartite graph model to formulate associations between
   expression and somatic SNVs using protein-protein interaction (PPI)
   network. A bipartite graph is a graph whose vertices can be partitioned
   into two subsets. In our work, vertices on the left partition of the
   bipartite graph correspond to individual gene expression status and
   vertices on the right partition represent individual mutated genes. And
   then an iterative process was used to propagate the effect of somatic
   SNVs and differential expression. After iteration, we will obtain a
   ranked gene list for each patient sample. Finally, we applied the
   Condorcet voting method to determine the most impactful drivers in a
   population. To test the performance of our approach, we analyzed three
   datasets which are breast cancer dataset (BRCA), kidney renal clear
   cell carcinoma (KIRC) and lung squamous cell carcinoma (LUSC). From
   TCGA Data Portal ([49]https://portal.gdc.cancer.gov/), we collected the
   data of somatic SNVs. And from UCSC [[50]13], we obtained gene
   expression data. Experimental results show that our method not only can
   identify more driver genes that have been validated than other methods,
   but also can recognize some impactful novel genes. Although these genes
   are not presented in Cancer Gene Census (CGC), some evidences show that
   these candidate genes have functional roles in cancer or cancer-related
   biological processes.

Methods

Method overview

   The proposed MECoRank method prioritizes driver genes based on the
   impact of their mutations and differential expression on the expression
   in the molecular interaction during a single patient. An overview of
   its workflow is presented in Fig. [51]1. In the following section, we
   first present the bipartite graph model and then give the iterative
   framework. Finally, we introduce Condorcet voting method for rank
   aggregation.

Fig. 1.

   [52]Fig. 1
   [53]Open in a new tab

   A schematic of MECoRank framework. a The data we used including gene
   expression of cancer and normal patients, somatic SNVs and PPI network.
   b The left partition of bipartite graph represents an individual
   patient’s expression set U where the transition matrices W^UU indicates
   the relations in U. The right of the bipartite graph represents the
   patient’s mutation set V where the transition matrices W^VU indicates
   the interactions between U and V. c We can obtain a score matrix in
   which each gene of every patient has a score. d We used the Condorcet
   voting to obtain the final rank of the genes

A bipartite graph model

   In this section, we detail the bipartite graph model used in our work.
   Consider a bipartite graph G = (U ∪ V, E), where U and V represent the
   individual expression and mutation respectively. Each edge in E
   connects a vertex in U and one in V. Let U = {u[1], u[2], …, u[m]} and
   V = {v[1], v[2], …, v[n]} be the two sets of m and n genes. We use u[i]
   to denote the i − th vertex in U, and v[j] to denote the j − th vertex
   in V, where 1 ≤ i ≤ m and 1 ≤ j ≤ n. For each patient, an edge between
   the nodes on the left and right partitions of the graph is drawn if
   u[i]and v[j] are known to interact according to PPI network. We
   constructed the bipartite graph for each patient by the same way. The
   edges between U and V are represented as the transition probability
   W^VU [[54]14]. And the edges within U are represented as W^UU. If there
   is an edge connecting u[i] and v[j] in PPI,
   [MATH: <msubsup><mi>w</mi><mi mathvariant="italic">ij</mi><mi
   mathvariant="italic">vu</mi></msubsup><mo>=</mo><mn>1</mn> :MATH]
   ; otherwise,
   [MATH: <msubsup><mi>w</mi><mi mathvariant="italic">ij</mi><mi
   mathvariant="italic">vu</mi></msubsup><mo>=</mo><mn>0</mn> :MATH]
   . As such, we can describe all edges weights of the graph as a m × n
   matrix
   [MATH: <msup><mi>W</mi><mi
   mathvariant="italic">VU</mi></msup><mo>=</mo><mfenced close="]"
   open="["><msubsup><mi>w</mi><mi mathvariant="italic">ij</mi><mi
   mathvariant="italic">vu</mi></msubsup></mfenced> :MATH]
   . Similarly, we can easily obtain the transition matrix W^UU. For each
   vertex v[j], we denote its degree (sum of connected edges’ weights) as
   d[j], and use a diagonal matrix D[v] to denote the degrees of all
   vertices in V; and similarly, for d[i] and D[u]. Note that in this
   paper, we deal with undirected bipartite graphs.

Iterative framework

   To rank vertices based on the graph structure, seminal algorithms like
   PageRank [[55]15] and HITS [[56]16] have been proposed [[57]17].
   PageRank is an algorithm used by Google Search to rank websites in
   their search engine results [[58]18]. This approach estimates the
   importance score of vertices as the stationary distribution of a random
   walk process – starting from a vertex, the surfer randomly jumps to a
   neighbor according to the edge weight [[59]17]. HITS algorithm is
   similar to PageRank in some aspects. This method assumes each vertex
   has two roles: hub and authority [[60]16]. If a vertex is linked by
   many vertices with hub score, this vertex has high authority and vice
   versa [[61]17]. These two methods are focus on unipartite graphs. Our
   iterative framework references HITS, PageRank and their variants which