Abstract
With the development of various high throughput technologies and
analysis methods, researchers can study different aspects of a
biological phenomenon simultaneously or one aspect repeatedly with
different experimental techniques and analysis methods. The output from
each study is a rank list of components of interest. Aggregation of the
rank lists of components, such as proteins, genes and single nucleotide
variants (SNV), produced by these experiments has been proven to be
helpful in both filtering the noise and bringing forth a more complete
understanding of the biological problems. Current available rank
aggregation methods do not consider the network information that has
been observed to provide vital contributions in many data integration
studies. We developed network tuned rank aggregation methods
incorporating network information and demonstrated its superior
performance over aggregation methods without network information.
The methods are tested on predicting the Gene Ontology function of
yeast proteins. We validate the methods using combinations of three
gene expression data sets and three protein interaction networks as
well as an integrated network by combining the three networks. Results
show that the aggregated rank lists are more meaningful if protein
interaction network is incorporated. Among the methods compared,
CGI_RRA and CGI_Endeavour, which integrate rank lists with networks
using CGI [[31]1] followed by rank aggregation using either robust rank
aggregation (RRA) [[32]2] or Endeavour [[33]3] perform the best.
Finally, we use the methods to locate target genes of transcription
factors.
Keywords: Rank Aggregation, Network, CGI, Gene Rank, Endeavour, RRA
Introduction
As new biotechnologies, such as microarray, genotyping, and next
generation sequencing (NGS), continue to be developed and the cost of
biological experiments decreases, a biological phenomenon can be
studied using different technologies or by different research groups.
Each study can yield a rank list of genes according to the strength of
association of genes with the biological phenomenon of interest. These
studies can include the detection of differentially expressed genes
under certain perturbations, disease status or treatments; association
of single nucleotide variants (SNV), copy number variation (CNV),
methylation, and alternative splicing with phenotypes; or
identification of target genes of transcription factors or other
regulatory molecules including various RNAs. For example, in the study
of associating SNVs to a disease, SNVs can be ordered using the
p-values based on the χ^2 test for equal allele frequency between the
cases and controls. Similarly, in the study of differentially expressed
genes for a disease status, genes can be ranked by comparing the genes'
expression levels between cases and controls. Regulatory targets of a
transcription factor can be ranked according to the differential
expression levels of genes from knock out experiments of the
transcription factor.
In order to have a more complete understanding of the biological
phenomena of interest taking into account the multiple studies rather
than using individual ones, efficient aggregation of the results from
the individual studies is needed. Since these studies can use different
technologies and are generally carried out from different laboratories,
global normalization of the data across these studies can be
challenging. Thus, aggregation of the rank lists of the genes provides
a promising approach for integrating the results from these studies
allowing scientists to better interpret experimental results,
understand the potential mechanism and determine following up
experiments [[34]4-[35]6]. The inputs of rank aggregation are gene (or
SNP, domain, protein, pathway) rank lists and the output is a combined
rank list that is anticipated to be more meaningful than any single
rank list.
Many methods have been developed for rank list aggregation using the
same type of data such as gene expression profiles, which we refer as
horizontal data integration. Manor et al. [[36]7] proposed a method to
rank the risk of SNVs causing disease, in which the data is re-sampled
multiple times and a SNV ranking is generated for each such sample,
with the final SNV ranking being an aggregation of rankings from all
the samples. Aerts et al. [[37]3] integrated disease gene rank lists
from multiple data sources using order statistics [[38]8]. Alder et al.
[[39]9] used rank aggregation method to merge information from
different microarray data sets into a single global rank list to show
the expression level similarity. Kolde et al. [[40]2] developed a new
rank aggregation method intended to improve the order statistic from
[[41]3]. Jiang et al. [[42]10] used a similar rank aggregation method
as in [[43]3] to integrate multiple information sources to rank causal
genes for livestock diseases. Sun et al. [[44]11] developed a software
package for integrating rank lists from pathway enrichment analysis of
different data sets, such as transcriptomics, proteomics, metabolomics
and genome wide association studies (GWAS). Trepper et al. [[45]12]
found genes showing mRNA-level response to changes in DAF-16 activity
for further experimental validation by integrating publicly available
data from genome wide association studies with a rank integration
method. Lin [[46]13] reviewed the available rank aggregation methods at
that time and classified them into three categories. The first category
is based on Thurstone's model [[47]14] that is originally designed for
marketing and advertisement research. The second category is based on
heuristic algorithms including Borda's method [[48]15] and Markov chain
based methods [[49]16]. The third category is based on stochastic
optimization minimizing the average distance between the input rank
lists and the aggregated rank list [[50]17]. In a more recent study,
Deng et al. [[51]18] developed a Bayesian approach for rank aggregation
that can estimate the reliability of different rank lists and use the
reliability to weight the different lists.
Another type of integration is to combine different types of data such
as gene expression profiles and molecular networks, which we refer as
vertical data integration. Realizing that genes that are in the same
pathway or close together in networks are more likely to perform
similar functions and thus have similar ranks, investigators have
developed a variety of different methods to integrate different types
of data such as gene expression and networks [[52]19-[53]21] or
association results of genes with molecular networks [[54]22-[55]24].
Wang et al. [[56]21] proposed a network guided gene ranking method by
firstly utilizing networks to find coordinative components representing
the underlying biological processes or pathways and then projecting
gene expression data onto the coordinative components to estimate the
association strengths of genes to a biological phenomenon. These
association strengths are then used to rank the genes. Lavi et al.
[[57]20] introduced a kernel based on the network topology and then
used support vector machine to detect disease biomarkers from gene
expression data. Garcia et al. [[58]19] proposed an approach to find
the sub-network component associated with extreme values of a list of
genes. Hofree et al. [[59]22] developed a method to stratify the
different cancer types into informative subtypes by clustering together
patients with mutations in similar network regions based on integration
of somatic mutation profiles and networks. Novarino et al. [[60]24]
studied the genetic basis of hereditary spastic paraplegias (HSP) by
using whole-exome sequencing in combination with network analysis. They
identified 18 previously unknown putative HSP genes and generated a
host of other candidate genes for future study. Jia and Zhao [[61]23]
put forward a cancer driver gene prediction method by firstly
identifying the significantly mutated genes with generalized additive
models based on sample-specific mutation profiles and then collected
the novel interaction neighbors of the mutant genes in the network with
the help of random walk with restart. Mutant genes and the
corresponding interaction neighbors for each sample are heuristically
integrated to present the final prediction result.
Despite the many studies on either horizontal or vertical data
integration, no studies are available to consider both multiple rank
lists and networks simultaneously. In this paper, we study whether we
can improve the performance of rank list aggregation by incorporating
network information. Our basic assumption is that genes that are close
together in networks have similar ranks in the list. Therefore, we
firstly tune each given rank list with network and then aggregate the
updated rank lists. "Combining Gene expression with Interaction (CGI)"
[[62]1] and "GeneRank (GR)" [[63]25] are two widely recognized methods
for integrating network information with gene expression data. We
modify these methods so that they can be applicable to rank lists.
Endeavour [[64]3] and RRA [[65]2] are then used to aggregate the rank
lists leveraged by the network. We study the effectiveness of
incorporating networks in rank list aggregation by comparing the
performance of all the combinations, CGI_Endeavour, CGI_RRA,
GR_Endeavour, and GR_RRA using three yeast gene expression data sets
and three protein interaction networks as well as the integrated
network. We found that by incorporating network information, the
performance of rank list aggregation is improved significantly in most
cases. Among these methods, CGI_RRA and CGI_Endeavour perform the best.
Focusing on CGI_RRA and CGI_Endeavour, we show that the improvement
really originates from the network information by implementing the
methods on network with shuffled labels. By running the methods on
networks with noisy interactions, we show the performance of CGI_RRA
and CGI_Endeavour decreases as the noise level increases as expected,
but is relatively robust to noise when the noise level is as high as
40%. At last, we show the applicability of our methods on predicting
pathway members.
Materials and methods
In this section, we present methods to integrate multiple rank lists
with network information for gene prioritization. For each rank list,
we first modify two methods, Combining Gene expression and protein
Interaction (CGI) [[66]1] and GeneRank (GR) [[67]25], for improving the
gene rank by integrating the original rank list with the protein
interaction network. We then aggregate the updated rank lists using
Endeavour [[68]3] and Robust Rank Aggregation (RRA) [[69]2].
Assuming that there are m rank lists of n genes, r[j ]= (r[1,j],
r[2,j],⋯,r[n],[j])', j = 1,⋯, m, where r[i,j ]is the rank of the i-th
gene in the j-th list. First, each rank list is normalized into rank
ratio list, ra[i,j ]= r[i,j]/[n], i = 1,⋯, n. Second, rank ratio lists
are transformed to score lists based on the inverse of standard normal
distribution function.
[MATH: F(x)= ∫
-∞xe-s<
mrow>22ds,zi,j=-F-1(rai,j). :MATH]
(1)
If ra[i,j ]= 1, it is substituted by ra[i,j ]= 0.9999 so that the value
of z[i,j ]is well defined. The value of z[i,j ]ranges in (-∞, +∞) and
high value of z[i,j ]corresponds to high rank. This transformation
makes the distribution of z[i,j ]for fixed j to be standard normal so
that the integration with the network is more stable.
In addition to a set of rank lists, we also assume that there is a
protein interaction network. Let H denote the adjacency matrix of the
input network where the nodes indicate proteins and the edges indicate
interactions. If two nodes u and v are connected in the network, H[u,v
]= 1; Otherwise, H[u,v ]= 0.
Update rank lists using CGI
Ma et al. [[70]1] developed a method, CGI, to integrate gene expression
profiles with protein interaction network for gene prioritization. The
basic idea is that if a gene and most of its neighbors are associated
with a phenotype of interest, the gene is more likely to be true causal
genes for the phenotype compared to another gene with non-associated
neighbor genes. Therefore, Ma et al. [[71]1] proposed to use the
weighted average association strength of the gene and its neighbors
with the phenotype to measure the likelihood that the gene is related
to the phenotype. Close neighbors are weighted more heavily than
distant neighbors. One key question is how to define closeness between
nodes in the network. The authors showed that diffusion kernel defined
in [[72]26] performed well compared to direct neighbors or shortest
distance on networks for gene prioritization.
In this study, we extend CGI so that it can be applied to rank based
data. CGI first needs to define a similarity measure between two genes
based on the protein interaction network. Many different similarity
measures can be defined including direct neighbors, shortest distance,
and diffusion kernels as reviewed recently in [[73]27]. Previous
studies have shown the superiority of diffusion kernel similarity
measure over direct neighbor and shortest distance similarity measures.
Therefore, we concentrate on the use of diffusion kernel in this study.
The diffusion kernel is defined as S = e^-τL, where r is a tuning
parameter and L = D - H is the Laplacian matrix of H, where D is a
diagonal matrix with the diagonal elements containing the node degrees.
The elements in score matrix S is normalized as
[MATH: Ku,v=Su,v/Su,uS<
/mi>v,v
:MATH]
. The diffusion kernel represents a global similarity between nodes in
a network, with higher values representing closer relationship. CGI
corresponding to CGI[3 ]in [[74]1] is defined as
[MATH: Ri,j=z<
mrow>i,j+λ∑<
mrow>l=1,l≠in|zl,j|Kl,i1
+λ∑<
mrow>l=1,l≠inKl,i,i=1,2,⋯,n, :MATH]
(2)
where λ is a tuning parameter that controls the contribution of the
neighbor genes and z[i,j ]is defined in equation 1. The updated value
of R[i,j ]indicates the strength of association of gene i with the
phenotype in the j-th list. Large value of R[i,j ]corresponds to high
rank.
Update rank lists using GR
GeneRank (GR) [[75]25] is a widely used method for gene prioritization
by integrating gene expression profiles with a network. GR is based on
similar ideas as PageRank originally designed for ranking web pages and
was extended to gene expression analysis in [[76]25]. Let c[i ]be the
absolute value of the correlation coefficient between the i-th gene
expression levels with the phenotype,
[MATH:
ri(0)=ci/∑i<
mrow>ci :MATH]
, and
[MATH: r(0)=(r1(0),r
2(0),⋯,r
n(0))′ :MATH]
. The ranking of the genes is based on the solution to
[MATH: (I-dHD-1)X=(I-d)r(0). :MATH]
Here we adapt the method for integrating gene ranks with a network. For
the above transformed z score, we further transform it following the
procedures for GR. For the i-th element in the j-th rank list, let
[MATH: Ri,j(0)=zi,j/||zj||1
:MATH]
, where
[MATH: ||zj||1=∑i|zi,j| :MATH]
. For the j-th rank list, let
[MATH:
Rj(k)=(1-d)zj+dHD<
/mrow>-1
Rj(k-1),k=1,2,⋯, :MATH]
where d is a tuning parameter, z[j ]is a column vector with the i-th
element as z[i,j ]and
[MATH:
Rj(k) :MATH]
is a column vector with the i-th element as
[MATH:
Rij
(k) :MATH]
. If d is small, more weight is assigned to the original score. While
if d is large, more weight is given to the neighboring genes and
network information will play a more important role. When
[MATH:
Rj(k) :MATH]
finally stabilized (i.e
[MATH: maxi|Rij(k)-Ri
j(k-1)|<1×10-6 :MATH]
) or the number of iterations reaches a large number (1000 in this
paper), the iteration stops. The limit of
[MATH:
Rj(k) :MATH]
as k tends to infinity denoted as R[j ]satisfies the equation
[MATH: (I-dHD<
/mrow>-1)Rj=(1-d)zj,j=1,2,⋯,m. :MATH]
We use the component values of R[j ]to rank the genes with larger
values ranked higher for the j-th list.
Aggregate rank lists using Endeavour
The component scores of R[j ]from CGI and GR can be used to re-rank the
genes. A new set of rank lists
[MATH:
rjn
ew,j=1,⋯,m :MATH]
, incorporating network information are produced, which are then
aggregated in the second step. Let
[MATH: ri,jnew :MATH]
be the rank of the i-th gene in the j-th list. Stuart et al. [[77]8]
proposed a rank list aggregation statistic calculated from all rank
ratios using the joint cumulative distribution of a multidimensional
order statistic. Assuming
[MATH: r(i)ne<
mi>w={ri,1new,⋯,r
i,mnew} :MATH]
is ordered rank ratio vector for gene i across m rank lists in
ascending order. Let
[MATH: Q(r(i)ne<
mi>w)=m! ∫
0ri,1new∫
s1<
/msub>ri,2new⋯ ∫
sm-1ri,mbewdsm⋯ds2
ds1
mn>. :MATH]
Aerts et al. [[78]3] developed an efficient method to calculate the
statistic using the iteration
[MATH:
Vk=∑l
=1k(-1)l-1
Vk-ll<
/mi>!r
mrow>i,m-k+1ne
w,Q(r(i)ne<
mi>w)=m!Vm
mi>,V0=1 :MATH]
. This iteration method is used in our paper to calculate
[MATH: Q(r(i)ne<
mi>w) :MATH]
, which are then used to rank genes with smaller values of
[MATH: Q(r(i)ne<
mi>w) :MATH]
ranked higher.
Aggregate rank lists using RRA
Kolde et al. [[79]2] proposed another rank aggregation method that
assumes the null distribution of all the rank ratios as uniform on the
unit interval. The same rank ratio vector
[MATH: r(i)ne<
mi>w={ri,1new,⋯,r
i,mnew} :MATH]
for the i-th gene is taken as input. For the j-th rank, let
[MATH: βj,m(r(i)ne<
mi>w) :MATH]
denote the probability of
[MATH: r
^j≤ri
,jnew :MATH]
under the null distribution and
[MATH: βj,m(r(i)ne<
mi>w)=∑l
=jmCml(ri,jnew)l(1-ri
,jnew)m-l :MATH]
Since some rank lists may contain large variation from the true rank,
Kolde et al. [[80]2] proposed to use the minimum value of
[MATH: βj,m(r(i)ne<
mi>w) :MATH]
over j = 1, 2,⋯,m, that is,
[MATH: ρ(r(i)ne<
mi>w)=min{β<
mi>j,m(r(i)ne<
mi>w),j=1,⋯,m} :MATH]
to rank the genes.
The values of
[MATH: Q(r(i)ne<
mi>w) :MATH]
and
[MATH: ρ(r(i)ne<
mi>w) :MATH]
are then used to aggregate the updated rank lists using Endeavour and
RRA, respectively. Combining the two network incorporation procedures
and two rank aggregation procedures together, we have four methods:
CGI_Endeavour, CGI_RRA, GR_Endeavour, and GRI_RRA. We are interested in
knowing which combinations of network rank list updating methods (CGI
or GR) and rank list aggregation methods (Endeavour or RRA) yield the
most biologically meaningful rank.
Set the parameters for CGI and GR
The tuning parameters (τ,λ) for CGI and d for GR need to be set before
implementing the methods. To do so, we need a set of genes, referred as
training genes, known to be associated with a trait of interest. A good
ranking method should put the training genes on the top. Therefore, for
fixed parameters in CGI or GR, we use Wilcox rank sum statistic to test
the hypothesis that the training genes are ranked higher. The ranks of
these training genes are compared to that of the other genes and a
p-value is obtained. Lower p-value corresponds to better performance of
a rank aggregation method. Therefore, we choose the parameters that
yield the lowest p-value.
Data: gene expression profiles and protein interactions
Three gene expression data sets are used in this study: yeast
compendium knockout (Compendium) [[81]28], cell cycle (Cycle)
[[82]29,[83]30] and stress response (Stress) [[84]31]. The Compendium
data consists of yeast gene expression profiles across 300 conditions.
The cell cycle data includes expression profiles of yeast genes at 17
time points across five cell cycles. The stress response data contains
expression profiles across 173 diverse environmental conditions such as
temperature shocks, amino acid starvation and nitrogen source
depletion.
To study the effects of different networks, three yeast protein
interaction networks from BioGRID [[85]32], DIP [[86]33] and MIPS
[[87]34] are used. The BioGRID (3.2.101 version) network contains 6,226
proteins and 220,823 interactions. The DIP [[88]33] network (20130707
version) consists of 4,767 proteins and 22,162 interactions. The MIPS
[[89]34] network (18052006 version) includes 4,554 proteins and 12,526
interactions. It can be seen that the BioGRID network contains many
more genes and interactions than either DIP or MIPS. In addition, we
also form an integrated network by combing all the interactions from
the three networks which contains 6,567 proteins and 229,773
interactions.
Evaluation criteria for the different rank aggregation methods
We study the effectiveness of integrating networks with rank lists
using C-GLEndeavour, CGI_RRA, GR_Endeavour, and GR_RRA based on yeast
gene expression and protein interactions and compare their performance
with the corresponding Endeavour and RRA without integrating network
information. In addition, we also study the performance of first
integrating the rank lists by either Endeavour or RRA and then
integrating the resulting list with the network by CGI or GR. These
methods are denoted as Endeavour_CGI, Endeavour_GR, RRA_CGI, and
RRA_GR, respectively. We conceive a study of genes having the same gene
ontology (GO) as a set of transcription factors (TF) having the same GO
term. For a GO term, we first collect all the TFs having the term and
take all the genes having the same GO term as the TFs as the golden
positive set. Genes with highly correlated expression profiles with the
TF are more likely to have the same GO function as the TF. Therefore,
we rank the genes according to their co-expression level with the TF
with high absolute correlation ranked high on the list. For each TF
with a given GO term, a rank list is obtained and thus, a set of rank
lists is obtained for each GO term. A total of 269 TFs from [[90]35]
are used in this study.
Given the gene expression, protein-protein interaction network and
known associations between genes and GO terms, we evaluate whether the
network tuned rank aggregation methods (CGI_Endeavour, CGI_RRA,
GR_Endeavour and GR_RRA) perform better than the original aggregation
methods (Endeavour and GR). We adopt a large scale 10-fold
cross-validation for the comparison. More specifically, we randomly
divide the known genes of a GO term into 10 subsets with roughly equal
number of genes. Therefore, each subset contains about 10% of known
genes having the GO term. In each run, we assume that one subset of
genes is already known and use them as training set to select the
parameters for CGI and GR. The remaining 9 subsets of genes are used as
validation set and ranked against the other genes based on the selected
parameters. Repeating this procedure for each subset, we obtain 10 rank
lists and 10 p-values by comparing the ranks of the validation set of
genes with the remaining genes using Wilcox rank sum test for each GO
term. We repeat this process 10 times to obtain a total of 100 rank
lists and p-values for the combination of each GO term and each method.
Lower p-values indicate better performance of an ranking method.
We also study the effect of incompleteness, false positive and false
negative interactions in networks on the performance of the integration
methods by permuting some of the interactions in the network.
Results
We mainly present our results with the combination of the Compendium
gene expression data and BioGRID interaction data in the main text. The
results based on other expression and protein interaction data are
presented in the Additional Files. We download the TFs for each GO term
and the known associations between protein and GO terms from GO slim in
the Saccharomyces Genome Database [[91]36]. We retain the GO terms with
at least 30 known genes and 5 TFs so that appropriate statistical
analysis can be carried out. The GO terms with more than 40 TFs are
filtered out to control the computation time. We only keep the GO terms
indicating significant result (p-value < 1 × 10^-4) with the original
rank aggregation method by either Endeavour or RRA. As a result, a
total of 18 GO terms are retained.
Comparison of network tuned rank aggregation methods with the original rank
aggregation methods of Endeavour and RRA
We first compare the performance of Endeavour and GR using gene
expression data only. We then compare the performance of CGI_Endeavour,
CGI_RRA, GR_Endeavour, and GR_RRA, together with Endeavour and RRA,
using the evaluation procedures described in subsection 2.7.
Figure [92]1 shows the relative performance of Endeavour and RRA based
on the three different expression data sets. Each point corresponds to
- log[10](p - value) of a GO term from one of 10 rounds of 10-fold
cross validation experiments. It can be seen that Endeavour outperforms
RRA in most situations across all three expression data sets as the
value of - log[10](p - value) from Endeavour is generally larger than
that from RRA.
Figure 1.
Figure 1
[93]Open in a new tab
Comparison of the performance of Endeavour and RRA based on the
Compendium, Stress and Cell Cycle expression data sets. The x-axis
indicates - log[10](p-value) from Endeavour and the y-axis is -
log[10](p-value) from RRA. The - log[10](p-value) from Endeavour is
generally larger than that from RRA indicating slightly better
performance of Endeavour over RRA based on these data sets.
Figure [94]2 shows the average - log[10](p - value) together with the
standard errors of the four integrative approaches, CGI_Endeavour,
CGI_RRA, GR_Endeavour and GR_RRA, by incorporating network information
with CGI and GR based on the combination of the compendium expression
data and the BioGRID protein interactions. For comparison, the results
for Endeavour and RRA are also shown. We have the following
observations. First, except for GO term 7 (invasive growth in response
to glucose limitation), the average - log(p - value) based on both
C-GLEndeavour and GR_Endeavour are higher than that for Endeavour,
indicating the usefulness of incorporating network information for rank
aggregation. Second, the improvement of CGI_Endeavour over Endeavour is
much higher than that for GR_Endeavour indicating that CGI makes more
effective use of network than GR integrating rank lists with the
network. Third, the relative performance of RRA, CGI_RRA, and GR_RRA
are similar to that among Endeavour, CGI_Endeavour, and GR_Endeavour.
Fourth, the performances of CGI_Endeavour and CGI_RRA are similar
across the 18 GO categories under study.
Figure 2.
Figure 2
[95]Open in a new tab
The average - log[10](p - value) together with standard errors of six
rank aggregation methods: Endeavour, CGI_Endeavour, GR_Endeavour, RRA,
CGI_RRA, and GR_RRA based on the Compendium expression data and BioGRID
interactions. The x-axis shows the selected 18 GO function terms and
the y-axis indicates - log[10](p - value) of the different methods.
Higher value corresponds to better performance.
Figures S1-S8 in the Additional File show the corresponding results
with other combinations of gene expression data and protein interaction
networks. The relative performance of the six rank aggregation methods
are the same as presented above. However, it is noted that the average
- log(p - value) based on the DIP and MIPS protein interaction data
sets are generally smaller than that based on the BioGRID network. The
results are most likely due to the relative sparse yeast networks in
DIP and MIPS compared to BioGRID. Thus, for the purpose of rank
aggregation, we recommend the use of the BioGRID interaction data set.
We will mainly focus on CGI_Endeavour and CGI_RRA in the rest of this
paper. The average selected parameters for CGI_Endeavour and CGI_RRA
are presented in Tables [96]1 and [97]2 in Additional File. We will use
the same parameters in the following analysis.
Table 1.
The improvement by incorporating network using CGI_Endeavour and
CGI_RRA is significantly correlated with the number of training genes.
CGI_Endeavour CGI_RRA
Expression Network correlation p-value correlation p-value
__________________________________________________________________
Compendium BioGRID
DIP
MIPS 0.6614(0.0943)
0.722(0.0796)
0.4212(0.1339) 0.0041(0.0065)
0.0015(0.0040)
0.0668(0.0830) 0.6581(0.0920)
0.7485(0.0855)
0.5232(0.0995) 0.0041(0.0072)
0.0013(0.0046)
0.0217(0.0271)
Stress BioGRID
DIP
MIPS 0.6313(0.0865)
0.6636(0.13300)
0.4620(0.1000) 0.0060(0.0119)
0.0084(0.0212)
0.0387(0.0397) 0.6387(0.0874)
0.7060(0.1333)
0.3961(0.1160) 0.0055(0.0114)
0.0062(0.0162)
0.0725(0.0739)
Cycle BioGRID
DIP
MIPS 0.7776(0.0894)
0.6497(0.1376)
0.5997(0.0823) 0.0007(0.0019)
0.0104(0.0248)
0.0082(0.0131) 0.6585(0.1071)
0.5645(0.1482)
0.5414(0.0890) 0.0063(0.0151)
0.0227(0.0342)
0.0170(0.0233)
[98]Open in a new tab
The table shows the average Spearman correlation between log-p-fold and
the number of training genes together with the standard deviation (3rd
and 5th columns) and the mean p-value and its standard deviation (4th
and 6th columns) using all combinations of gene expression and
interaction data sets.
Elements in the parenthesis are standard deviation.
Table 2.
Computational time of implementing CGI_Endeavour, CGI_RRA, GR_Endeavour
and GR_RRA with the Compendium expression data and the BioGRID
interaction network for the GO term of ATPase activity.
Method Time
CGI_Endeavour 1:11:31
__________________________________________________________________
CGI_RRA 43:14
__________________________________________________________________
GR_Endeavour 6:54:33
__________________________________________________________________
GR_RRA 6:53:07
[99]Open in a new tab
The effect of the number of the training genes on the performance of rank
aggregation methods
In our studies, we determine the parameters using training genes so
that they are most likely to be ranked on the top. It is expected that
the improvement of our methods using networks over the ones without
networks increases with the number of training genes. We use the
log-ratio of the p-value from the method without using the network over
the p-value from the method integrating the network to evaluate the
improvement of our integrative method and denote this quantity as
"log-p-fold". The higher this value is, the more effective of using the
network information becomes. Positive value of log-p-fold indicates
improvement of integrating networks. Therefore, for a given method of
combining express data and network, we calculate the Spearman
correlation between the log-p-fold values and the number of training
genes across the 18 GO terms for each of the 10 rounds of 10-fold cross
validation experiments. The mean and standard deviation of these
Spearman correlations are calculated. The results for all combinations
of integration method, expression data and network are given in Table
[100]1. All the correlations are positive indicating that the
improvement of using networks increases with the number of training
genes for CGI_Endeavour and CGI_RRA. All mean p-values calculated based
on the alternative hypothesis that the correlation is positive
indicates that the correlations are all significantly higher than zero.
From the above analysis, we see that the standard deviations calculated
across 10 rounds of the 10-fold cross validation experiments are small.
Considering the huge time consuming burden for implementing 10 rounds
of 10-fold experiments, we present the following results with only one
fixed classification of training gene set and validation gene set.
The contribution of network to the network tuned rank aggregation methods
We study if the observed improvement of rank aggregation methods
incorporating networks was due to the added biological information in
the protein interaction network. Following the procedures implemented
in [[101]20], we randomly permute gene names in the network, and use
the rank lists together with the permuted network for incorporating
network information and aggregating the updated rank lists. The
randomized networks preserve the topology of the original network, but
dissociate any correlation between the network and the rank lists.
Figure [102]3 shows the relationship between the - log[10](p - value)
using the randomized network based on BioGRID and the Compendium
expression data using CGI_Endeavour and CGI_RRA and that without using
the network. It can be seen that the - log[10](p - value) using the
randomized network is smaller than that without the network, indicating
that the randomized network can decrease the performance if the network
is random as expected. The results also show that the observed improved
performance using true networks is due to the contributions of the
molecular network. We also carry out the same analysis using other gene
expression and protein interaction data sets and the results are given
in Additional File (Figures S9-S16).
Figure 3.
Figure 3
[103]Open in a new tab
The effects of randomized gene labels on the performance of
CGI_Endeavour and CGI_RRA based on the Compendium expression and the
BioGRID network. The x-axis is - log[10](p - value) from the original
method. The y-axis is - log[10](p - value) from the methods
incorporating network information.
The effects of false interactions in PPI on the performance of network tuned
rank aggregation methods
It is well known that most current available PPI networks are
incomplete and consist of many false interactions. We investigate how
false interactions in PPI networks affect the performance of network
tuned rank aggregation methods. Therefore, we study CGI_Endeavour and
CGI_RRA based on networks with different levels of false interactions.
As in [[104]1], we randomly select α% of the edges and replaced them
with randomly selected protein pairs with no interactions in the
original data set. Then we implement CGI_Endeavour and CGI_RRA based on
the new network and input rank lists. We perform this procedure for
noisy level of 20% and 40% of the total edges for each network. Figure
[105]4 shows the log-p-fold for different GO terms based on
CGI_Endeavour and CGI_RRA using the noisy network generated from
BioGRID and the compendium expression data. For most of GO function
terms, the performance of CGI_Endeavour and CGI_RRA decreases as the
noise level increases as expected. Even at 40% noise level, their
performance is still better than not using the networks. The results
based on other combinations of noisy networks and expression data sets
are presented in the Additional File as Figures S17-S24.
Figure 4.
Figure 4
[106]Open in a new tab
The effect of noise in PPI network on the performance of CG_Eendeavour
and CGI_RRA based on the Compendium expression and the BioGRID network.
The x-axis shows the selected 18 GO function terms and the y-axis
indicates - log[10](p - value) from the different methods. Higher value
corresponds to better performance.
Performance of CGI_Endeavour, GFLEndeavour, CGI_RRA and GR_RRA using the
integrated network
The performances of our integration methods based on the BioGRID
network are better than that based on the DIP and the MIPS networks. We
wonder if the integrated network by combining all the interactions from
these three networks can further improve the performance of our
integration approaches. Therefore, we merge the three networks together
and run our methods based on the integrated network. Figure [107]5
shows the performance of a) CGI_Endeavour, b) GR_Endeavour, c) CGI_RRA
and d) GR_RRA based on the integrated, BioGRID, DIP and MIPS networks
and the Compendium expression data. We can see that the performances of
the integration method based on the integrated network are better than
that using any individual network.
Figure 5.
Figure 5
[108]Open in a new tab
Comparison of the performance of the integration methods based on the
integrated and the individual networks. The four sub-figures (from top
to bottom) indicate the performance of CGI_Endeavour, GR_Endeavour,
CGI_RRA & GR_RRA based on the integrated, BioGRID, DIP and MIPS
interaction networks and the Compendium gene expression data.
An alternative strategy of firstly aggregating rank lists and then
integrating network
In the above studies, we first integrate each individual rank list with
the network and then aggregate the updated rank lists. A natural
alternative approach is to first aggregate rank lists and then update
the integrated rank list with the network. Corresponding to our four
methods, we name the four methods following the alternative approach as
Endeavour_CGI, Endeavour_GR, RRA_CGI and RRA_GR. Based on the BioGRID
network and the Compendium expression data, we compare the performance
of these four methods with the approaches studied above. Figure [109]6
shows the relative performances of the Endeavour based methods (upper
panel) and the RRA based methods (lower panel), respectively. It shows
that CGI_Endeavour and CGI_RRA are the best performers in each class,
respectively. While CGI_Endeavour and CGI_RRA perform similarly across
all the 18 GO categories.
Figure 6.
Figure 6
[110]Open in a new tab
Comparison of the performance of CGI_Endeavour, GR_Endeavour, CGI_RRA
and GR_RRA with the corresponding methods first integrating the rank
lists followed with network update. The BioGRID network and the
Compendium gene expression data are used.
Computational time
The computational time depends on the number of individual rank lists
to be integrated. We compare the computational time of CGI_Endeavour,
GR_Endeavour, CGI_RRA and GR_RRA for a specific GO function: ATPase
activity, based on the BioGRID network and the Compendium expression
data. There are 5 known transcription factors for this function
indicating 5 individual rank lists to be aggregated. Table [111]2 shows
the time spent by these four methods using a computer of CPU speed of
2.3 GHz and 16 Gb memory. CGI_Endeavour and CGI_RRA are more time
efficient than GR_Endeavour and GR_RRA. RRA related methods are more
time efficient than Endeavour related methods, consistent with the
conclusion in [[112]2]. We do not include the diffusion kernel matrix
computation time in the time of CGI related method. Because once we
have got the diffusion kernel matrix from the network, we can use it
repeatedly for network integration of different rank lists. Whereas
GeneRank need to be rerun each time for each network integration.
Therefore, CGI related methods are more time efficient.
Case studies: identification of pathway genes from knock-out experiments
Kolde et al. [[113]2] validated their RRA method by predicting the
pathway members with knock-out data. Genes with large absolute
differential expression levels before and after a TF is knocked out are
ranked high. Rank lists of genes whose expression level were most
affected by knockout of a TF are directly extracted from [[114]37].
For a specific GO function term (interchangeable with pathway in their
paper), Kolde et al. [[115]2] identified the known TFs and collected
the corresponding rank lists. They showed that the integrated rank list
resulted from RRA is better than any single rank list corresponding to
a TF. We validate our method by comparing the results from direct
aggregation using RRA or Endeavour and aggregation after incorporating
the BioGRD network information using 10% of the genes having the same
GO term as training set. We select the same 18 GO terms as above.
Figure [116]7 shows - log[10](p - value) for the Endeavour,
CGI_Endeavour, RRA, and CGI_RRA based on BioGRID. It can be seen that
CGI_Endeavour and CGI_RRA perform much better than Endeavour and RRA,
respectively. The results using MIPS and DIP interaction networks are
given in Additional File (Figures S25-S26).
Figure 7.
Figure 7
[117]Open in a new tab
Comparison of the performance of Endeavour, CGI_Endeavour, RRA and
CGI_RRA on predicting pathway members based on TF knockout data from
[[118]37]and the BioGRID interactions. The x-axis shows the selected 18
GO function terms and the y-axis indicates - log[10](p - value) from
the different methods. Higher value corresponds to better performance.
Discussion and conclusion
Gene prioritization continues to play important roles in the
identification of genes responsible for complex traits, finding targets
of gene regulators such as TFs, microRNAs or long non-coding RNAs, and
construction of gene regulation networks. Many methods have been
developed to integrate multiple rank lists from similar types of
studies such as gene expression profiles or one gene expression study
with a network. However, integrating multiple rank lists with one or
more networks for gene prioritization is understudied. Here we propose
to first update each rank list with a network using similar ideas as in
CGI or GR and then integrate the updated gene rank lists using
Endeavour or RRA. Using three yeast gene expression data sets and three
protein interaction networks, we study the effectiveness of the
approaches for integrating multiple lists with networks. We show that
integrating multiple rank lists with a PPI network can indeed improve
the performance of rank list aggregation over corresponding methods
without using network information. As expected, the extent of
improvement depends on the network used. It is shown that among the
three yeast interaction networks studied in this paper, the BioGRID
network achieves the largest improvement compared to DIP and MIPS, most
likely due to the large size of the BioGRID network. It is also shown
that the combination of CGI with Endeavour or RRA generally outperforms
GR combined with Endeavour or RRA. On the other hand, the performance
of CGI_Endeavour is similar to CGI_RRA. The conclusions hold for all
the three gene expression data sets studied indicating the generality
of these results.
We then show that the improvement of CGI_Endeavour or CGI_RRA
incorporating network information over Endeavour or RRA, respectively,
is indeed due to the incorporation of network. If the gene labels are
randomized while keeping the network structure unchanged, it is shown
that the performance of CGI_Endeavour or CGI_RRA would be worse than
Endeavour or RRA, respectively. Finally, we show that the performance
of CGI_Endeavour or CGI_RRA is relatively stable with many false
positive and false negative interactions.
Our methods are validated based on rank lists obtained according to the
gene co-expression level with the TF for a given GO term. Actually,
these methods are applicable to rank lists acquired from other methods.
Our study has several limitations. First, we use yeast gene expression
and protein interaction data for our study because the interaction data
is relatively rich compared to other organisms such as human. It is not
clear whether the conclusions derived from this study are valid to many
other organisms such as human where our knowledge of the interaction
network is much less than for yeast. On the other hand, more and more
interaction data will be generated for different organisms and we
expect that our results will be applicable as more interaction data
become available. Second, the PPI networks including BioGRID, DIP and
MIPS we study in this paper are static and do not indicate where and
when the interactions occur. In reality, protein interactions are
dynamic depending on particular cellular locations, tissues, and
environment. For certain protein properties, if the conditions or
tissues for the properties to show up are known, we may restrict to
such networks under these conditions instead of the whole static
network. Third, we determine the parameters in CGI or GR based on a set
of known genes associated with a trait of interest. It is not clear how
to best set the parameters when training genes are not available.
Finally, we only consider one network in this study. Multiple networks
are usually available and it is not clear whether one should first
derive one comprehensive network and then use the approaches developed
here or alternatively, design new approaches to simultaneously
integrate multiple networks. These are problems for further studies.
In conclusion, integrating multiple rank lists together with network
information using CGLEndeavor or CGI_RRA can be effectively used for
gene prioritization. The extent of improvement over corresponding
approaches without using network information depends on the
completeness and accuracy of the network. Even for relative sparse
networks containing a significant fraction of false positives,
CGI_Endeavour and CGI_RRA still outperform their corresponding
counterparts, Endeavour and RRA, respectively.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
FS conceived and designed the study. WW carried out the experiments,
analyzed the results, and drafted the paper. XZ and ZL helped with the
experimental design and interpretation of the results. All authors
participated in finalizing the paper, read the manuscript and approved
the final version.
Supplementary Material
Additional file 1
Supplementary.pdf includes all the supplementary figures and tables.
[119]Click here for file^ (457.2KB, pdf)
Acknowledgements