Abstract
Background
Big data is becoming ubiquitous in biology, and poses significant
challenges in data analysis and interpretation. RNAi screening has
become a workhorse of functional genomics, and has been applied, for
example, to identify host factors involved in infection for a panel of
different viruses. However, the analysis of data resulting from such
screens is difficult, with often low overlap between hit lists, even
when comparing screens targeting the same virus. This makes it a major
challenge to select interesting candidates for further detailed,
mechanistic experimental characterization.
Results
To address this problem we propose an integrative bioinformatics
pipeline that allows for a network based meta-analysis of viral
high-throughput RNAi screens. Initially, we collate a human protein
interaction network from various public repositories, which is then
subjected to unsupervised clustering to determine functional modules.
Modules that are significantly enriched with host dependency factors
(HDFs) and/or host restriction factors (HRFs) are then filtered based
on network topology and semantic similarity measures. Modules passing
all these criteria are finally interpreted for their biological
significance using enrichment analysis, and interesting candidate genes
can be selected from the modules.
Conclusions
We apply our approach to seven screens targeting three different
viruses, and compare results with other published meta-analyses of
viral RNAi screens. We recover key hit genes, and identify additional
candidates from the screens. While we demonstrate the application of
the approach using viral RNAi data, the method is generally applicable
to identify underlying mechanisms from hit lists derived from
high-throughput experimental data, and to select a small number of most
promising genes for further mechanistic studies.
Electronic supplementary material
The online version of this article (doi:10.1186/s13015-015-0035-7)
contains supplementary material, which is available to authorized
users.
Keywords: Network analysis, RNAi screening, Virus-host interactions
Background
RNA interference (RNAi) has become an important workhorse of functional
genomics, and genome-wide RNAi screens have been employed for example
to identify genes involved in cell growth and viability, proliferation,
differentiation, signaling or trafficking [[27]1-[28]9]. The technology
has furthermore accelerated the discovery of novel host dependency
factors (HDF) and host restriction factors (HRF) in viral infection
[[29]10-[30]19]. However, while RNAi is a very powerful tool to
identify genes involved in a specific biological process, the placement
of hits in their functional and spatiotemporal context in the
underlying molecular processes remains a major challenge
[[31]20,[32]21]. The interpretation of RNAi data in particular for
virus screens is complicated further by the observed low overlap
between identified host factors, even in different screens targeting
the same virus [[33]22-[34]24]. This low overlap has been explained by
different experimental conditions such as host cell type and viral
strain used, transfection, incubation and infection time, and siRNA
library used [[35]24] as well as by technical artifacts arising from
cell population context [[36]25,[37]26]. Furthermore, due to the
typical setup of RNAi experiments with primary screens followed by
secondary validation assays, it is likely that published hit lists are
highly specific, but not very sensitive, further explaining the low
overlap observed between different screens at the level of individual
genes [[38]27]. This, however, severely restricts a comparative
analysis of inter-species RNAi screens [[39]28]. On the other hand,
protein interaction networks, virus-host interaction networks and other
heterogeneous data have increased tremendously [[40]29-[41]34]. This
offers novel ways to interpret hit lists from RNAi experiments from a
network perspective, by integrating individual hits in their systemic
context. It has been shown that this approach increases the overlap
between different screens for the same virus at the pathway level
[[42]24], and the method can be extended to meta-analysis of screens
targeting different viruses. Being less dependent on individual genes,
but rather focusing on pathways, may shed new light onto virus-specific
and generic host processes facilitating or restricting infection, and
may prove a promising approach to identify potential host targets for
antiviral drug development.
Several meta-analyses of RNAi screens have been conducted, albeit most
work focused on integrating different screens targeting a single virus
[[43]24,[44]28,[45]35,[46]36]. A notable exception is the study by
Snijder et al., including 45 screens targeting 17 different mammalian
viruses [[47]37]. The authors show that accounting for cellular
heterogeneity improves gene overlaps between screens, but the study
does not focus on functional regions within the host protein network
targeted by different viruses. In contrast, Navratil et al. study
virus-host protein interactions in the human interferon network
[[48]32], throwing light on how viruses of different families target
the innate immune system. Other similar analyses focused largely on
HIV, for example, Murali et al. employed a semi-supervised machine
learning approach mapping RNAi hits onto a protein interaction network
to predict new HDFs [[49]38]. Macpherson et al. and similarly Maulik et
al. mine the HIV-1 human protein interaction network using
biclustering, and identify biclusters enriched with GO terms and RNAi
hits [[50]39,[51]40]. Several authors have furthermore used
protein-protein interaction (PPI) networks to identify topological
properties of proteins targeted by pathogens. Dyer et al. characterized
host proteins targeted by 190 different pathogens, including 35
viruses, 17 bacterial and two protozoan groups [[52]29]. One of the
major outcomes of this analysis was that pathogens preferentially
target proteins with high node betweenness (bottlenecks) or high degree
(hubs). Similarly, the studies by Dijk et al. and Dickerson et al. both
showed that HIV preferentially targets hub and bottleneck genes in the
human protein network [[53]30,[54]31]. Further characterizing the
neighborhood of HDFs, Gulbahce et al. showed that proteins translated
from genes involved in viral diseases are most likely located in the
neighborhood of their corresponding viral targets [[55]33].
Given the typically low overlap between different RNAi screens at the
gene level and the relatively long hit lists resulting from individual
screens, a central problem is how to select most promising candidates
for functional characterization and detailed biochemical follow-up
experiments. When looking for putative antiviral drug targets, one is
typically interested in candidates that have a significant impact on
infection outcome in the specific virus under consideration, or
possibly even in several different viral species if e.g. broadly acting
antivirals are sought for. Corresponding target pathways should
therefore be “enriched” by hit genes from the RNAi data, while at the
same time it is desirable that the respective targets are centrally
located in the virus-host interaction network.
In this manuscript, we present a comparative analysis of RNAi hits for
different viruses in the context of functional modules of protein
interaction networks. The main purpose of our work is in hit
prioritization, that is, we strive to identify a small set of
candidates for further detailed follow-up experiments. We cluster the
host protein network to identify functional host modules, and then use
a statistical test to identify modules enriched with hits from seven
genome-wide RNAi screens for three different viruses. Network
topological characteristics are used to filter relevant subnetworks
further, and resulting modules and their neighborhoods are annotated
and interpreted. Using this approach, we identified several interesting
candidate pathways for human immunodeficiency virus 1 (HIV-1) and
hepatitis C virus (HCV), including known targets such as the mediator
complex or members of the heterogeneous nuclear ribonucleoprotein
subunits (hnRNPs) in HIV infection, or MAP kinases and heat shock
proteins in HCV infection. Furthermore, using our approach, we predict
that SERCA1 and Tankyrase-1 (TNKS1) may be interesting targets for
further characterization in HCV infection.
Materials and methods
An overview of the data analysis pipeline used is shown in Figure
[56]1. In brief, we collate information from 11 different public
protein-protein interaction (PPI) data repositories, and integrate them
into a large human PPI network. Subsequently, we use a
cohesiveness-based greedy clustering algorithm to identify –possibly
overlapping– clusters in the protein network, which are then tested for
enrichment of hits from one or several RNAi screens. Significant
modules are then filtered further using topological properties and
semantic similarity, and functionally characterized using gene ontology
and Reactome pathways. Using tissue-specific expression data, we
predict novel putative host factors based on neighborhood relations in
identified modules. We describe each of these steps in more detail in
the following.
Figure 1.
Figure 1
[57]Open in a new tab
Overview of the data analysis pipeline. (1) Protein interactions from
public databases are collated to build an integrated human PPI network.
(2) Greedy unsupervised clustering is used to identify relevant,
possibly overlapping, submodules in the PPI network. (3) Hits from one
or several RNAi screens are mapped to these modules and modules are
filtered for significant enrichment. (4) Subnetworks are further
filtered based on network topology and semantic similarity values. (5)
Resulting modules are visualized as subnetworks, color-coded for hits,
non-hits, and (6a,b) are then functionally characterized based on GO
and Reactome pathway. (6c) Lastly, using gene expression data from
different tissues, tissue-specific putative novel host factors are
predicted.
Human protein interaction network:
The human protein interaction network was collated from two major
resources: the iRefIndex database, a meta-database comprising data from
ten resources (DIP, IntAct, MINT, BioGRID, BIND, CORUM, MPact, HPRD,
MPPI, OPHID [[58]41-[59]51]), and the String v9.0 database [[60]52]
which includes both experimentally validated as well as computationally
predicted interactions. The union of reported interactions in these
databases was used to establish our PPI network. We utilized a score
filter of 0.75 on the STRING interactions as a tradeoff between
reliability of included interactions and sufficient network density for
further computations. Different thresholds between 0.6 and 0.9 were
tested for the predicted interactions from STRING. For higher scores,
the predicted interactions did not add much to the existing pool of
interactions, and subsequent clustering resulted in few to no
subnetworks. Conversely, for lower scores, the subnetworks included
broad networks with multiple, non-specific functional annotations. A
score of 0.75 led to optimal subnetworks that were functionally
specific, and returned a reasonable number of subnetworks for further
analysis. The overall procedure resulted in a protein interaction
network comprising 15,383 proteins and 337,413 interactions from STRING
and iRefIndex.
RNAi screening data:
We then mapped data from seven published genome-wide RNAi screens to
the PPI network, including three human immunodeficiency virus-1 (HIV-1)
screens [[61]10,[62]12,[63]53], three Hepatitis C Virus (HCV) screens
[[64]13,[65]18,[66]54] and one west Nile virus (WNV) screen [[67]11].
Further data analysis was then performed individually using only
screens targeting the same virus (intra-species), as well as across all
seven screens (inter-species).
Submodule identification and statistical testing:
We used the ClusterONE algorithm to detect overlapping subnetworks in
the human PPI network. ClusterONE is a neighborhood-expansion, greedy
graph clustering algorithm [[68]55]. It is able to take edge weights
corresponding to confidence scores into account in the clustering, and
allows overlapping clusters where individual proteins may be part of
more than one cluster. We used default values for most parameters of
the ClusterONE algorithm, except for the merge-method parameter which
was set to multi to merge highly overlapping clusters, as well as the
minimum cluster size parameter, which we varied between 25 and 100. The
variation of the cluster size parameter leads to clusters of different
granularity, from very small, highly cohesive clusters, to larger and
more heterogeneous clusters. Both may be desirable for the analysis of
virus-targeted subnetworks, we therefore continued analysis with a
redundant set of larger and smaller, overlapping clusters; we label
this set of clusters C^all in the following. Note that these clusters
are not merged or integrated further, but rather C^all is a set of
different clusters. After clustering, we tested for significant
enrichment of RNAi hits within each cluster in C^all using Fisher’s
exact test, with significance level α=0.05, resulting in the set
C^hit⊂C^all of clusters significantly enriched with RNAi hits. We note
that the clusters in C^hitmay still overlap and may even contain
clusters that are subsets/supersets of one another.
Submodule filtering and cluster selection:
We next used additional filtering criteria to select a small number of
relevant clusters from C^hit for further manual analysis. The
underlying idea is to choose clusters that differ significantly from
non-significant clusters not only based on their enrichment with RNAi
hits, but also with respect to their “importance” in the underlying
host PPI network. We selected seven network centrality measures and two
further similarity measures for this filtering step. We briefly review
these measures in the following, but before repeat some elementary
definitions from graph theory.
Let G=(V,E) be an undirected graph with nodes v∈V corresponding to
proteins and undirected edges e∈E corresponding to interactions between
proteins. As we consider undirected edges only, let e[i,j]=e[j,i]. We
define a pathP between two nodes s,t∈V in a graph G=(V,E) as a sequence
v[0], e[0], v[1], e[1],..., v[k−1], e[k−1], v[k] of nodes v[i]∈V and
edges e[i]∈E, where edge e[i] connects nodes v[i] and v[i+1], where
v[i]≠v[j] for all nodes in P, and where v[0]:=s and v[k]:=t. The length
of P is defined as the number of edges in the path P.
When clustering the graph G using a graph clustering algorithm such as
ClusterONE, the nodes V in G are grouped into different clusters. Let
V[C]⊆V be one such cluster. This cluster induces a subnetwork
S[C]=(V[C],E[C]) on G, where E[C]={e[i,j]∈E:v[i],v[j]∈V[C]}, i.e., the
induced subnetwork consists of the subset V[C] of nodes, and all edges
in E between these nodes in the original graph G. Hereafter, we use the
term subnetwork to denote the full subnetwork S[C]=(V[C],E[C]), whereas
by cluster we refer only to the subset of nodes V[C]⊆V.
To filter significant clusters V[C]∈C^hit further, we used the
following topological properties of the nodes in V[C] respectively
their induced subnetwork S[C]:
1. Average node degree: The node degree of a vertex v in a graph
G=(V,E) is given by
[MATH:
deg(v,G)
mo>:=|{ev,w∈E
|∀w∈V}
|, :MATH]
i.e., it is the number of edges in E adjacent to v. The average
node degree of a subnetwork S[C]=(V[C],E[C]) of G is the average
degree of all nodes in V[C]:
[MATH:
CD(<
/mo>SC)=1|<
mi>VC|
∑v∈V
Cdeg(v,S<
mi>C), :MATH]
where |V[C]| denotes the number of nodes in V[C]. Note that we
compute the degree with respect to the edge set E[C] of the
subgraph S[C], and not the full graph G.
2. Average node betweenness: The node betweenness of a node v∈V is the
ratio of the number of shortest paths between any two nodes s, t in
G that pass through v, to the total number of shortest paths
between any two nodes in G. Let Ψ(v) be the set of ordered pairs
(s,t) in V×V, so that s, t and v are distinct. Then,
[MATH:
CB(<
/mo>v,G)=∑(s,t
mi>)∈Ψ(v,G)σ(s
,t|v,G)<
/mrow>σ(s,t|G), :MATH]
where σ(s,t|G) is the total number of s,t-shortest paths in G, and
σ(s,t|v,G) is the number of shortest paths from s to t in G that
pass through node v. The average node betweennessC[B](S[C]) of a
subgraph S[C] is the average node betweenness of all nodes v∈V[C]
in the subgraph S[C],
[MATH:
CB(<
/mo>SC)=1|<
mi>VC|
∑v∈V
C<
mi>CB(v,
SC)
mo>. :MATH]
3. Average node closeness: The normalized closeness of a node v∈V is
defined as
[MATH: CClo(v,
G)=1
|V|−1
∑w∈V,
mo>w≠vd(v,w|G)−1,
:MATH]
where d(v,w|G) is the length of the shortest path between two nodes
v,w∈V. The average node closenessC[Clo](S[C]) of a subgraph
S[C]=(V[C],E[C]) is
[MATH: CClo(SC)=
1|VC|∑v∈V
C<
mi>CClo(v,
SC)
mo>. :MATH]
4. Average eigenvector centrality: Let A=(a[i,j]) be the adjacency
matrix of G=(V,E), i.e., A is a symmetric |V|×|V| matrix with entry
a[i,j]=1 if v[i,j]∈E and a[i,j]=0 otherwise. The eigenvector
centralityC[E] of a node v∈V is
[MATH:
CE(<
/mo>v,G)=
1λ∑w∈V
aw,v
mi>CE
msub>(w,G), :MATH]
where λ is the (absolute) largest eigenvalue of A. The average
eigenvector centralityC[E](S[C]) for a subgraph S[C]=(V[C],E[C]) is
defined as
[MATH:
CE(<
/mo>SC)=∑v∈V
C
1|VC|CE(v,<
mrow>SC). :MATH]
Eigenvector centrality is based on the idea that importance of a
node is determined by the importance of its neighbors: a node
becomes more important the more important its neighbors are.
5. Average clustering coefficient: Let N[v]={w∈V:(v,w)∈E} be the set
of all neighbors of a node v∈V. The local clustering coefficient of
v is then defined as
[MATH: CClu(v,
G)=|{ej,k
msub>∈E:j,k∈<
/mo>Nv}||N<
mi>v|(|N<
/mi>v|−1
)/2. :MATH]
For a given subgraph S[C]=(V[C],E[C]), we define the average
clustering coefficientC[Clu](S[C]) as the mean of C[Clu](v,S[C])
over all v∈V[C].
6. Mean path length: The mean path length for a subgraph
S[C]=(V[C],E[C]) is the average length of all shortest paths
between all pairs of nodes s,t∈V[C] in the graph S[C]:
[MATH:
CP(<
/mo>SC)=1|<
mi>VC|(|
VC|
mo>−1)∑s,t∈
mo>VC
munder>d(s,t|SC),<
/mo> :MATH]
where d(s,t|S[C]) is the length of the shortest path between nodes
s and t in the subgraph S[C].
In addition to the network centrality measures above, we also used the
following similarity coefficients to filter clusters:
1. Dice similarity coefficient: For any given node v∈V in a graph G,
let
[MATH:
EvG<
/mi>:={
ev,w
∈E} :MATH]
be the set of edges adjacent to v. The dice similarity coefficient
of the edge sets
[MATH:
EvG<
/mi> :MATH]
and
[MATH:
EwG<
/mi> :MATH]
of two nodes v,w∈V is defined as
[MATH: CDS(v,<
/mo>w,G)=
2|Ev
G⋂EwG||EvG
|+|EwG|. :MATH]
The average dice similarity of a subnetwork S[C]=(V[C],E[C]),
V[C]⊆V, is
[MATH: CDS(
SC)=<
mrow>2|VC|(|VC|−1
mn>)∑v,w∈
mo>VC
munder>CDS(v,<
/mo>w,SC
mrow>). :MATH]
2. Wang similarity coefficient: This coefficient is biologically
motivated and is based on similarity between gene ontology terms.
Wang similarity takes the hierarchical structure of the GO graph
into account by aggregating the information of ancestor terms when
comparing two GO annotations [[69]56]. Writing C[G](v,w) for the
Wang similarity between the GO annotations of nodes v and w, we
compute the within-cluster similarity C[G](S[C]) as the average
Wang similarity C[G](v,w) between all pairs of genes v,w in the
subnetwork S[C].
We note that a number of different measures have been proposed to
compute the semantic similarity between two GO terms, for a
comprehensive review see Pesquita et al. [[70]57]. The choice of GO
semantic similarity measure and a comparative evaluation of different
measures are still subject to debate in the literature, as no gold
standard exists, and different studies come to different conclusions
[[71]57]. The choice of similarity measure is therefore somewhat
arbitrary and a matter of personal preferences. We opted for Wang