Abstract Background Big data is becoming ubiquitous in biology, and poses significant challenges in data analysis and interpretation. RNAi screening has become a workhorse of functional genomics, and has been applied, for example, to identify host factors involved in infection for a panel of different viruses. However, the analysis of data resulting from such screens is difficult, with often low overlap between hit lists, even when comparing screens targeting the same virus. This makes it a major challenge to select interesting candidates for further detailed, mechanistic experimental characterization. Results To address this problem we propose an integrative bioinformatics pipeline that allows for a network based meta-analysis of viral high-throughput RNAi screens. Initially, we collate a human protein interaction network from various public repositories, which is then subjected to unsupervised clustering to determine functional modules. Modules that are significantly enriched with host dependency factors (HDFs) and/or host restriction factors (HRFs) are then filtered based on network topology and semantic similarity measures. Modules passing all these criteria are finally interpreted for their biological significance using enrichment analysis, and interesting candidate genes can be selected from the modules. Conclusions We apply our approach to seven screens targeting three different viruses, and compare results with other published meta-analyses of viral RNAi screens. We recover key hit genes, and identify additional candidates from the screens. While we demonstrate the application of the approach using viral RNAi data, the method is generally applicable to identify underlying mechanisms from hit lists derived from high-throughput experimental data, and to select a small number of most promising genes for further mechanistic studies. Electronic supplementary material The online version of this article (doi:10.1186/s13015-015-0035-7) contains supplementary material, which is available to authorized users. Keywords: Network analysis, RNAi screening, Virus-host interactions Background RNA interference (RNAi) has become an important workhorse of functional genomics, and genome-wide RNAi screens have been employed for example to identify genes involved in cell growth and viability, proliferation, differentiation, signaling or trafficking [[27]1-[28]9]. The technology has furthermore accelerated the discovery of novel host dependency factors (HDF) and host restriction factors (HRF) in viral infection [[29]10-[30]19]. However, while RNAi is a very powerful tool to identify genes involved in a specific biological process, the placement of hits in their functional and spatiotemporal context in the underlying molecular processes remains a major challenge [[31]20,[32]21]. The interpretation of RNAi data in particular for virus screens is complicated further by the observed low overlap between identified host factors, even in different screens targeting the same virus [[33]22-[34]24]. This low overlap has been explained by different experimental conditions such as host cell type and viral strain used, transfection, incubation and infection time, and siRNA library used [[35]24] as well as by technical artifacts arising from cell population context [[36]25,[37]26]. Furthermore, due to the typical setup of RNAi experiments with primary screens followed by secondary validation assays, it is likely that published hit lists are highly specific, but not very sensitive, further explaining the low overlap observed between different screens at the level of individual genes [[38]27]. This, however, severely restricts a comparative analysis of inter-species RNAi screens [[39]28]. On the other hand, protein interaction networks, virus-host interaction networks and other heterogeneous data have increased tremendously [[40]29-[41]34]. This offers novel ways to interpret hit lists from RNAi experiments from a network perspective, by integrating individual hits in their systemic context. It has been shown that this approach increases the overlap between different screens for the same virus at the pathway level [[42]24], and the method can be extended to meta-analysis of screens targeting different viruses. Being less dependent on individual genes, but rather focusing on pathways, may shed new light onto virus-specific and generic host processes facilitating or restricting infection, and may prove a promising approach to identify potential host targets for antiviral drug development. Several meta-analyses of RNAi screens have been conducted, albeit most work focused on integrating different screens targeting a single virus [[43]24,[44]28,[45]35,[46]36]. A notable exception is the study by Snijder et al., including 45 screens targeting 17 different mammalian viruses [[47]37]. The authors show that accounting for cellular heterogeneity improves gene overlaps between screens, but the study does not focus on functional regions within the host protein network targeted by different viruses. In contrast, Navratil et al. study virus-host protein interactions in the human interferon network [[48]32], throwing light on how viruses of different families target the innate immune system. Other similar analyses focused largely on HIV, for example, Murali et al. employed a semi-supervised machine learning approach mapping RNAi hits onto a protein interaction network to predict new HDFs [[49]38]. Macpherson et al. and similarly Maulik et al. mine the HIV-1 human protein interaction network using biclustering, and identify biclusters enriched with GO terms and RNAi hits [[50]39,[51]40]. Several authors have furthermore used protein-protein interaction (PPI) networks to identify topological properties of proteins targeted by pathogens. Dyer et al. characterized host proteins targeted by 190 different pathogens, including 35 viruses, 17 bacterial and two protozoan groups [[52]29]. One of the major outcomes of this analysis was that pathogens preferentially target proteins with high node betweenness (bottlenecks) or high degree (hubs). Similarly, the studies by Dijk et al. and Dickerson et al. both showed that HIV preferentially targets hub and bottleneck genes in the human protein network [[53]30,[54]31]. Further characterizing the neighborhood of HDFs, Gulbahce et al. showed that proteins translated from genes involved in viral diseases are most likely located in the neighborhood of their corresponding viral targets [[55]33]. Given the typically low overlap between different RNAi screens at the gene level and the relatively long hit lists resulting from individual screens, a central problem is how to select most promising candidates for functional characterization and detailed biochemical follow-up experiments. When looking for putative antiviral drug targets, one is typically interested in candidates that have a significant impact on infection outcome in the specific virus under consideration, or possibly even in several different viral species if e.g. broadly acting antivirals are sought for. Corresponding target pathways should therefore be “enriched” by hit genes from the RNAi data, while at the same time it is desirable that the respective targets are centrally located in the virus-host interaction network. In this manuscript, we present a comparative analysis of RNAi hits for different viruses in the context of functional modules of protein interaction networks. The main purpose of our work is in hit prioritization, that is, we strive to identify a small set of candidates for further detailed follow-up experiments. We cluster the host protein network to identify functional host modules, and then use a statistical test to identify modules enriched with hits from seven genome-wide RNAi screens for three different viruses. Network topological characteristics are used to filter relevant subnetworks further, and resulting modules and their neighborhoods are annotated and interpreted. Using this approach, we identified several interesting candidate pathways for human immunodeficiency virus 1 (HIV-1) and hepatitis C virus (HCV), including known targets such as the mediator complex or members of the heterogeneous nuclear ribonucleoprotein subunits (hnRNPs) in HIV infection, or MAP kinases and heat shock proteins in HCV infection. Furthermore, using our approach, we predict that SERCA1 and Tankyrase-1 (TNKS1) may be interesting targets for further characterization in HCV infection. Materials and methods An overview of the data analysis pipeline used is shown in Figure [56]1. In brief, we collate information from 11 different public protein-protein interaction (PPI) data repositories, and integrate them into a large human PPI network. Subsequently, we use a cohesiveness-based greedy clustering algorithm to identify –possibly overlapping– clusters in the protein network, which are then tested for enrichment of hits from one or several RNAi screens. Significant modules are then filtered further using topological properties and semantic similarity, and functionally characterized using gene ontology and Reactome pathways. Using tissue-specific expression data, we predict novel putative host factors based on neighborhood relations in identified modules. We describe each of these steps in more detail in the following. Figure 1. Figure 1 [57]Open in a new tab Overview of the data analysis pipeline. (1) Protein interactions from public databases are collated to build an integrated human PPI network. (2) Greedy unsupervised clustering is used to identify relevant, possibly overlapping, submodules in the PPI network. (3) Hits from one or several RNAi screens are mapped to these modules and modules are filtered for significant enrichment. (4) Subnetworks are further filtered based on network topology and semantic similarity values. (5) Resulting modules are visualized as subnetworks, color-coded for hits, non-hits, and (6a,b) are then functionally characterized based on GO and Reactome pathway. (6c) Lastly, using gene expression data from different tissues, tissue-specific putative novel host factors are predicted. Human protein interaction network: The human protein interaction network was collated from two major resources: the iRefIndex database, a meta-database comprising data from ten resources (DIP, IntAct, MINT, BioGRID, BIND, CORUM, MPact, HPRD, MPPI, OPHID [[58]41-[59]51]), and the String v9.0 database [[60]52] which includes both experimentally validated as well as computationally predicted interactions. The union of reported interactions in these databases was used to establish our PPI network. We utilized a score filter of 0.75 on the STRING interactions as a tradeoff between reliability of included interactions and sufficient network density for further computations. Different thresholds between 0.6 and 0.9 were tested for the predicted interactions from STRING. For higher scores, the predicted interactions did not add much to the existing pool of interactions, and subsequent clustering resulted in few to no subnetworks. Conversely, for lower scores, the subnetworks included broad networks with multiple, non-specific functional annotations. A score of 0.75 led to optimal subnetworks that were functionally specific, and returned a reasonable number of subnetworks for further analysis. The overall procedure resulted in a protein interaction network comprising 15,383 proteins and 337,413 interactions from STRING and iRefIndex. RNAi screening data: We then mapped data from seven published genome-wide RNAi screens to the PPI network, including three human immunodeficiency virus-1 (HIV-1) screens [[61]10,[62]12,[63]53], three Hepatitis C Virus (HCV) screens [[64]13,[65]18,[66]54] and one west Nile virus (WNV) screen [[67]11]. Further data analysis was then performed individually using only screens targeting the same virus (intra-species), as well as across all seven screens (inter-species). Submodule identification and statistical testing: We used the ClusterONE algorithm to detect overlapping subnetworks in the human PPI network. ClusterONE is a neighborhood-expansion, greedy graph clustering algorithm [[68]55]. It is able to take edge weights corresponding to confidence scores into account in the clustering, and allows overlapping clusters where individual proteins may be part of more than one cluster. We used default values for most parameters of the ClusterONE algorithm, except for the merge-method parameter which was set to multi to merge highly overlapping clusters, as well as the minimum cluster size parameter, which we varied between 25 and 100. The variation of the cluster size parameter leads to clusters of different granularity, from very small, highly cohesive clusters, to larger and more heterogeneous clusters. Both may be desirable for the analysis of virus-targeted subnetworks, we therefore continued analysis with a redundant set of larger and smaller, overlapping clusters; we label this set of clusters C^all in the following. Note that these clusters are not merged or integrated further, but rather C^all is a set of different clusters. After clustering, we tested for significant enrichment of RNAi hits within each cluster in C^all using Fisher’s exact test, with significance level α=0.05, resulting in the set C^hit⊂C^all of clusters significantly enriched with RNAi hits. We note that the clusters in C^hitmay still overlap and may even contain clusters that are subsets/supersets of one another. Submodule filtering and cluster selection: We next used additional filtering criteria to select a small number of relevant clusters from C^hit for further manual analysis. The underlying idea is to choose clusters that differ significantly from non-significant clusters not only based on their enrichment with RNAi hits, but also with respect to their “importance” in the underlying host PPI network. We selected seven network centrality measures and two further similarity measures for this filtering step. We briefly review these measures in the following, but before repeat some elementary definitions from graph theory. Let G=(V,E) be an undirected graph with nodes v∈V corresponding to proteins and undirected edges e∈E corresponding to interactions between proteins. As we consider undirected edges only, let e[i,j]=e[j,i]. We define a pathP between two nodes s,t∈V in a graph G=(V,E) as a sequence v[0], e[0], v[1], e[1],..., v[k−1], e[k−1], v[k] of nodes v[i]∈V and edges e[i]∈E, where edge e[i] connects nodes v[i] and v[i+1], where v[i]≠v[j] for all nodes in P, and where v[0]:=s and v[k]:=t. The length of P is defined as the number of edges in the path P. When clustering the graph G using a graph clustering algorithm such as ClusterONE, the nodes V in G are grouped into different clusters. Let V[C]⊆V be one such cluster. This cluster induces a subnetwork S[C]=(V[C],E[C]) on G, where E[C]={e[i,j]∈E:v[i],v[j]∈V[C]}, i.e., the induced subnetwork consists of the subset V[C] of nodes, and all edges in E between these nodes in the original graph G. Hereafter, we use the term subnetwork to denote the full subnetwork S[C]=(V[C],E[C]), whereas by cluster we refer only to the subset of nodes V[C]⊆V. To filter significant clusters V[C]∈C^hit further, we used the following topological properties of the nodes in V[C] respectively their induced subnetwork S[C]: 1. Average node degree: The node degree of a vertex v in a graph G=(V,E) is given by [MATH: deg(v,G):=|{ev,wE |wV} |, :MATH] i.e., it is the number of edges in E adjacent to v. The average node degree of a subnetwork S[C]=(V[C],E[C]) of G is the average degree of all nodes in V[C]: [MATH: CD(< /mo>SC)=1|< mi>VC| vV Cdeg(v,S< mi>C), :MATH] where |V[C]| denotes the number of nodes in V[C]. Note that we compute the degree with respect to the edge set E[C] of the subgraph S[C], and not the full graph G. 2. Average node betweenness: The node betweenness of a node v∈V is the ratio of the number of shortest paths between any two nodes s, t in G that pass through v, to the total number of shortest paths between any two nodes in G. Let Ψ(v) be the set of ordered pairs (s,t) in V×V, so that s, t and v are distinct. Then, [MATH: CB(< /mo>v,G)=(s,t)Ψ(v,G)σ(s ,t|v,G)< /mrow>σ(s,t|G), :MATH] where σ(s,t|G) is the total number of s,t-shortest paths in G, and σ(s,t|v,G) is the number of shortest paths from s to t in G that pass through node v. The average node betweennessC[B](S[C]) of a subgraph S[C] is the average node betweenness of all nodes v∈V[C] in the subgraph S[C], [MATH: CB(< /mo>SC)=1|< mi>VC| vV C< mi>CB(v, SC). :MATH] 3. Average node closeness: The normalized closeness of a node v∈V is defined as [MATH: CClo(v, G)=1 |V|1 wV,wvd(v,w|G)1, :MATH] where d(v,w|G) is the length of the shortest path between two nodes v,w∈V. The average node closenessC[Clo](S[C]) of a subgraph S[C]=(V[C],E[C]) is [MATH: CClo(SC)= 1|VC|vV C< mi>CClo(v, SC). :MATH] 4. Average eigenvector centrality: Let A=(a[i,j]) be the adjacency matrix of G=(V,E), i.e., A is a symmetric |V|×|V| matrix with entry a[i,j]=1 if v[i,j]∈E and a[i,j]=0 otherwise. The eigenvector centralityC[E] of a node v∈V is [MATH: CE(< /mo>v,G)= 1λwV aw,vCE(w,G), :MATH] where λ is the (absolute) largest eigenvalue of A. The average eigenvector centralityC[E](S[C]) for a subgraph S[C]=(V[C],E[C]) is defined as [MATH: CE(< /mo>SC)=vV C 1|VC|CE(v,< mrow>SC). :MATH] Eigenvector centrality is based on the idea that importance of a node is determined by the importance of its neighbors: a node becomes more important the more important its neighbors are. 5. Average clustering coefficient: Let N[v]={w∈V:(v,w)∈E} be the set of all neighbors of a node v∈V. The local clustering coefficient of v is then defined as [MATH: CClu(v, G)=|{ej,kE:j,k∈< /mo>Nv}||N< mi>v|(|N< /mi>v|1 )/2. :MATH] For a given subgraph S[C]=(V[C],E[C]), we define the average clustering coefficientC[Clu](S[C]) as the mean of C[Clu](v,S[C]) over all v∈V[C]. 6. Mean path length: The mean path length for a subgraph S[C]=(V[C],E[C]) is the average length of all shortest paths between all pairs of nodes s,t∈V[C] in the graph S[C]: [MATH: CP(< /mo>SC)=1|< mi>VC|(| VC|1)s,tVCd(s,t|SC),< /mo> :MATH] where d(s,t|S[C]) is the length of the shortest path between nodes s and t in the subgraph S[C]. In addition to the network centrality measures above, we also used the following similarity coefficients to filter clusters: 1. Dice similarity coefficient: For any given node v∈V in a graph G, let [MATH: EvG< /mi>:={ ev,wE} :MATH] be the set of edges adjacent to v. The dice similarity coefficient of the edge sets [MATH: EvG< /mi> :MATH] and [MATH: EwG< /mi> :MATH] of two nodes v,w∈V is defined as [MATH: CDS(v,< /mo>w,G)= 2|Ev GEwG||EvG |+|EwG|. :MATH] The average dice similarity of a subnetwork S[C]=(V[C],E[C]), V[C]⊆V, is [MATH: CDS( SC)=< mrow>2|VC|(|VC|1)v,wVCCDS(v,< /mo>w,SC). :MATH] 2. Wang similarity coefficient: This coefficient is biologically motivated and is based on similarity between gene ontology terms. Wang similarity takes the hierarchical structure of the GO graph into account by aggregating the information of ancestor terms when comparing two GO annotations [[69]56]. Writing C[G](v,w) for the Wang similarity between the GO annotations of nodes v and w, we compute the within-cluster similarity C[G](S[C]) as the average Wang similarity C[G](v,w) between all pairs of genes v,w in the subnetwork S[C]. We note that a number of different measures have been proposed to compute the semantic similarity between two GO terms, for a comprehensive review see Pesquita et al. [[70]57]. The choice of GO semantic similarity measure and a comparative evaluation of different measures are still subject to debate in the literature, as no gold standard exists, and different studies come to different conclusions [[71]57]. The choice of similarity measure is therefore somewhat arbitrary and a matter of personal preferences. We opted for Wang