Abstract Despite efforts for extensive molecular characterization of cancer patients, such as the international cancer genome consortium (ICGC) and the cancer genome atlas (TCGA), the heterogeneous nature of cancer and our limited knowledge of the contextual function of proteins have complicated the identification of targetable genes. Here, we present Aberration Hub Analysis for Cancer (AbHAC) as a novel integrative approach to pinpoint aberration hubs, i.e. individual proteins that interact extensively with genes that show aberrant mutation or expression. Our analysis of the breast cancer data of the TCGA and the renal cancer data from the ICGC shows that aberration hubs are involved in relevant cancer pathways, including factors promoting cell cycle and DNA replication in basal-like breast tumors, and Src kinase and VEGF signaling in renal carcinoma. Moreover, our analysis uncovers novel functionally relevant and actionable targets, among which we have experimentally validated abnormal splicing of spleen tyrosine kinase as a key factor for cell proliferation in renal cancer. Thus, AbHAC provides an effective strategy to uncover novel disease factors that are only identifiable by examining mutational and expression data in the context of biological networks. Keywords: cancer, genomics, computational biology, systems biology, target discovery INTRODUCTION Understanding the molecular etiology of cancer is challenging with various complexities including the multifactorial nature of the disease as well as the heterogeneity that exists at both genome and phenome levels. Advances in next-generation sequencing (NGS) have made it possible to profile multiple levels (e.g. genomic, epigenomic, and transcriptomic) of the molecular landscapes of patient samples at a high resolution. This has enabled us to identify driver abnormalities of several cancers, in particular those with a less heterogeneous molecular landscape [[50]1–[51]3]. However, identifying the aberrations that are functionally relevant from among the plethora of abnormal genomic patterns, particularly given the presence of many passenger events, has remained challenging for many cancers. Integrative analysis of molecular data using computational methods and prior biological knowledge has been suggested as an effective approach to find cancer driver factors [[52]4]. However, most bioinformatics approaches that are currently used to address this issue may miss important factors that are not affected by abnormal genetic or transcriptome patterns, but are nonetheless important for development and progression of cancer, and thus, can represent novel therapeutic targets. Integrating the protein interaction network (the “interactome”) with genomic data has emerged as a promising approach to identify novel factors that are not captured by pathway analysis [[53]5–[54]8]. This approach is based on the “guilt by association” principle, which assumes shared function among interacting partners in protein interaction network [[55]9]. Interpreting protein interactions as definitive indications of shared function is an association fallacy and often false [[56]10]; however, these interactions provide testable hypotheses which can explain the underlying biology of disease. Several such approaches have been developed that generally follow the procedure proposed by Ideker et al [[57]11] to map p-values or z-scores obtained from differential expression or somatic mutation analyses onto the interactome in order to identify subgraphs that are highly enriched for these aberrations [[58]12, [59]13]. For example, HotNet finds concentrated subnetworks of recurrent mutations by calculating an influence measure between all pairs of mutated genes, and has been successful in exploring defective interaction modules in several cancer types [[60]13]. A more recent approach [[61]14] uses data of individual patients independently in order to define affected subnetworks, and to distinguish driver mutated genes from passengers. Here, we use a novel approach, Aberration Hub Analysis for Cancer (AbHAC), to identify functionally-relevant and actionable factors in cancer. AbHAC examines all individual proteins, including but not limited to hub proteins in the interaction network [[62]15], for their direct connectivity with genes that are either significantly affected by somatic mutations or de-regulated at the mRNA level, and identifies “aberration hubs”, i.e. proteins with abnormally high interactions with genes that show aberrant mutation or expression. Therefore, AbHAC highlights candidate actionable proteins individually, and is capable of identifying factors that are not affected by genome or transcriptome aberrations themselves but are important players in cancer according to the significantly high number of deregulated proteins interacting with them. We use several lines of evidence to show that these aberration hubs represent important cancer-related factors. First, we show that aberration hubs can distinguish different cancers and can identify relevant proteins in different subtypes of breast cancer using data from The Cancer Genome Atlas (TCGA) [[63]16]. Next, we apply AbHAC to the International Cancer Genome Consortium (ICGC) Cancer Genomics of the Kidney (CAGEKID) data [[64]17] for the clear cell subtype of renal cell carcinoma (ccRCC). We show that AbHAC can identify known activated molecular networks such as VEGF and Src pathways, but additionally uncovers new candidate factors. These include spleen tyrosine kinase (SYK), which we further validate as a key proliferation regulatory factor in renal cancer cells. RESULTS Implementation of AbHAC The concept of AbHAC is to identify proteins whose local neighborhoods, constituted by their direct interacting partners, are significantly enriched for aberrant proteins (e.g. those encoded by mutated genes or translated from aberrantly expressed transcripts). AbHAC is based on the hypothesis that proteins with a significantly high number of deregulated interacting partners are likely important hubs that contribute to cancer. To implement AbHAC, we first constructed a whole human protein interaction network using the PSICQUIC [[65]18] query system (see Methods). After excluding proteins that do not have experimentally verified annotated interactions, the interaction network included a total of 11,851 proteins with an average of 10 interactions per protein (median of 3). Then, for a particular cancer dataset, we calculated for each protein in the network the number of its direct interacting partners with and without abnormal genetic patterns (e.g. the number of interacting partners with genes identified as being significantly mutated) or abnormal expression patterns (differential expression between tumor and normal samples). As we are interested in the subset of proteins that have more abnormal partners than expected by chance, we calculated a p-value for each protein for the observed numbers of normal and abnormal partners using a one-sided Fisher's exact test (Figure [66]1). To correct for multiple testing, considering the complex dependencies in the protein interaction network and the resulting correlations among the p-values, we randomized the protein interaction network by permuting proteins that have similar number of interaction partners (see Methods and [67]Supplementary Figure 1). Enrichment of various molecular aberrations can be examined by adjusting the inclusion criteria for deregulated partners. For example, we can focus uniquely on proteins whose direct interacting neighborhoods are enriched in genes deregulated at either RNA level (up- or down-regulated genes) or DNA level (significantly mutated genes), or we can undertake an integrated analysis to look for enrichment of aberrations in both RNA and DNA. Table [68]1 presents further examples of different aberration category queries. Figure 1. AbHAC algorithm. [69]Figure 1 [70]Open in a new tab For all the proteins in the interaction network, we assess if their local neighborhood is enriched for proteins whose coding genes are significantly mutated or/and aberrantly expressed (see Table [71]1 for different aberration categories). We consider each protein as an independent hypothesis, and use Fisher's exact test to evaluate over-representation of aberrant molecules among interacting partners of a given protein. We generate 100 permuted networks to correct for multiple testing. Table 1. Definition of the aberration categories. Aberration category Interacting partners are enriched in Proteins enter the analysis if they have following interacting partners UP Upregulated genes At least one upregulated DOWN Downregulated genes At least one downregulated MUT Mutated genes At least one mutated MUT.UP Upregulated or mutated genes At least one mutated and one upregulated MUT.DOWN Downregulated or mutated genes At least one mutated and one downregulated DE Upregulated or downregulated genes At least one differentially expressed MUT.DE Differentially expressed or mutated genes At least one mutated and one differentially expressed [72]Open in a new tab Aberration hubs are characteristics of tumor types To test whether the aberration hubs are associated with clinical or phenotypic variations, we applied AbHAC to TCGA breast cancer [[73]16] and the ICGC clear cell renal cell carcinomas (ccRCC) [[74]17] datasets ([75]Supplementary Tables 1-[76]8), and investigated the distribution of aberration hubs across these samples. We calculated AbHAC p-values for all proteins in patients with breast or renal cancer (see Methods for details), and performed principal component analysis (PCA) using these p-values. This analysis showed that “aberration hubs” are different between breast cancer and ccRCC, and can be used to clearly separate patients with different cancers (Figure [77]2a). To confirm that this observation was not due to differences in the tissue of origin (kidney vs. breast) and rather represent cancer-associated effects, we further examined clustering of breast cancer samples using AbHAC p-values by PCA. This analysis revealed that the first three principal components differentiating between breast cancers are significantly associated with known PAM50 subtypes of breast cancer (ANOVA p < 10^-5), confirming that the main diversity in AbHAC p-values is related to the subtype-specific differences (Figure [78]2b). Following this observation, we used a support vector classifier trained on half of the samples to predict the PAM50 subtype of remaining samples. This classifier achieved very high specificity and sensitivity in one-versus-all classification (see Methods and Figure [79]2c). Therefore, these results showed that cancer aberration hubs are not random and can provide clinical and biologically-relevant information. Figure 2. Aberration hubs distinguish between different tumor types and subtypes. [80]Figure 2 [81]Open in a new tab (a) Principal component analysis on AbHAC p-value matrices of TCGA breast cancer samples and ICGC renal cell carcinomas differentiates them based on tumor tissue of origin. First two principal components of AbHAC p-values for proteins enriched with differentially expressed (DE), downregulated (Down) or upregulated (UP) genes (in tumors relative to normal) among their direct interacting partners are shown. (b) Principal component analysis on AbHAC p-value matrices of TCGA breast cancers. For each patient, differentially expressed genes are identified as the 5 percent highest and lowest values when normalized to average values of non-tumor samples. These genes are then used for AbHAC analysis. P-values are calculated by ANOVA test. First three principal components differ among PAM50 subtypes of breast cancer. (c) The support vector machine classifier was trained on half of the breast tumors by grid search and 4-fold cross validation. The parameters identified by this approach were then applied on the other half of the breast tumors to predict their subtypes. The area under the curve (AUC) analyses show high sensitivity and specificity of AbHAC to predict subtype of breast cancer. Aberration hubs in breast cancer Breast cancer is a heterogeneous disorder with different subtypes, each characterized by distinct molecular alterations and clinical behavior [[82]19]. The TCGA datasets provide comprehensive molecular landscapes of intrinsic subtypes as defined by PAM50 classifier [[83]20]. Using these datasets ([84]Supplementary Tables 1-[85]6), we applied our statistical approach to the data from the PAM50 subtypes of breast cancer to identify proteins that might be relevant for each subtype (see Methods for details). Our analysis identified 74 aberration hubs using different aberration categories in different PAM50 classes (False discovery rate (FDR) < 0.05; [86]Supplementary Table 9). To verify if any of the 74 proteins had previously been associated with breast cancer, we conducted a literature review using MeSHOP [[87]21] with “Breast Neoplasms” as the MeSH term to count how often the proteins are reported to be implicated in breast cancer. We found a total of 1294 reported proteins of which 39 were in our list. This represents 4.7-fold enrichment of breast cancer-associated proteins among the identified aberration hubs compared to the background of other proteins in the interaction network (p = 6.9 × 10^-11, Fisher's exact test; [88]Supplementary Table 9), supporting the efficacy of AbHAC to identify relevant molecules in a given cancer dataset. We further questioned if genes encoding any of these aberration hubs are among significantly mutated or abnormally expressed genes in breast cancer according to TCGA datasets. We observed that only one of these 74 molecules (CDKN1B) is significantly mutated in breast cancers, and that 50 of them are not differentially expressed at mRNA level in tumors (all together or at subtype-level) when compared to non-tumor control samples ([89]Supplementary Table 9). In addition, only 15 of the 39 aberration hubs, which were connected to breast cancer based on previous literature, have significant differential expression at mRNA level or a mutation. This further illustrates that AbHAC is capable of identifying relevant factors that cannot be identified directly by interrogating mutational or expression patterns of genes. Our further analysis showed that 53 of the 74 breast cancer aberration hubs are specific to one of the PAM50 subtypes in a given aberration category (Table [90]2, [91]Supplementary Table 9 and [92]Supplementary Figure 2). Pathway analysis of these proteins, using KEGG datasets, revealed that aberration hubs found in luminal breast tumors are enriched in PI3K-Akt pathway and FOXA1 regulatory network ([93]Supplementary Table 10), in line with a high prevalence of PIK3CA and FOXA1 mutations in this subtype [[94]22, [95]23]. In addition, hubs of basal-like subtype are enriched in DNA replication and cell cycle pathways (FDR < 8.75 × 10-5), which confirms previous reports on importance of these pathways in basal-like tumors [[96]24]. Specifically, we observed several components of “origin recognition complex” (ORC) including ORC1, 2, 3, 5 and 6, and “mini-chromosome maintenance” (MCM) such as MCM3, 6 and 7, which are important factors for initiation of genome replication, among aberration hubs of basal-like tumors ([97]Supplementary Table 10). Table 2. Aberration hubs identified specific to a PAM50 subtype of breast cancer in a given aberration category by AbHAC (FDR < 0.05). PAM50 subtype Molecules with supporting literature in the same PAM50 subtype Potentially novel relevant factors Basal-Like CDC45, MCM7, AURKA, PCNA, CHEK1, TFDP1 HIST1H4A, MCM3, MCM6, MCMBP, ORC6, CDC6, XRCC6, ORC2, ORC1, ORC5, WRN, NEK6, MAD2L1BP, EIF6, XRCC5, OSM, ZNF652, MYBPC2, AIRE, CHD1L, HDGF, SNW1 HER2-enriched CXCR3, RACGAP1 ORC3, TONSL, COL1A2, CEACAM6, CCR3, KRT32, SEZ6L2, DPP8 Luminal A HSPB8 OSM, COL5A1, PDE4DIP, EGFR, ECM1, COL5A2 Luminal B ANAPC4, OPRK1, GOPC, CDC27, CDKN1B, S100A9, BRCA1, STK4, CDK3 [98]Open in a new tab List of all aberration hubs identified in breast cancer, their associated aberration categories, and references to the literature are