Abstract

   Despite efforts for extensive molecular characterization of cancer
   patients, such as the international cancer genome consortium (ICGC) and
   the cancer genome atlas (TCGA), the heterogeneous nature of cancer and
   our limited knowledge of the contextual function of proteins have
   complicated the identification of targetable genes. Here, we present
   Aberration Hub Analysis for Cancer (AbHAC) as a novel integrative
   approach to pinpoint aberration hubs, i.e. individual proteins that
   interact extensively with genes that show aberrant mutation or
   expression. Our analysis of the breast cancer data of the TCGA and the
   renal cancer data from the ICGC shows that aberration hubs are involved
   in relevant cancer pathways, including factors promoting cell cycle and
   DNA replication in basal-like breast tumors, and Src kinase and VEGF
   signaling in renal carcinoma. Moreover, our analysis uncovers novel
   functionally relevant and actionable targets, among which we have
   experimentally validated abnormal splicing of spleen tyrosine kinase as
   a key factor for cell proliferation in renal cancer. Thus, AbHAC
   provides an effective strategy to uncover novel disease factors that
   are only identifiable by examining mutational and expression data in
   the context of biological networks.

   Keywords: cancer, genomics, computational biology, systems biology,
   target discovery

INTRODUCTION

   Understanding the molecular etiology of cancer is challenging with
   various complexities including the multifactorial nature of the disease
   as well as the heterogeneity that exists at both genome and phenome
   levels. Advances in next-generation sequencing (NGS) have made it
   possible to profile multiple levels (e.g. genomic, epigenomic, and
   transcriptomic) of the molecular landscapes of patient samples at a
   high resolution. This has enabled us to identify driver abnormalities
   of several cancers, in particular those with a less heterogeneous
   molecular landscape [[50]1–[51]3]. However, identifying the aberrations
   that are functionally relevant from among the plethora of abnormal
   genomic patterns, particularly given the presence of many passenger
   events, has remained challenging for many cancers. Integrative analysis
   of molecular data using computational methods and prior biological
   knowledge has been suggested as an effective approach to find cancer
   driver factors [[52]4]. However, most bioinformatics approaches that
   are currently used to address this issue may miss important factors
   that are not affected by abnormal genetic or transcriptome patterns,
   but are nonetheless important for development and progression of
   cancer, and thus, can represent novel therapeutic targets.

   Integrating the protein interaction network (the “interactome”) with
   genomic data has emerged as a promising approach to identify novel
   factors that are not captured by pathway analysis [[53]5–[54]8]. This
   approach is based on the “guilt by association” principle, which
   assumes shared function among interacting partners in protein
   interaction network [[55]9]. Interpreting protein interactions as
   definitive indications of shared function is an association fallacy and
   often false [[56]10]; however, these interactions provide testable
   hypotheses which can explain the underlying biology of disease. Several
   such approaches have been developed that generally follow the procedure
   proposed by Ideker et al [[57]11] to map p-values or z-scores obtained
   from differential expression or somatic mutation analyses onto the
   interactome in order to identify subgraphs that are highly enriched for
   these aberrations [[58]12, [59]13]. For example, HotNet finds
   concentrated subnetworks of recurrent mutations by calculating an
   influence measure between all pairs of mutated genes, and has been
   successful in exploring defective interaction modules in several cancer
   types [[60]13]. A more recent approach [[61]14] uses data of individual
   patients independently in order to define affected subnetworks, and to
   distinguish driver mutated genes from passengers.

   Here, we use a novel approach, Aberration Hub Analysis for Cancer
   (AbHAC), to identify functionally-relevant and actionable factors in
   cancer. AbHAC examines all individual proteins, including but not
   limited to hub proteins in the interaction network [[62]15], for their
   direct connectivity with genes that are either significantly affected
   by somatic mutations or de-regulated at the mRNA level, and identifies
   “aberration hubs”, i.e. proteins with abnormally high interactions with
   genes that show aberrant mutation or expression. Therefore, AbHAC
   highlights candidate actionable proteins individually, and is capable
   of identifying factors that are not affected by genome or transcriptome
   aberrations themselves but are important players in cancer according to
   the significantly high number of deregulated proteins interacting with
   them.

   We use several lines of evidence to show that these aberration hubs
   represent important cancer-related factors. First, we show that
   aberration hubs can distinguish different cancers and can identify
   relevant proteins in different subtypes of breast cancer using data
   from The Cancer Genome Atlas (TCGA) [[63]16]. Next, we apply AbHAC to
   the International Cancer Genome Consortium (ICGC) Cancer Genomics of
   the Kidney (CAGEKID) data [[64]17] for the clear cell subtype of renal
   cell carcinoma (ccRCC). We show that AbHAC can identify known activated
   molecular networks such as VEGF and Src pathways, but additionally
   uncovers new candidate factors. These include spleen tyrosine kinase
   (SYK), which we further validate as a key proliferation regulatory
   factor in renal cancer cells.

RESULTS

Implementation of AbHAC

   The concept of AbHAC is to identify proteins whose local neighborhoods,
   constituted by their direct interacting partners, are significantly
   enriched for aberrant proteins (e.g. those encoded by mutated genes or
   translated from aberrantly expressed transcripts). AbHAC is based on
   the hypothesis that proteins with a significantly high number of
   deregulated interacting partners are likely important hubs that
   contribute to cancer. To implement AbHAC, we first constructed a whole
   human protein interaction network using the PSICQUIC [[65]18] query
   system (see Methods). After excluding proteins that do not have
   experimentally verified annotated interactions, the interaction network
   included a total of 11,851 proteins with an average of 10 interactions
   per protein (median of 3). Then, for a particular cancer dataset, we
   calculated for each protein in the network the number of its direct
   interacting partners with and without abnormal genetic patterns (e.g.
   the number of interacting partners with genes identified as being
   significantly mutated) or abnormal expression patterns (differential
   expression between tumor and normal samples). As we are interested in
   the subset of proteins that have more abnormal partners than expected
   by chance, we calculated a p-value for each protein for the observed
   numbers of normal and abnormal partners using a one-sided Fisher's
   exact test (Figure [66]1). To correct for multiple testing, considering
   the complex dependencies in the protein interaction network and the
   resulting correlations among the p-values, we randomized the protein
   interaction network by permuting proteins that have similar number of
   interaction partners (see Methods and [67]Supplementary Figure 1).
   Enrichment of various molecular aberrations can be examined by
   adjusting the inclusion criteria for deregulated partners. For example,
   we can focus uniquely on proteins whose direct interacting
   neighborhoods are enriched in genes deregulated at either RNA level
   (up- or down-regulated genes) or DNA level (significantly mutated
   genes), or we can undertake an integrated analysis to look for
   enrichment of aberrations in both RNA and DNA. Table [68]1 presents
   further examples of different aberration category queries.

Figure 1. AbHAC algorithm.

   [69]Figure 1
   [70]Open in a new tab

   For all the proteins in the interaction network, we assess if their
   local neighborhood is enriched for proteins whose coding genes are
   significantly mutated or/and aberrantly expressed (see Table [71]1 for
   different aberration categories). We consider each protein as an
   independent hypothesis, and use Fisher's exact test to evaluate
   over-representation of aberrant molecules among interacting partners of
   a given protein. We generate 100 permuted networks to correct for
   multiple testing.

Table 1. Definition of the aberration categories.

   Aberration category Interacting partners are enriched in Proteins enter
   the analysis if they have following interacting partners
   UP Upregulated genes At least one upregulated
   DOWN Downregulated genes At least one downregulated
   MUT Mutated genes At least one mutated
   MUT.UP Upregulated or mutated genes At least one mutated and one
   upregulated
   MUT.DOWN Downregulated or mutated genes At least one mutated and one
   downregulated
   DE Upregulated or downregulated genes At least one differentially
   expressed
   MUT.DE Differentially expressed or mutated genes At least one mutated
   and one differentially expressed
   [72]Open in a new tab

Aberration hubs are characteristics of tumor types

   To test whether the aberration hubs are associated with clinical or
   phenotypic variations, we applied AbHAC to TCGA breast cancer [[73]16]
   and the ICGC clear cell renal cell carcinomas (ccRCC) [[74]17] datasets
   ([75]Supplementary Tables 1-[76]8), and investigated the distribution
   of aberration hubs across these samples. We calculated AbHAC p-values
   for all proteins in patients with breast or renal cancer (see Methods
   for details), and performed principal component analysis (PCA) using
   these p-values. This analysis showed that “aberration hubs” are
   different between breast cancer and ccRCC, and can be used to clearly
   separate patients with different cancers (Figure [77]2a). To confirm
   that this observation was not due to differences in the tissue of
   origin (kidney vs. breast) and rather represent cancer-associated
   effects, we further examined clustering of breast cancer samples using
   AbHAC p-values by PCA. This analysis revealed that the first three
   principal components differentiating between breast cancers are
   significantly associated with known PAM50 subtypes of breast cancer
   (ANOVA p < 10^-5), confirming that the main diversity in AbHAC p-values
   is related to the subtype-specific differences (Figure [78]2b).
   Following this observation, we used a support vector classifier trained
   on half of the samples to predict the PAM50 subtype of remaining
   samples. This classifier achieved very high specificity and sensitivity
   in one-versus-all classification (see Methods and Figure [79]2c).
   Therefore, these results showed that cancer aberration hubs are not
   random and can provide clinical and biologically-relevant information.

Figure 2. Aberration hubs distinguish between different tumor types and
subtypes.

   [80]Figure 2
   [81]Open in a new tab

   (a) Principal component analysis on AbHAC p-value matrices of TCGA
   breast cancer samples and ICGC renal cell carcinomas differentiates
   them based on tumor tissue of origin. First two principal components of
   AbHAC p-values for proteins enriched with differentially expressed
   (DE), downregulated (Down) or upregulated (UP) genes (in tumors
   relative to normal) among their direct interacting partners are shown.
   (b) Principal component analysis on AbHAC p-value matrices of TCGA
   breast cancers. For each patient, differentially expressed genes are
   identified as the 5 percent highest and lowest values when normalized
   to average values of non-tumor samples. These genes are then used for
   AbHAC analysis. P-values are calculated by ANOVA test. First three
   principal components differ among PAM50 subtypes of breast cancer. (c)
   The support vector machine classifier was trained on half of the breast
   tumors by grid search and 4-fold cross validation. The parameters
   identified by this approach were then applied on the other half of the
   breast tumors to predict their subtypes. The area under the curve (AUC)
   analyses show high sensitivity and specificity of AbHAC to predict
   subtype of breast cancer.

Aberration hubs in breast cancer

   Breast cancer is a heterogeneous disorder with different subtypes, each
   characterized by distinct molecular alterations and clinical behavior
   [[82]19]. The TCGA datasets provide comprehensive molecular landscapes
   of intrinsic subtypes as defined by PAM50 classifier [[83]20]. Using
   these datasets ([84]Supplementary Tables 1-[85]6), we applied our
   statistical approach to the data from the PAM50 subtypes of breast
   cancer to identify proteins that might be relevant for each subtype
   (see Methods for details). Our analysis identified 74 aberration hubs
   using different aberration categories in different PAM50 classes (False
   discovery rate (FDR) < 0.05; [86]Supplementary Table 9). To verify if
   any of the 74 proteins had previously been associated with breast
   cancer, we conducted a literature review using MeSHOP [[87]21] with
   “Breast Neoplasms” as the MeSH term to count how often the proteins are
   reported to be implicated in breast cancer. We found a total of 1294
   reported proteins of which 39 were in our list. This represents
   4.7-fold enrichment of breast cancer-associated proteins among the
   identified aberration hubs compared to the background of other proteins
   in the interaction network (p = 6.9 × 10^-11, Fisher's exact test;
   [88]Supplementary Table 9), supporting the efficacy of AbHAC to
   identify relevant molecules in a given cancer dataset.

   We further questioned if genes encoding any of these aberration hubs
   are among significantly mutated or abnormally expressed genes in breast
   cancer according to TCGA datasets. We observed that only one of these
   74 molecules (CDKN1B) is significantly mutated in breast cancers, and
   that 50 of them are not differentially expressed at mRNA level in
   tumors (all together or at subtype-level) when compared to non-tumor
   control samples ([89]Supplementary Table 9). In addition, only 15 of
   the 39 aberration hubs, which were connected to breast cancer based on
   previous literature, have significant differential expression at mRNA
   level or a mutation. This further illustrates that AbHAC is capable of
   identifying relevant factors that cannot be identified directly by
   interrogating mutational or expression patterns of genes.

   Our further analysis showed that 53 of the 74 breast cancer aberration
   hubs are specific to one of the PAM50 subtypes in a given aberration
   category (Table [90]2, [91]Supplementary Table 9 and [92]Supplementary
   Figure 2). Pathway analysis of these proteins, using KEGG datasets,
   revealed that aberration hubs found in luminal breast tumors are
   enriched in PI3K-Akt pathway and FOXA1 regulatory network
   ([93]Supplementary Table 10), in line with a high prevalence of PIK3CA
   and FOXA1 mutations in this subtype [[94]22, [95]23]. In addition, hubs
   of basal-like subtype are enriched in DNA replication and cell cycle
   pathways (FDR < 8.75 × 10-5), which confirms previous reports on
   importance of these pathways in basal-like tumors [[96]24].
   Specifically, we observed several components of “origin recognition
   complex” (ORC) including ORC1, 2, 3, 5 and 6, and “mini-chromosome
   maintenance” (MCM) such as MCM3, 6 and 7, which are important factors
   for initiation of genome replication, among aberration hubs of
   basal-like tumors ([97]Supplementary Table 10).

Table 2. Aberration hubs identified specific to a PAM50 subtype of breast
cancer in a given aberration category by AbHAC (FDR < 0.05).

   PAM50 subtype Molecules with supporting literature in the same PAM50
   subtype Potentially novel relevant factors
   Basal-Like CDC45, MCM7, AURKA, PCNA, CHEK1, TFDP1 HIST1H4A, MCM3, MCM6,
   MCMBP, ORC6, CDC6, XRCC6, ORC2, ORC1, ORC5, WRN, NEK6, MAD2L1BP, EIF6,
   XRCC5, OSM, ZNF652, MYBPC2, AIRE, CHD1L, HDGF, SNW1
   HER2-enriched CXCR3, RACGAP1 ORC3, TONSL, COL1A2, CEACAM6, CCR3, KRT32,
   SEZ6L2, DPP8
   Luminal A HSPB8 OSM, COL5A1, PDE4DIP, EGFR, ECM1, COL5A2
   Luminal B ANAPC4, OPRK1, GOPC, CDC27, CDKN1B, S100A9, BRCA1, STK4, CDK3
   [98]Open in a new tab

   List of all aberration hubs identified in breast cancer, their
   associated aberration categories, and references to the literature are