Abstract

   Clear cell renal cell carcinoma (ccRCC) is the most common and most
   aggressive form of renal cell cancer (RCC). The incidence of RCC has
   increased steadily in recent years. The pathogenesis of renal cell
   cancer remains poorly understood. Many of the tumor suppressor genes,
   oncogenes, and dysregulated pathways in ccRCC need to be revealed for
   improvement of the overall clinical outlook of the disease. Here, we
   developed a systems biology approach to prioritize the somatic mutated
   genes that lead to dysregulation of pathways in ccRCC. The method
   integrated multi-layer information to infer causative mutations and
   disease genes. First, we identified differential gene modules in ccRCC
   by coupling transcriptome and protein-protein interactions. Each of
   these modules consisted of interacting genes that were involved in
   similar biological processes and their combined expression alterations
   were significantly associated with disease type. Then, subsequent gene
   module-based eQTL analysis revealed somatic mutated genes that had
   driven the expression alterations of differential gene modules. Our
   study yielded a list of candidate disease genes, including several
   known ccRCC causative genes such as BAP1 and PBRM1, as well as novel
   genes such as NOD2, RRM1, CSRNP1, SLC4A2, TTLL1 and CNTN1. The
   differential gene modules and their driver genes revealed by our study
   provided a new perspective for understanding the molecular mechanisms
   underlying the disease. Moreover, we validated the results in
   independent ccRCC patient datasets. Our study provided a new method for
   prioritizing disease genes and pathways.

   Abbreviations: RCC, Renal cell cancer; ccRCC, Clear cell renal cell
   carcinoma; eQTL, Expression quantitative trait loci; SVM, Support
   vector machine; TCGA, The Cancer Genome Atlas; KEGG, Kyoto Encyclopedia
   of Genes and Genomes; DEG, Differentially expressed gene; DGM,
   Differential gene module; AUC, Area Under Curve; ROC, Receiver
   Operating Characteristic

   Keywords: ccRCC, Causative mutation, Pathways, Protein-protein
   interaction, Gene module, eQTL

1. Introduction

   Kidney cancer is the sixth most common form of cancer for men and the
   tenth most common form of cancer for women. In 2016, over 63,000 newly
   diagnosed cases and 14,400 kidney cancer deaths were reported in the
   United States [34][1]. The vast majority of kidney cancers are renal
   cell carcinomas (RCC), among which nearly 75% are clear cell renal cell
   carcinomas (ccRCC) [35][2]. Despite recent advances, metastatic RCC
   remains largely an incurable disease [36][3], [37][4]. Patients with
   this disease often have no apparent symptoms or laboratory
   abnormalities in the early stages. The incidence of ccRCC has been
   rising steadily in recent years due to the prevalence of adverse
   lifestyle changes and exposure to toxins such as smoke [38][5].

   ccRCC is characterized by the presence of VHL gene mutation in most
   cases [39][6]. However, the loss of VHL alone is not sufficient for
   tumor initiation and survival, and a fraction of ccRCCs contain
   wild-type VHL genes, suggesting additional genetic alterations are
   required in the course of tumor development. Recent large-scale
   sequencing studies of ccRCC, including TCGA (The Cancer Genome Atlas)
   project have discovered several new and prevalent genomic mutations
   such as PBRM1 and BAP1 [40][7], [41][8], [42][9]. Despite these
   findings, the mortality rate of ccRCC has not significantly decreased,
   indicating that the genetic basis of the disease occurrence and
   development remains to be elucidated. Additionally, previous studies
   have shown that ccRCC is a highly heterogeneous disease [43][10],
   [44][11], creating the need to identify new disease genes and pathways.

   The expression quantitative trait loci (eQTL) analysis has been used to
   identify single-nucleotide polymorphisms (SNPs) that are significantly
   associated with gene expressions [45][12], [46][13], [47][14]. Most
   eQTL analysis performed testing on transcript-SNP pairs to identify
   genetic mutations that significantly affected individual gene
   expression. Here, we presented a gene module-based eQTL method to
   identify the somatic mutations that are associated with gene clusters,
   which potentially function in the same pathway. We first identified
   differentially expressed gene modules (DGMs). The DGMs are comprised of
   a set of interacting genes based on protein-protein interactions and
   expression profile. The Gene Ontology analysis suggested that majority
   DGMs contained genes involved in the same biological processes.
   Additionally, the genes inside the same DGM tended to be co-expressed.
   Hence, these gene modules most likely contained genes function together
   in the disease-affected pathway. Disease genes are not always
   differentially expressed. The integration of gene expressions and
   protein interactions empower the discovery of disease genes, as disease
   genes without significant expression alterations could be revealed by
   DGMs through interacting with the differentially expressed genes in the
   gene modules. The subsequent eQTL analysis further established the
   linkages of somatic mutations with the DGMs. Collectively, the DGMs and
   their associated genetic mutations lead to the identification of novel
   disease genes and pathways. Moreover, we examined the DGMs on four
   independent ccRCC patient cohorts. The results showed DGMs accurately
   classified the tissue types blindly.

2. Results

   An interacting pathway regulates the expression of a group of genes
   that often perform certain functions together. When a pathway is
   perturbed by genetic mutations, then expression levels of interacting
   genes associated with the pathway can be altered accordingly and can
   further contribute to malignant transformation. By integrating gene
   expression and protein-protein interactions here, we developed a new
   method of identifying gene clusters in the pathways impacted by the
   disease. Then, we performed an eQTL analysis to infer potential driver
   mutations and disease affected pathways. The procedure of our study was
   illustrated in [48]Fig. 1.

Fig. 1.

   [49]Fig. 1
   [50]Open in a new tab

   The procedure of our study. After the differentially expressed gene
   modules were identified by coupling PPI with gene expression, somatic
   mutations were linked with the DGMs using eQTL analysis. Here, SMA-DGM
   refers to somatic mutations associated the DGMs.

2.1. Differentially Expressed Gene Modules Identification

   The RNA-Seq expression profile of 19,768 protein-coding genes was
   obtained from TCGA 539 ccRCC and 72 paired normal tissue samples. After
   filtering out the genes with very low expression levels (Methods), a
   total of 16,343 genes remained for the subsequent analysis. Then, we
   coupled gene expression and protein-protein using a network approach to
   systematically reveal gene modules that were differentially expressed
   in ccRCC. At first, each individual gene was employed as the seed of a
   module, and new genes were added to the module in an iterative manner.
   At each step, all genes that interacted with any gene member of the
   module were assessed using an activity score. A higher activity score
   suggested the expression level of the corresponding module was more
   likely associated with the tissue phenotype (Methods, [51]Fig. 1).
   Hence, the gene that maximized the activity score was selected and
   added to the module. After a gene module was built, we applied three
   statistical tests to evaluate the significance of the module compared
   to background. The three tests included permuting tissue phenotype,
   randomizing genes in the module, and randomizing genes in the module
   with the same seed protein, respectively (Methods). Finally, we
   identified 1066 significant gene modules with activity scores equal or
   larger than 0.34 (P-value < 0.001 in all three statistical tests,
   [52]Fig. 1). We referred to these gene modules as differential gene
   modules (DGMs).

2.2. Performance Evaluation Classification Based on Differential Gene Modules

   The DGMs represented gene clusters that were significantly associated
   with tissue phenotypes. Thus, we hypothesized that the expression
   levels of DGMs can be utilized as features to distinguish ccRCC tissue
   from normal tissue samples. We examined the hypothesis using the
   TCGA-ccRCC dataset and three independent ccRCC patient datasets
   obtained from GEO (Methods, [53]Table 1) [54][4], [55][9]. The TCGA
   dataset contained an imbalance between ccRCC and normal samples (539
   ccRCC versus 72 normal samples), whereas the other three data sets
   contained more balanced samples ([56]Table 1).

Table 1.

   The performance of DGM and DEG based hierarchy clustering and SVM
   classifiers on the TCGA ccRCC patient group and three independent ccRCC
   datasets.
   ccRCC patient cohorts Normal Tumor Misclustered tissue samples
     __________________________________________________________________

   AUC of the classifiers
     __________________________________________________________________

   DGM-based DEG-based DGM-based DEG-based
   TCGA-ccRCC 72 539 2 4 0.942 0.767
   [57]GSE36895 23 29 0 0 0.923 1.0
   [58]GSE46699 63 67 9 15 0.953 0.949
   [59]GSE40435 101 101 0 0 0.956 0.997
   [60]Open in a new tab

   The DGM based classifier significantly outperformed the DGE based
   classifier by 22.8%((0.942 -0.767) / 0.767, [61]Table 1 bold number) on
   the TCGA-ccRCC which is an imbalanced data set (72 normal vs 539 tumor
   samples).

   The differentially expressed genes (DEG) based evaluation was performed
   for comparison as well. The TCGA expression profile was generated using
   RNA-Seq data, whereas the expression profiles of the other three
   independent ccRCC patient cohorts were produced using Microarray data
   (Methods). We used edgeR for the TCGA RNA-Seq dataset, and t-test
   followed by multiple-test correction for microarray datasets to perform
   differential expression analyses (Methods, Supp. Fig. 1).

   We conducted hierarchical clustering analysis, using DGMs and DEGs,
   respectively, on the four ccRCC datasets including the TCGA dataset,
   [62]GSE36895, [63]GSE40435 and [64]GSE46699. For [65]GSE36895 and
   [66]GSE40435, both using DGMs and DEGs yielded distinctive tumor and
   normal tissue clusters with perfect homogeneity ([67]Table 1). However,
   for the TCGA data and [68]GSE46699, the clusters yielded by DGMs tended
   to be more homogeneous as compared to the clusters generated by the
   DEGs. Four tumor tissue samples were misclustered using DEGs ([69]Fig.
   2A top panel, [70]Table 1), whereas the number of misclustered tumor
   tissues was reduced to two using DGMs for the TCGA data ([71]Fig. 2A
   bottom panel, [72]Table 1). For the [73]GSE46699, using DGM resulted in
   9 misclustered tumor samples ([74]Fig. 2B top panel, [75]Table 1),
   whereas using DGE yielded 15 misclustered ccRCC samples ([76]Fig. 2B
   bottom, [77]Table 1).

Fig. 2.

   [78]Fig. 2
   [79]Open in a new tab

   The performance comparisons of clustering and classification based on
   DGMs and DEGs. (A) Hierarchical clustering of TCGA 539 ccRCC and 72
   normal tissues based on the expression of DGMs (top) and DEGs (bottom).
   (B) Hierarchical clustering of an independent ccRCC ([80]GSE46699)
   tumor and normal tissues based on DGMs (top) and DEGs (bottom). (C) The
   ROC curves of the classifiers using the expression of DGM and DEG for
   the TCGA dataset locate at left panel, for [81]GSE46699 locate at right
   panel.

   Moreover, we built SVM-based (Support Vector Machine) classifiers to
   predict tissue type using the expression levels of DGMs and DEGs genes
   as input features, respectively. The area under curve (AUC) of the
   receiver operating characteristic (ROC) curve generated by three-fold
   validation was measured for classification performance assessment. The
   AUC of the classifier using DGM as features is 0.942, which is
   significantly higher than 0.767 for the classifier using DEG as
   features, in predicting TCGA dataset tissue types ([82]Fig. 2C left
   panel). However, additional classifications on three independent ccRCC
   datasets, which included fewer but more balanced tissue samples than
   the TCGA dataset, showed that the DEG-based classifiers performed
   slightly better or very similar compared to the performance of DGM
   based classifiers ([83]Table 1, [84]Fig. 2C right panel). Nevertheless,
   the DGM yielded good performance more robustly in both classification
   and clustering and suggests that genes in the differential modules were
   significantly associated with ccRCC; they are coordinately expressed;
   and, they likely function together in the disease pathways.

2.3. Functional Assessment of the Gene Modules

   We conducted Gene Ontology (GO) analysis on individual DGMs. The GO
   terms enrichment of were assessed by hypergeometric test (P < 0.01).
   Given that the median size of the gene modules is 6, we found that
   90.4%, 65.4% and 39.9% (963/1065, 696/1065, and 425/1065) of these
   modules contained at least two, three, and four genes that participated
   in the same significantly enriched biological process, respectively. In
   contrast, none of the random modules that had the same topology and
   size as the DEMs contained more than one gene in the same biological
   process.

   Additionally, our expression analysis showed that the majority of the
   genes in the DGMs appeared to be co-expressed (74.5%, 793/1065). Thus,
   the significant modules more likely consisted of genes functioning
   together in the disease-related pathways.

   We found a total of 22 enriched biological process terms that were
   significantly associated with at least 18.9% (201/1065) of the DGMs
   (Supp. Table 1), including several known cancer-related biological
   processes. For instance, 209 gene modules were prevalent in the
   neurotrophin Tropomyosin Receptor Kinase (TRK) receptor signaling
   pathway, a pathway involving malignant gliomas [85][15].

   Moreover, we identified 26 Kyoto Encyclopedia of Genes and Genomes
   (KEGG) pathways that were significantly enriched in at least 8.5%
   (90/1065) of the gene modules (Supp. Table 2). Thyroid cancer pathway,
   a top-affected pathway in our list, was significantly associated with
   27.9% (297/1065) of gene modules (P < 0.05, hypergeometric test). It
   has been reported that ccRCC is most frequent of origin of thyroid
   metastases and represents 12 to 34% of all secondary thyroid tumors
   [86][16], [87][17], [88][18]. 18.4% (196/1065) modules included genes
   that are significantly prevalent in fatty acid degradation pathways.
   Cellular proliferation requires fatty acids for synthesis of membranes
   and signaling molecules. Dysregulation of cellular proliferation is
   associated with the occurrence of cancer.

2.4. Gene Module-based eQTL Analysis

   We performed eQTL analysis on differential gene modules. The mutation
   of VHL, a known ccRCC causative gene, was found to be significantly
   associated with multiple DGMs (FDR < 0.03, [89]Fig. 3A and B). These
   modules were enriched of genes in MAPK signaling pathway, apoptosis,
   pathways in cancer (P < 0.03, hypergeometric test).

Fig. 3.

   [90]Fig. 3
   [91]Open in a new tab

   The examples of differentially expressed gene modules. The genes
   colored in red were up-regulated, whereas genes colored in green were
   down-regulated in ccRCC. The intensity of the color is proportioned to
   log2 fold-change of the gene expression. Circle nodes refer to the
   expression levels of genes that were significantly changed, whereas
   diamond nodes refer to the genes without significantly altered
   expression levels. (For interpretation of the references to color in