Abstract Clear cell renal cell carcinoma (ccRCC) is the most common and most aggressive form of renal cell cancer (RCC). The incidence of RCC has increased steadily in recent years. The pathogenesis of renal cell cancer remains poorly understood. Many of the tumor suppressor genes, oncogenes, and dysregulated pathways in ccRCC need to be revealed for improvement of the overall clinical outlook of the disease. Here, we developed a systems biology approach to prioritize the somatic mutated genes that lead to dysregulation of pathways in ccRCC. The method integrated multi-layer information to infer causative mutations and disease genes. First, we identified differential gene modules in ccRCC by coupling transcriptome and protein-protein interactions. Each of these modules consisted of interacting genes that were involved in similar biological processes and their combined expression alterations were significantly associated with disease type. Then, subsequent gene module-based eQTL analysis revealed somatic mutated genes that had driven the expression alterations of differential gene modules. Our study yielded a list of candidate disease genes, including several known ccRCC causative genes such as BAP1 and PBRM1, as well as novel genes such as NOD2, RRM1, CSRNP1, SLC4A2, TTLL1 and CNTN1. The differential gene modules and their driver genes revealed by our study provided a new perspective for understanding the molecular mechanisms underlying the disease. Moreover, we validated the results in independent ccRCC patient datasets. Our study provided a new method for prioritizing disease genes and pathways. Abbreviations: RCC, Renal cell cancer; ccRCC, Clear cell renal cell carcinoma; eQTL, Expression quantitative trait loci; SVM, Support vector machine; TCGA, The Cancer Genome Atlas; KEGG, Kyoto Encyclopedia of Genes and Genomes; DEG, Differentially expressed gene; DGM, Differential gene module; AUC, Area Under Curve; ROC, Receiver Operating Characteristic Keywords: ccRCC, Causative mutation, Pathways, Protein-protein interaction, Gene module, eQTL 1. Introduction Kidney cancer is the sixth most common form of cancer for men and the tenth most common form of cancer for women. In 2016, over 63,000 newly diagnosed cases and 14,400 kidney cancer deaths were reported in the United States [34][1]. The vast majority of kidney cancers are renal cell carcinomas (RCC), among which nearly 75% are clear cell renal cell carcinomas (ccRCC) [35][2]. Despite recent advances, metastatic RCC remains largely an incurable disease [36][3], [37][4]. Patients with this disease often have no apparent symptoms or laboratory abnormalities in the early stages. The incidence of ccRCC has been rising steadily in recent years due to the prevalence of adverse lifestyle changes and exposure to toxins such as smoke [38][5]. ccRCC is characterized by the presence of VHL gene mutation in most cases [39][6]. However, the loss of VHL alone is not sufficient for tumor initiation and survival, and a fraction of ccRCCs contain wild-type VHL genes, suggesting additional genetic alterations are required in the course of tumor development. Recent large-scale sequencing studies of ccRCC, including TCGA (The Cancer Genome Atlas) project have discovered several new and prevalent genomic mutations such as PBRM1 and BAP1 [40][7], [41][8], [42][9]. Despite these findings, the mortality rate of ccRCC has not significantly decreased, indicating that the genetic basis of the disease occurrence and development remains to be elucidated. Additionally, previous studies have shown that ccRCC is a highly heterogeneous disease [43][10], [44][11], creating the need to identify new disease genes and pathways. The expression quantitative trait loci (eQTL) analysis has been used to identify single-nucleotide polymorphisms (SNPs) that are significantly associated with gene expressions [45][12], [46][13], [47][14]. Most eQTL analysis performed testing on transcript-SNP pairs to identify genetic mutations that significantly affected individual gene expression. Here, we presented a gene module-based eQTL method to identify the somatic mutations that are associated with gene clusters, which potentially function in the same pathway. We first identified differentially expressed gene modules (DGMs). The DGMs are comprised of a set of interacting genes based on protein-protein interactions and expression profile. The Gene Ontology analysis suggested that majority DGMs contained genes involved in the same biological processes. Additionally, the genes inside the same DGM tended to be co-expressed. Hence, these gene modules most likely contained genes function together in the disease-affected pathway. Disease genes are not always differentially expressed. The integration of gene expressions and protein interactions empower the discovery of disease genes, as disease genes without significant expression alterations could be revealed by DGMs through interacting with the differentially expressed genes in the gene modules. The subsequent eQTL analysis further established the linkages of somatic mutations with the DGMs. Collectively, the DGMs and their associated genetic mutations lead to the identification of novel disease genes and pathways. Moreover, we examined the DGMs on four independent ccRCC patient cohorts. The results showed DGMs accurately classified the tissue types blindly. 2. Results An interacting pathway regulates the expression of a group of genes that often perform certain functions together. When a pathway is perturbed by genetic mutations, then expression levels of interacting genes associated with the pathway can be altered accordingly and can further contribute to malignant transformation. By integrating gene expression and protein-protein interactions here, we developed a new method of identifying gene clusters in the pathways impacted by the disease. Then, we performed an eQTL analysis to infer potential driver mutations and disease affected pathways. The procedure of our study was illustrated in [48]Fig. 1. Fig. 1. [49]Fig. 1 [50]Open in a new tab The procedure of our study. After the differentially expressed gene modules were identified by coupling PPI with gene expression, somatic mutations were linked with the DGMs using eQTL analysis. Here, SMA-DGM refers to somatic mutations associated the DGMs. 2.1. Differentially Expressed Gene Modules Identification The RNA-Seq expression profile of 19,768 protein-coding genes was obtained from TCGA 539 ccRCC and 72 paired normal tissue samples. After filtering out the genes with very low expression levels (Methods), a total of 16,343 genes remained for the subsequent analysis. Then, we coupled gene expression and protein-protein using a network approach to systematically reveal gene modules that were differentially expressed in ccRCC. At first, each individual gene was employed as the seed of a module, and new genes were added to the module in an iterative manner. At each step, all genes that interacted with any gene member of the module were assessed using an activity score. A higher activity score suggested the expression level of the corresponding module was more likely associated with the tissue phenotype (Methods, [51]Fig. 1). Hence, the gene that maximized the activity score was selected and added to the module. After a gene module was built, we applied three statistical tests to evaluate the significance of the module compared to background. The three tests included permuting tissue phenotype, randomizing genes in the module, and randomizing genes in the module with the same seed protein, respectively (Methods). Finally, we identified 1066 significant gene modules with activity scores equal or larger than 0.34 (P-value < 0.001 in all three statistical tests, [52]Fig. 1). We referred to these gene modules as differential gene modules (DGMs). 2.2. Performance Evaluation Classification Based on Differential Gene Modules The DGMs represented gene clusters that were significantly associated with tissue phenotypes. Thus, we hypothesized that the expression levels of DGMs can be utilized as features to distinguish ccRCC tissue from normal tissue samples. We examined the hypothesis using the TCGA-ccRCC dataset and three independent ccRCC patient datasets obtained from GEO (Methods, [53]Table 1) [54][4], [55][9]. The TCGA dataset contained an imbalance between ccRCC and normal samples (539 ccRCC versus 72 normal samples), whereas the other three data sets contained more balanced samples ([56]Table 1). Table 1. The performance of DGM and DEG based hierarchy clustering and SVM classifiers on the TCGA ccRCC patient group and three independent ccRCC datasets. ccRCC patient cohorts Normal Tumor Misclustered tissue samples __________________________________________________________________ AUC of the classifiers __________________________________________________________________ DGM-based DEG-based DGM-based DEG-based TCGA-ccRCC 72 539 2 4 0.942 0.767 [57]GSE36895 23 29 0 0 0.923 1.0 [58]GSE46699 63 67 9 15 0.953 0.949 [59]GSE40435 101 101 0 0 0.956 0.997 [60]Open in a new tab The DGM based classifier significantly outperformed the DGE based classifier by 22.8%((0.942 -0.767) / 0.767, [61]Table 1 bold number) on the TCGA-ccRCC which is an imbalanced data set (72 normal vs 539 tumor samples). The differentially expressed genes (DEG) based evaluation was performed for comparison as well. The TCGA expression profile was generated using RNA-Seq data, whereas the expression profiles of the other three independent ccRCC patient cohorts were produced using Microarray data (Methods). We used edgeR for the TCGA RNA-Seq dataset, and t-test followed by multiple-test correction for microarray datasets to perform differential expression analyses (Methods, Supp. Fig. 1). We conducted hierarchical clustering analysis, using DGMs and DEGs, respectively, on the four ccRCC datasets including the TCGA dataset, [62]GSE36895, [63]GSE40435 and [64]GSE46699. For [65]GSE36895 and [66]GSE40435, both using DGMs and DEGs yielded distinctive tumor and normal tissue clusters with perfect homogeneity ([67]Table 1). However, for the TCGA data and [68]GSE46699, the clusters yielded by DGMs tended to be more homogeneous as compared to the clusters generated by the DEGs. Four tumor tissue samples were misclustered using DEGs ([69]Fig. 2A top panel, [70]Table 1), whereas the number of misclustered tumor tissues was reduced to two using DGMs for the TCGA data ([71]Fig. 2A bottom panel, [72]Table 1). For the [73]GSE46699, using DGM resulted in 9 misclustered tumor samples ([74]Fig. 2B top panel, [75]Table 1), whereas using DGE yielded 15 misclustered ccRCC samples ([76]Fig. 2B bottom, [77]Table 1). Fig. 2. [78]Fig. 2 [79]Open in a new tab The performance comparisons of clustering and classification based on DGMs and DEGs. (A) Hierarchical clustering of TCGA 539 ccRCC and 72 normal tissues based on the expression of DGMs (top) and DEGs (bottom). (B) Hierarchical clustering of an independent ccRCC ([80]GSE46699) tumor and normal tissues based on DGMs (top) and DEGs (bottom). (C) The ROC curves of the classifiers using the expression of DGM and DEG for the TCGA dataset locate at left panel, for [81]GSE46699 locate at right panel. Moreover, we built SVM-based (Support Vector Machine) classifiers to predict tissue type using the expression levels of DGMs and DEGs genes as input features, respectively. The area under curve (AUC) of the receiver operating characteristic (ROC) curve generated by three-fold validation was measured for classification performance assessment. The AUC of the classifier using DGM as features is 0.942, which is significantly higher than 0.767 for the classifier using DEG as features, in predicting TCGA dataset tissue types ([82]Fig. 2C left panel). However, additional classifications on three independent ccRCC datasets, which included fewer but more balanced tissue samples than the TCGA dataset, showed that the DEG-based classifiers performed slightly better or very similar compared to the performance of DGM based classifiers ([83]Table 1, [84]Fig. 2C right panel). Nevertheless, the DGM yielded good performance more robustly in both classification and clustering and suggests that genes in the differential modules were significantly associated with ccRCC; they are coordinately expressed; and, they likely function together in the disease pathways. 2.3. Functional Assessment of the Gene Modules We conducted Gene Ontology (GO) analysis on individual DGMs. The GO terms enrichment of were assessed by hypergeometric test (P < 0.01). Given that the median size of the gene modules is 6, we found that 90.4%, 65.4% and 39.9% (963/1065, 696/1065, and 425/1065) of these modules contained at least two, three, and four genes that participated in the same significantly enriched biological process, respectively. In contrast, none of the random modules that had the same topology and size as the DEMs contained more than one gene in the same biological process. Additionally, our expression analysis showed that the majority of the genes in the DGMs appeared to be co-expressed (74.5%, 793/1065). Thus, the significant modules more likely consisted of genes functioning together in the disease-related pathways. We found a total of 22 enriched biological process terms that were significantly associated with at least 18.9% (201/1065) of the DGMs (Supp. Table 1), including several known cancer-related biological processes. For instance, 209 gene modules were prevalent in the neurotrophin Tropomyosin Receptor Kinase (TRK) receptor signaling pathway, a pathway involving malignant gliomas [85][15]. Moreover, we identified 26 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways that were significantly enriched in at least 8.5% (90/1065) of the gene modules (Supp. Table 2). Thyroid cancer pathway, a top-affected pathway in our list, was significantly associated with 27.9% (297/1065) of gene modules (P < 0.05, hypergeometric test). It has been reported that ccRCC is most frequent of origin of thyroid metastases and represents 12 to 34% of all secondary thyroid tumors [86][16], [87][17], [88][18]. 18.4% (196/1065) modules included genes that are significantly prevalent in fatty acid degradation pathways. Cellular proliferation requires fatty acids for synthesis of membranes and signaling molecules. Dysregulation of cellular proliferation is associated with the occurrence of cancer. 2.4. Gene Module-based eQTL Analysis We performed eQTL analysis on differential gene modules. The mutation of VHL, a known ccRCC causative gene, was found to be significantly associated with multiple DGMs (FDR < 0.03, [89]Fig. 3A and B). These modules were enriched of genes in MAPK signaling pathway, apoptosis, pathways in cancer (P < 0.03, hypergeometric test). Fig. 3. [90]Fig. 3 [91]Open in a new tab The examples of differentially expressed gene modules. The genes colored in red were up-regulated, whereas genes colored in green were down-regulated in ccRCC. The intensity of the color is proportioned to log2 fold-change of the gene expression. Circle nodes refer to the expression levels of genes that were significantly changed, whereas diamond nodes refer to the genes without significantly altered expression levels. (For interpretation of the references to color in