Abstract Genome-wide association studies (GWAS) have emerged as the method of choice for identifying common variants affecting complex disease. In a GWAS, particular attention is placed, for obvious reasons, on single-nucleotide polymorphisms (SNPs) that exceed stringent genome-wide significance thresholds. However, it is expected that many SNPs with only nominal evidence of association (e.g., P < 0.05) truly influence disease. Efforts to extract additional biological information from entire GWAS datasets have primarily focused on pathway-enrichment analyses. However, these methods suffer from a number of limitations and typically fail to lead to testable hypotheses. To evaluate alternative approaches, we performed a systems-level analysis of GWAS data using weighted gene coexpression network analysis. A weighted gene coexpression network was generated for 1918 genes harboring SNPs that displayed nominal evidence of association (P ≤ 0.05) from a GWAS of bone mineral density (BMD) using microarray data on circulating monocytes isolated from individuals with extremely low or high BMD. Thirteen distinct gene modules were identified, each comprising coexpressed and highly interconnected GWAS genes. Through the characterization of module content and topology, we illustrate how network analysis can be used to discover disease-associated subnetworks and characterize novel interactions for genes with a known role in the regulation of BMD. In addition, we provide evidence that network metrics can be used as a prioritizing tool when selecting genes and SNPs for replication studies. Our results highlight the advantages of using systems-level strategies to add value to and inform GWAS. Keywords: genome-wide association study (GWAS), systems biology, coexpression network, osteoporosis __________________________________________________________________ Genome-wide association studies (GWAS) have revolutionized complex disease genetics. In just the last few years, GWAS have been used to identify hundreds of variants affecting a diverse range of common diseases and disease associated quantitative traits (for a summary, see [26]http://www.genome.gov/gwastudies/). Although GWAS have proven extremely effective at identifying common variants with relatively large effects, the first wave of data suggests that for many diseases, this class of variation accounts for only a small fraction of the genetic risk. For example, a large-scale, meta-analysis of ∼32,000 individuals identified 56 loci associated with bone mineral density (BMD), a strong predictor of osteoporotic fracture. However, in aggregate these single-nucleotide polymorphisms (SNPs) only explained 5.8% of the variance in femoral neck BMD ([27]Estrada et al. 2012). It is possible that for most diseases, the missing heritability is attributable to a combination of many more common variants with increasingly smaller effect sizes and rare variants, both of which are difficult to detect with GWAS in its current form ([28]Altshuler et al. 2008). It has been suggested that additional genes and biological mechanisms underlying a disease process could be extracted from GWAS data by searching lists of genes harboring nominally significant (e.g., P < 0.05) associations. Most of the initial attempts to identify such pathways have used gene ontology (GO) and pathway-enrichment tools to compare the number of genes in a specific pathway harboring nominally significant SNPs to the number expected at random. This approach has been applied to several GWAS datasets with varying results ([29]Askland et al. 2009; [30]Baranzini et al. 2009; [31]Elbers et al. 2009a; [32]O’Dushlaine et al. 2009; [33]Peng et al. 2010; [34]Ritchie 2009; [35]Torkamani and Schork 2009; [36]Torkamani et al. 2008; [37]Wang et al. 2007). Several issues complicate pathway analysis. First, enrichment results can vary widely across software tools ([38]Elbers et al. 2009b). Second, enrichment analyses are biased toward what we already know concerning pathway membership, and most predefined gene categories are very general in nature, making it more difficult to develop testable hypotheses with the goal of investigating specific disease mechanisms. Third, these strategies fail to provide information on the relationships between associated genes. Such information is critical to understanding how networks of polymorphic genes work together to promote or provide protection against disease. Recently, [39]Baranzini et al. 2009 used protein−protein interaction data to address this latter point by identifying interacting partners that were nominally associated with multiple sclerosis. However, missing from this approach was the ability to incorporate network concepts with clinical information. The specific goal of this study was to address these issues. Weighted gene coexpression network analysis (WGCNA) is a widely used analytical method that identifies functional connections between genes using microarray gene expression data ([40]Chen et al. 2008; [41]Gargalovic et al. 2006; [42]Ghazalpour et al. 2006; [43]Horvath et al. 2006; [44]Oldham et al. 2008; [45]van Nas et al. 2009; [46]Winden et al. 2009). WGCNA groups genes into modules on the basis of their coexpression similarities across a population of samples. The resulting modules have been shown to be comprised of genes that share similar functions or are involved in the same pathway [as examples: ([47]Ghazalpour et al. 2006; [48]Horvath et al. 2006; [49]Oldham et al. 2008; [50]van Nas et al. 2009)]. The advantage of WGCNA is that connections between genes can be established in an unbiased manner using disease-relevant expression data. In the present work we used WGCNA to perform a systems-level analysis of GWAS data. The analysis was performed by combining SNP-level association data from a large BMD GWAS with microarray expression data from a disease-relevant cell type from subjects with known BMD status (low vs. high). Using WGCNA, we identified modules composed of genes that were highly interconnected with one another and displayed nominal evidence of association with BMD. Through the characterization of module content and topology, our approach identified biological mechanisms, modules, individual genes, and network concepts that likely play an important role in the regulation of BMD. Materials and Methods Converting SNP lists to gene lists using ProxyGeneLD Several caveats complicate the conversion of a list of SNPs with association P-values to the assignment of gene-wide P-values using raw GWAS data. The primary confounders are linkage disequilibrium (LD) and biases due to gene size and the number of SNPs typed per gene. LD makes gene identification difficult because many nominally significant SNPs will be in LD with multiple genes. In addition, larger genes and genes with a greater density of SNPs typed have an increased probability of harboring nominally significant SNPs just by chance. Recently, [51]Hong et al. 2009 developed an algorithm (referred to as ProxyGeneLD) that reduces biases by accounting for LD when annotating genes. ProxyGeneLD works by identifying clusters of GWAS SNPs (referred to as proxy clusters) in high LD (r2 ≤ 0.80) using HapMap data. It then assigns proxy clusters and singleton SNPs (that did not group within a proxy cluster) to the nearest gene. Unadjusted gene-wide P-values are then calculated as the minimum of any SNP, either as a singleton or member of a proxy cluster per gene. P-value adjustments are made by multiplying the unadjusted P-value by the number of SNPs assigned to that gene. We used precomputed P-values from a recently published GWAS performed by deCODE ([52]Styrkarsdottir et al. 2008). These data are available for download from [53]http://content.nejm.org/cgi/content/full/NEJMoa0801197/DC1 as individual text files. The GWAS consisted of 5,861 Icelandic subjects phenotyped for hip (HBMD) and spine (SBMD) BMD and genotyped at 301,019 SNPs ([54]Styrkarsdottir et al. 2008). All SNPs for both traits were annotated using ProxyGeneLD. LD patterns were determined using CEU HapMap samples and genes were defined as the transcript plus a 1-kbp extension upstream to include promoter regions. P-values were assigned to a total of 16,878 genes. Genes with an adjusted P ≤ 0.05 for at least one of the two BMD traits were referred to as the nominally significant GWAS geneset (NSGG). GO and pathway-enrichment analysis We performed GO and pathway-enrichment analysis for the NSGG and network modules by using the Database for Annotation, Visualization and Integrated Discovery [DAVID ([55]Dennis et al. 2003; [56]Huang da et al. 2009)]. Each analysis was performed using the functional annotation charting and functional annotation clustering options. Functional annotation charting tests each individual GO or pathway term for enrichment. In contrast, functional annotation clustering combines single categories with a significant overlap in gene content and then assigns an enrichment score (ES; defined as the –log10 of the geometric mean of the P-values for each single term in the cluster) to each cluster, making interpretation of the results more straightforward. Functional annotation clustering cannot be performed for more than 3000 genes. Because the NSGG contained 3083 genes, we used to top 3000 ranked on adjusted P-value for the analysis. The search was limited to KEGG and Biocarta pathways, PFAM protein domains, and GO terms in the “Molecular Function,” “Biological Process,” and “Cellular Component” categories. Single categories were considered significantly enriched at a false discovery rate (FDR) ≤ 5%. To assess the significance of functional clusters, we created 10 sets of 3000 genes randomly selected from the aforementioned list of 16,878 genes with assigned P-values. Functional annotation clustering was performed for all 10 random gene sets. The max random ES was 2.75. Therefore, we used an ES cutoff of ≥3.0 as the threshold for significance in all analyses. Gene expression data processing To generate gene coexpression networks we used previous published microarray data from 26 healthy Chinese females ages 20−45 yr, with a mean age of 27.3 yr ([57]Lei et al. 2009). In this study expression profiles were generated from circulating monocytes that were isolated and purified from subjects with low (n = 12) and high (n = 14) BMD. We downloaded the Affymetrix CEL files from National Center for Biotechnology Information (NCBI)’s Gene Expression Omnibus ([58]GSE7158; [59]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE7158). The raw data were imported and processed using the affy package ([60]Gautier et al. 2004) for the R Language and Environment for Statistical Computing ([61]Ihaka and Gentleman 1996). Robust multiarray algorithm was used to normalize and generate probe level expression data ([62]Irizarry et al. 2003). WGCNA Network analysis was performed using the WGCNA R package ([63]Langfelder and Horvath 2008). An extensive overview of WGCNA, including numerous tutorials, can be found at [64]http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/. To begin, we identified all probes assaying the expression of NSGG genes. To eliminate noise due to genes that were not expressed, we selected NSGG probes whose levels exceeded the median level of expression across the entire array. As part of our quality control, we performed a clustering and principal components analysis based on the expression of these probes. Two samples from the high BMD group, [65]GSM172405 and [66]GSM172418, were significant outliers and were removed from the analysis. A preliminary calculation of network connectivity was used to identify the most connected probe for each gene. A WGCNA network for the selected probes was generated exactly as described in ([67]Farber 2010). GeneSignificance (GS) for the each network gene was defined as the absolute value of its Pearson correlation with BMD status. Module Membership (MM) was calculated as the Pearson correlation between each gene’s expression and its module eigengene, calculated using Singular Value Decomposition ([68]Alter et al. 2000). Network depictions were constructed using Cytoscape ([69]Shannon et al. 2003). In silico replication To compare replication success rates in hubs and genes with the highest GWAS P-values, we used data from a second GWAS, the Framingham Osteoporosis Study [FOS ([70]Kiel et al. 2007)]. The FOS GWAS consisted of 1141 subjects genotyped at ∼100,000 SNPs. We downloaded the association data [in the form of SNPs and precomputed P-values generated using generalized estimating equation models ([71]Kiel et al. 2007)] for three BMD traits (femoral neck, lumbar spine, and trochanter) from the database of Genotype and Phenotype at NCBI ([72]http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gap). SNP lists for each of the three traits were converted to gene lists using ProxyGeneLD precisely as described previously. A gene was considered successfully replicated if it had an unadjusted P ≤ 0.05 for at least one of the three BMD traits. The percentage of successfully replicated genes was calculated in the blue, magenta, greenyellow, and brown modules for the top 20%, 10%, and 5% of genes based on intramodular connectivity (k.in). These rates were compared with those for the top 20%, 10%, and 5% of GWAS network genes selected based on adjusted P-value from the deCODE ([73]Styrkarsdottir et al. 2008) GWAS or GS. Results Identifying genes with nominally significant genome-wide associations An overview of the systems-level analysis of GWAS data are presented in [74]Figure 1. The first step in the analysis was the identification of genes displaying evidence of association using data from a BMD GWAS [n = 5861 ([75]Styrkarsdottir et al. 2008)]. We used the ProxyGeneLD algorithm ([76]Hong et al. 2009), which takes LD patterns into account when assigning SNPs to genes and adjusts for gene length and SNP density biases (see Materials and Methods), to generate gene-wide adjusted P-values for two osteoporosis-related traits, HBMD and SBMD. Gene-wide P-values were calculated for a total of 16,878 genes. Of these, 1777 and 1861 had gene-wide adjusted P ≤ 0.05 for HBMD and SBMD, respectively. By combining the two lists, 3083 unique genes were identified with adjusted P ≤ 0.05 for at least one of the BMD traits. We refer to these genes as NSGG. Figure 1 . [77]Figure 1  [78]Open in a new tab Overview of the systems-level analysis of GWAS data. To determine whether gene length and SNP density were potential confounders in the NSGG, we calculated the correlation between these two variables and HBMD unadjusted (defined as the minimum P-value for proxy clusters and single SNPs assigned to a particular gene) and adjusted P-values. As described previously, 1777 genes had adjusted P ≤ 0.05. In contrast, 5228 genes had unadjusted P ≤ 0.05. In the latter gene set, we observed a strong correlation between unadjusted P and gene length (r = 0.46, P = 0) and SNP density (r = 0.50, P = 0). However, this correlation was not observed after adjustment for gene length (r=-0.01, P = 0.88) or SNP density (r = −0.01, P = 0.74). Thus, our network analysis of GWAS genes should not be influenced by these systematic biases. Conventional pathway enrichment fails to pinpoint specific biological mechanisms We next determined whether the NSGG was enriched for “biological themes” using the conventional approach of GO and pathway enrichment analysis. DAVID ([79]Dennis et al. 2003; [80]Huang da et al. 2009) was used for this analysis, although we also used WebGestalt ([81]Zhang et al. 2005) and observed similar results. A total of 24 individual terms, all of which were GO categories, were significantly enriched in the NSGG at an FDR ≤ 5% (Supporting Information, [82]File S1). The most significant term was protein binding (GO:0005515; FDR = 1.7 × 10^−10). Other significant categories included developmental process (GO:0032502; FDR = 9.5 × 10^−5), cation binding (GO:0043169; FDR = 2.5 × 10^−3), and cell differentiation (GO:0030154; FDR = 2.7 × 10^−2). DAVID also generates category clusters by condensing sets of related terms ([83]Dennis et al. 2003; [84]Huang da et al. 2009). This condenses redundant categories, identifies terms containing a smaller number of genes that on their own would require higher fold enrichments to reach statistical significance, and makes interpreting the results much easier. Each cluster receives an ES, which is defined as the geometric mean (on a –log10 scale) of the P-values for all single terms in the cluster. A total of 32 clusters had ESs > 1.3 (equivalent to a nominal P ≤ 0.05); however, it was unclear whether this was an appropriate significance cutoff. To determine the distribution of ESs observed using a set of random genes we created 10 sets of 3000 genes randomly selected from the whole genome and ran each through DAVID. ESs for the random gene sets ranged from 1.36 to 2.75. Therefore, we selected an ES cutoff of ≥3.0. Using this threshold, a total of five significant clusters were identified in the NSGG ([85]Table 1 and [86]File S2). The top GO terms in each of the five clusters were “intracellular part,” “metal ion binding,” “developmental process,” “intracellular organelle part,” and “organelle inner membrane.” These data indicate that the NSGG is enriched for groups of genes sharing similar functionality; however, because the identified categories are very general in nature this analysis does little to pinpoint specific biological mechanisms underlying variation in BMD. Table 1. Gene category and pathway enrichment analysis of NSGG genes. Functional Group Top GO Term Top Term FDR ES[87]^a 1 GO:0044424∼intracellular part 9.5 × 10^−7 6.3 2 GO:0046872∼metal ion binding 1.3 × 10^−4 5.9 3 GO:0032502∼developmental process 9.5 × 10^−5 5.6 4 GO:0044446∼intracellular organelle part 5.5 × 10^−2 4.2 5 GO:0019866∼organelle inner membrane 2.1 × 10^−1 3.3 [88]Open in a new tab ^a ES, enrichment score defined as the –log10 (geometric uncorrected P-value for all single categories) for each DAVID cluster. Generation of a weighted gene coexpression network for NSGG genes WGCNA reveals connections between genes using microarray expression data by grouping genes based on a topological overlap measure [TOM ([89]Dong and Horvath 2007; [90]Zhang and Horvath 2005)]. Two genes have a high TOM if they are highly interconnected with the same set of genes ([91]Dong and Horvath 2007; [92]Zhang and Horvath 2005). To evaluate the coexpression relationships between NSGG genes in a disease-relevant context we used microarray expression profiles of purified circulating monocytes isolated from individuals with discordant levels of BMD ([93]Lei et al. 2009). The dataset included 24 profiles from young (mean age = 27.3 years) Chinese females, 12 with low BMD (mean Z-score=-1.72) and 12 with high BMD (mean Z-score = 1.57). We choose to use this dataset because it represents the largest study performed to date with both expression profiles for a cell-type relevant to BMD [monocytes are precursors to bone-resorbing osteoclasts ([94]Fujikawa et al. 1996)] and clinical information on the subjects. After excluding non- and lowly expressed genes we identified probes representing 1918 (62%) of the 3083 NSGG genes and applied the WGCNA algorithm to generate a GWAS network. The resulting network was composed of 13 distinct gene modules ([95]Figure 2). Sixty-three of the genes failed to fit within a distinct group and were assigned to the “grey” module. The modules ranged in size from 40 (salmon module) to 356 genes (turquoise module). A complete list of module assignments and network metrics for all genes is included in [96]File S3. Figure 2 . [97]Figure 2  [98]Open in a new tab WGCNA coexpression network composed of BMD GWAS genes. Shown is the hierarchical clustering dendogram for all 1918 genes used in the analysis. Each line is an individual gene. Genes were clustered based on a dissimilarity measure (1 − TOM). The branches correspond to modules of highly interconnected groups of genes. The tips of the branches represent genes that are the least dissimilar and thus share the most similar network connections. Below the dendogram each gene is color coded to indicate its module assignment. The WGCNA approach has been used to generate robust networks in several diverse applications ([99]Chen et al. 2008; [100]Gargalovic et al. 2006; [101]Ghazalpour et al. 2006; [102]Horvath et al. 2006; [103]Oldham et al. 2008; [104]van Nas et al. 2009; [105]Winden et al. 2009), including experiments with a similar or smaller number of samples relative to this study ([106]Gargalovic et al. 2006; [107]Gong et al. 2007). Most WGCNA analyses, however, use a series of preliminary filtering steps to select the most biologically meaningful genes for network construction ([108]Ghazalpour et al. 2006). In such studies, the expression data exclusively determines which genes are used in the analysis. Because our network genes were not selected entirely based on expression profiles, we wanted to ensure that the resulting modules were cohesive and robust. To test cohesiveness, we calculated the mean MM for each module. MM is the correlation between each gene in a module and its module eigengene. Thus, it is a measure of how tightly a particular gene fits into its module. The greater the mean MM for a module, the more similar the coexpression relationships are across the module. The mean MM ± SEM ranged from 0.60 ± 0.01 (brown module) to 0.74 ± 0.01 (tan module), indicating that modules consisted of genes sharing highly similar expression patterns. We addressed robustness, as described previously ([109]Ghazalpour et al. 2006), by randomly splitting the dataset in half 1000 times and calculating k.in in each half. The analysis was performed for the largest (turquoise) and smallest (salmon) modules. The mean correlation ± SEM between the real and random k.in values was 0.65 ± 0.05 and 0.52 ± 0.03 in the turquoise and salmon modules, respectively. Thus, the GWAS network modules are cohesive and robust to exclusion of half the data. Characterization of module content reveals a key role for oxidative phosphorylation in the regulation of BMD One way in which network analysis can inform GWAS is to expose pathway enrichments that were not observed in a large set of nominally significant genes, such as the NSGG. We expected that by parsing genes based on coexpression similarities, more refined functions would be condensed within modules, revealing enrichments for more specific processes. This would improve the process of converting a detectable enrichment into a testable hypothesis. To determine whether specific modules were enriched for novel gene categories or pathways we repeated the DAVID analysis for each module. Of the 13 modules, five had at least one cluster with an ES ≥ 3.0. Interestingly, the turquoise module stood out as displaying detailed enrichments that were not observed in the analysis of the entire NSGG ([110]Table 2 and [111]File S4). In the turquoise module, significant enrichments were observed for six clusters with the following top terms “cytoplasmic part” (ES = 8.1), “mitochondrion” (ES = 7.3), “electron carrier activity” (ES = 4.9), “electron carrier activity” (ES = 4.2), “hydro-lyase activity” (ES = 3.9), and “RNA splicing” (ES = 3.5). Within each cluster there were a number of terms that were not significant in the entire NSGG, suggesting that partitioning genes into coexpression can reveal hidden enrichments. Table 2. Network modules with significant DAVID enrichments. Module Number of Genes Top Term for Each Cluster Top Term FDR ES[112]^a Pink 112 GO:0044446∼intracellular organelle part 0.78 3.1 GO:0019538∼protein metabolic process 6.0 × 10^−2 3.0 Black 134 GO:0043231∼intracellular membrane-bound organelle 2.0 × 10^−2 3.0 Red 134 GO:0044429∼mitochondrial part 2.0 × 10^−2 3.4 GO:0044446∼intracellular organelle part 1.3 × 10^−1 3.1 Blue 297 GO:0043231∼intracellular membrane-bound organelle 2.7 × 10^−6 5.9 hsa00040:Pentose and glucuronate interconversions 2.6 × 10^−6 4.1 GO:0005634∼nucleus 5.9 × 10^−6 3.7 Turquoise 356 GO:0044444∼cytoplasmic part 7.1 × 10^−10 8.1 GO:0005739∼mitochondrion 6.0 × 10^−8 7.3 GO:0009055∼electron carrier activity 7.5 × 10^−5 4.9 GO:0009055∼electron carrier activity 7.5 × 10^−5 4.2 GO:0016836∼hydrolyase activity 2.0 × 10^−2 3.9 GO:0008380∼RNA splicing 9.2 × 10^−2 3.5 [113]Open in a new tab ^a ES, enrichment score defined as the –log10 (geometric uncorrected P-value for all single categories) for each DAVID cluster. To investigate the enrichments in more detail, we focused on a single enriched term in cluster 2, the KEGG pathway “oxidative phosphorylation” (oxphos), because it represented one of the most specific enriched terms. This single term was not enriched in the NSGG (FDR = 99.8); however, its enrichment in the turquoise module was significant (FDR = 1.1 × 10^−3). Of the 356 turquoise module genes, 16 (4.5%) were involved in oxphos ([114]Table 3). To determine whether this enrichment was specific to the GWAS network, we generated 100 random networks. Each network was created by selecting 3083 genes at random using the same gene filtering steps and network parameters used to construct the real network. A total of 114 of the 20,080 genes (0.6%) with unique gene identifiers on the array belonged to the KEGG oxphos pathway. As shown above 16 of the 356 turquoise (4.5%) module genes were involved in oxphos. Using a Fisher’s Exact test this enrichment is highly significant (4.5% vs. 0.6%; P = 1.8 × 10^−9). We then performed this same test for each of 1709 modules belonging to the 100 random networks. None of the random module enrichment P-values exceeded the P-value for the real turquoise module, indicating that this enrichment is specific to the BMD GWAS network. Table 3. Members of the turquoise module involved in oxidative phosphorylation. Gene Unadjusted GWAS P-value k.in[115]^a rank k.total[116]^b rank r[117]^c NDUFB6 1.0 × 10^−2 1 8 −0.10 COX5B 4.9 × 10^−3 2 9 −0.22 COX8A 5.0 × 10^−3 3 5 −0.22 COX7A2 4.2 × 10^−3 6 22 −0.21 NDUFA13 7.8 × 10^−3 9 27 −0.16 ATP5J2 9.0 × 10^−4 14 54 −0.20 NDUFS7 3.2 × 10^−2 15 60 −0.35 COX6B1 1.4 × 10^−2 20 49 −0.25 ATP5G2 1.2 × 10^−3 24 41 −0.13 NDUFB1 6.0 × 10^−3 29 70 −0.08 NDUFA2 3.8 × 10^−2 32 128 −0.39 NDUFA11 6.1 × 10^−3 36 113 −0.14 COX17 1.0 × 10^−2 54 199 −0.19 NDUFV2 8.0 × 10^−4 55 111 −0.12 NDUFA7 8.0 × 10^−3 69 252 −0.30 ATP6V1H 5.6 × 10^−4 181 491 0.48 [118]Open in a new tab ^a k.in = Intramodule (the turquoise module) connectivity. ^b k.total = Total network connectivity. ^c r = Pearson correlation between expression of gene in monocytes and BMD status (low vs. high). Oxphos genes were also among the most connected in both the turquoise module and the whole network ([119]Table 3). In fact, the three most connected turquoise hubs were oxphos genes. In addition, of the 16 total genes, 15 were in the top 20% of genes when ranked on k.in ([120]Table 3). Another observation was that the expression of all 15 highly connected oxphos genes was negatively correlated with BMD status ([121]Table 3). Thus, by exploring the content of the turquoise module, we have identified an association between genetic variation in oxphos genes and BMD, determined that oxphos genes are module and network hubs, and determined that oxphos gene expression in monocytes was inversely correlated with BMD levels. Discovery of a turquoise submodule highly correlated with BMD status In addition to content, module topology (the unique distribution of edges among nodes) can also be evaluated in WGCNA networks. We investigated turquoise module topology by generating a network view showing all edges with a TOM ≥ 0.15 and their corresponding nodes ([122]Figure 3). The network consisted of 88 nodes and 256 edges. An initial inspection indicated that most nodes were grouped into a central core (containing many of the oxphos genes identified previously in this article) with two small submodules radiating from COX5B, an oxphos gene and the second most connected node in the module. We then overlaid information regarding the correlation between each gene’s expression and BMD status in the monocyte expression study. We suggest that correlation is a meaningful measure of biological significance, especially when considering GWAS genes, because it is likely that the correlations reflect subtle genetically-regulated differences in expression that are associated with alterations in BMD. As shown in [123]Figure 3 most of the genes were either not correlated (nodes shaded white) or slightly negatively correlated with BMD (nodes shaded light green). None of the genes were significantly positively correlated (max correlation in the turquoise module is 0.10). Interestingly, the genes in one of the submodules were among the most negatively correlated (shaded dark green) in the turquoise module and the entire network ([124]Table 4). One of the submodule genes, IFI35, was the second most negatively correlated (r = −0.58, P = 2.7 × 10^−3) with BMD in the NSGG network and 4 of the 8 genes in the sub-module were in the top 50. The average correlation for this group was -0.42. To determine the probability of randomly observing a group of 8 genes this negatively correlated ([125]Table 4) we created 10^6 sets of 8 genes selected at random from the turquoise module. Of the random gene sets none had an average correlation more extreme than this turquoise sub-module (most negative r = −0.36). Figure 3 . [126]Figure 3  [127]Open in a new tab Network view of the turquoise module reveals a submodule of genes negatively correlated with BMD status. This network contains all turquoise module edges with TOM ≥ 0.15 and their corresponding nodes. Genes are shaded based on their correlation with BMD from white (no correlation) to dark green (strong negative correlation). Node sizes are proportional to each gene’s –log10 GWAS P (most significant unadjusted GWAS P-value for either HBMD or SBMD). The submodule of interest is on the right-hand side of the figure. Notice that this group of gene is highly interconnected and negatively correlated with BMD status. Table 4. Genes comprising the turquoise sub-module. Gene Description Unadjusted GWAS P-Value r[128]^a r P-Value Meta-analysis Distance, Kbp[129]^b Meta-analysis P-Value[130]^c IFI35 Interferon-induced protein 35 1.0 × 10^−2 −0.58 2.7 × 10^−3 742 5.1 × 10^−7 TAP1 Transporter 1, ATP-binding cassette, subfamily B (MDR/TAP) 9.9 × 10^−4 −0.48 1.7 × 10^−2 EPSTI1 Epithelial stromal interaction 1 (breast) 8.0 × 10^−4 −0.48 1.8 × 10^−2 510 9.8 × 10^−8 CMPK2 Cytidine monophosphate (UMP-CMP) kinase 2, mitochondrial 9.6 × 10^−3 −0.47 2.2 × 10^−2 PARP12 Poly (ADP-ribose) polymerase family, member 12 1.9 × 10^−4 −0.42 4.0 × 10^−2 ZCCHC2 Zinc finger, CCHC domain containing 2 3.3 × 10^−3 −0.37 7.5 × 10^−2 172 4.9 × 10^−9 LYSMD2 LysM, putative peptidoglycan-binding, domain containing 2 6.0 × 10^−3 −0.35 9.0 × 10^−2 564 1.4 × 10^−6 LOC26010 Spermatogenesis associated, serine-rich 2-like 1.3 × 10^−3 −0.24 2.6 × 10^−1 [131]Open in a new tab ^a r, Pearson correlation between expression of gene in monocytes and BMD status (low vs. high). ^b The distance between the TSS for each respective gene and the location of a genome-wide suggestive or significant BMD association identified by ([132]Estrada et al. 2012). ^c The P-value for the associations identified by ([133]Estrada et al. 2012). Using gene information and literature searches, we found no obvious functional connection between the genes that comprised this subnetwork. However, using expression data from a panel of mouse tissues [[134]http://www.biogps.org ([135]Lattin et al. 2008; [136]Su et al. 2002, [137]2004)] we did observe that six of the genes are expressed in osteoclasts (EPSTl1, IFI35, PARP12, CMPK2, ZCCHC2, and TAP1) and the other two are expressed in osteoblasts (LOC26010 and LYSMD2). The group of osteoclast genes is also the most negatively correlated with BMD ([138]Table 4). Next, we determined whether any of the eight genes were located in close proximity to suggestive or significant GWAS loci (P < 1.0 × 10^−5) identified in a recent meta-analysis of BMD ([139]Estrada et al. 2012). Interestingly of the eight, the transcription start site for four (EPSTl1, IFI35, ZCCH2, and LYSMD2) are less than 750 Kbp away from a GWAS association ([140]Table 4). Therefore, these genes represent a highly interconnected sub-module whose expression is negatively correlated with BMD. These data together suggest they play a role in the regulation of BMD. Again, as demonstrated above, the functional interconnections between genes in this sub-module, and its correlation with BMD, was only revealed by network analysis. Identifying functional connections between known and novel genes One of the advantages of our approach is the ability to identify connections between novel genes with evidence of association and those with a previously established role in disease. This information can be used in two ways. First, it can identify new pathways that a known gene may participate in and second, it can identify novel genes through “guilty by association.” To investigate the network connections for a known gene we focused on tumor necrosis factor (TNF), the most highly connected gene in the NSGG network with a known role in BMD. TNF was the 13th most connected gene in the entire network with a total network connectivity (k.total) of 29.0 (max k.total = 35.2). It was the 6th most connected gene in the blue module with a k.in = 27.6 (max blue module k.in = 30.8). TNF is known to play a prominent role in osteoclastogenesis ([141]Lam et al. 2000) and several studies have found associations between TNF polymorphisms and BMD ([142]Fontova et al. 2002; [143]Kim et al. 2009). In the deCODE GWAS it was associated with HBMD and SBMD with unadjusted P-values of 1.2 × 10^−3 and 1.6 × 10^−2, respectively. The fact that TNF is one of hubs of a monocyte network provides additional support for the biological relevance of the GWAS network. We created a TNF submodule by identifying all edges within the blue module involving TNF with a TOM ≥ 0.15. The submodule contained 99 genes ([144]Figure 4). Using DAVID we identified three significant clusters that were enriched in the sub-module with terms related to “nuclear proteins” (ES = 4.0), “gene expression” (ES = 3.6), and “regulation of transcription” (ES = 3.0) ([145]File S5). Of the 99 genes, 47 belonged to the GO cellular component category “nucleus” (FDR = 3.82 × 10^−6, 1.9 fold enrichment), and 32 were in the GO molecular function category “transcription factor activity” (FDR = 1.8 × 10^−4, 3.5 fold enrichment). In support of its disease relevance the submodule included several genes with known roles in bone metabolism, such as nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor; NR3C1); protein tyrosine phosphatase, receptor type, E (PTPRE); CD44 molecule (Indian blood group; CD44); NLR family, pyrin domain containing 3 (NLRP3); FBJ murine osteosarcoma viral oncogene homolog B (FOSB); and dual-specificity phosphatase 6 (DUSP6). Thus, our network analysis rediscovered TNF as key intracellular signaling “hub” gene important in bone metabolism. More importantly, this network can be mined in future studies to identify novel genes that interact with TNF in some way (e.g., are downstream targets of TNF signaling, etc.) to affect bone mass. Figure 4 . [146]Figure 4  [147]Open in a new tab Characterizing the coexpression relationships for a highly connected known BMD gene. This TNF centered network provides a view of all edges and their corresponding nodes connected to TNF with a TOM ≥ 0.15. Genes are color coded based on their correlation with BMD; white (−0.20 < r<0.20), blue (r ≥ 0.20), and yellow (r≤-0.20). Node sizes are proportional to each gene’s –log10 GWAS P (most significant unadjusted GWAS P-value for either HBMD or SBMD). Relating network concepts to measures of biological relevance Exploring GWAS genes in the context of an expression network also allows one to relate network concepts, such as MM, to a measure of biological relevance. If a network property, inherent to a specific module, is associated with disease this suggests that the module serves an important biological role. It may also be possible to use the property as a gene screening tool to select genes for downstream studies. We focused on the association between the network concept MM and GS, a measure of biological relevance. GS was defined as the absolute value of the correlation between a gene’s expression and BMD status. Of the 13 network modules, significant (P < 0.003 after adjusting for number of modules) positive correlations were observed between MM and GS in the magenta (r = 0.44, P = 9.9 × 10^−5), greenyellow (r = 0.66, P = 1.6 × 10^−10), and brown (r = 0.36, P = 1.9 × 10^−7) modules ([148]Figure 5). Figure 5 . [149]Figure 5  [150]Open in a new tab Correlation between MM and GS for each of the 13 distinct GWAS modules. MM (defined as the correlation between each gene’s expression and its module eigengene) for each module is plotted against GS (defined as each gene’s correlation with BMD status). MM in the blue, magenta, greenyellow and brown modules is significantly (P < 0.003) correlated with GS. On the basis of the correlations between MM and GS, we hypothesized that hub genes from these three modules were the most biologically relevant and thus, the most likely to represent true positive associations with BMD. If true this suggests that selecting genes based on MM may result in greater replication success rates in subsequent studies compared with selecting genes using the traditional metric, GWAS P-value. To test this we performed an in silico replication study using data from a second BMD GWAS [FOS ([151]Kiel et al. 2007)]. Of the 1918 total network genes, 1264 were annotated in FOS using ProxyGeneLD. Genes were considered successfully replicated if their gene-wide associations were less than the significance thresholds defined below with any one of three BMD traits (femoral neck, lumbar spine, and trochanter BMD). From the 1264 network genes annotated in both studies, we compared the FOS replication rates for three groups of genes: (1) hub genes (based on k.in) from the magenta, greenyellow, and brown modules; (2) network genes ranked on GS; and (3) network genes ranked on P-value in the deCODE GWAS. The replication rates were compared for the top 20%, 10%, and 5% of genes within each group at three different significant levels, P ≤ 0.05, P ≤ 0.01, and P ≤ 0.001. As shown in [152]Table 5, selecting genes on K.in resulted in greater replication rates in all comparisons. The difference in replication rate between K.in and GWAS P-value increased as the definition of a hub gene became more stringent. For example, when comparing the top 5% of hubs vs. the top 5% of genes based on P-value, the difference in replication rate was twofold higher for hubs. Although validation studies will be needed, these data suggest that k.in may be a better metric than GWAS P-value to use to select genes for subsequent replication studies. Table 5. Replication rates of network genes selected using intramodular connectivity (k.in), gene significance (GS), or P-value. Top 20%[153]^a Top 10% __________________________________________________________________ Top 5% __________________________________________________________________ 0.05 0.01 0.001 0.05 0.01 0.001 0.05 0.01 0.001 K.in 35.0% 15.0% 2.5% 57.9% 26.3% 5.3% 60.0% 10.0% 10.0% GS 35.0% 2.5% 0.0% 21.4% 5.3% 0.0% 20.0% 0.0% 0.0% P-value 32.0% 14.6% 0.0% 34.1% 17.5% 0.0% 30.1% 9.5% 0.0% [154]Open in a new tab ^a Genes selected for replication were in the top 20%, 10%, and 5% based on K.in or GS in the magenta, greenyellow, and brown modules or P-value using all network genes. Discussion In this study, we have applied network theory to a list of genes with evidence of association with BMD using disease-relevant microarray gene expression data in subjects with known BMD status. We demonstrate that network analysis can group genes into modules that are enriched for specific biological processes. In some cases the enrichments were unique to modules and were more detailed and specific than those identified in the entire gene set. We also show that module topology can be used to identify groups of interconnected genes strongly associated with a clinical trait. Not only can this approach be used to reveal hidden enrichments, but it can also identify potentially important coexpression relationships for genes that exceed genome-wide significant thresholds or that have been previously associated with the disease. We also demonstrate that for three of the modules there was a significant correlation between MM and GS. We go on to provide evidence suggesting that hub genes replicate at a higher rate relative to genes selected using GWAS P-value or GS. This study provides a framework for combing network analysis and gene expression data to extract additional biological information from GWAS data. One of the limitations of GWAS is that it does not provide functional information for associated genes. Our systems-level approach does so by grouping genes using expression data from a cell type or tissue that is relevant to the disease in subjects with clinical data. Our discovery of the turquoise submodule of eight genes negatively correlated with BMD is a good example. Importantly, the interconnections between genes in this group could only have been identified by studying their relationships in a disease context. This information combined with the knowledge that they are expressed in mouse osteoclasts can be used to guide in vitro and in vivo experiments to validate their role in bone. The major bottleneck in any analysis using GWAS data are generating gene lists. Because of the nature of GWAS data, many SNPs with nominally significant P-values will be false-positives. This coupled with the difficulties in converting SNP-based to gene-based P-values leads to gene lists that contain a considerable level of noise. What is clear from this study and others ([155]Hong et al. 2009) is that potential biases have to be taken into consideration. In addition, our data suggest that functional grouping using coexpression similarities is an excellent approach to separate noise from real biological signal. We have proven this by identifying that the inherent network concept MM is correlated with GS in three of the 13 modules. The main purpose of any analysis designed to mine GWAS data are the generation of testable hypotheses. We believe a systems-level approach offers many advantages over other strategies for this purpose. For example, we demonstrate that parsing GWAS gene lists into functional groups identified a key role for oxidative phosphorylation, which can now be experimentally validated. Additionally, we identified novel genes based on their connection to known bone genes, membership in an enriched pathway or connectivity in one of the modules in which MM was correlated with BMD. Such genes can be tested to validate their associations and to investigate their biological role in functional genomics and replication studies. Oxidative stress is known to be increased in age-related diseases such as osteoporosis. It is also known that oxphos plays a direct and key role in bone metabolism ([156]Bratic and Trifunovic 2010; [157]Kousteni 2011). In bone modeling and remodeling, osteoclasts resorb mineral by acidifying the bone matrix ([158]Blair 1998). This process requires significant energetic resources, which are primarily generated through the oxidative phosphorylation of glucose ([159]Williams et al. 1997). Recently, it has been demonstrated that increased oxidative phosphorylation occurs in osteoclast precursors as they differentiate into mature osteoclasts ([160]Kim et al. 2007). Importantly, our data suggest that genetic variation in multiple oxphos genes influence bone mass. Moreover, the expression of these genes in monocytes is inversely correlated with bone mass, suggesting that increased oxphos in monocytes/osteoclasts results in decreased bone mass. Our analysis focused on osteoporosis; however, it is likely applicable to any disease with GWAS data and the appropriate gene expression profiles. GWASs have been performed for a myriad of disease. As an example, our search of the Gene Expression Omnibus database at NCBI using the term “cancer” resulted in 344 datasets, suggesting that for many diseases relevant gene expression data that can be used for network analysis is already available. In conclusion, this study provides proof-of-principle that a systems-level analysis of GWAS data is capable of adding significant value to existing datasets and future studies. This analysis provides a straightforward approach to identify pathways, individual genes, gene modules and network concepts that play an important role in disease. Supplementary Material Supporting Information [161]supp_3_1_119__index.html^ (1.3KB, html) Acknowledgments