Abstract
Background
Genome-wide association studies (GWAS) have been successful in
identifying disease-associated genetic variants. Recently, an
increasing number of GWAS summary statistics have been made available
to the research community, providing extensive repositories for studies
of human complex diseases. In particular, cross-trait associations at
the genetic level can be beneficial from large-scale GWAS summary
statistics by using genetic variants that are associated with multiple
traits. However, direct assessment of cross-trait associations using
susceptibility loci has been challenging due to the complex genetic
architectures in most diseases, calling for advantageous methods that
could integrate functional interpretation and imply biological
mechanisms.
Results
We developed an analytical framework for systematic integration of
cross-trait associations. It incorporates two different approaches to
detect enriched pathways and requires only summary statistics. We
demonstrated the framework using 25 traits belonging to four phenotype
groups. Our results revealed an average of 54 significantly associated
pathways (ranged between 18 and 175) per trait. We further proved that
pathway-based analysis provided increased power to estimate cross-trait
associations compared to gene-level analysis. Based on Fisher’s Exact
Test (FET), we identified a total of 24 (53) pairs of trait-trait
association at adjusted p[FET] < 1 × 10^− 3 (p[FET] < 0.01) among the
25 traits. Our trait-trait association network revealed not only many
relationships among the traits within the same group but also novel
relationships among traits from different groups, which warrants
further investigation in future.
Conclusions
Our study revealed that risk variants for 25 different traits
aggregated in particular biological pathways and that these pathways
were frequently shared among traits. Our results confirmed known
mechanisms and also suggested several novel insights into the etiology
of multi-traits.
Electronic supplementary material
The online version of this article (10.1186/s12864-018-5373-7) contains
supplementary material, which is available to authorized users.
Keywords: GWAS, Pathway enrichment analysis, Multi-dimensional scaling,
Cross-trait association, Summary statistics, Pleiotropy abbreviations
Background
Genetic variants that affect multiple traits are often called
pleiotropy [[35]1]. It has been increasingly recognized that
pleiotropic effects are likely widespread in human complex traits
[[36]1, [37]2]. For example, individuals carrying schizophrenia risk
alleles tended to be associated with increased risk of Crohn’s Disease
and Ulcerative Colitis [[38]3]; men with cystic fibrosis are often
infertile because of congenital absence of the vas deferens [[39]4].
The past decade has witnessed a wave of genome-wide association studies
(GWAS), generating rich resources of genetic variants in large cohorts
with various clinical phenotypes or traits. GWAS data have been
successful in identifying disease-associated genetic variants,
pinpointing biological mechanisms, explaining heritability [[40]5], and
further, they have greatly promoted the investigation of pleiotropy
systematically. One of the most well-known loci is the Major
Histocompatibility Complex (MHC) region on chromosome 6, which was
identified to be associated with many traits through GWAS [[41]6].
Recently, Pickrell et al. used large GWAS data to perform a systematic
search for genetic variants that influence pairs of 42 traits and to
interpret the associations [[42]3]. Mancuso et al. introduced a method
to estimate the local genetic correlation based on gene expression and
utilized it to 30 traits [[43]7]. We recently applied functional and
regulatory approaches to identify novel regulatory variants in
large-scale genome-phenome (GWAS-PheWAS) studies [[44]8]. Notably, the
cross-trait association is highly relevant, but not identical, to
pleiotropy because trait association derived from shared genetic
variants does not require the underlying causal mechanism. Yet
pleiotropy occurs when a genetic variant causes multiple traits. In
this study, we utilized GWAS data to estimate cross-trait association
at the pathway level, aiming to reveal biological and functional
insights towards the understanding of pleiotropic effects.
GWAS data provide systematic, unbiased measurement of genome-wide
variants. However, the marginal effect of each single nucleotide
polymorphism (SNP) is often small to moderate, making it challenge to
study cross-trait associations at the SNP level. Furthermore, SNP level
analyses are typically less informative for biological interpretation.
For example, > 80% of disease-associated variants identified from GWAS
are located in non-coding regions and for the few variants that are
located in coding regions, the biological mechanisms often remain
unclear [[45]9, [46]10]. There is thus an urgent need for methods that
could fill in the gap between GWAS discoveries and clinical
applications, i.e., methods of precision medicine [[47]11]. Grouping of
multiple SNPs across genes or well-established pathways can be
beneficial to improve statistical power and gain insight into a
biological system [[48]12–[49]14]. Over the past few years, numerous
pathway enrichment methods have emerged to greatly boost GWAS data
discovery and better understanding of the molecular basis of phenotypes
[[50]15, [51]16], such as gene set enrichment analysis (GSEA) and
ALIGATOR (Association LIst Go AnnoTatOR) [[52]17]. These algorithms
have greatly contributed to investigate the common associated pathways
in different traits and diseases. For example, the Network and Pathway
Analysis Subgroup of the Psychiatric Genomics Consortium (PGC) has
employed five pathway enrichment methods to study the common pathways
across multiple psychiatric disorders, providing novel insights into
the etiology of psychiatric disorders [[53]11]. However, methods to
investigate the combined effect of multiple SNPs, either at the gene or
the pathway levels, are universally subjected to influential factors
such as gene length, pathway size, SNP density, and linkage
disequilibrium (LD) [[54]18, [55]19]. In addition, such strategies have
not been extended to cross-trait associations systematically.
In this work, we collected GWAS summary statistics data for 25 traits
and developed an analysis framework to investigate the cross-trait
associations using two complementary pathway-based enrichment analysis
methods. We utilized a comprehensive annotation of pathways while
controlling for pathway redundancy. Our results reported molecular
pathways that were implied by genetic variants underlying multiple
traits and provided insights for the understanding of cross-trait
association and future clinical applications.
Results
Analysis workflow
Figure [56]1 illustrated our analysis framework. We started with the
GWAS summary statistics for 25 traits downloaded from nine consortia.
We grouped these traits into four general groups (Table [57]1): (i) six
anthropometric and social traits: Body Mass Index (BMI), Bone Mineral
Density in Female Neck (FN-BMD), Bone Mineral Density in Lumbar Spine
(LS-BMD), Educational attainment (EDU), Height, and Waist–Hip Ratio
(WHR); (ii) three immune-related traits: Crohn’s Disease (CD),,
Rheumatoid Arthritis (RA), and Ulcerative Colitis (UC); (iii) 10
metabolic traits: Age At Menarche (AAM), Coronary Artery Disease (CAD),
Fasting Glucose (FG), Fasting Insulin (FI), High-density Lipoproteins
(HDL), Low-density Lipoproteins (LDL), Total Cholesterol (TC),
Triglycerides (TG), Type 1 Diabetes (T1D), and Type 2 Diabetes (T2D);
and (iv) six neurological/neuropsychiatric phenotypes: Alzheimer’s
disease (ALZ), Attention Deficit-Hyperactivity Disorder (ADHD), Autism
Spectrum Disorder (ASD), Bipolar Disorder (BD), Major Depressive
Disorder (MDD), and schizophrenia (SCZ). We summarized features of
these datasets in Table [58]1, including sample size, the number of
SNPs available with statistical significance, and the publication year.
All data were conducted using samples of European ancestry. The number
of SNPs with summary statistics ranged from ~ 1 million to ~ 12 million
per trait.
Fig. 1.
Fig. 1
[59]Open in a new tab
Overview of the analysis workflow. Details are provided in the
Materials and methods section
Table 1.
Summary of the 25 traits
Phenotype Abbreviation # individuals (cases/controls, if applicable) #
Total SNPs # Sig. SNPs^1 Generation year
1 Neurological/Neuropsychiatric phenotypes
Alzheimer’s disease ALZ 17,008 / 37,154 7,055,881 108,745 2013
Attention deficit-hyperactivity disorder ADHD 1947 trio / 1947 trio &
840 / 688 1,230,535 13,120 2013
Autism spectrum disorder ASD 4788 / 4788 1,245,864 15,781 2013
Bipolar disorder BD 6990 / 4820 1,233,533 18,723 2013
Major depressive disorder MDD 9227 / 7383 1,232,793 14,864 2013
Schizophrenia SCZ 9379 / 7736 1,237,958 24,361 2013
2 Anthropometric and social traits
Body mass index BMI 322,154 2,554,637 63,111 2015
Bone mineral density (femoral neck) FN-BMD 32,735 10,586,899 117,114
2015
Bone mineral density (lumbar spine) LS-BMD 28,498 10,582,866 120,390
2015
Educational attainment EDU 293,723 8,146,840 311,622 2015
Height HEIGHT 253,288 2,550,858 239,924 2014
Waist–hip ratio WHR 142,762 2,560,788 38,995 2015
3 Immune-related traits
Crohn’s disease CD 5956 / 14,927 12,276,505 185,729 2015
Rheumatoid arthritis RA 18,136 / 49,724 8,747,962 122,601 2014
Ulcerative colitis UC 6968 / 20,464 12,255,196 181,092 2015
4 Metabolic phenotypes
Age at menarche AAM 182,416 2,441,815 44,202 2015
Coronary artery disease CAD 22,233 / 64,762 2,420,360 41,279 2011
Fasting glucose FG 46,186 2,470,476 32,818 2010
Fasting insulin FI 38,238 2,461,105 30,044 2010
High-density lipoproteins HDL 99,900 2,692,429 39,833 2010
Low-density lipoproteins LDL 95,454 2,692,564 40,261 2010
Total cholesterol TC 100,184 2,692,413 44,112 2010
Triglycerides TG 96,598 2,692,560 40,574 2010
Type 1 diabetes T1D 9934 / 16,956 2,048,237 45,430 2011
Type 2 diabetes T2D 12,171 / 56,862 2,473,440 39,081 2012
[60]Open in a new tab
^1Significant SNPs (p < 0.01)
Our ultimate goal is to investigate the shared genetic components of
multiple traits at the pathway level. To this end, we employed two
methods to conduct the pathway-based analysis, PASCAL (PAthway SCoring
ALgorithm) [[61]12] and INRICH (INterval-based enRICHment) [[62]20].
PASCAL evaluates pathways based on the combined effect of multiple
variants and genes, which may individually represent weak associations
with the investigated trait [[63]12]. Conversely, INRICH considers
genes by using the most significant SNP located within the gene and
detects pathway enrichment based on genes with strong association
[[64]20]. We chose SNPs with p-value < 0.01 for the following analysis.
A comprehensive collection of pathways was included, consisting of 1077
pathways from BioCarta, KEGG, and Reactome databases. To reduce the
redundancy among pathways, we conducted pairwise comparison and chose
those pathways with size within the range of 10–200 genes and sharing
no more than 30% genes with other pathways. This step resulted in 355
pathways considered as nearly non-redundant, including 79 BioCarta, 67
KEGG, and 209 Reactome pathways. Pathways that were identified as
significantly associated with a trait by both methods were considered
as with robust evidence. We calculated a combined p-value for each
pathway using Fisher’s combined probability test [[65]21]. With these
combined p-values per pathway per trait, we then applied Fisher’s Exact
Test (FET) to estimate whether any two traits shared a significant
number of pathways that were significantly associated with each of
them. We used Benjamini and Hochberg’s method [[66]22] to correct
multiple tests.
Pathway enrichment analysis
To explore the pathways that were respectively identified as
significantly associated with each trait by PASCAL and INRICH, we
conducted a multi-dimensional scaling (MDS) analysis [[67]23]. As shown
in Fig. [68]2, traits from the same group in general were located
closer to each other than to traits from different groups. For example,
immune-related traits CD and UC as the most common forms of
inflammatory bowel disease (IBD) and RA formed a unique cluster that
was quite distinct from other trait groups based on both the PASCAL
results and the INRICH results, as well as the combined method. The
other three groups did not show clear boundaries in location but a
trend of cluster was observed. For example, in the PASCAL results, 7
out of 10 metabolic traits were distinguishable from other groups,
while T2D, FI, and FG were scattered in other areas. In the combined
results, the six anthropometric and social traits formed a concise
cluster that can be visually identified, although their locations in
the MDS plot overlapped with the metabolic traits and
neurological/neuropsychiatric traits. The six
neurological/neuropsychiatric phenotypes were not closely located in
all three result sets, but sub-clusters were observed. For example, SCZ
and BD were relatively closer to each other in all cases, which is
consistent with previous reports that shared genetic liabilities exist
between SCZ and BD [[69]24]. ALZ appeared as an outlier located far
away from any other neuropsychiatric traits. This is expected because,
although all six traits are neurological/neuropsychiatric related, ALZ
is the only neurodegenerative disease that involves the degeneration of
multiple regions in brain. Thus, the pathways involved in ALZ are
likely different from those involved in ADHD, ASD, BP, MDD, and SCZ.
Fig. 2.
[70]Fig. 2
[71]Open in a new tab
Multi-dimensional scaling (MDS) plot for all 25 traits using the
results of PASCAL, INRICH, and the combined method. The x-axis and
y-axis represent the first and second dimension from MDS results,
respectively. The ellipse indicates the confidence limit standard
deviations at 0.95 multiplied with the corresponding value found from
the chi-squared distribution with 2 degrees of freedom
Although PASCAL and INRICH differ in the selection of gene signals and
the association tests, the results from the two methods are overall
consistent and the main patterns were reproducible. The number of
trait-associated pathways identified by PASCAL and INRICH showed a
spearman correlation of 0.621 (p-value = 7 × 10^− 4), indicating the
robustness of our results. For example, the immune group tended to have
more pathways than other disease groups did and this trend was detected
by both methods.
The most significant pathways associated with traits
To further explore what pathways were associated with each trait, we
identified the most significantly associated pathway in each trait.
Figure [72]3 summarizes the whole distribution of all pathways by
traits. The detailed results were provided in Additional file [73]1:
Table S1. At p[combined] < 0.01, we found an average of 54 pathways per
trait (ranged between 18 and 175). Among them, only HDL and BD had < 25
disease-associated pathways. LS-BMD, CD and UC each had more than 100
disease-associated pathways. Interestingly, the most significant
pathways of traits that belong to the same category tended to be the
same or similar. For example, in neurological/neuropsychiatric traits,
the KEGG pathway “arrhythmogenic right ventricular cardiomyopathy
(ARVC)” was the most significantly associated with ADHD
(p[combined] = 1.52 × 10^− 5), ASD (p[combined] = 8.76 × 10^− 5), and
MDD (p[combined] = 4.25 × 10^− 6). The ARVC pathway describes several
processes that are related to heart failure, arrhythmia, and sudden
death. Although the pathway itself did not show a clear relationship
with neurological/neuropsychiatric traits, several genes in this
pathway are frequently observed to be associated with neuronal and
brain functions, such as calcium voltage-gated channel subunit genes
(CACNA1C, CACNA1D), catenin genes (CTNNA1, CTNNA2) and SLC8A1. The ARVC
pathway has also been previously observed in SCZ [[74]19] and MDD
[[75]25]. Among the immune-related traits, the KEGG pathway “JAK-STAT
signaling pathway” was ranked the most significant in CD
(p[combined] = 2.76 × 10^− 7), RA (p[combined] = 1.62 × 10^− 5), and UC
(p[combined] = 2.48 × 10^− 9). This pathway has been previously
reported to be involved in the adaptive and mucosal immunity as well as
epithelial repair and regeneration [[76]26–[77]28] and is biologically
relevant with those immune-related traits. Among the metabolic traits,
the Reactome pathway “HDL mediated lipid transport” was found as the
most significant pathway in HDL (p[combined] = 2.93 × 10^− 6) and TC
(p[combined] = 7.72 × 10^− 5), while another much related pathway from
Reactome, “chylomicron mediated lipid transport”, was mostly associated
with TG (p[combined] = 1.31 × 10^− 6).
Fig. 3.
[78]Fig. 3
[79]Open in a new tab
Distribution of pathway-based analysis results in each trait. The
x-axis represents the -log[10] transformed p[combined] for pathways.
The most significantly associated pathway with each trait was labeled.
The first letter in each pathway name indicates the sources (K for
KEGG, R for Reactome, and B for BioCarta, respectively)
Genes in the most significantly associated pathways
We next examined the individual genes in each of the most significantly
associated pathways. We chose those genes that had a gene-based p-value
as calculated by PASCAL (denoted by p[gene]) less than 1 × 10^− 4. For
the most significantly associated pathways in each of the 25 traits, an
average of 13 genes had p[gene] < 1 × 10^− 4, ranged between 2 (SCZ and
LDL) to 69 (HEIGHT). The ratio of genes with p[gene] < 1 × 10^− 4 in
each pathway ranged between 3.8% (AAM) and 60% (HDL), with an average
19% (Additional file [80]2: Table S2). For example, in the KEGG pathway
“folate biosysthesis”, which was the most significantly associated
pathway with SCZ (p[PASCAL] = 9.99 × 10^− 4, p[INRICH] = 4.30 × 10^− 2,
p[combined] = 4.80 × 10^− 4), there were only two genes that were
significantly associated with the trait at the gene level: SPR
(p[gene] = 5.02 × 10^− 5) and GGH (p[gene] = 8.06 × 10^− 5). In
contrast, in the Reactome pathway “HDL mediated lipid transport”, which
was most significantly associated with HDL (p[PASCAL] = 3.39 × 10^− 7,
p[INRICH] = 0.52, p[combined] = 2.93 × 10^− 6), nine out of 15 genes
(60%) in this pathway showed gene-based significance
(p[gene] < 1 × 10^− 4).
One of the advantages of pathway-based analysis is that the combined
effect of multiple genes, instead of the effects of each individual
gene, was examined for disease association. We thus asked whether the
same genes contributed to the disease association at the pathway level.
For the neurological/neuropsychiatric traits, we chose the ARVC pathway
as an example since this pathway was most significantly associated with
3 out of the 6 traits (the combined p-value: p[ALZ] = 4.35 × 10^− 2,
p[ADHD] = 1.52 × 10^− 5, p[ASD] = 8.76 × 10^− 5, p[BD] = 6.44 × 10^− 4,
p[MDD] = 4.25 × 10^− 6, p[SCZ] = 7.49 × 10^− 4). There were a total of
76 genes in this pathway. As shown in Fig. [81]4a, a total of 15 genes
had gene-based significance (p[gene] < 1 × 10^− 4) in at least one
neuropsychiatric trait. Seven genes were significant in two traits.
Each trait has one or two private genes that were uniquely associated
with the trait itself but not with other traits. Similarly, we chose
the JAK-STAT signaling pathway to examine shared genes for the three
immune-related traits (Fig. [82]4b). There were 155 genes in this
pathway, in which 55 genes were associated with at least one trait at
p[gene] < 1 × 10^− 4. We checked the genes that were associated with
multiple traits as well as the genes that were uniquely associated with
one single trait. Notably, seven genes from this pathway were found to
be associated with all three traits: IL12RB2 (location in the human
genome: 1p31.3), IL2 (4q27), IL21 (4q27), IL23R (1p31.3), IL2RA
(10p15.1), LIF (22q12.2), and TYK2 (19p13.2). Among them, IL12RB2 was
previously reported to be associated with an autoimmune disease
Systemic Lupus Erythematosus in Chinese population [[83]29]; IL21 plays
key roles in humoral immunity and autoimmune diseases [[84]30]; IL23R
is located in a genomic region that is associated with RA [[85]31]; and
TYK2 had been considered as a potential drug target for certain common
autoimmune disorders [[86]32]. However, these seven genes only
accounted for 13% of all (55) associated genes in the “JAK-STAT
signaling pathway”. Another 14 genes showed gene-based significance
across two intestinal diseases (CD, and UC), but not RA. Nevertheless,
there were genes private to each of the three traits that were
contributive to the pathway, implying heterogamous genetic risks at the
gene level which converged at the pathway level. By exploring the
disease-associated pathways in all the 25 traits, we found that in most
cases, the enrichment of pathway was due to the combined effect of
multiple genes, rather than being driven by single genes.
Fig. 4.
[87]Fig. 4
[88]Open in a new tab
Venn diagram to compare genes in the most significantly associated
pathway. a Trait-associated genes (p[gene] < 1 × 10^− 4) from five
neuropsychiatric traits in the ARVC pathway. b Trait-associated genes
from three immune-related traits in the JAK-STAT signaling pathway
Cross-trait association at the pathway level
We next explored how multiple traits were related at the pathway level.
We first conducted a preliminary screen using the full-spectrum of
p-values for all pathways and the hierarchical clustering analysis. As
shown in Additional file [89]3: Figure S1, we observed an obvious trend
that biologically relevant traits tended to cluster together. To
accurately estimate the functional overlapping of genetic variants at
the pathway level, we next exploited FET for all possible pairs of
traits. Results from FET were summarized in Additional file [90]4:
Table S3, including the lists of shared pathways and all statistics.
The magnitude of overlapping pathways varied greatly. At adjusted
p[FET] < 1 × 10^− 3, the number of shared significant pathways
(p[combined] < 0.01) between any pair of traits ranged between 12 (FG
and FI) to 99 (UC and CD). These shared pathways proved that
trait-associated genetic variants indeed converged on functional
pathways that potentially contributed to the traits. Because these
pathways were nearly non-redundant (i.e., with < 30% overlapping genes
per pair), they represent assessment of independent functional
processes.
Using hierarchical clustering, the 25 traits could be visually
distinguishable through forming sub-groups that were quite consistent
with their pre-assigned group labels (Fig. [91]5). The immune-related
traits all clustered with each other and they are closest to the
anthropometric and social group. The anthropometric and social traits
also formed a sub-group, except BMI that was distantly located. The
immune group and the neurological/neuropsychiatric group were close to
each other, which was expected because many previous works have
reported that immune related processes are widely involved in several
neurological/neuropsychiatric traits such as SCZ [[92]33] and ALZ
[[93]34].
Fig. 5.
[94]Fig. 5
[95]Open in a new tab
Heatmap of 25 traits. Hierarchical clustering was performed using
p-values from Fisher’s Exact Test
To further explore which pathways contributed to the trait
associations, we next examined the 20 most significantly associated
pathways within each group of traits (ordered by the average
p[combined]across traits within group, Additional file [96]5: Figure
S2). In the neurological/neuropsychiatric trait group, we found several
pathways that were shared by multiple traits and were also previously
implicated. For example, the KEGG “calcium signaling pathway”
(p[ALZ] = 5.59 × 10^− 5, p[ADHD] = 2.96 × 10^− 4,
p[ASD] = 1.88 × 10^− 1, p[BD] = 8.61 × 10^− 5, p[MDD] = 6.20 × 10^− 5,
p[SCZ] = 1.62 × 10^− 3), the KEGG “axon guidance” pathway
(p[ALZ] = 1.07 × 10^− 3, p[ADHD] = 8.12 × 10^− 4,
p[ASD] = 6.22 × 10^− 4, p[BD] = 1.13 × 10^− 2, p[MDD] = 2.62 × 10^− 3,
p[SCZ] = 2.05 × 10^− 3), the “alanine aspartate and glutamate
metabolism” (p[ALZ] = 9.60 × 10^− 3, p[ADHD] = 8.22 × 10^− 2,
p[ASD] = 1.22 × 10^− 2, p[BD] = 5.31 × 10^− 3, p[MDD] = 6.93 × 10^− 3,
p[SCZ] = 1.70 × 10^− 2), and “the voltage gated potassium channels”
(p[ALZ] = 2.04 × 10^− 3, p[ADHD] = 2.33 × 10^− 4,
p[ASD] = 3.16 × 10^− 4, p[BD] = 2.86 × 10^− 2, p[MDD] = 4.69 × 10^− 3,
p[SCZ] = 5.80 × 10^− 3) were nominally significantly associated with
multiple traits and were all previously implied in neuropsychiatric
disorders [[97]19, [98]35]. In the metabolism group, we observed
pathways related to lipid metabolism and were shared by all four
traits: “chylomicron mediated lipid transport” (p[HDL] = 3.54 × 10^− 4,
p[LDL] = 1.09 × 10^− 3, p[TC] = 1.09 × 10^− 4, p[TG] = 1.32 × 10^− 6),
“HDL mediated lipid transport” (p[HDL] = 2.93 × 10^− 4,
p[LDL] = 1.97 × 10^− 3, p[TC] = 7.72 × 10^− 4, p[TG] = 2.71 × 10^− 6),
“alpha Q signaling events” (p[HDL] = 2.35 × 10^− 3,
p[LDL] = 3.16 × 10^− 3, p[TC] = 3.12 × 10^− 3, p[TG] = 1.75 × 10^− 3),
and “ubiquitin mediated proteolysis” (p[HDL] = 1.96 × 10^− 4,
p[LDL] = 8.92 × 10^− 4, p[TC] = 9.90 × 10^− 4, p[TG] = 2.75 × 10^− 3).
To better represent the relationship among the multi-traits, we built a
trait-trait association network based on the FET result using two
thresholds: adjusted p[FET] < 1 × 10^− 3 (referred as highly
significantly connected network) and adjusted p[FET] < 0.01 (referred
as moderately significantly connected network). After applying the
adjusted p[FET] < 1 × 10^− 3 (Fig. [99]6a), we observed strong
associations (i.e., edges) among pairs of immune-related traits and
anthropometric and social traits. The immune related traits in general
had more associated pathways individually and more shared pathways
across traits (Fig. [100]6c). UC and CD as the two most common form of
IBD shared 99 pathways (adjusted p[FET] = 8.30 × 10^− 17). In addition,
we observed crosstalk between bone mineral density (LS-BMD and FN-BMD)
and immune-related traits (UC and CD) [[101]36]. FN-BMD shared 25 to 68
pathways with traits from other groups (i.e., UC, CD, T1D, T2D and
MDD). This observation is consistent with increased likelihood of
osteoporosis among those with type 1 diabetes from the civilian
community in the United States [[102]37]. We observed 25 shared
pathways between FN-BMD and MDD. This strong sharing, although
unexpected, had some support from previous work that bone mineral
density was reduced in patients with depressive disorders [[103]38].
For other closely related traits, we found LDL and TC shared 18
pathways (adjusted p[FET] = 2.29 × 10^− 6) and FG and FI shared 12
pathways (adjusted p[FET] = 7.36 × 10^− 4). Interestingly, at adjusted
p[FET] < 1 × 10^− 3, we did not observe any significant association
among neurological/neuropsychiatric traits, although association
between ALZ, ADHD, and MDD with traits of other groups were observed
(e.g., MDD versus FN-BMD, ADHD and FG).
Fig. 6.
[104]Fig. 6
[105]Open in a new tab
Trait-trait association network based on Fisher’s Exact Test. a Network
obtained using adjusted p[FET] < 1 × 10^− 3. b Network obtained using
adjusted p[FET] < 0.01. Each node represents a trait. Node color
indicates trait group (purple: neurological/neuropsychiatric traits,
orange: anthropometric and social traits, blue: immune-related traits,
and green: metabolic traits). Edge indicates the two end traits shared
a significant number of pathways. Edge width is proportional to the
-log[10]-transformed adjusted p[FET]. Edge labels represent the number
of shared associated pathways. (C) The number of pathways significantly
associated with each trait (p[combined] < 0.01)
When we used a relaxed threshold (adjusted p[FET] < 0.01), we observed
more associations (Fig. [106]6b). In addition to the immune-related
group and the social related group, traits from the metabolic group and
traits from the neurological/neuropsychiatric group were each found to
have strong within-group connections. For example, SCZ shared 11
pathways with ALZ (adjusted p[FET] = 3.89 × 10^− 3) and MDD and ADHD
shared 13 pathways (adjusted p[FE] = 5.61 × 10^− 3). Interestingly, SCZ
and UC shared 25 associated pathways (adjusted p[FET] = 1.52 × 10^− 3).
This observation is consistent with increased rate of autoimmune
diseases in schizophrenia patients in Denmark [[107]39, [108]40] and
with molecular evidence for a partial autoimmune etiology for
schizophrenia [[109]41]. Finally, BD and BMI shared 8 common associated
pathways (adjusted p[FET] = 4.59 × 10^− 3), where BD patients were
found with higher BMI and increased metabolic comorbidity in previous
studies [[110]42]. ALZ and CAD had 10 common associated pathways
(adjusted p[FET] = 5.94 × 10^− 3), consistent with the previous finding
that many cardiovascular risk factors are shown to increase the risk of
dementia and ALZ [[111]43]. Figure [112]6c illustrated the total number
of associated pathways for each trait.
Discussion
During the past decade, large-scale GWAS data for complex diseases and
traits have been growing at an unprecedented rate [[113]44]. While
conventional GWAS analyses apply genome-wide significance thresholds
(e.g., 5 × 10^− 8) to select genetic variants for further investigation
[[114]45], pathway-based analysis presents a complementary way to
explore GWAS data by examining groups of functional related genes for
their combined effects, and thus, it has the improved power and
interpretability [[115]14, [116]18, [117]19, [118]25, [119]35,
[120]46]. In this study, we take the advantage of pathway-based
analysis to study cross-trait associations. We developed a pipeline to
integrate pathway enrichment results from two independent methods using
large-scale GWAS summary statistics for 25 traits. The two methods
differ in their assumptions and individual strengths and thus, provided
complementary assessment of trait-associated pathways. Further boosted
by the Fisher’s combined probability test and trait-trait association
network, we examined shared pathways, contributing genes, and
cross-trait associations. Our results provided insights into the
functional association based on genetic variants and can be extended
when more GWAS summary data are made available in future.
Our primary goal was to use individual variants found to be associated
with multiple traits to identify shared biological pathways between
traits, as we hypothesized that similar or related pathways are likely
underlie the pathogenicity of related traits. In our results, we found
the immune-related traits shared many trait-associated pathways. This
finding was consistent with previous studies that immune-related traits
tended to have many genetic markers with common function in immunity
and inflammation [[121]3]. In addition, we observed a number of highly
relevant relationships between multiple complex traits by associated
pathway enrichment results, while most genes in the pathway did not
show significant overlap. For example, the anthropometric trait,
FN-BMD, was found to share pathways with traits from other groups such
as T1D [[122]47], T2D [[123]48], MDD [[124]38] and UC [[125]36].
Although such associations have been reported before, the underlying
mechanisms remain unknown. Through our analyses, we not only presented
further evidence at the pathway level to these associations, but also
revealed biological pathways in which the genetic variants converged.
For example, FN-BMD shared 27 pathways with T1D and 27 pathways with
T2D respectively; among them, 9 pathways were shared among all three
traits. Overall, we demonstrated that biologically relevant traits had
stronger sharing at the pathway level than at the gene level and our
pathway results would provide further interpretation of genetic
variants that drive the cross-trait associations.
Our work has several limitations. First, the current pathway resources
that are publically available and widely used in the field such as KEGG
unavoidably have limitations such as low resolution in defining
biological pathways, relatively few genes compared to genome-wide
datasets, overrepresentation of well-known pathways versus
underrepresentation of pathways attracting less common interests, and
lack of tissue and cell specificity, among others. All our analyses
were confined with these limitations. Second, we used 3 pathway
databases to increase the use of annotated pathways. However, many
pathways may be redundant when more pathways are included. We evaluated
and removed those highly redundant pathways (30% gene). This may
exclude lots informative pathways in our analyses. However, Fisher’s
Exact Test used in this study still with limitation as all associated
pathway are not completely independent. Therefore, it is still a
difficult task to balance the pathway independence and information.
Third, pathway-based analysis can only evaluate genetic variants
located in gene regions but ignore the majority of variants located in
the non-coding regions. As a result, there is information lost for
variants located at non-coding regions or genes without any pathway
information. Fourth, our work focused on common variants based on
genotyping arrays and imputation, with MAF > 0.05. Inferences from rare
variants may also provide insights into cross-trait association but
were not investigated in our work (e.g., for traits under strong
negative selection [[126]5, [127]49]). Despite these challenges, our
work provides a complementary way to utilize GWAS data to study
cross-trait associations at the pathway level, which is a promising
strategy to discover novel biological insights underlying complex
traits.
Conclusions
We presented an integrative framework to systematically investigate
cross-trait associations using GWAS summary statistics for 25 traits
using pathway-based analysis. Our results revealed trait-associated
pathways individually and in common, highlighting associations both
within and across trait groups. We observed a significant proportion of
pathways shared by the immune-related traits and anthropometric and
social traits, and a number of moderately significant correlations
within neurological/neuropsychiatric disorder and metabolic phenotypes.
Interestingly, more than half of the associations were cross trait
groups. We also constructed a novel trait-trait association network to
better examine the genetic association at the pathway level. These
results collectively shed lights on the molecular pathways underlying
complex traits, both related or unrelated traits. The results warrant
further investigation of function of critical nodes (genes) or
validated when more data become available.
Methods
Data collection
GWAS summary statistics were downloaded for 25 different traits from
International Genomics of Alzheimer’s Project (IGAP,
[128]http://web.pasteur-lille.fr/en/recherche/u744/igap/igap_download.p
hp), Genetic Factors of Osteoporosis (GEFOS, [129]http://www.gefos.org)
consortium, The Genetic Investigation of ANthropometric Traits (GIANT,
[130]https://portals.broadinstitute.org/collaboration/giant/index.php/M
ain_Page), International Inflammatory Bowel Disease Genetics Consortium
(IIBDGC, [131]https://www.ibdgenetics.org/downloads.html),
Meta-Analyses of Glucose and Insulin-related traits Consortium (MAGIC,
[132]https://www.magicinvestigators.org), Psychiatric Genomics
Consortium (PGC, [133]http://www.med.unc.edu/pgc), Reproductive
Genetics Consortium (ReproGen, [[134]50]) and Social Science Genetic
Association Consortium (SSGAC, [135]https://www.thessgac.org). These
traits span a wide range of phenotype measurements categorized into
four groups. We used the samples of European ancestry. For genotyped
SNPs, we removed those with minor allele frequency (MAF) < 0.05.
Wherever applicable, we mapped GWAS summary data to the hg19 human
assembly using liftOver software [[136]51].
Pathway enrichment analysis
In our analysis using PASCAL, we chose SNPs with nominal significance
of association (GWAS p-value < 0.01). This is a commonly used strategy
to find SNPs with weak or moderate association signals before they can
be integrated for further analysis at the pathway level [[137]19]. SNPs
were mapped to genes if they were located within a range of 50 kb
upstream or downstream of genes’ transcription start sites. PASCAL
employs a chi-squared method that takes into account of LD, gene
length, and SNP density [[138]12]. INRICH [[139]20] considers SNPs that
are associated with a trait at p-value < 1 × 10^− 4 (in our case, we
used p-value < 1 × 10^− 2) and generates independent genomic intervals
for the enrichment test. The Plink clumping function was employed,
taking the 1000 Genomes Project European panel as the reference and the
following parameters: r^2 = 0.2, maximum distance between a pair of
SNPs is 1 Mb. The aim to generate LD-independent intervals was to
ensure that SNPs located adjacent in the human genome are analyzed as
one independent unit as they might tag the same causal variants. INRICH
overlaps the independent intervals with the pre-defined pathways and
calculate an enrichment statistic E for each pathway, where E is the
number of intervals that overlap at least one target gene in the
pathway. To estimate the significance of E, a permutation test is
conducted to generate a null set of intervals which match the interval
size, overlapping gene, and SNP density to the original input intervals
[[140]52]. We chose to generate 1000 replicates to calculate the
empirical p-value for each pathway.
With the successful accomplishment of both PASCAL and INRICH, we had
two p-values for each pathway in each trait, testing the same
hypothesis that whether an investigated pathway is associated with the
trait of interest. Thus, we introduced the Fisher’s combined
probability test to calculate a combined p-value for each pathway in
each treat as follows:
[MATH:
χ2k2=−2∑i=
1klnpi :MATH]
, where χ^2 follows a chi-squared distribution with 2 k degrees. In our
case k = 2.
Reduce redundancy among pathway annotations
From an initial compiled set of 1077 pathways across three gene set
databases (KEGG, BioCarta, Reactome), we restricted downstream analyses
to 1032 pathways of size 10–200 genes. We further removed pathways that
showed substantial redundancy, i.e., any two pathways sharing more than
30% genes (# shared genes /max{# genes in pathway A, # genes in pathway
B}) were identified and the larger pathway would be retained. Finally,
a total of 355 pathways assumed as non-redundant were obtained and used
for follow up analysis, including 67 KEGG, 79 BioCarta and 209 Reactome
pathways.
Cross-trait association based on trait-associated pathways
To assess whether a pair of traits share pathways higher than chance
expectation, we first determined the pathways that were significantly
associated with each trait (p[combined] < 0.01), followed by Fisher’s
Exact Test. Specifically, for two traits a and b, we build a
dichotomous 2 × 2 contingency table as follows:
Trait b
Trait a ∣A ∩ B∣
[MATH: ∣A¯∩B∣ :MATH]
[MATH: ∣A∩B¯∣ :MATH]
[MATH: ∣A¯∩B¯∣
:MATH]
[141]Open in a new tab
Here, A denotes the set of pathways significantly associated with trait
a and B denotes the set of pathways significantly associated with trait
b. The format |•| indicates the number of records referred by •.
∣A ∩ B∣ denotes the number of pathways significantly associated with
both traits a and b,
[MATH: ∣A¯∩B∣ :MATH]
and
[MATH: ∣A∩B¯∣ :MATH]
denote the number of pathways showing association only in trait b or a,
respectively, and
[MATH: ∣A¯∩B¯∣
:MATH]
represents the number of pathways that are not associated with either
trait. We used Fisher’s Exact Test to estimate whether two traits
shared overrepresented pathways.
Hierarchical clustering was then conducted based on the FET results. We
used Spearman distance and ward linkage aggregation. Heatmaps were
created using the ComplexHeatmap package in R. In our analysis, we
consider a level of significance with adjusted p < 0.01 using the
Benjamini and Hochberg’s method [[142]22] to correct multiple tests.
Additional files
[143]Additional file 1:^ (152.2KB, xlsx)
Table S1. Results of pathway-based analysis of the 25 traits. (XLSX 152
kb)
[144]Additional file 2:^ (14.2KB, xlsx)
Table S2. Significantly associated genes in the most significant
pathway for each trait. (XLSX 14 kb)
[145]Additional file 3:^ (85.6KB, pdf)
Figure S1. Hierarchical clustering analysis for all associated pathways
between 25 traits. (PDF 85 kb)
[146]Additional file 4:^ (46.8KB, xlsx)
Table S3. Results of Fisher’s Exact Test for 25 traits. (XLSX 46 kb)
[147]Additional file 5:^ (539.4KB, pdf)
Figure S2. Heatmap showing the 20 most significant pathways for each
trait group. (PDF 539 kb)
Acknowledgements