Abstract

   Circulating metabolite levels may reflect the state of the human
   organism in health and disease, however, the genetic architecture of
   metabolites is not fully understood. We have performed a whole-genome
   sequencing association analysis of both common and rare variants in up
   to 11,840 multi-ethnic participants from five studies with up to 1666
   circulating metabolites. We have discovered 1985 novel
   variant-metabolite associations, and validated 761 locus-metabolite
   associations reported previously. Seventy-nine novel variant-metabolite
   associations have been replicated, including three genetic loci located
   on the X chromosome that have demonstrated its involvement in metabolic
   regulation. Gene-based analysis have provided further support for seven
   metabolite-replicated loci pairs and their biologically plausible
   genes. Among those novel replicated variant-metabolite pairs, follow-up
   analyses have revealed that 26 metabolites have colocalized with 21
   tissues, seven metabolite-disease outcome associations have been
   putatively causal, and 7 metabolites might be regulated by plasma
   protein levels. Our results have depicted the genetic contribution to
   circulating metabolite levels, providing additional insights into
   understanding human disease.

   Subject terms: Genome-wide association studies, Metabolomics,
   Predictive markers
     __________________________________________________________________

   Here, the authors perform a whole-genome sequencing association
   analysis of genetic variants in ≤11,840 multi-ethnic participants with
   ≤1666 circulating metabolites, discovering 1985 novel
   variant-metabolite associations and insights into human disease.

Introduction

   Circulating metabolite levels are highly heritable^[63]1, and
   positioned along the pathway between the genetic determinants and a
   wide variety of health outcomes. The latter include numerous Mendelian
   disorders, in which imbalanced blood or tissue metabolites levels are
   observed^[64]2–[65]5, as well as various complex diseases, for which
   metabolite patterns are being investigated^[66]6–[67]9. Most previous
   genetic studies of the human metabolome have focused on common variant
   analysis in European populations, predominantly using genome-wide
   association studies^[68]10–[69]15, with few studies investigating
   Hispanic^[70]16 and African-American^[71]17 participants. Inclusion of
   ethnically diverse populations may lead to genetic discovery in broader
   populations, and therefore, better understanding of disease^[72]18.
   Additionally, most previous studies focused on investigating autosomal
   chromosomes. Exploration of the X chromosome can further enrich our
   understanding of the genetic architecture of metabolites. Adding to the
   complexity, the number of measurable circulating metabolites has been
   growing^[73]19, while only a modest proportion of the metabolites,
   typically including several hundreds of traits^[74]20, have been
   explored in relation to genotypic data.

   In this investigation, we performed association analyses using
   whole-genome sequencing (WGS) to investigate the association of common
   and rare variants with 1666 circulating metabolites in multi-ethnic
   populations, using single variant and gene-centric analyses. We
   aggregated up to 11,840 adult participants of African, European, and
   Hispanic ancestries from five studies involved in the Trans-Omics for
   Precision Medicine (TOPMed) program for discovery analyses (full list
   of TOPMed authors is available in Supplementary
   Data [75]1)^[76]1,[77]21,[78]22. Our novel findings were further
   investigated using independent adult samples from TOPMed (up to 6763
   participants), and two publicly available datasets (up to 11,322
   participants), as replication. We also performed a gene network
   analysis, in which we integrated genome-wide associations with the
   human protein interactome to discover important interactions among
   metabolite-associated genes and their functions in biological pathways.

   The centralized analyses utilizing jointly called WGS and harmonized
   metabolite data enable us to rapidly detect common and rare variants
   with maximized statistical power. The associations discovered in the
   present investigation advance our knowledge on the genetic architecture
   of circulating metabolites, as well as provide context for the
   identification of further connections between metabolic processes and
   disease phenotypes.

Results

Study design

   In the discovery analyses we analyzed up to 15,660,619 common (Minor
   Allele Frequency [MAF] ≥5%), low-frequency (1% < MAF < 5%) and rare
   (MAF ≤ 1%) variants belonging to autosomal chromosomes and the X
   chromosome for association with 1666 rank-normalized circulating
   metabolites in up to 11,840 participants (mean age at 56.7 years old,
   57% women) from a pooled sample of 1843 African-American (AA), 5938
   European American (EA), and 4059 Hispanic (HIS) participants from the
   Atherosclerosis Risk in Communities study (ARIC), Hispanic Community
   Health Study/Study of Latinos (HCHS/SOL), Framingham Heart Study (FHS),
   Cardiovascular Health Study (CHS), and Multi-Ethnic Study of
   Atherosclerosis (MESA) (Methods). For replication analysis, we obtained
   summary statistics from up to five cohort studies (independent
   participants from FHS, Women’s Health Initiative [WHI], Jackson Heart
   Study [JHS], FENLAND, TwinsUK), including 2466 AA and 15,619
   individuals of European ancestry, for a total sample size of up to
   18,085 individuals (“Methods” section). The information on
   participating cohorts, as well as metabolite measurement methods and
   genotyping information is presented in Supplementary Data [79]2.
   Demographics of study participants, the biochemical name, pathway and
   missingness for each metabolite are summarized in Supplementary
   Data [80]3. The study design, applied statistical and functional
   analyses, and an overview of the known and novel findings are displayed
   in Fig. [81]1.

Fig. 1. Study design.

   [82]Fig. 1
   [83]Open in a new tab

   We performed single variant analysis of up to up to 15,660,619 variants
   with each of 1666 metabolites in up to 11,840 participants (Methods).
   Summary association statistics for variants in novel loci with
   P-value ≤ 3 × 10^−11 (Methods section and Supplementary Data [84]5)
   were obtained from five independent studies (up to 18,085
   participants). Only variants that were associated with a metabolite at
   P-value ≤ 1.02 × 10^−4 in the replication analyses and had concordant
   directions of effect across studies (“Methods” section) were considered
   replicated.

Single Variant Tests

   Overall, 150,468 single variant-metabolite associations reached the
   statistical significance threshold (P-value ≤ 3 × 10^−11); 2999
   associations were conditionally independent
   (P-value[conditional] ≤ 5 × 10^−8, “Methods” section), of which 1014
   pairs (602 variants, 520 metabolites, 159 loci) were known
   (Supplementary Data [85]4) and 1985 pairs (1906 variants, 230
   metabolites, 909 loci) were novel (with 708 loci reported for the first
   time for any metabolite, Supplementary Data [86]5). Inflation in our
   whole genome-wide single variant tests was well controlled with mean of
   genomic control lambda at 1.00 (standard deviation [SD] = 0.03,
   Supplementary Data [87]3). Consistent with our previous report^[88]16,
   rare and low-frequency variants (0.5% ≤ MAF < 5%) on average had 6.2
   times larger effect on metabolites levels compared to common variants.
   The mean effect was at 1.39 SD and 0.22 SD change per minor allele for
   rare and low-frequency, and common variants respectively (Supplementary
   Fig. [89]1). Likewise, around 63% of detected variants belonged to
   genes, harboring 9% exonic variants (Supplementary Fig. [90]1)^[91]16.

   Among 1985 novel independent associations identified, 488 statistically
   significant variant-metabolite association pairs were available for
   replication, and 79 pairs of 65 unique variants and 65 metabolites were
   successfully replicated (P-value ≤ 1.02 × 10^−4, with consistent
   direction of effect in both discovery and replication sets), with
   explained variances ranging from 0.3% to 17% (Supplementary
   Data [92]6). Novel replicated loci affect metabolites from eight super
   pathways, including lipid-related metabolites (46%) amino acids (30%),
   cofactors and vitamins (9%), nucleotides (9%), carbohydrates (4%),
   organic acids (4%), energy (2%), and xenobiotics (2%), with 13 loci
   affecting more than one metabolite. Overall, among 79 novel replicated
   findings, the signals are relatively consistent across three ancestries
   − 73 had the same direction of effect across all the analyzed
   ancestries (Supplementary Data [93]5[94]c). We attempted to extend
   those novel variant-metabolite pairs into two pediatric studies (1734
   Hispanic children, “Methods” section), however, only one association
   was validated (rs7458962 - methylated nucleoside 5-methyluridine pair,
   (Supplementary Data [95]7), suggesting limited generalizability, which
   may be in part due to the focus of Hispanic background and asthma
   condition for the pediatric populations.

   Genetic loci associated with metabolites identified on the X chromosome
   are sparse. We identified 18 novel loci on the X chromosome, 3 of which
   were successfully replicated. For example, the strongest associations
   for the novel replicated loci were detected on the chromosome X, for
   two amino acid-related metabolites involved in lysine metabolism
   (N-6-trimethyllysine - TMLHE, P-value = 9.89 × 10^−68; N6-acetyllysine
   - HDAC6, P-value = 9.27 × 10^−57). TMLHE encodes trimethyllysine
   dioxygenase, which converts trimethyllysine into hydroxytrimethyllysine
   in the carnitine biosynthesis pathway. HDAC6 encodes histone
   deacetylase 6, a protein implicated in deacetylation of lysine residues
   on the N-terminal part of the core histones^[96]23. The minor A allele
   of the missense variant, rs61735967 (MAF = 2.1%), in HDAC6, was
   associated with high levels of N6-acetyllysine, a risk factor for
   neurological deficits.

Gene-based rare variant analysis

   To explore the effect of an aggregation of rare variants in each of
   17,174 genes for 230 metabolites associated with novel loci
   (Fig. [97]1), we performed gene-centric analysis using STAAR-O, a newly
   developed method that provides powerful and robust rare variant
   association tests by dynamically incorporating multiple functional
   annotations (“Methods” section)^[98]24,[99]25. STAAR-O groups rare
   variants into multiple coding and non-coding masks for each gene,
   including putative loss of function ([pLOF], stop gain, stop loss and
   splice), missense, synonymous, promoter and enhancer masks.

   We detected 253 statistically significant (P-value ≤ 1.05 × 10^−9,
   accounting for 17,174 analyzed genes, 230 metabolites and 12
   categories) metabolite-gene-functional category associations. A total
   of 128 metabolite-gene association pairs (including 75 coding and 73
   non-coding genes), had 106 unique genes associated with 45 metabolites
   (Supplementary Data [100]8 and [101]9 and Supplementary Fig. [102]2a).
   Thirty-nine identified gene-metabolite pairs (58
   gene-metabolite-functional category associations) are located outside
   of novel or known loci identified using single variant analysis; 78
   gene-metabolite pairs were located within known loci, and 11 – within
   novel loci (including 8 replicated locus-metabolite associations).
   Three replicated variant-metabolite pairs were annotated to genes that
   were also statistically significant in gene-centric analyses with
   respective metabolites (guanidinoacetate - SLC25A45, deoxycarnitine -
   SLC25A45, and N-acetylputrescine - HDAC10).

   Gene-centric analysis implicated a biologically plausible gene ALPL,
   located ~55 Mb downstream of rs1697421, aggregation of missense
   variants in which was significantly associated
   (P-value = 3.87 × 10^−10) with glycerol 3-phosphate levels.
   Interestingly, ALPL encodes the tissue-nonspecific alkaline phosphatase
   protein – an enzyme involved in the dephosphorylation of several
   phosphorus-containing metabolites. Mutations in this gene have been
   linked to hypophosphatasia^[103]26,[104]27, a disorder characterized by
   loss of mineralization and joint pain.

Co-localization of metabolites with eQTLs

   To interpret the underlying biological activity beyond the identified
   replicated loci, we performed colocalization analysis with gene
   expression in GTEx V8 to investigate whether any of the 46 metabolite
   loci containing replicated variant-metabolite pairs also have effect on
   gene expression levels in various tissues (Supplementary Data [105]10).
   We identified that across 25 genetic loci, 29 variants (40
   variant-metabolite pairs), have evidence of colocalization with 40
   tissues (posterior probability, PPr > 0.6). For the majority of these
   loci (18 loci, 26 locus-metabolite associations), a single potential
   causal variant underlies both the expression of a single gene and the
   metabolite(s).

   Overall, nineteen replicated novel independent variants were
   colocalizing for the association with 26 metabolites and 26 gene eQTLs
   in 21 tissues (Fig. [106]2), suggesting that the expression of these
   genes may be the reason behind the variation of metabolite levels
   associated with these loci. Among 26 novel replicated variant-gene eQTL
   pairs, 27% were pertaining to 5 missense variants, and 13 - to 10
   intronic variants. Additionally, four intronic variants
   (rs67481496-ETFDH, rs17125278-PANK1, rs1077989-TMEM229B and chromosome
   6:160139865-SLC22A1), one synonymous variant (rs10405636-SSBP4) and one
   missense variant (rs1799958-ACADS) colocalize with expression of the
   gene to which they were annotated (with the most deleterious functional
   consequence).

Fig. 2. Novel replicated variant-metabolite pairs colocalizing for the
association with gene eQTLs.

   [107]Fig. 2
   [108]Open in a new tab

   The direction of the effect of minor allele on metabolite levels and
   gene expression is shown in the legend. At the bottom of the graph, in
   light gray, are the names of the metabolites. Above the names of
   metabolites are eQTL gene-tissue pairs. If both the effect of minor
   allele on metabolite levels and on gene expression is more than 0, such
   variant-metabolite-gene eQTL combinations are marked in yellow, and
   annotated as “Same Direction: Positive”. If the effect of minor allele
   on metabolite levels and on gene expression is less than zero, such
   variant-metabolite-gene eQTL combinations are marked in purple, and
   annotated as “Same Direction: Negative”. If the effect of minor allele
   on metabolite levels is less than zero and the effect of minor allele
   on gene expression is more than zero, or vice versa, such
   variant-metabolite-eQTL combinations are marked in gray, and annotated
   as “Opposite Direction”. Additionally, the following acronyms were used
   for tissues: BPBG brain putamen basal ganglia, BCH brain cerebellar
   hemisphere, EBV-TL - Cells EBV-transformed lymphocytes.

Gene network and pathway analysis

   To enhance the biological insight obtained from our findings, we
   performed gene network analysis and pathway enrichment analysis of the
   65 metabolites with statistically significant replicated findings,
   aiming to identify gene networks associated with each metabolite. The
   dense module search of GWAS (dmGWAS version 2.7) was used for our
   network-based analysis, where the input was gene-level weights, based
   on MAGMA association scores, and the human protein interactome,
   comprised of experimentally validated protein-protein interactions
   (PPIs), annotated in PathwayCommons^[109]28. We then performed
   enrichment analysis for gene sets from the top resulting network
   modules using the over-representation analyses (ORA, see Methods).

   Overall, 31 metabolite-gene set pair had significant association
   (P-value ≤ 2.70 × 10^−8 Bonferroni correction for 28,438 Biological
   Process terms in Gene Ontology annotations and 65 metabolites). Among
   those, 2, 6, and 4 metabolite-gene set pairs contained genes belonging
   to 2 known metabolite loci, 4 novel metabolite loci, and 3 novel
   replicated metabolite loci respectively. The latter included one
   metabolite-gene pair (N-acetylputrescine-HDAC10, Supplementary
   Data [110]11), which was also detected and replicated in single variant
   analysis (missense variant rs61748567 - N-acetylputrescine) and
   detected by the coding gene-centric analysis. Among them, several genes
   associated with N-acetylputrescine, were enriched in several GO
   Biological Process terms, including intracellular receptor signaling
   pathway (GO:0030522) and covalent chromatin modification (GO:0016569),
   with HDAC10 being a part of the latter pathway. HDAC10 is involved in
   deacetylation of polyamines, including N-acetylputrescine^[111]29,
   whereas other members of the HDAC family have been shown to act as
   deacylases as well as deacetylases^[112]30. Moreover, putrescine
   (N-acetylputrescine precursor) depletion was previously suggested to
   affect chromatin structure in brain tumor cells^[113]31. Therefore, our
   data provide additional biological mechanisms of the genetic and
   metabolomic engagement in the above processes.

Mendelian randomization

   To identify putatively causal relationships between various phenotypes
   in FinnGen and the 65 metabolites associated with replicated variants,
   we performed a series of MR analyses (Methods) using: (1) 1801
   phenotypic traits from FinnGen, and (2) summary statistics for 3283
   plasma proteins to elucidate the possible causal pathways (see
   Methods).

   Using summary statistics from FinnGen, 27 statistically significant
   (P-value ≤ 1.51 × 10^−7, accounting for 65 metabolites, 1801 traits and
   3283 pQTLs) metabolite-outcome association pairs were detected, where
   12 (5 metabolites, 6 FinnGen outcomes) had two Instrumental Variables
   (IV) available (Supplementary Data [114]12 and Fig. [115]3). Among 15
   associations with more than two instrumental variables (9 metabolites,
   7 FinnGen outcomes), seven associations (3 metabolites, 6 outcomes)
   remained nominally significant (P-value < 0.05) in MRPRESSO outlier
   test. For example, genetically regulated higher
   1-linoleoylglycerophosphoethanolamine levels demonstrate a putative
   causal effect on Type 2 Diabetes (OR [95%CI] = 0.82 [0.80–0.85]).
   Additionally, higher 1-stearoylglycerophosphoethanolamine levels show
   putative causal effects on disorders of choroid and retina (OR
   [95%CI] = 0.81 [0.78–0.84]), degeneration of macula and posterior pole
   (OR [95%CI] = 0.71 [0.67–0.76]), wet age-related macular degeneration
   (AMD, OR [95%CI] = 0.49 [0.44–0.55]), and age-related macular
   degeneration (OR [95%CI] = 0.56 [0.51–0.61]). Likewise, higher
   1-palmitoylglycerophosphoethanolamine (GPE) levels have a significant
   causal effect on age-related macular degeneration (OR [95%CI] = 0.60
   [0.55–0.66]). The latter two associations are due to several
   conditionally independent variants in or near ALDH1A2 and GCKR,
   associated with 1-palmitoyl-GPE and 1-stearoyl-GPE levels. ALDH1A2 is
   involved in retinoic acid synthesis^[116]32, and is known to be
   associated with AMD^[117]33. GCKR is a well-known gene associated with
   diabetes, and diabetes is a risk factor for AMD^[118]34. Wet AMD is
   accompanied by severe loss of photoreceptors and ganglion
   cells^[119]35. Metabolite 1-palmitoyl-GPE was reported to induce
   neurite outgrowth;^[120]36 therefore, 1-palmitoyl-GPE may play a
   protective role against the loss of ganglion cells in wet AMD.
   Moreover, 1-palmitoyl-GPE and 1-stearoyl-GPE both belong to saturated
   lysophosphatidylethanolamine species. Given the concordant OR, it is
   possible that saturated lysophosphatidylethanolamines in general might
   influence the age-related macular degeneration. However, functional
   investigations are needed to support these findings.

Fig. 3. MR results.

   [121]Fig. 3
   [122]Open in a new tab

   A. Metabolites effect on FinnGen phenotype traits. FinnGen phenotype
   traits are provided on the x-axis, metabolites are provided on the
   y-axis. The color indicates whether increase in metabolite level
   increases FinnGen phenotype odds (purple), or decreases FinnGen
   phenotype odds (blue). The number on the center of each square
   indicates the number of variants used to obtain each result. B. pQTL
   effect on metabolites. Metabolites are provided on the x-axis, pQTL are
   provided on the y-axis. The color indicates whether increase in pQTL
   levels increases metabolite levels (purple), or decreases metabolite
   levels (blue). The number on the center of each square indicates the
   number of variants used to obtain each result.

   To determine robustness of the identified metabolite–phenotype
   putatively causal associations, we performed additional MR analyses
   using a set of independent studies (UKBiobank, European Bioinformatic
   Institute [EBI] and BioBank Japan [BBJ]), matched by the outcome. Seven
   metabolite-outcome association pairs met stringent Bonferroni
   correction (P-value ≤ 0.05/11 = 4.55 × 10^−3) and had the same
   direction of effect as in metabolite FinnGen MR analyses (Supplementary
   Data [123]12).

   The interaction between metabolite and protein plays a critical role in
   controlling cellular homeostasis^[124]37. To identify possible causal
   pathways, we performed MR to identify putatively causal relationships
   between plasma proteins using summary statistics for 3283 protein
   quantitative trait loci (pQTLs) from the INTERVAL study^[125]38 and 65
   metabolites associated with the replicated genetic loci. We detected 52
   statistically significant pairs (P-value[IVW MR] < 4.93 × 10^−7,
   accounting for 65 metabolites and 1561 pQTLs), where twelve metabolites
   were affected by 44 proteins (Supplementary Data [126]13 and
   Fig. [127]3b). For example, genetically regulated increased
   N-acylethanolamine-hydrolyzing acid amidase (ASAHL) levels were causal
   of decreased N-oleoyltaurine levels. ASAHL plays role in N-acyl
   ethanolamines degradation^[128]39–[129]43, and has hydrolytic activity
   against the ceramides^[130]39. Therefore, our data suggests that ASAHL
   may also affect N-acyl amines. Metabolites may reversely affect protein
   levels, such as protein-metabolite interactions or post-translational
   modifications^[131]44,[132]45. We additionally tested the potential
   causal associations between metabolite and plasma proteins using the
   same analytical approach. There were eleven metabolites causally
   associated with 17 proteins with a total of 24 significant
   metabolite-pQTL associations pairs (with P-value[IVW
   MR] < 1.51 × 10^−7, accounting for 65 metabolites, 1801 traints and
   3283 pQTLs). For example, 1-palmitoyl-glycerophosphoethanolamine (GPE)
   levels were predictive of P5I11 levels, and 1-stearoyl-GPE levels -
   predictive of GGT2, P5I11, PSG5, and PKB beta levels (product of AKT2);
   Supplementary Data [133]14 and Supplementary Fig. [134]4.

Discussion

   We conducted a WGS study to detect genetic loci associated with 1666
   circulating metabolites in a multi-ethnic population, and identified 75
   novel replicated metabolite-genetic locus associations, with 22
   associations driven by nonsynonymous variants. A comprehensive
   gene-centric rare variant analysis was performed for a subset of
   metabolites, with 126 gene-metabolite pairs detected, showing
   associations between 45 metabolites and 105 genes. Using Mendelian
   Randomization, we showed that the levels of 13 metabolites were
   associated with the risk of 12 phenotype outcomes, including type 2
   diabetes and macular degeneration. Moreover, 16 metabolites were
   associated with 29 protein QTLs. Our study represents the first WGS of
   human metabolome in multi-ethnic population, which provides novel
   insights beyond previous GWAS.

   Previous metabolite genetic studies often restricted to single
   metabolomic platform^[135]1,[136]46, used other platform results for
   replication^[137]46, or combined cross-platform results using
   meta-analyses^[138]21. We demonstrated a contemporary approach to
   analyze cross-platform harmonized metabolite levels in pooled samples,
   which largely improved computational efficiency. Our results showed
   that this new approach well controlled genomic inflation, reproduced
   hundreds of known metabolite loci, and enabled novel gene
   identification via rare variant analyses within a modest sample set,
   revealing the advantage of joint analyses, specifically for large
   genomic initiatives, where multiple studies are involved. Importantly,
   we are the first to extend metabolite genetic association discovery
   into multi-ethnic populations. The diverse ancestral background
   promoted novel findings beyond studies focusing on single ancestral
   population^[139]1,[140]16,[141]47, even with tens of thousands of
   participants^[142]21. Furthermore, our findings provided additional
   insights into biological pathways by investigating the interacting
   effects between proteins and metabolites, where most past studies
   dedicated to illustrate the putative causal effect between metabolites
   and health outcomes.

   Although using pooled samples is considered computationally efficient,
   the variant-set test is intensive and costly for whole-genome analyses.
   We performed gene-centric rare variant analyses among 230 metabolites,
   which had significant common variant findings, as we considered those
   metabolites had relatively high heritability and used this opportunity
   to explore rare variants contribution to those metabolites. For the
   current analyses, more than 5500 jobs were run, including both single
   variant and gene-centered analyses, on the DNAnexus platform using the
   instance type “mem3_ssd1_v2_x32”, which provides 32 cores, 224GB of
   memory (> =7GB/core), and 640GB of solid-state drive storage
   (20GB/core)^[143]48. Future efforts are warranted to further explore
   rare-variant effects across all metabolites.

   To understand possible mechanisms underlying the replicated novel
   findings, we applied versatile analytic approaches, including
   colocalization and pathway analyses, which provided an additional level
   of detail for the identified loci. Using colocalization analyses, we
   identified 18 unique loci where the novel replicated variant
   colocalized with the eQTL for 26 unique genes in GTEx tissues,
   highlighting the biologically plausible genes. For example, the splice
   site intronic SLC22A1 variant (chromosome 6:160139865) is associated
   with increased levels of the lysine metabolism metabolite
   glutarylcarnitine (C5). Our colocalization analysis also showed that
   6:160139865 colocalizes with decreased SLC22A1 levels in the liver -
   the primary expression site of the latter protein^[144]49. SLC22A1
   encodes a plasma membrane transporter organic cation transporter 1
   (OCT1)^[145]49, which plays a role in regulating levels of
   acylcarnitines^[146]50 and isobutyrylcarnitine^[147]51. Therefore, our
   data suggests that SLC22A1 plays role in regulation of blood levels of
   glutarylcarnitine.

   Another replicated intronic ELL variant rs8109573 is colocalizing with
   decreased expression of ISYNA1 in ovaries and increased myoinositol
   levels. ISYNA1 is located ~68 kilo base-pairs (kbp) downstream of ELL,
   and encodes an inositol-3-phosphate synthase enzyme, which plays a key
   role in myoinositol synthesis pathway^[148]52. Intronic variant,
   rs67481496, located in ETFDH and ~7 kbp from the replicated variant
   rs17843966, is colocalizing with increased expression of ETFDH in heart
   tissues, liver, and skeletal muscle and decreased glutarylcarnitine
   (C5-DC) levels. The latter association is consistent with the
   association observed in patients with glutaric acidemia, caused by
   deleterious mutations in the ETFDH gene, which, among other metabolite
   changes, is accompanied by increased blood levels of glutarylcarnitine
   (C5-DC)^[149]53,[150]54.

   Pathway analysis provided additional biological insights. For example,
   top results of the pathway analysis identified biological functional
   terms of coagulation and the regulation of body fluid levels for the
   metabolite acylcarnitine linoleoylcarnitine (C18:2 carnitine), with 15
   overlapping genes between these pathway annotations (Supplementary
   Data [151]11 and Supplementary Fig. [152]3). None of the genes in the
   gene sets for those pathways belonged to ‘known’ or ‘novel’ loci
   identified by single variant analysis for this metabolite, and none
   were statistically significantly associated with linoleoylcarnitine in
   gene-centric analysis. Nevertheless, gene network analysis of the GWAS
   summary statistics allowed us to identify these important gene
   interactions. Previously, observational studies showed the involvement
   of acylcarnitines in blood coagulation. For example, Deguchi et al.
   showed that long-chain acylcarnitines, including linoleoylcarnitine,
   are lower in patients with venous thromboembolism, compared to
   age-matched controls (P-value = 0.02), and that linoleoylcarnitine
   possesses anticoagulant properties, possibly, due to the capability of
   acylcarnitines to bind with factor Xa^[153]55. Later, Zeleznik et al.
   showed that metabolites in the acylcarnitine pathway, including
   linoleoylcarnitine, are depleted in the intermediate/high-risk group of
   pulmonary embolism compared to the low-risk group^[154]56, further
   supporting the notion of acylcarnitines involvement in coagulation.
   However, no coagulation-related genetic loci were previously identified
   for linoleoylcarnitine, making our pathway analysis the first genetic
   evidence to link this metabolite to blood coagulation.

   The MR analyses have been adopted to various phenotypes to help
   identify the causal relationships, specifically the potential
   bi-directional associations between proteins and metabolites. For
   example, genetically regulated higher levels of CUZD1 contribute to
   decreased levels of several lipid-related metabolites, including C56:4
   TAG, C34:3 DAG, C20:3 CE, and 1-stearoyl-GPE. CUZD1 is located in the
   secretory granules in the pancreas and pancreatic secretions. Although
   it’s exact biological function is unknown, it is thought to play a role
   in immune response due to its involvement in inflammatory bowel disease
   (IBD)^[155]57, with anti-CUZD1 autoantibodies suggested as a marker of
   the IBDs^[156]58. At the same time, patients with IBDs may have altered
   lipid profiles, compared to healthy individuals^[157]59. Therefore, our
   data suggests that CUZD1 affects the levels of various lipid-related
   metabolites, although further functional studies are needed to explore
   these relationships further.

   Likewise, genetically high levels of N-acetylarginine decrease the
   levels of IGFBP-6. IGFBP-6 has cancer-protective properties, plays a
   role in the immune system^[158]60,[159]61 and in neuronal
   protection^[160]62. N-acetylarginine is a guanidino compound that is
   capable of inducing seizures in animal experiments;^[161]63 high levels
   of this compound present in argininemia that is characterized by
   neurological symptoms^[162]64.

   In this project, we applied a stringent Bonferroni correction to define
   significance for replication, and a modest amount (79 out of 488) of
   our novel association were replicated. However, most associations (304
   out of 488) showed the same direction of effects, providing supportive
   evidence of our findings, which are warranted for further
   investigation. Of note, our replication set included participants from
   several studies with a modest sample size, which may impact the
   replication due to possible heterogeneity across studies and lack of
   sufficient statistical power, specifically for low-frequency variants.

   Of note, most our novel findings are consistent across ancestries. For
   examples, associations of rs1697421- glycerol 3-phosphate, rs6440123 −
   1-stearoyl-2-oleoyl-GPE (18:0/18:1), rs68008113 - cerotoylcarnitine
   (C26), rs113680823 - arabitol/xylitol and rs5112 −
   1-stearoyl-2-arachidonoyl-GPE (18:0/20:4) had the same directions of
   effect in three ancestries, European, African and Hispanic Americans.
   High statistical significance (p-value < 1.85 × 10^−5 in each ancestral
   group) in European and Hispanic Americans were observed though there
   was no significance in African Americans. Ancestry-specific results
   (Supplementary Data [163]5[164]c) are provided to facilitate further
   dissection of potential differences across ancestral groups.

   In summary, we showed the feasibility of performing computationally
   efficient pooled analysis, using both metabolomics and WGS data, which
   can be applied for the future research projects. Additionally, this
   study provides further determination of the genetic architecture of
   circulating metabolites in a multi-ethnic population, using both common
   and rare variants, comprehensive functional annotation, and a
   systematic identification of potential causal relationships between the
   genes, metabolites, various phenotypes and plasma protein levels. Our
   results can be widely used in future studies to expand further our
   understanding of the biological processes in health and disease.

Methods

Genetic studies

   Five cohorts contributed to the discovery stage of the analysis (ARIC,
   FHS, CHS, MESA, and HCHS/SOL) with a total of 11,840 participants,
   including 5938 EA, 1843 AA, and 4059 HIS. Study-specific
   characteristics, metabolite measurement procedures, genetic sequencing
   information, and quality control for the studies are provided in
   Supplementary Data [165]2[166]A, while basic study characteristics are
   listed in Supplementary Data [167]3.

   For replication, summary association statistics were requested (for the
   novel variant-metabolite associations with P < 3 × 10^−11) from three
   cohorts: FHS (2969 EAs), WHI (1328 EA), and JHS (2466 AAs).
   Additionally, we obtained publicly available summary statistics from
   9363 European FENLAND participants^[168]21 and 1959 EUR TwinsUK
   participants^[169]1. In total, up to 18,085 individuals (including
   15,619 European ancestry participants and 2466 AA participants) were
   available for replication meta-analysis. Study-specific characteristics
   are provided in Supplementary Data [170]2[171]B.

   All the participating studies were approved by corresponding
   institutional review boards, and all participants provided written
   informed consent.

Metabolite measurements

   Details of the metabolites measurements are provided in Supplementary
   Data [172]2, while for the previously published studies, these can be
   found in the respective manuscripts^[173]1,[174]21. In brief, blood
   samples were collected in participating studies, processed and stored
   at −70 °C since collection. Overall, 1666 metabolites were measured by
   untargeted, gas and/or liquid chromatography-mass spectrometry-based
   quantification protocol (Supplementary Data [175]3)^[176]65,[177]66. In
   HCHS/SOL and ARIC, metabolites were measured by Metabolon Inc. (Durham,
   NC) platform. For CHS, FHS, JHS MESA, and WHI, metabolites were
   measured by Broad Institute.

Genotyping, quality control, and imputation

   Blood samples were sequenced on the Illumina HiSeq X; for MESA and FHS,
   sequencing was performed by the Broad Institute of MIT and Harvard; for
   CHS and HCHS/SOL - by the Baylor College of Medicine Human Genome
   Sequencing Center, while for ARIC - by both centers. Variants calling
   was completed using the GotCloud pipeline^[178]67.

   Quality control procedures have been described elsewhere^[179]67. In
   short, variant filtering was performed by calculating Mendelian
   consistency scores using known familial relatedness and duplicates, and
   by training a Support Vector Machine classifier between known variant
   sites (positive labels - SNPs polymorphic either in the 1000 Genomes
   Omni2.5 array or in HapMap 3.3, with additional evidence of being
   polymorphic in the sequenced samples) and Mendelian inconsistent
   variants (negative labels - having the Bayes Factor for Mendelian
   consistency <0.001; or if 10% or more of families or pairs of duplicate
   samples show Mendelian inconsistency within families or genotype
   discordance between duplicate samples). Additionally, excess
   heterozygosity filter was applied to variants with the Hardy–Weinberg
   disequilibrium P-value < 1 × 10^−6 in the direction of excess
   heterozygosity after accounting for population structure. Mendelian
   discordance filter was applied when ≥5% of families show Mendelian
   inconsistency or genotype discordance.

Statistical analysis

Single variant tests

   We applied a two-stage procedure for rank normalization in
   genotype-metabolite association analyses^[180]68. The fully adjusted
   two‐stage approach was chosen due to its ability to reduce excess Type
   I errors and to improve statistical power, as well as due to having a
   lower degree of inflation compared to approaches without
   rank-normalization^[181]68. It has been widely applied to large-scale
   GWAS studies for complex traits^[182]69–[183]71, including metabolomic
   measures in the mixed populations^[184]16. As in above studies, data
   preparation for the single variant analysis involved several steps.
   First, each of the 1666 metabolites were inverse rank normal
   transformed by study, race and batch. Second, we obtained the residuals
   using generalized linear mixed model adjusting for age, sex, race,
   study, and study variables (such as recruitment center), and the first
   11 principal components with random effects accounting for
   inter-individual correlation (due to either relatedness, shared
   household, or census block group). Third, above residuals were inverse
   normal transformed, and finally, these inverse transformed residuals
   were used in the genetic analyses again adjusting for all of the
   aforementioned covariates, along with estimated glomerular filtration
   rate (eGFR)^[185]72. For this study, we considered both autosomal and X
   chromosome variants. Overall, up to 15,660,619 variants (MAF ≥ 0.5%,
   N ≥ 200, minor allele count [MAC] ≥ 5) were analyzed with each
   metabolite. Analyses were performed in GENESIS^[186]73, using additive
   genetic models. Significance for single variant analysis was defined as
   two-sided P-value ≤ 3 × 10^−11 (accounting for ~1,000,000 independent
   variants and 1666 metabolites).

Conditional analysis

   Across the analyzed genome, we defined metabolite-associated genetic
   loci as containing all statistically significant variants within
   500 kbp from each other. To account for linkage disequilibrium, we
   added 500 kbp to each side of the region, and all the overlapping
   regions were merged. We identified 922 loci, containing 2614
   locus-metabolite pairs.

   For every locus-metabolite pair, we performed conditional analysis
   using GENESIS to identify the independent leading variants.
   Conditioning was performed step-wise. In each round, conditioning was
   performed on the variant(s) with the lowest P-value in the region.
   Variants that were both statistically significant in the primary
   analysis (P-value ≤ 3 × 10^−11) and genome-wide statistically
   significant (P-value[conditional] < 5 × 10^−8) in the conditional
   analysis were considered conditionally independent associations.

   We identified 2999 conditionally independent variant-metabolite
   associations (Supplementary Data [187]4–[188]5). A majority (2330) of
   the metabolite-genetic region pairs had one conditionally independent
   variant; 218 pairs had 2 conditionally independent variants; 41 pairs −
   3; 18 pairs − 4; and 3 pairs − 6 conditionally independent variants.
   For each statistically significant independent variant-metabolite
   association, we used R to calculate proportion of variance in
   corresponding metabolite explained by the variant^[189]74.

Annotation of the known and novel findings

   To annotate the identified 2999 independent variant-metabolite
   associations, we obtained reports from the Metabolomic GWAS
   Server^[190]12, TwinsUK study^[191]1, GWAS Catalogue, GRASP Search,
   previous reports from our group^[192]16,[193]75–[194]77 and performed
   manual search through published papers to detect known loci that
   overlap with our findings. If a variant from a variant-metabolite pair
   was previously associated with any of the metabolites in its
   sub-pathway (Supplementary Data [195]3), the variant-metabolite pair
   was considered known, otherwise the variant-metabolite pair was
   considered novel.

Replication analysis

   We performed inverse-variance weighted meta-analysis of single variant
   summary statistics obtained from five studies (Supplementary
   Data [196]5), using meta version 4.18-0 R package. Out of 1985 novel
   variant–metabolite associations, 488 variant-metabolite associations
   (107 metabolites, 458 unique variants) were available in at least one
   replication cohort. Significant replication was defined as: (1) had
   two-sided P-value ≤ 1.02 × 10^−4 = 0.05/488 in meta-analysis or in a
   single replication cohort (when association was available only in one
   cohort) and (2) had consistent direction of effect in both discovery
   and replication meta-analysis.

Generalization in pediatric populations

   We requested summary statistics for 1985 novel variant-metabolite
   associations from two children studies - Childhood Asthma Management
   Program (CAMP) and Genetic Epidemiology of Asthma in Costa Rica (CRA)
   (Supplementary Data [197]7), and obtained summary statistics for 51
   novel variant-metabolite pairs. For these associations, we also
   performed inverse-variance weighted meta-analysis between the two
   studies (CAMP and CRA). Significant associations were defined as: (1)
   had two-sided P-value ≤ 9.8 × 10^−4 = 0.05/51 in meta-analysis and (2)
   had consistent direction of effect in both discovery and meta-analysis.

Gene-centric rare variant analyses

   To test whether rare variants in aggregate affect metabolite
   regulation, we performed gene-centric rare variant analyses for 230
   metabolites associated with novel loci, using variant-set test for
   association using annotation information omnibus test (STAAR-O) in
   discovery dataset^[198]24, which boosts the power of rare variant
   association tests by incorporating multiple variant functional
   annotations. For each test, we included variants with MAF ≤ 1%
   (Supplementary Fig. [199]2b). To ensure the robustness of the results,
   gene-metabolite associations with the sample size of <5000 and
   cumulative MAC of <100 were excluded. Aggregation was based on each of
   the following five functional variant categories for the gene-centric
   coding genome - missense, synonymous, putative loss of function (stop
   gain, stop loss, and splice), disruptive missense, and combined
   putative loss of function and disruptive missense. For the gene-centric
   non-coding genome, aggregation was performed based on the following
   seven variant categories: downstream, enhancer variants overlaid with
   Cap Analysis of Gene Expression (CAGE) sites, promoter CAGE, enhancer
   variants overlaid with DNAse HyperSensitivity (DHS), promoter DHS,
   upstream and UTR^[200]25. Gene-metabolite associations with two-sided
   P-value ≤ 1.05 × 10^−9 (accounting for 17,174 analyzed genes, 230
   metabolites, and 12 categories), were considered significant.

Co-localization of metabolites with eQTLs

   We performed co-localization analysis with GTEx V8 eQTLs summary to
   investigate whether, for the 65 identified novel replicated genetic
   locus-metabolite associations, these genetic loci share candidate
   variants with the gene expression levels. Analysis was performed using
   HyPrColoc, which can identify subsets of traits colocalizing at
   distinct causal variants in the genomic locus. For each genetic locus
   (Supplementary Data [201]3[202]c), all metabolites associated with
   variants within the locus with evidence of replication were analyzed
   simultaneously. Variants represented in both discovery dataset and all
   49 tissues of GTEx V8 dataset were included^[203]78,[204]79 (prior
   structure:
   [MATH: <mi>p</mi> :MATH]
   = 0.0001,
   [MATH: <mi>γ</mi> :MATH]
   = 0.98, Supplementary Data [205]10). As colocalizing, we considered
   variants with posterior probability (PPr) > 0.6 for colocalization
   between metabolite(s) with gene eQTLs in tissue(s). We also performed
   sensitivity analyses for the co-localization results (including the
   metabolites and eQTLs detected in the primary co-localization
   analysis), to address the causal configuration of priors (Supplementary
   Data [206]10[207]b).

Gene network and pathway analysis

   Gene-level association scores were obtained for each of the 65
   metabolites, based on respective single variant summary statistics. The
   gene-level association analysis was performed by applying the MAGMA
   (Multi-marker Analysis of GenoMic Annotation) tool version 1.09a. MAGMA
   maps SNPs to genes during the “annotation step”, and then performs
   SNP-wise mean for each gene to obtain gene-level P-values during the
   “gene analysis step”^[208]80. In order to perform this gene-level
   association test, a mixed population linkage disequilibrium (LD)
   reference panel from the 1000 Genomes Project for individuals of
   American ancestry was used as input, and the default MAGMA parameters
   were applied. MAGMA employs a multiple regression model to assess the
   additive effects of single variant associations, while accounting for
   LD patterns. The 1000 Genomes Project LD reference panel of American
   ancestry, which includes individuals of multiple ethnicities, was used
   because it has been previously recommended as an appropriate reference
   for investigations of a mixed population^[209]81. The P-values were
   then transferred to Z-scores via the inverse normal distribution
   function. The calculated Z-scores were used as gene weights in our
   network-based analysis of GWAS signals. The dense module search of GWAS
   (dmGWAS version 2.7) tool was used to identify gene networks associated
   with each metabolite^[210]28. The dmGWAS method uses GWAS-based
   gene-level scores and a reference protein-protein interaction (PPI)
   network to identify gene network modules associated with a phenotype of
   interest. In this case, the reference PPI used was a collection from
   PathwayCommons, representing the human protein interactome, which
   included 39,240 annotations of experimentally validated PPIs^[211]82.
   Next, gene sets from the top 10 ranking network modules were extracted
   for each metabolite. The pathway enrichment analysis was performed for
   each of these gene sets by over-representation analyses (ORA,
   Supplementary Data [212]11). The ORA was performed by using the
   WebGestalt R package version 2019 with the Gene Ontology Biological
   Process term annotations for genome protein-coding genes. Default
   parameters were applied for ORA methods^[213]83.

Mendelian randomization

   We performed a MR analysis using the summary statistics for 1801 traits
   in 135,638 participants from FinnGen (R3 - public release of 16 June
   2020). For each of 65 metabolites associated with replicated variants,
   we used all conditionally independent variants. For each variant, we
   obtained a causal estimate as the ratio of the association of the
   variant with each of 1801 FinnGen traits.

   To determine robustness of the identified statistically significant
   metabolite - FinnGen outcome phenotype associations, we performed
   additional MR analyses using our summary statistics as exposure and an
   additional set of independent studies (UKBiobank, EBI, and BBJ) as
   outcomes. Thirty-three statistically significant metabolite-FinnGen
   outcome association pairs were matched to a comparable outcome obtained
   from one of the above datasets (Supplementary Data [214]12).

   We also used the Sun et al^[215]38. summary statistics for plasma
   proteins for metabolite-pQTL MR^[216]38. For above two MR analyses,
   associations with P-value[IVW MR] < 1.51 × 10^−7 (accounting for 65
   metabolites, 1801 traits and 3283 pQTLs) were considered statistically
   significant.

   We further performed MR for 1561 pQTL independent variants reported by
   Sun et al^[217]38. as IV, with each of 65 metabolites associated with
   replicated genetic loci as outcomes (significance threshold was set at
   P-value[IVW MR] < 4.93 × 10^−7, accounting for 65 metabolites and 1561
   pQTLs).

   For all of the above MR analyses, if exposure was associated with more
   than one variant, we performed a fixed effect inverse-variance
   weighting meta-analysis (IVW) using TwoSampleMR to obtain the overall
   estimates. Heterogeneity was assessed using the Q-statistic. Further,
   if the exposure was associated with more than two variants, we
   performed Egger MR using TwoSampleMR, as well as MR-PRESSO outlier test
   (to detect outlier IVs), using MR-PRESSO version 1.0^[218]84. Egger MR,
   although conservative, generates valid estimates even if not all the
   genetic instruments are valid, given that the Instrument Strength
   Independent of Direct Effect assumption holds^[219]85. Additionally,
   Egger MR intercept can help detect (unbalanced) pleiotropy. We obtained
   the F-statistic^[220]86 for the association of genetic variants with
   corresponding metabolites to assess instrument strength.

Reporting summary

   Further information on research design is available in the [221]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [222]Supplementary Information^ (488.4KB, pdf)
   [223]41467_2023_38800_MOESM2_ESM.pdf^ (148.8KB, pdf)

   Description of Additional Supplementary Files
   [224]Supplementary Data 1^ (32.5KB, xlsx)
   [225]Supplementary Data 2-14^ (1.7MB, xlsx)
   [226]Reporting Summary^ (2.8MB, pdf)

Acknowledgements