Abstract

   Variation in the blood metabolome is intimately related to human
   health. However, few details are known about the interplay between
   genetics and the microbiome in explaining this variation on a
   metabolite-by-metabolite level. Here, we perform analyses of variance
   for each of the 930 blood metabolites robustly detected across a cohort
   of 1,569 individuals with paired genomic and microbiome data, while
   controlling for a number of relevant covariates. We find that 595 (64%)
   of these blood metabolites are significantly associated with either
   host genetics or the gut microbiome, with 69% of these associations
   driven solely by the microbiome, 15% driven solely by genetics, and 16%
   under hybrid genome-microbiome control. Additionally, interaction
   effects, where a metabolite-microbe association is specific to a
   particular genetic background, are quite common, albeit with modest
   effect sizes. This knowledge will help to guide targeted interventions
   designed to alter the composition of the human blood metabolome.

Editor summary:

   Diener and Dai et al analyse blood metabolites from 1569 individuals
   and identify metabolites associated with the microbiome, host genetics,
   or under hybrid genetic-microbiome control

Introduction

   The human blood metabolome is shaped by a combination of intrinsic and
   extrinsic forces and constitutes the primary resource pool for human
   metabolism. While the composition of the plasma metabolite pool is
   strongly driven by diet, lifestyle, and the ecology of the gut
   microbiota, the fate of individual metabolites in the blood is often
   tightly regulated by host genetics ^[40]1.

   Genetic variants are known to alter the human blood metabolome in
   several disease-relevant contexts. For example, certain deleterious
   alleles impact cholesterol levels and contribute to
   hypercholesterolemia, while other well-known alleles drive
   phenylalanine accumulation and lead to phenylketonuria ^[41]2,[42]3.
   The majority of genetic variants associated with blood metabolite
   levels affect either enzymes or solute carriers, thus directly
   influencing an individual’s ability to produce, consume, secrete, or
   absorb small molecules ^[43]4,[44]5. Many of the currently identified
   metabolome-associated genetic variants are pleiotropic ^[45]6, which
   indicates a vast interconnectedness of the metabolome across different
   systems of the body.

   Recent studies have identified the human gut microbiome as a major
   determinant of blood metabolite variability ^[46]1,[47]7. In this prior
   work, host genetics and gut microbiome composition were found to
   associate with plasma metabolite levels in a largely orthogonal manner,
   which is supported by the observation that, for the most part, host
   genetics and the gut microbiota do not tend to associate strongly with
   one another ^[48]8, even though a number of clear examples of
   genome-microbiome associations have been shown ^[49]9.

   Despite these broad, multivariate regression results showing a global
   correspondence between genomics, the microbiome, and the metabolome,
   little is known about how this maps onto individual blood metabolite
   levels. While one might expect that the variation in metabolites
   produced by bacteria in the human gut is mostly governed by the
   microbiota and that metabolites specific to human metabolic processes
   are associated more with host genetics, there are examples, as in the
   case of microbe-host co-metabolites like conjugated secondary bile
   acids ^[50]10, where the story becomes more complicated.

   Intestinal signaling, such as the activation of FXR and TGR5 receptors
   in the human gut, can regulate glucose, insulin, cholesterol, and bile
   acid homeostasis ^[51]11,[52]12. Furthermore, many microbiome-derived
   metabolites are modified by hepatic enzymes and converted into a
   variety of conjugated compounds, such as hippurate or polyamines
   ^[53]13,[54]14. These multi-layer filters on microbe-host co-metabolism
   make it challenging to map blood metabolites to potential microbial
   precursors. For example, blood cholesterol levels are affected by host
   genetic variants, but they can also be regulated through intestinal
   signaling, involving the production of secondary bile acids by gut
   commensals, or through gut microbial cholesterol dehydrogenases that
   funnel host and dietary cholesterol into fecal coprostanol
   ^[55]15,[56]16. Thus, even though host genetics and the microbiome
   appear to have largely orthogonal effects on the blood metabolome as a
   whole, they nonetheless act on an overlapping set of metabolites,
   potentially explaining independent components of the variance of these
   compounds.

   Additionally, even in the absence of strong heritability of human gut
   commensals, there are key examples where genetics and the microbiome
   may interact in particular disease conditions, such as cystic fibrosis
   ^[57]17. This raises the question of whether there are instances where
   microbiome-metabolome associations are modulated by the genetic
   background of the host in a similar manner as other gene-environment
   interactions ^[58]18, where a genetic variant may augment or attenuate
   an individual’s risk of developing a particular disease when exposed to
   a specific environmental risk factor.

   Here, we studied variability in plasma metabolite abundances in a
   cohort of 1,569 healthy individuals from the U.S. We find extensive
   interplay between the genetic and gut microbial determinants of
   individual metabolite levels in the blood, which provides deep insights
   into the microbe-host co-metabolisms that govern the composition of the
   human blood metabolome.

Results

Metabolites associated with genetics and the microbiome

   We performed genome-metabolite and microbiome-metabolite association
   tests for a healthy cohort of 1,569 individuals with paired blood and
   fecal samples taken within 30 days of each other. The cohort consisted
   of individuals in the U.S., mostly from the Pacific West, that had
   enrolled in a scientific wellness program of the former company,
   Arivale, Inc., between 2016 and 2019. These individuals were generally
   healthy, between 18 and 87 years old, predominantly female (63.5%) and
   with a mean±SD body mass index (BMI) of 28.2±6.6. Metabolite abundances
   across the data set were first adjusted for common confounders and the
   residuals were used for further downstream analyses (927 metabolites
   associated with at least one confounding variable, FDR-corrected
   p<0.05, mean R^2=0.12±0.06). The cohort was split randomly into a
   training cohort of 1,000 individuals and a validation cohort of 569
   individuals. Microbiome-metabolite and genome-metabolite association
   analyses were performed within the training cohort and the resulting
   training cohort coefficients were fixed in the validation cohort models
   to generate out-of-sample R^2 estimates ([59]Fig. 1A).

Figure 1. Study design and cross-sectional variances in metabolites explained
by host genetics, gut bacterial genera, or both.

   Figure 1.
   [60]Open in a new tab

   (A) The study comprised a metabolome-genome-wide association analysis
   and a metabolome-microbiome-wide association analysis (regressions on
   centered log-ratio transformed bacterial genus-level abundances)
   performed in the Arivale cohort. Explained variance for a specific
   blood metabolite can be partitioned into host genetic and microbiome
   associated components. The cohort was split into a training cohort of
   1,000 individuals and a validation cohort of 569 individuals. (B)
   Fraction of variance (R^2) explained in the training cohort by host
   genetic or microbiome features across the 595 metabolites with
   significant associations with either the genome or the microbiome
   (FDR-corrected p<0.05). Specific sub-groups are annotated in the plot,
   including metabolites only associated with host genetics, metabolites
   only associated with gut genera, or metabolites significantly
   associated with both (i.e., hybrid). (C) Percentage of the 595
   significant metabolites associated only with host genetics, only with
   the microbiome, or with both (hybrid) in the training cohort. (D)
   Fraction of variance (R^2) explained in the validation cohort.
   Coefficients were fixed to those from the training cohort to estimate
   out-of-sample R^2 values. (E) Percentage of the 595 significant
   metabolites associated only with host genetics, only with the
   microbiome, or with both (hybrid) in the validation cohort. (F)
   Comparison of the R^2 values obtained in this study with the R^2 values
   obtained in Bar, Korem et al. ^[61]1 for 159 metabolites identified as
   significant in both studies. The solid line denotes an ordinary least
   squares regression line, and the error bands denotes the 95% confidence
   interval of the regression. P-values were obtained by a Pearson
   correlation test and “r” denotes the Pearson product-moment correlation
   coefficient.

   To identify associations between host genetics and individual
   circulating blood metabolite levels, we performed genome-wide
   association analyses on 7.68 million common variants (Minor Allele
   Frequency ≥ 1%) in the 1,000 individuals from the training cohort for
   each of the 930 blood metabolites that passed our prevalence threshold.
   A total of 182 metabolites of the 930 tested (19.6%) were associated
   with one or more of the 297 independent lead variants that passed the
   genome-wide significance threshold (p < 5.37 × 10^−11; see Fig. S1). Of
   these 297 variants, 79 were in intergenic regions and 218 variants
   mapped to 123 genes, including those associated with inherited
   metabolic disorders such as ACADS, ALMS1, and GCKR. Potential
   loss-of-function consequences were predicted for two variants: stop
   gained for rs183603441 in HYKK, and a splice donor variant for
   rs61825638 in KMO. Missense mutations were predicted for 25 variants,
   affecting 39 metabolites. However, the majority of the
   metabolite-associated variants found within genes were synonymous.

   In particular, lactosylceramide lactosyl-N-nervonoyl-sphingosine
   (d18:1/24:1) was associated with lead variants rs3752246 (P = 9.8 ×
   10^−14) and rs3752248 (P = 5.6 × 10^−13) located in ABCA7. rs3752246 is
   a missense variant and has been previously identified as a risk variant
   for late-onset Alzheimer’s disease ^[62]19,[63]20. Prior studies have
   identified altered levels of ceramides in the blood and brain of
   individuals with Alzheimer’s Disease, with the d18:1/24:1 ceramide
   showing a strong and robust association with Alzheimer’s Disease in a
   recent meta-analysis ^[64]21. Our results further underscore a possible
   shared genetic architecture underlying lactosyl-N-nervonoyl-sphingosine
   (d18:1/24:1) levels and late-onset Alzheimer’s disease.

   Associations with more than one metabolite were identified for 70 of
   the 297 significant lead variants (23.6%), revealing the extent of
   pleiotropy in our cohort. The effect sizes of pleiotropic variants are
   generally higher than non-pleiotropic variants (median effect size
   difference = 0.08, P = 0.04; two-sided Mann-Whitney U-test), while the
   distributions of minor allele frequencies were similar between the two
   types of variants. Overall, the 70 pleiotropic lead variants were
   associated with 92 metabolites (52.2% of all significant
   genetically-associated metabolites). Four variants (rs4149056,
   rs1047891, rs148982377, and rs45446698), of which rs4149056 and
   rs1047891 are missense variants, were each associated with more than 8
   metabolites. rs4149056 in SLCO1B1 (solute transporter in liver) was
   associated with 8 metabolites, including primary and secondary bile
   acids, conjugates of polyunsaturated fatty acids, and free fatty acids.
   rs1047891 in CPS1 (mitochondrial enzyme involved in the urea cycle) was
   found to be associated with 8 metabolites, many of which were
   conjugated to glycine and glutamine moieties and were highly
   correlated. rs148982377 in ZNF789 and rs45446698 in ZSCAN25
   (transcription factors) were associated with the same set of 9 steroid
   hormone metabolites, primarily conjugates of DHEA and androsterone.

   Analysis at the gene level reveals a further degree of pleiotropy as
   well as polygenicity. Using annotations from dbSNP, we identified
   associations between genes and metabolites. 182 metabolites were
   associated with at least one gene, with 123 significant genes
   identified in total. Of these, 35 genes (27.8%) were associated with
   more than one metabolite. Individual pleiotropic genes tend to be
   associated with metabolites with similar biochemical properties,
   providing insights into unidentified metabolites. For example, the gene
   cluster of UGT1A1, UGT1A3, UGT1A4, UGT1A5, UGT1A6, UGT1A7, UGT1A8,
   UGT1A9, and UGT1A10 was associated with 10 metabolites, including
   bilirubin, biliverdin, and five unidentified compounds. These genes
   encode UDP-glucuronosyltransferase (UGT) enzymes, which metabolize
   bilirubin ^[65]22. Consistent with our results, Metabolon later
   determined that these five unidentified compounds were degradation
   products of bilirubin. In addition to pleiotropy, we also observe
   instances of polygenic associations for several metabolites. For
   example, ethylmalonate, a breakdown product of butyrate, was
   significantly associated with six different genes: ACADS, SPPL3, CABP1,
   OASL, DYNLL1, and MLEC.

   In order to identify associations between the gut microbiome and
   circulating blood metabolite levels, we performed regressions using
   centered-log-ratio (CLR) transformed bacterial genus-level abundances
   as independent variables, while correcting for sex, age, sex-age
   interactions, BMI, and genetic kinship (i.e., the first 5 principal
   components of the genetic distance matrix). A majority of the tested
   metabolites had significant associations with at least one bacterial
   genus in the gut microbiome (508/930 = 54.6%, FDR-corrected p<0.05).
   Here, the average fraction of explained variance (R^2) was lower than
   genetic associations (mean R^2 of 0.04 and 0.11 for microbiome and host
   genetic features in the training set, respectively; 0.02 vs. 0.09 in
   the validation set, respectively). However, these R^2 values are not
   directly comparable across genomic and microbiome models due to the
   different significance thresholds used (Bonferroni for the GWAS and
   FDR-correction for the microbiome).

   The blood metabolites with the largest fraction of variance explained
   by microbial features were dominated by compounds involved in bacterial
   metabolism of aromatic and phenolic compounds, such as
   cinnamoylglycine, hydrocinnamate, and hippurate (R^2>0.25 in the
   training cohort and R^2>0.15 in the validation cohort). Catabolism of
   phenylalanine and phenylacetate to cinnamate and benzoate is exclusive
   to the microbiome and is conspicuously absent in germ-free mice
   ^[66]23,[67]24. Hippurate is formed in the liver by conjugating
   benzoate to a glycine and blood levels of hippurate have been
   positively associated with gut microbiome alpha-diversity ^[68]7,[69]25
   and with overall metabolic health ^[70]26.

   Additionally, we found several products of bacterial protein
   fermentation, such as p-cresol derivatives and the tryptophan breakdown
   product indolepropionate, associated with the gut microbiome
   ([71]Tables S1–[72]S2). Furthermore, we found that all secondary bile
   acids identified in plasma were significantly associated with the gut
   microbiome (FDR-corrected p<0.05), which is expected because primarily
   bile acids are deconjugated into secondary bile acids by bile salt
   hydrolases expressed by gut commensal bacteria ^[73]27. Here,
   isoursodeoxycholate had the largest fraction of variance explained by
   gut microbiome composition (R^2 of 0.36 and 0.3 in the training and
   validation cohorts, respectively).

   R^2 estimates from the current study showed good agreement with prior
   estimates from an Israeli cohort^[74]1 for microbiome-metabolite
   associations (Pearson r=0.5, [75]Fig. 1F) and excellent agreement for
   genome-metabolite associations (Pearson r=0.9, [76]Fig. 1F). The
   largest differences in microbiome-metabolite variances across cohorts
   were observed mostly for xanthine derivatives and may be due to
   different microbiome sequencing strategies, confounder-adjustments,
   different statistical modeling approaches, or differences between serum
   (Bar, Korem, et al.) and plasma (this study).

Additive contributions of genetic and microbial factors

   We identified a total of 595 out of 930 plasma metabolites (64%) that
   were associated with either genetic factors, microbial factors, or both
   ([77]Fig. 1A–[78]B). Most of these metabolites (508) were significantly
   associated with the microbiome. 16.3% and 11.4% of these metabolites
   showed “hybrid” associations in the training and validation cohorts,
   respectively, meaning they were significantly associated with both
   genetic and microbial factors (i.e., >1% of total explained variance
   explained by both sets of features, [79]Fig. 1B–[80]E). In particular,
   we found that about ⅓ − ½ of all metabolites with a significant genetic
   association also showed a microbial association (i.e., 82/161 and
   57/160 metabolites associated with genetic variants in the training and
   validation cohorts, respectively). Conversely, no more than ⅕ of all
   metabolites with a microbiome association also showed a hybrid
   association with host genetics across training and validation cohorts.
   Consistent with the average explained variances of metabolites reported
   above for genetics and the microbiome, hybrid metabolites with
   particularly large total explained variances (i.e., >20%) showed a
   tendency to have higher genetic R^2 values than microbial R^2 values
   (see [81]Fig. 1B,[82]D and [83]Fig. 2A). However, as mentioned above,
   these patterns should be interpreted with caution because of the
   different significance thresholds used for microbiome and genetic
   association tests.

Figure 2. Metabolites associated with both genetics and the microbiome.

   Figure 2.
   [84]Open in a new tab

   (A) The 20 metabolites with the highest total R^2 in the validation
   cohort that were significantly associated with both host genetics and
   the microbiome in the training cohort. Blue and green colored bars
   denote individual R^2 values in genetics-only (blue) and
   microbiome-only (green) regression models, respectively. Gray bars
   denote R^2 values from regression using joined genetics and microbiome
   data. (B) R^2 values obtained by either adding individual contributions
   of genetics and the microbiome (additive) or by performing a joint
   regression (joint) for all metabolites in the hybrid group with an
   identified standard. The difference between the two groups indicates a
   very small, but nonetheless significant, overlap in variance explained
   by genetics and the microbiome (n=82 metabolites in each group,
   two-sided Mann-Whitney test). However, the variances explained by host
   genetics and the microbiome were largely additive. Boxes denote the
   interquartile range (IQR) and the center line denotes the median.
   Whiskers extend up to the largest and smallest point within 1.5 IQRs.
   Stars denote significance (*** - p<0.001). Metabolite names ending in
   “*” or “**” indicate compounds for which a standard is not available,
   but Metabolon is confident in its identity.

   In order to test whether genetic and microbial factors contained
   redundant information, we compared a joint genetics-microbiome model to
   individual genetic and microbial models. If the overlap in variance
   explained between genetic and microbial feature sets was large, the
   joint model R^2 value would be substantially smaller than the sum of
   the individual model R^2 values (see [85]Fig. 2A). Although we found
   that the difference in R^2 between the joint model and the sum of the
   individual model R^2 values was statistically significant in the
   validation cohort across a very large number of models, the magnitude
   of this difference was extremely small (median difference in
   R^2<0.003), indicating that genetics and the microbiome explain nearly
   exclusive components of the variance for a given metabolite. This
   result is consistent with prior work showing that the variances in the
   blood metabolome explained by the microbiome and host genetics were
   largely orthogonal ^[86]1,[87]8. However, here we demonstrate that this
   is not only true globally, but at the individual metabolite level as
   well.

Genome-microbiome interaction effects influence metabolites

   Finally, we investigated whether the genetic background of the host
   could modulate microbiome-metabolite associations (see [88]Fig. 3A). To
   this end, we tested all genetic variant / bacterial genus / metabolite
   triplets across our full cohort (N=1,569), filtering for only those
   variant-metabolite pairs that were previously detected in the
   genome-wide associations studies (i.e., 23,126 triplets in total). We
   employed a strategy similar to gene-environment interaction studies,
   using the CLR transformed abundances of microbial genera as the
   environmental variable, while adjusting for the covariates mentioned
   above ^[89]28. 466 interaction effects were deemed significant under an
   FDR-corrected p-value cutoff of 0.05, involving 69 distinct
   metabolites. Though gene-microbiome interactions were quite common,
   they only explained very small fractions of plasma metabolite
   variability (mean R^2=0.005) across all genotypes. However,
   gene-microbiome-metabolite interactions often showed appreciable R^2 in
   the homozygous minor allele genotypes, where they explained up to 16%
   of the metabolite variances ([90]Fig. 4).

Figure 3. Gene-microbiome interactions.

   Figure 3.
   [91]Open in a new tab

   (A) Gene-microbiome interactions occur when the correlation between a
   gut bacterial genus and a blood metabolite is itself conditional on a
   specific allele pairing. (B) Pathway enrichment analysis for
   metabolites involved in significant gene-microbe interactions. (C)
   Bacterial family-level enrichment analysis for taxa involved in
   significant gene-microbe interactions. (D) Host gene enrichment
   analysis for genes involved in significant gene-microbe interactions.
   In (B-D) red circles denote prevalence across all tested features
   (i.e., background prevalence) and blue triangles denote prevalence in
   only those features with significant interaction terms. Stars denote
   significantly enriched features under a one-sided hypergeometric test
   with a Benjamini-Hochberg correction for multiple testing (* - p<0.05,
   ** - p<0.01, *** - p<0.001). (B-D) include data from the entire cohort
   (n=1,569).

Figure 4. Gene-microbiome interactions explain variation in blood metabolite
levels.

   Figure 4.
   [92]Open in a new tab

   Shown are six selected examples where the genomic background modulates
   the associations between bacterial genus-level abundances and
   metabolite levels. In (A-F) metabolite abundances were log-transformed
   and adjusted for common confounders described above. In (A-F) “ρ”
   denotes the Spearman’s Rank-Order correlation coefficient of the
   regression and “p” denotes the p-value under a Spearman’s correlation
   test. The solid line indicates an ordinary least squares regression
   line relating the transformed metabolite abundance with the transformed
   genus abundance within each group. The shaded area is the 95%
   confidence interval of the regression. Associations were corrected for
   sex, age, age^2, sex:age, and sex:age^2 interactions, BMI, microbiome
   vendor, metabolomics batch, and the first 5 principal components of
   genetic ancestry (N=1,569). Metabolite names ending in “*” or “**”
   indicate compounds for which a standard is not available, but Metabolon
   is confident in its identity.

   Gene-microbiome interactions were significantly enriched in metabolites
   involved in fatty acid metabolism, phenylalanine and tyrosine
   metabolism, pyrimidine metabolism, and glycine metabolism ([93]Fig. 3C,
   one-sided hypergeometric test, FDR-corrected p<0.05) and showed
   enrichment in several host genes, such as CPS1, AGXT2, and SLC2A9
   ([94]Fig. 3E, one-sided hypergeometric test FDR-corrected p<0.05).
   Additionally, we observed enrichment for interactions that involved the
   Ruminococcaceae family ([95]Fig. 3D). We observed several instances of
   clinically relevant metabolites that showed stronger microbe-metabolite
   associations in the homozygous minor allele genotype. For example, we
   saw interactions with metabolites involved in a variety of disease
   conditions, such as hepatic failure, mitohormesis, cancer, and
   inflammatory diseases ([96]Tables S3) ^[97]29–[98]31. In particular,
   N-acetyltyrosine levels were associated with Ruminococcaceae UCG-002
   and Bacteroides, but only in the homozygous minor allele genotypes for
   variants in the ALMS1 gene and in an unchracterized locus on chromosome
   2 ([99]Fig. 4A, [100]E). Glycine levels switched from a slight negative
   correlation with Ruminococcaceae UBA1819 in the homozygous major allele
   genotype for a locus in the CPS1 gene to a stronger positive
   correlation within the homozygous minor allele genotype ([101]Fig. 4B).
   As a final example, sphingomyelin levels were associated with
   Holdemania, but only in the homozygous minor allele genotype of a
   variant in the SYNE2 gene ([102]Fig. 4C). All of the selected
   interactions could also be observed in the unadjusted data (Fig. S2).

Genetics associate with conjugated secondary bile acids

   Having mapped plasma metabolite variability to genetic and microbial
   factors, we now asked whether the observed partitioning of variances
   explained may change along pathways that involve host-microbe
   co-metabolisms. To this end, we investigated the metabolism of
   secondary bile acids. Secondary bile acids are formed in the large
   intestine via microbial deconjugation of primary bile acids, reabsorbed
   into the bloodstream via the portal vein, and further metabolized in
   the liver ^[103]32. Thus, secondary bile acid levels in the bloodstream
   are influenced by both the microbiome and the host. We investigated
   whether individual bile acid species variances were predominantly
   explained by genetic factors, microbial factors, or both.

   While unconjugated secondary bile acids were exclusively associated
   with the gut microbiome, 4 out of 6 detected secondary bile acid
   conjugates showed strong genetic components of their variances in both
   the training and validation cohorts ([104]Fig. 5A). In particular,
   plasma deoxycholate and glycochenodeoxycholate abundances showed no
   significant genetic contribution, whereas more than 40% of the variance
   in the host-conjugated deoxycholate glucuronide and more than 20% of
   the variance in host-conjugated glycochenodexycholate glucuronide was
   explained by host genetics ([105]Fig. 5B). Other glucuronidated
   non-bile-acid compounds, such as p-cresol glucuronide, did not show
   this kind of pattern (see [106]Tables S1–[107]S2). Modifications like
   glucuronidation or sulfation usually occur in the liver and are used to
   mark metabolites for excretion in feces or urine ^[108]33,[109]34.
   These results indicate that the blood levels of conjugated secondary
   bile acids are under strong genetic control, while blood levels of
   unconjugated secondary bile acids are not.

Figure 5. Host genetic and gut microbial associations with secondary bile
acids.

   Figure 5.
   [110]Open in a new tab

   (A-C) Explained variances for unconjugated secondary bile acids and
   hepatic secondary bile acid conjugates in the training cohort (A), the
   validation cohort (B) and in Bar, Korem et. al.^[111]1 (C). (A-C) Gray
   lines connect deoxycholate and glycochenodeoxycholate with their
   glucuronidated forms and colored boxes denote unconjugated and
   conjugated secondary bile acids. (D) Associations between bacterial
   genera and secondary bile acids (both conjugated and unconjugated) in
   the training cohort. Only significant associations are shown
   (FDR-corrected and confounder-adjusted p<0.05, two-sided Pearson
   correlation test). Fill colors denote the Pearson product-moment
   correlation coefficients (see legend). (E) Colored cells denote
   associations that passed a genome-wide significance threshold in the
   GWAS and the color denotes the estimated coefficient in the GWAS
   regression. In (A-B, D-E) associations and variances were corrected for
   sex, age, age^2, sex:age, and sex:age^2 interactions, BMI, microbiome
   vendor, metabolomics batch, and the first 5 principal components of
   genetic ancestry. Metabolite names ending in “*” or “**” indicate
   compounds for which a standard is not available, but Metabolon is
   confident in its identity.

   Most hepatically modified secondary bile acids showed associations with
   both genetic and microbial features. One exception was
   glycochenodeoxycholate glucuronide, which was only associated with
   genetics and not with the microbiome, even though the unconjugated
   form, glycochenodeoxycholate, was only associated with the microbiome
   ([112]Fig. 5A). This result was consistent across the training and
   validation cohorts and in the Bar, Korem, et al. data set, where the
   same transition to genetic association after glucuronidation could be
   observed ([113]Fig. 5A–[114]C). The genomic variant rs4149056 was
   associated with 3 of the 6 modified secondary bile acids and has been
   shown previously to affect the abundance of bile acids in plasma
   ^[115]35 (see [116]Fig. 5E). This variant is located in the solute
   carrier protein SLCO1B1, which is expressed in the liver ^[117]36 and
   transports secondary bile acids, with a preference for sulfated bile
   acids and bile salts ^[118]37. Several other variants in SLCO1B1 were
   associated with secondary bile acid derivatives. The two detected
   glucuronidated secondary bile acids were associated with distinctly
   different sets of SNPs. Glycochenodeoxycholate glucuronide plasma
   abundances were mostly affected by variants in the SLCO1B3-SLC1B7 gene
   cluster, whereas we found several variants located on chromosome 4,
   many of which were in the TMPRSS11E serine protease-encoding gene (and
   a number of intergenic variants), that specifically affected
   deoxycholate glucuronide levels ([119]Fig. 5E).

   Furthermore, we also identified at least two significant
   genome-microbiome-metabolite interactions affecting the levels of
   deoxycholate glucuronide and glycochenodeoxycholate glucuronide. An
   uncharacterized variant on chromosome 4 was associated with a decreased
   level of deoxycholate glucuronide in the homozygous minor allele
   genotype, but this effect was mitigated in individuals with higher
   levels of Eggerthella in their gut microbiomes ([120]Fig. 4D).
   Furthermore, a variant in the SLCO1B1 gene showed a positive
   correlation with Ruminoclostrium only in the homozygous major allele
   genotype, although this ‘major’ allele was actually less common in our
   cohort ([121]Fig. 4E).

   Thus, whereas cross-sectional variations in unmodified secondary bile
   acid levels were not explained by host genetic variation, plasma
   abundances of several secondary bile acid conjugates were significantly
   associated with a diverse set of compound-specific genetic variants on
   chromosomes 4 and 12. Additionally, we observed that unmodified
   deoxycholate and host-conjugated deoxycholate glucuronide where
   associated with mutually exclusive sets of bacterial genera, despite
   sharing the same secondary bile acid backbone ([122]Fig. 5D).

Sphingosines and ceramides show heterogeneous associations

   We next asked whether hybrid associations would also be common in
   another disease-relevant class of blood metabolites: ceramides. High
   ceramide levels have been shown to be associated with insulin
   resistance, hypercholesterolemia, liver steatosis, and the formation of
   lipid rafts ^[123]38,[124]39. Furthermore, some classes of ceramides in
   the blood increase the risk for late-onset Alzheimer’s disease, as they
   are neurotoxic and induce apoptosis ^[125]40,[126]41. Ceramides are the
   simplest of the sphingolipids and are formed either by de novo
   synthesis from sphingosine or by hydrolysis of sphingomyelin molecules
   ^[127]42. While ceramides are rarely found in bacteria, many bacteria
   in the Bacteroidetes phylum can synthesize sphingolipids, which have
   been shown to be taken up and processed by human epithelial cells in
   vitro ^[128]43. Thus, we asked whether variation in ceramide and
   sphingosine derivatives levels in blood was significantly explained by
   microbial genera, host genetic factors, or a combination of both.

   We observed a large degree of heterogeneity in variance partitioning in
   ceramide and sphingosine molecules ([129]Fig. 6). Variance in
   sphingosine itself was only weakly explained by the composition of the
   gut microbiome (R^2=0.01), with Streptococcus as the only significant
   microbiome-related association. However, other intermediates in
   ceramide synthesis, such as sphinganine derivatives, showed stronger
   microbiome and genome associations (i.e. up to 5% and 2% of variance
   explained in training and validation cohorts, respectively; [130]Fig.
   6A–[131]B). Ceramides showed a broad range of explained variances.
   Whereas most ceramides were associated with the gut microbiome, a small
   subset had additional genetic components to their variances, such as
   lactosyl-N-nervonoyl-sphingosine and ceramide (d16:1/24:1, d18:1/22:1),
   both of which had more than 5% of their variation explained by genetic
   factors in the training and validation cohorts. Both of these lipid
   species are examples of very long-chain fatty-acyl sphingolipids. While
   shorter fatty acids (≤C18) are prevalent in the human diet ^[132]44,
   often serving as preferential substrates for ceramide synthesis by
   certain gut microbes, longer chain fatty acids, such as nervonic acid
   (24:1), are often the product of elongation by host enzymes before
   being incorporated into sphingolipids that are primarily found in brain
   tissue ^[133]45. This result was corroborated in Bar, Korem, et al.,
   where only short-chain sphingolipids were observed to be associated
   with the microbiome ^[134]1. In summary, we observed broad
   heterogeneity in the fraction of variance explained by the microbiome
   and host genetics, where a difference in the fatty acid chain length
   can distinguish circulating ceramides associated with gut microbial
   genera (i.e., shorter chains) from those associated with host genetics
   (i.e., longer chains).

Figure 6. Host genetic and gut microbial associations with sphingosine and
ceramides.

   Figure 6.
   [135]Open in a new tab

   (A-C) Explained variances for sphingosine, sphingosine intermediates,
   and ceramides in the training cohort (A), the validation cohort (B) and
   the study from Bar, Korem et. al. (C). (D) Associations between
   bacterial genera and sphingosines/ceramides in the training cohort.
   Only significant associations are shown (FDR-corrected and
   confounder-adjusted p<0.05, two-sided Pearson correlation test). Fill
   colors denote the Pearson product-moment correlation coefficients (see
   legend). (E) Colored cells denote associations that passed a
   genome-wide significance threshold in the GWAS and the color denotes
   the estimated coefficient in the GWAS regression. In (A-B, D-E)
   associations and variances were corrected for sex, age, age^2, sex:age,
   and sex:age^2 interactions, BMI, microbiome vendor, metabolomics batch,
   and the first 5 principal components of genetic ancestry. Metabolite
   names ending in “*” or “**” indicate compounds for which a standard is
   not available, but Metabolon is confident in its identity.

Discussion

   In this study, we partition the cross-sectional variance explained in
   individual blood metabolite levels into their host genetic and gut
   microbiome components, across a population of 1,569 generally healthy
   U.S. residents with paired genomic, microbiome, and blood metabolomic
   data. Prior studies have established that the plasma metabolome is
   intimately connected to host genetics as well as to the gut microbiome,
   and that overall genetic and microbial influences on the metabolome
   appear to be orthogonal. Similarly, we quantified the overlap between
   microbial and genetic variance components for a given metabolite and
   found that it was significant but very small (<0.03% of explained
   variance). This result has important implications for Mendelian
   randomization studies, which rely on this overlap (i.e.,
   non-additivity), and explains why previous studies required very large
   cohorts to detect very few putatively causal effects ^[136]46.

   Until now, it has been unclear as to whether or not genetics and the
   microbiome affect the levels of mutually exclusive sets of metabolites
   in human blood or act simultaneously on individual metabolites. Most of
   the detected blood metabolites (508/930) showed significant
   associations with the gut microbiome. We found that hybrid
   genome-microbiome models were common and affected about 10–16% of all
   metabolites associated with either host genetics or the microbiome.
   Putting a more exact estimate to this proportion will require larger
   studies than the one presented here because hybrid associations were
   enriched in metabolites with small microbiome variance components
   (R^2<0.02), many of which we were likely underpowered to detect (e.g.,
   as illustrated by the lower proportion of hybrid cases in the smaller
   validation cohort). We observed that between one third and one half of
   blood metabolites associated with host genetics included significant
   hybrid associations with the gut microbiome, while only one in five
   metabolites associated with the microbiome showed an additional hybrid
   genetic component to their variances. Thus, while both genetic and
   microbiome variation were important to explaining variation in the
   blood metabolome, the microbiome (and the myriad factors associated
   with variation in the microbiome, like diet and lifestyle) appears to
   be the dominant driving force behind variation in circulating
   metabolites.

   Additionally, we found some evidence for gene-microbiome interactions,
   where a particular metabolite-microbiome association was modulated by
   the genetic background of the host. In general, these interactions only
   explained small fractions of metabolite variance (i.e., <2%). However,
   these interactions often appear to result from relatively strong
   associations within the homozygous minor allele genotype, but we had
   limited numbers of these minor allele carriers to assess these
   associations robustly. That being said, if we look within the
   homozygous minor allele genotypes, we observed up to 16% of a
   metabolite’s variance explained by a given bacterial genus. In
   particular, N-acetyl-tyrosine, glycine, and sphingomyelin abundances
   all showed strong associations with a distinct set of gut bacterial
   genera, but only in the homozygous minor allele genotypes. Thus, it
   appears that genetically-determined deviations from a particular
   quantitative trait may be modulated, or even induced, by a particular
   microbial community composition in the gut. These kinds of
   genome-microbiome interaction effects could help guide the design of
   microbiome-targeted therapeutics that mitigate host genetic disease
   risk. For instance, we found that glycine abundance was modulated by 24
   gene-microbiome-metabolome interactions ([137]Fig. 4A and [138]Table
   S3), mostly involving variants in the CPS1 gene. Glycine-addiction is a
   hallmark of cancer proliferation, and restriction of glycine and serine
   can reduce tumor growth in murine xenograft and allograft models
   ^[139]29,[140]47. Thus, there may be benefit to microbiome-targeted
   interventions in individuals with the observed variants in the CPS1
   gene to lower circulating glycine levels. However, given the generally
   low prevalence of the minor allele and the high level of heterogeneity
   in gut bacterial community composition, targeted interventional studies
   will be required to validate these interactions in vivo.

   Prior studies of microbe-host co-metabolites derived from microbial
   precursors in blood, like hippurate or p-cresol sulfate (i.e.,
   hepatically modified forms of microbially-derived metabolites), hinted
   that the cross-sectional variance in these molecules might only be
   associated with the microbiome and not with host genetics
   ^[141]13,[142]25,[143]48. Here we show that those observations do not
   extrapolate to all microbe-host co-metabolites. Indeed, we found that
   much of the cross-sectional variation in many metabolites derived from
   gut microbial precursors was explained by host genetics. For example,
   while unconjugated secondary bile acids in the bloodstream are only
   associated with the gut microbiome, their glucuronidated or sulfated
   derivatives, formed in the liver, were often strongly influenced by
   host genetic variation. Most of the variance in deoxycholate
   glucuronide levels across the current study cohort could be explained
   by a combination of genetic and microbial factors, while unconjugated
   deoxycholate was solely associated with the microbiome. Deoxycholate
   and deoxycholate glucuronide both showed associations with the
   microbiome, but these associations were with different sets of
   bacterial genera. Glucuronidation facilitates excretion into urine and
   feces and prior work in rats has shown that glucuronidated bile acids
   are reabsorbed less efficiently than unmodified secondary bile acids in
   the intestine ^[144]49. The gut metagenome encodes a variety of
   β-glucuronidases ^[145]50,[146]51 which likely enable gut commensals to
   use these bile-acid-conjugated-glucuronides as a carbon source.
   Conversely, free secondary bile acids can be toxic to many bacterial
   taxa ^[147]52. Thus, bile acids can directly drive changes in the gut
   microbiome, either by acting as carbon sources that promote growth or
   as toxic compounds that inhibit growth, which may explain the very
   different subsets of gut genera associated with conjugated and
   unconjugated forms of the same secondary bile acid ([148]Fig. 5D).

   Another intriguing finding from our analyses was the variable
   association of certain plasma sphingolipids with either host genetics
   or the gut microbiome, depending on the fatty-acyl groups comprising
   each lipid species. Ceramides with a ≤C18 fatty-acyl group showed
   stronger correspondence to the gut microbiome, consistent with the high
   prevalence of these fatty acids in the diet ^[149]44 and their
   preferential incorporation into ceramides by certain gut taxa ^[150]53.
   On the other hand, ceramides with very long chain fatty-acyl groups
   (22:1,24:1), that are most abundant in brain tissue and often
   synthesized through elongation by host enzymes ^[151]45, showed a
   stronger correspondence with host genetics. Importantly, ceramides with
   different fatty-acyl chain lengths have been implicated in a number of
   human diseases, including Alzheimer’s disease, depression and mood
   disorders ^[152]54. Distinguishing between which ceramides are under
   the control of diet and the gut microbiome versus genetic
   predisposition may aid in the design of new and improved precision
   therapeutics.

   It should be noted that the current study only included individuals
   from the U.S. who were predominantly of European descent. While our
   results are consistent with results obtained from cohorts in Israel and
   Sweden ^[153]1,[154]55, future studies in more diverse populations will
   be required to see whether or not the reported observations replicate
   more broadly. Finally, while we included many highly relevant
   covariates in our regression analyses (i.e. age, sex, BMI, and genetic
   ancestry), many of the observed microbe-metabolite associations are
   likely confounded with lifestyle and dietary habits, which were not
   comprehensively tracked in the current study population and can
   strongly influence the composition of the gut microbiome.

   Overall, our analyses show that the plasma metabolome is influenced by
   the interplay between genetic and microbial factors, where the
   abundances of individual microbially-derived metabolites absorbed in
   the gut is often affected additively by both host genetic variation and
   by variation in the ecology of the gut. Furthermore, many
   microbe-metabolite associations are dependent upon host genetic
   background. These hybrid genome-microbiome-metabolite models provide
   unique insights into the forces underlying variation in the human blood
   metabolome and can suggest possible therapeutic strategies.
   Specifically, we suggest that disease-relevant blood metabolites
   strongly associated with the microbiome may be modifiable through
   dietary, probiotic, prebiotic or lifestyle interventions, whereas
   metabolites under strong genetic control may require pharmacological
   interventions that target host metabolic pathways. Understanding which
   of these circulating small molecules fall predominantly under host
   versus microbiome control will help to guide interventions designed to
   prevent and/or treat a range of diseases.

Methods

Institutional Review Board Approval for the Study

   Procedures for this study were reviewed and approved by the Western
   Institutional Review Board with the Institutional Review Board (IRB)
   study number 20170658 for the Institute for Systems Biology and 1178906
   for Arivale.

Cohort Description

   All study participants were subscribers of the Arivale Scientific
   Wellness program who provided consent and research authorization
   allowing the use of their anonymized, de-identified data in research.
   Arivale, Inc. (USA), was a consumer scientific wellness company that
   shut down in 2019. The Arivale program is described in detail in Zubair
   et. al. ^[155]56. In brief, participants signed up for a comprehensive
   deep phenotyping program coupled with personalized data-driven wellness
   coaching in order to improve overall health and wellness. Baseline
   blood draws were taken by trained phlebotomists at LabCorp or Quest
   service centers and paired with at-home fecal sampling (see
   methodological details below). The cohort contained 997 individuals
   with female sex assigned at birth and 572 individuals with male sex
   assigned at birth (self-reported). The average age of the cohort was
   49±11 years, and the average BMI was 28.2±6.6.

Metabolomics

   Plasma samples were transferred to Metabolon, Inc. (USA), for sample
   processing and data acquisition. Untargeted metabolomics was performed
   using Metabolon’s Global Metabolomics high-performance liquid
   chromatography (HPLC)–mass spectrometry (MS) platform across a total of
   11 batches. In brief, EDTA-plasma samples were thawed on ice and
   protein was removed using aqueous methanol extraction. The resulting
   samples were divided into five fractions (1 backup fraction) and placed
   briefly on TurboVap (Zymark) to remove organic solvents. Measurements
   were taken using Waters ACQUITY ultra-performance liquid chromatography
   (UPLC) and Thermo Scientific Q-Exactive high resolution/accuracy mass
   spectrometer interfaced with a heated electrospay ionization (HESI-II)
   source and an Orbitrap mass analyzer operated at 35,000 mass
   resolution. Each of the four sample fractions was reconstituted for
   measurement in one of four methods. One aliquot was measured under
   positive ion condition by elution from a C18 column (Waters UPLC BEH
   C18–2.1×100 mm, 1.7 μm) using water and methanol with 0.05%
   perfluoropentanoic acid and 0.1% formic acid. The second sample was
   eluted from the same column but using acetonitrile along with methanol
   and water to optimize for more hydrophobic compounds. The third aliquot
   was measured under negative ion optimized conditions and gradient
   eluted from a C18 column using methanol, water, and 6.5mM ammonium
   bicarbonate at pH 8. The last aliquot was measured using negative
   ionization after elution from a HILIC column (Waters UPLC BEH Amide
   2.1×150 mm, 1.7 μm) using water, acetonitrile, and 10mM ammonium
   formate at pH 10.8. MS analyses were run using MS and data-dependent
   MSn scans with dynamic exclusion with an approximate range of 70–1000
   m/z.

   Assay batch correction was performed by running a selected set of
   quality control plasma samples in every batch and normalizing the
   measured abundances to the quality control samples. Metabolon judges
   the confidence level of annotation on a three-tiered scale. The
   majority (>80%) of metabolites were annotated by a match to an internal
   standard of the metabolite run on the same HPLC/MS setup (no specific
   mark added to the name). Some metabolites were annotated only by a
   match to a previously published MS spectrum (marked with “*”), and some
   metabolites were annotated only by a match to a known chemical formula
   (marked with “**”). Metabolites not annotated on any of those tiers,
   but detected consistently in different biofluids, are denoted by a “X
   -” prefix followed by a unique ID to indicate unknown compounds.
   Metabolite not observed in at least 50% of the samples were removed,
   resulting in the final list of 930 metabolites.

Genome-wide association analysis

   DNA was extracted from whole blood samples by Covance (Redmond, WA)
   using a standardized protocol. Whole genome sequencing was performed by
   Wuxi, Inc. (Shanghai, China) in a CLIA-certified laboratory for 1,569
   individuals. Extracted libraries were sequenced using an HiSeq X
   sequencer (Illumina, San Diego, USA) with 150bp libraries and aiming
   for >30x coverage. Basecalling and conversion to raw FASTQ files was
   performed using the Illumina Basespace software. Raw reads were aligned
   to the hg19 human reference genome with BWA 0.7.12. The GATK
   HaplotypeCaller with GATK 3.3.0 was used to call individual variants.
   This included indel local realignment followed by base quality
   recalibration. This yielded a set of around 7.68M measured SNPs.

   To account for genetic structure and potential cryptic relatedness in
   the cohort, we used a linear mixed model to test for associations
   between genetic variants and metabolite levels. For each metabolite, we
   performed a genome-wide association study using FastGWA using the hg19
   genetic map and the metabolite confounder-adjusted residuals ^[156]57.
   Linkage disequilibrium scores were extracted for all variants in the
   study and used as an additional covariate in the genome-wide
   association study. Metabolome genome-wide significance was called using
   the Bonferroni corrected p-value threshold of 5.37e-11 (5e-8 /
   #metabolites) where 5e-8 is the commonly used threshold for genome-wide
   significance used in the human genomics field ^[157]58. Lead variants
   were mapped by two-step clumping strategy based on, linkage equilibrium
   (LD scores) where we first clumped all variants within 250kbp fragments
   that exceeded LD score cutoff of 0.8, followed by second clumping in
   250kbp fragments using an LD score cutoff of 0.3. This resulted in the
   final set of 297 independent lead variants.

Gut Microbiome Sequencing

   Fecal samples were collected and chemcially preserved using a
   proprietary at-home swab kit (OMNIgene Gut, DNAGenotek, USA). 250μL of
   homogenized stool from each sample were subsequently used for DNA
   extraction, using a MoBio PowerMag Soil DNA isolation kit (QIAGEN,
   Germany) and the KingFisher Flex instrument. DNA concentrations in each
   sample were quantified with a QuBit (ThermoFisher, USA) and purity was
   assessed by measuring the A260/A280 absorbance ratio. Amplification and
   library preparation were performed by external providers using custom
   and optimized protocols by either amplifying the V4 region
   (SecondGenome, USA) or the V3-V4 region (DNAGenotek, USA) of the 16S
   gene.

   Sequencing was performed using a MiSeq (Illumina, USA) using either a
   paired-end 250bp protocol (SecondGenome) or a paired-end 300bp protocol
   (DNAGenotek). Basecalling was performed using the Illumina Basespace
   platform which removed the added phiX reads and provided the final
   FASTQ files used for downstream analysis. Quality of the sequencing
   reads was assessed by manual inspection of the error rate across
   sequencing cycles and appropriate length cutoffs of 250bp for the
   forward reads and 230bp for the reverse reads were chosen based on the
   profiles. Reads with more than 2 expected errors under the Illumina
   error model were removed from the analysis along with reads containing
   ambiguous base calls (“N” nucleotides). More than 97% of reads passed
   those filters yielding a mean of around 200,000 reads per sample.

   Filtered and truncated reads were then used to infer amplicon sequence
   variants using DADA2 ^[158]59. Here error profiles were learned for
   each sequencing run separately. The resulting ASVs and respective
   counts were merged and chimeras were removed using the “consensus”
   strategy implemented in DADA2, which removed around 16% of all reads.
   Taxonomy assignment of ASVs was performed by using the naive Bayes
   classifier in DADA2 with the SILVA database (version 128). Species
   assignments were performed by exact match of the inferred ASVs with the
   reference 16S gene in SILVA, where possible. About 90% of reads could
   be classified down to the genus levels and 32% of reads could be
   classified down to the species level. The total set of samples was then
   filtered for those individuals that also had paired microbiome,
   metabolomics, and whole genome shotgun sequencing data available. Where
   several microbiome time points were available we used the one closest
   to the blood draw used for generating the metabolomics data. Low
   abundance taxa, with less than 10 reads on average or appearing in less
   than 50% of all samples, were removed from the data set prior to
   analysis.

Metabolite-microbiome associations and confounder adjustment

   Genus-level read counts were scaled using the centered log-ratio (CLR)
   transform and centered across the 1,569 samples. Metabolite abundances
   were log-transformed and centered which yielded good similarity to a
   normal distribution as judged by a QQ-plot.

   Transformed and scaled metabolite and bacterial genera abundances were
   then regressed against the set of confounders and the residuals were
   saved as new dependent variables. For the confounder adjustment we
   chose the most common intrinsic factors that would affect blood
   metabolite and fecal microbe abundances. Firstly, we corrected for sex,
   age, and BMI by including coefficients for biological sex, age, age^2,
   sex:age, and sex:age^2 interactions, and BMI. Additionally, we included
   covariates for common batch effects, like the season of the year the
   samples were obtained, the metabolomics batch (each sample in a batch
   was processed together), and the vendor for the microbiome sample (DNA
   Genotek or Second Genome). Finally, we included continuous measures of
   genetic ancestry as covariates in our analyses, given by the first 5
   principal components from an analysis of 100,000 linkage disequilibrium
   corrected SNPs (minor allele frequency > 5%) calculated with the PC-AiR
   and PC-Relate methods ^[159]60. The residuals were used for all further
   analyses as well as visualizations and analyses of variances.

   After this we performed individual linear regressions between all
   metabolite-genus pairs to find significant associations between
   individual metabolites and bacterial genera. P-values were obtained
   using F-tests and corrected for false discovery rate using the
   Benjamini and Hochberg method. We then retained the significant
   associations with an FDR-corrected p<0.05. We also performed the same
   analysis with alpha-diversity as the independent variable or with
   alpha-diversity as a covariate to the CLR-transformed bacterial genus
   abundance to verify that individual genus-level associations were not
   due to an association with alpha-diversity alone, which we did not
   observe to be the case.

Analysis of explained variances (R^2)

   Explained variances (R^2) for each metabolite in the training cohort
   were obtained by ordinary least squares models only containing features
   that were significantly associated with each metabolite individually in
   the training cohort. Ergo, we only include genus abundances for those
   genera with a FDR-corrected p<0.05 in the previous metabolite-microbe
   associations and only those single nucleotide variants that were
   identified as significant in the previous genome-wide association study
   for the particular metabolite. This was done to avoid inflation of the
   explained variances, as adding random variables to a regression model
   tends to increase the explained variance. We performed multilinear
   regressions with only microbiome features, only genetic features, and
   both genetic and microbiome features to quantify the variance of the
   transformed and confounder-adjusted metabolite abundances that was
   explained by each feature type individually and jointly. Consequently,
   it should be noted that the explained variances in this manuscript
   refer to the metabolite abundance variance after removing components of
   the variance explained by the covariates listed above. As such, these
   R^2 caclulations pose an upper bound of the explained variance of the
   raw metabolite abundance but are also independent of common
   confounders, such as age, sex, and BMI and thus should be more
   generalizable to other populations structures.

   Overlap between microbiome and genetic features was quantified as the
   difference between the sum of R^2 of the individual feature type models
   and the joint model R^2 (R^2[microbiome] + R^2[genetics] − R^2[joint]).
   In the case of complete independence of microbiome factors and
   genetics, this difference would be zero and would become positive if
   there is partial or complete overlap in the variances explained by the
   microbiome and host genetics.

   In the validation cohort we used the same regression models but as in
   the training cohort but fixed all coefficients to the ones from the
   training cohort except for the intercept term to account for different
   baseline levels of metabolites between the cohorts and avoid negative
   R^2 values. This bounds the validation R^2 values to [0, 1] where a
   metabolite not associated/validated with the identified features from
   the training cohort will intrinsically receive an R^2 of 0.

Genome-microbiome interactions

   For gene microbiome interactions, we performed ordinary least squares
   regression of the form
   [MATH:
   <msub><mrow><mi>m</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   sub><mrow><msub><mrow><msub><mrow><mi>β</mi></mrow><mrow><mn>0</mn></mr
   ow></msub><mo>+</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>1</mn></mrow
   ></msub><mi>b</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow><mrow></m
   row></msub><mo>+</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mro
   w></msub><msub><mrow><mi>s</mi></mrow><mrow><mi>k</mi></mrow></msub><mo
   >+</mo><mi>γ</mi><msub><mrow><mi>b</mi></mrow><mrow><mi>j</mi></mrow></
   msub><msub><mrow><mi>s</mi></mrow><mrow><mi>k</mi></mrow></msub><mo>+</
   mo><mi>δ</mi><msub><mrow><mi>c</mi><mo>+</mo><mi>ϵ</mi></mrow><mrow><mi
   >i</mi></mrow></msub><mo>,</mo> :MATH]

   where m[i] denotes the scaled abundance of metabolite i, b[j] the
   scaled abundance of the bacterial genus j, s[k] the ordinal versions of
   the allele on variant k (0 - major allele, 1 - heterozygous minor
   allele, 2 - homozygous major allele), c the vector of confounder
   covariates, and ε[i] a random normally distributed variable with an
   expectation of zero. Significance of the interaction was then evaluated
   by an F-test comparing the full model to a model with a fixed ɣ=0.
   P-values from all tested metabolite / variant / genus triplets were
   adjusted for false discovery rate using the Benjamini and Hochberg
   method and judged as significant with a FDR cutoff p<0.05. Even though
   individual metabolite and bacterial genus abundances had been adjusted
   for confounders previously, we still included them in this regression
   because the product b[j]*s[k] had not been adjusted for the same
   confounders. Also, to make our analysis computationally feasible, we
   performed the regressions only for those metabolites, bacterial genera,
   and genetic variants that had shown at least one significant
   interaction in the prior analyses. The full cohort was includes in the
   interaction analysis due to the low prevalence of the minor allele
   groups.

   Finally, post-hoc tests for microbe-metabolite interactions within
   sub-cohorts carrying a specific allele pairing were performed by
   Pearson tests on the product-moment correlation within each sub-cohort
   and for those triplets that had shown significant interaction effects.

Extended Data

Extended Data Figure 1. Genome-wide association study results.

   Extended Data Figure 1.
   [160]Open in a new tab

   Genome-wide association p-values for variants associated with at least
   one metabolite (only p < 1e-5 shown here). Red dashed line denotes
   genome-wide significance (p < 5.37e-11). Gray bars denote separation
   between chromosomes. N=1,000 (training cohort). The full list of
   p-values from the GWAS analysis can be found in [161]Table S4.

Extended Data Figure 2.

   Extended Data Figure 2.
   [162]Open in a new tab

   Gene-microbiome interactions explain variation in blood metabolite
   levels (unadjusted data). Shown are six selected examples where the
   genomic background modulates the associations between bacterial
   genus-level abundances and metabolite levels. In (A-F) metabolite
   abundances were log-transformed, but no adjustment for common
   confounders, was performed here. In (A-F) “ρ” denotes the Spearman’s
   Rank-Order correlation coefficient of the regression and “p” denotes
   the p-value under a two-sided Spearman’s correlation test. The solid
   line indicates an ordinary least squares regression line relating the
   transformed metabolite abundance with the transformed genus abundance
   within each group. The shaded area is the 95% confidence interval of
   the regression. Associations were corrected for sex, age, age^2,
   sex:age, and sex:age^2 interactions, BMI, microbiome vendor,
   metabolomics batch, and the first 5 principal components of genetic
   ancestry (N=1,569). Metabolite names ending in “*” or “**” indicate
   compounds for which a standard is not available, but Metabolon is
   confident in its identity.

Supplementary Material

   1845232_RS_File
   [163]NIHMS1845232-supplement-1845232_RS_File.pdf^ (1.3MB, pdf)
   1845232_Sup_Tab_1-5

   Table S1. Training cohort R^2 values for genetics and microbiome for
   the 595 metabolites with a significant association in the training
   cohort (N=1,000). Significant associations with bacterial genera were
   obtained by Pearson correlation tests between CLR-transformed genus
   abundances and log-transformed metabolite residuals corrected for
   common confounders (all FDR-corrected p<0.05). Significant associations
   with genomic variants were obtained by the FastGWA mixed-linear model
   with a Bonferroni correction across metabolites (p < 5e-8 /
   #metabolites).

   Table S2. Validation out-of-sample R^2 values for genetics and
   microbiome for the 595 metabolites with a significant association in
   the training cohort (N=569). Explained variances were calculated by
   using the fixed association coefficients from the training set models.

   Table S3. Genome-microbiome interactions for blood metabolite
   regression models with a significant interaction term (466
   interactions, N=1,569). P-values were obtained from a two-sided Wald
   test on the interaction term coefficient γ of an ordinary least squares
   regression model.

   Table S4. Significant results from the genome-wide association analysis
   in the training cohort (N=1,000) along with annotations for variants
   and metabolites. Significant associations with genomic variants were
   obtained by the FastGWA mixed-linear model using a sparse
   genetic-relationship matrix with a Bonferroni correction across
   metabolites (p < 5e-8 / #metabolites).

   Table S5. Enrichment tests for genome-microbiome associations. P-values
   were obtained from one-sided hypergeometric tests and corrected for
   multiple testing using the Benjamini-Hochberg correction.
   [164]NIHMS1845232-supplement-1845232_Sup_Tab_1-5.xlsx^ (4.7MB, xlsx)

Acknowledgments