Abstract

   While gut microbiome and host gene regulation independently contribute
   to gastrointestinal disorders, it is unclear how the two may interact
   to influence host pathophysiology. Here we developed a machine
   learning-based framework to jointly analyse paired host transcriptomic
   (n = 208) and gut microbiome (n = 208) profiles from colonic mucosal
   samples of patients with colorectal cancer, inflammatory bowel disease
   and irritable bowel syndrome. We identified associations between gut
   microbes and host genes that depict shared as well as disease-specific
   patterns. We found that a common set of host genes and pathways
   implicated in gastrointestinal inflammation, gut barrier protection and
   energy metabolism are associated with disease-specific gut microbes.
   Additionally, we also found that mucosal gut microbes that have been
   implicated in all three diseases, such as Streptococcus, are associated
   with different host pathways in each disease, suggesting that similar
   microbes can affect host pathophysiology in a disease-specific manner
   through regulation of different host genes. Our framework can be
   applied to other diseases for the identification of host
   gene–microbiome associations that may influence disease outcomes.

   Subject terms: Gene expression, Microbial communities, Genomics,
   Machine learning, Microbiome
     __________________________________________________________________

   A machine learning framework for integrating multi-omic
   high-dimensional datasets identified disease-specific and shared host
   gene–microbiome associations across three gastrointestinal diseases.

Main

   The human gut microbiome plays a critical role in modulating human
   health and disease. Variations in the composition of the human gut
   microbiome have been associated with a wide variety of chronic
   diseases, including colorectal cancer (CRC), inflammatory bowel disease
   (IBD) and irritable bowel syndrome (IBS). For example, previous studies
   have reported an increase in the abundance of Fusobacterium nucleatum
   and Parvimonas in CRC^[40]1,[41]2, reduced abundance of
   Faecalibacterium prausnitzii and enrichment of enterotoxigenic
   Bacteroides fragilis in CRC and IBD^[42]3–[43]5, and overrepresentation
   of Enterobacteriaceae and Streptococcus in IBD and IBS^[44]6–[45]8. In
   addition to the gut microbiome, dysregulation of host gene expression
   and pathways have also been implicated in these diseases. Researchers
   have reported disruption of Notch and WNT signalling pathways in
   CRC^[46]9,[47]10, activation of toll-like receptors (for example, TLR4)
   that induce NF-κB and TNF-α signalling pathways in IBD^[48]11,[49]12,
   and dysregulation of immune response and intestinal antibacterial gene
   expression in IBS^[50]8,[51]13. While host transcription and gut
   microbiome have separately been identified as contributing factors to
   these gastrointestinal (GI) diseases, it is unclear how the two may
   associate to influence host pathophysiology^[52]14.

   Studies in model organisms have demonstrated that the modulation of
   host gene expression by the gut microbiome is a potential mechanism by
   which microbes can affect host physiology^[53]15–[54]18. For example,
   in zebrafish, the gut microbiome negatively regulates the transcription
   factor hepatocyte nuclear factor 4, leading to host gene expression
   profiles associated with human IBD^[55]18. In mice, the gut microbiota
   can alter host epigenetic programming to modulate intestinal gene
   expression involved in immune and metabolic processes^[56]16,[57]17.
   Additionally, recent in vitro cell culture experiments have shown that
   specific gut microbes can modify the gene expression in interacting
   human colonic epithelial cells^[58]19. Given the evidence for crosstalk
   between the gut microbiome and host gene regulation, characterizing the
   interplay between the two factors is critical for unravelling their
   role in the pathogenesis of human intestinal diseases.

   A few recent studies have investigated associations between the host
   transcriptome and gut microbiome in specific human gut disorders,
   including IBD, CRC and IBS. For example, studies examining
   microbiome–host gene relationships in IBD have identified mucosal
   microbiome associations with host transcripts enriched for
   immunoinflammatory pathways^[59]20–[60]22. While investigating
   longitudinal host–microbiome dynamics in IBD, Lloyd-Price et al.^[61]22
   identified associations between expression of chemokine genes,
   including CXCL6 and DUOX2, and abundance of gut microbes, including
   Streptococcus and Ruminococcaceae. Studies investigating the role of
   host gene–microbiome associations in CRC have found correlations
   between the abundance of pathogenic mucosal bacteria and expression of
   host genes implicated in gastrointestinal inflammation and
   tumorigenesis^[62]23,[63]24. In IBS, host genes implicated in gut
   barrier function and peptidoglycan binding, such as KIFC3 and PGLYRP1,
   are associated with microbial abundance of Peptostreptococcaceae and
   Intestinibacter^[64]8. While these studies have revealed important
   insights about host gene–microbiome crosstalk in GI diseases, they are
   limited in several aspects. For example, most studies have examined
   associations between a limited subset of host genes and gut microbes;
   for instance, by focusing only on differentially expressed
   genes^[65]21,[66]22,[67]24, genes associated with immune
   functions^[68]13,[69]23 or select microbes representing bacterial
   clusters or co-abundance groups^[70]20,[71]23, thus characterizing only
   a subset of potential associations. In addition, the identification of
   host gene–microbe associations is based on testing for pairwise
   correlation between every host gene and microbe using Spearman or
   Pearson correlation, thus ignoring the inherent multivariate properties
   of these datasets^[72]21,[73]22,[74]24. Additionally, most studies
   focus on examining associations in a single disease at a time; hence,
   common and unique patterns of host–microbiome associations across
   multiple disease states remain poorly characterized.

   Here we comprehensively characterized associations between mucosal gene
   expression and microbiome composition in patients with colorectal
   cancer, inflammatory bowel disease and irritable bowel syndrome—three
   GI disorders in which both host gene regulation and gut microbiome have
   been implicated as contributing
   factors^[75]1,[76]6,[77]8,[78]10,[79]13. We developed and applied a
   machine learning framework that overcomes typical challenges in
   multi-omic integrations, including high-dimensionality, sparsity and
   multicollinearity, to identify biologically meaningful associations
   between gut microbes and host genes and pathways in each disease. We
   leveraged our framework to characterize disease-specific and shared
   host gene–microbiome associations across the three diseases that may
   facilitate insights into the molecular mechanisms underlying
   pathophysiology of these gastrointestinal diseases.

Results

Integrating host gene expression and gut microbiome abundance

   To study host–microbiome relationship across diseases, we used host
   gene expression (RNA-seq) data and gut microbiome abundance (16S rRNA
   sequencing) data generated from colonic mucosal biopsies obtained from
   patients with CRC, IBD and IBS (Fig. [80]1a). For each individual in
   our study, we obtained a pair of samples—a microbiome sample and a host
   gene expression sample. In total, across the three disease cohorts, our
   study included 208 such pairs of microbiome and host gene expression
   samples (416 samples in total; Supplementary Table [81]1). All
   datasets, except the host gene expression (RNA-seq) data for CRC, have
   been previously published as individual
   studies^[82]3,[83]8,[84]22,[85]25. Detailed information on disease
   cohorts, samples, sequencing, quality control and data processing is
   available in Methods.

Fig. 1. Integrating host gene expression and gut microbiome abundance in CRC,
IBD and IBS.

   [86]Fig. 1
   [87]Open in a new tab

   a, Overview of study design: colonic biopsy samples were collected from
   individuals in each disease cohort and for each sample, paired host
   transcriptomic (RNA-seq) data and gut microbiome abundance (16S rRNA)
   data were generated. The paired host transcriptomic and gut microbiome
   data were integrated using a machine learning-based framework to
   characterize associations between gut microbiota and host genes and
   pathways across the three diseases (left to right) (see Methods for
   details on the integration framework and mathematical notations). b,
   Procrustes analysis showing overall association between variation in
   host gene expression and gut microbiome composition in CRC, IBD and IBS
   (left to right). We used Aitchison’s distance for host gene expression
   data (circles) and Bray-Curtis distance for gut microbiome data
   (triangles). Panel a was created using [88]BioRender.com.

   Previous studies have identified host gene–microbiome associations in
   human gut disorders, including CRC, IBD and IBS^[89]8,[90]22,[91]23.
   Thus, one might expect intestinal gene expression patterns and
   microbiome composition to be broadly correlated in these diseases. To
   test for such an overall association between host gene expression and
   gut microbiome composition, we performed Procrustes analysis using
   paired data for each disease cohort. Our analysis showed significant
   correspondence between host gene expression variation and gut
   microbiome composition across subjects in CRC (Monte Carlo P
   value = 0.0001). However, Procrustes agreement is not significant in
   IBD (Monte Carlo P value = 0.1) and IBS (Monte Carlo P value = 0.42)
   (Fig. [92]1b and Methods). These results were verified using a Mantel
   test (Methods). This lack of significant overall correspondence between
   host transcriptome and gut microbiome across diseases might suggest
   that, rather than an overall association between the two, it is
   probable that only a subset of gut microbes is associated with a subset
   of host genes at the colonic epithelium^[93]15,[94]17. Hence, we need
   integration approaches to characterize such host gene–microbiome
   associations.

   To this end, we developed a machine learning-based multi-omic
   integration framework using sparse canonical correlation analysis
   (sparse CCA)^[95]26,[96]27 and lasso penalized regression^[97]28. We
   applied this approach to data from CRC, IBD and IBS for a comprehensive
   characterization of potentially biologically meaningful associations
   between gut microbiota and host genes and pathways across the three
   diseases (Fig. [98]1, Methods and Extended Data Fig. [99]1).

Extended Data Fig. 1. Overview of host gene-microbiome integration pipeline.

   [100]Extended Data Fig. 1
   [101]Open in a new tab

   Steps for integrating gut microbiome abundance and host gene expression
   data using sparse CCA and lasso approaches.

Shared host pathways are associated with disease-specific gut microbes

   We hypothesized that host genes and gut microbial taxa involved in
   common biological functions would act in a coordinated fashion, and
   hence would have correlated expression and abundance patterns. To
   investigate this, we used sparse CCA to characterize group-level
   association between host transcriptome and gut microbiome in each of
   the three diseases^[102]26,[103]27. We fit the sparse CCA model for
   each dataset to identify subsets of significantly correlated host genes
   and gut microbes, known as components (Methods and Supplementary Tables
   [104]2–[105]4). We then performed pathway enrichment analysis on the
   set of host genes in each significant component to determine host
   pathways that associate with gut microbes in a disease. We identified
   ‘shared’ pathways, namely host pathways for which gene expression
   correlates with gut microbes across disease cohorts, and
   ‘disease-specific’ pathways, namely host pathways for which gene
   expression correlates with gut microbes in only one of the three
   disease cohorts (Fig. [106]2a, Fisher’s exact test, Benjamini-Hochberg
   FDR < 0.1; Supplementary Table [107]5). For simplicity, we focused on
   the top five most significant shared and disease-specific pathways
   (Fig. [108]2a). We found three pathways shared across CRC, IBD and IBS
   that are known to regulate gastrointestinal tract inflammation, and gut
   barrier protection and repair. For example, oxidative phosphorylation,
   which is the process of energy metabolism in the mitochondria, is known
   to be dysregulated in IBD and CRC, and contributes to tumorigenesis and
   drug resistance in CRC^[109]29–[110]32. We also found overlapping host
   pathways between disease pairs (see CRC & IBD, CRC & IBS, and IBD & IBS
   in Fig. [111]2a), including immunoregulatory pathways and cell-surface
   receptors such as integrin pathway, cell and focal adhesion, and
   proteasome.

Fig. 2. Shared immunoregulatory and metabolic host pathways associate with
disease-specific gut microbes across human diseases.

   [112]Fig. 2
   [113]Open in a new tab

   a, Host pathways enriched for sparse CCA gene sets associated with gut
   microbiome composition across diseases (FDR < 0.1). Dot size represents
   the significance of enrichment for each pathway, and dot colour denotes
   the disease cohort in which this pathway is significantly associated
   with microbiome composition. b, Association between microbial taxa in
   CRC, IBD and IBS (top to bottom) and host genes in the RAC1 pathway, a
   pathway for which host gene expression correlates with gut microbes
   across disease cohorts (a shared host pathway). The size of circles and
   triangles represents the absolute value of sparse CCA coefficients of
   genes and microbes, respectively. Microbial taxa belonging to a common
   taxonomic order are shown as overlapping triangles. Genes that are
   common between pathways or components across at least two disease
   cohorts are shown in grey. c, Association between the set of host genes
   in disease-specific host pathways (that is, host pathways for which
   gene expression correlates with gut microbes in only one of the disease
   cohorts) and groups of gut bacteria in CRC, IBD and IBS (top to
   bottom). d, A common set of host genes (grey circles) that are
   associated with disease-specific sets of microbes. These host genes are
   enriched for immunoregulatory and acute inflammatory response pathways.

   In addition, we identified 102 disease-specific host pathways that are
   associated with gut microbes, including 52 CRC-specific, 25
   IBD-specific and 25 IBS-specific pathways in our study cohorts
   (Supplementary Table [114]5 and Fig. [115]2a). While IBD-specific host
   pathways include A6B1/A6B4 integrin pathway and integrin beta-1 pathway
   that regulate leucocyte recruitment in GI inflammation^[116]33,
   IBS-specific pathways include immune response pathways, including B
   cell receptor signalling pathway, and ribosome pathway.

   To better understand the host gene–microbe associations that underlie
   common associations, we focused on the RAC1 pathway, where host gene
   expression is associated with microbiome composition in CRC, IBD and
   IBS. The RAC1 pathway is known to regulate immune response and
   intestinal mucosal repair, and has previously been implicated in IBD
   and CRC^[117]34,[118]35 (Fig. [119]2b). As expected, we observed some
   overlapping host genes for this shared pathway across the three
   diseases. However, the microbial taxa they are correlated with are
   disease-specific. In CRC, the RAC1 pathway is associated with oral
   bacterial taxa such as Streptococcus, Synergistales and GN02, where
   Streptococcus species are known to be associated with colorectal
   carcinogenesis^[120]36,[121]37. In IBD, the RAC1 host pathway is
   associated with microbial taxa previously implicated in IBD, including
   Granulicatella^[122]38,[123]39, and Clostridium sensu stricto 1, a
   microbe associated with chronic enteropathy similar to IBD^[124]40. In
   IBS, this pathway is associated with bacteria such as Bacteroides
   massiliensis that has been shown to be prevalent in colitis^[125]41,
   and Bifidobacterium and Odoribacter that are known to be depleted in
   IBS^[126]42,[127]43.

   To investigate disease-specific associations, we considered unique host
   pathways for which host gene expression correlates with gut microbes
   only in one of the three diseases (Fig. [128]2c). For example, the
   Syndecan-1 pathway, which we found to be associated with gut microbial
   taxa only in CRC, has been previously shown to regulate the tumorigenic
   activity of cancer cells^[129]44,[130]45. Host gene expression in this
   pathway is associated with microbial taxa such as Parvimonas and
   Bacteroides fragilis that are known to promote intestinal
   carcinogenesis and are considered biomarkers of
   CRC^[131]1,[132]46,[133]47. The integrin-1 pathway, a disease-specific
   host pathway in IBD, is found to be associated with
   Peptostreptococcaceae, Intestinibacter and
   Phascolarctobacterium—microbial taxa that have been implicated in IBD
   by previous studies^[134]48–[135]50. To assess similarities in host
   gene components across diseases, we identified a set of host genes that
   are common between components across the three diseases, and we found
   that these genes are enriched for immune response pathways in gut
   epithelium, including vascular endothelial growth factor (VEGF),
   complementation and coagulation cascades, and cytokine–cytokine
   receptor association (Fig. [136]2d, Fisher’s exact test,
   Benjamini-Hochberg FDR < 0.1). While this set of host genes is
   associated with disease-specific groups of microbes, we also found
   overlapping microbes between IBD and IBS, such as Peptostreptococcaceae
   and Intestinibacter—taxa that are found in high abundance in
   gastrointestinal inflammation^[137]48,[138]50.

Gut microbes are associated with individual host genes and pathways

   Previous studies have shown that specific microbial taxa can regulate
   expression of individual host genes^[139]15,[140]19. Therefore, we
   explored associations between individual host genes and gut microbes in
   each disease. To do so, we used lasso penalized regression models to
   identify specific gut microbial taxa whose abundance is associated with
   the expression of a host gene^[141]28. We fit these models in a
   gene-wise manner, using the expression for each host gene as the
   response variable and the abundances of gut microbial taxa as
   predictors. We then applied stability selection to identify robust
   associations (Methods). Using this approach, we found 755, 1,295 and
   441 significant and stability-selected host gene–taxa associations in
   CRC, IBD and IBS, respectively (Fig. [142]3, FDR < 0.1). These
   represent associations between 745 host genes and 120 gut microbes in
   CRC (Supplementary Table [143]6), between 1,246 host genes and 56 gut
   microbes in IBD (Supplementary Table [144]7), and between 436 host
   genes and 102 gut microbes in IBS (Supplementary Table [145]8 and Fig.
   [146]3a). Examples of specific host gene–microbe associations can be
   found in Extended Data Fig. [147]2. Overall, we observed
   disease-specific patterns in host gene–taxa associations.

Fig. 3. Specific gut microbes are associated with individual host genes and
pathways in each disease.

   [148]Fig. 3
   [149]Open in a new tab

   a, Heatmap showing the overall pattern of correlation between
   significant and stability-selected host genes (rows) and gut microbial
   taxa (columns) identified by the lasso model in CRC, IBD and IBS
   (FDR < 0.1). b, Host pathways enriched among genes that are correlated
   with specific gut microbes in CRC (purple), IBD (green) and IBS
   (yellow) (FDR < 0.1). c–e, Networks showing specific gut microbes
   correlated with specific host genes enriched for disease-specific host
   pathways in CRC (c), IBD (d) and IBS (e). Triangular nodes represent
   gut microbes, circular nodes represent host genes and pathways. Edge
   colour represents positive (blue) or negative (red) association, and
   edge width represents strength of association (Spearman rho). Grey
   edges represent host gene–pathway associations.

Extended Data Fig. 2. Examples of specific associations between gut microbial
taxa and host genes in colorectal cancer (CRC), Inflammatory Bowel Disease
(IBD), and Irritable Bowel Syndrome (IBS).

   [150]Extended Data Fig. 2
   [151]Open in a new tab

   The top row shows specific associations for CRC (n = 44), the middle
   row shows the specific associations for IBD (n = 56), and the bottom
   row shows specific associations for IBS (n = 29). The x-axis represents
   normalized abundance of microbial taxa, and the y-axis represents
   normalized expression of the host gene. The black line represents the
   line of best fit using a linear model and the grey shaded area
   represents 95% confidence interval. Desparsified lasso was used to
   obtain 95% confidence intervals (CI) and p-values (pval) for the
   association. P-values were adjusted for multiple comparisons using
   Benjamini-Hochberg (FDR) method (padj). CRC: colorectal cancer, UC:
   Ulcerative colitis, CD: Crohn’s Diseases, IBS-C: Irritable bowel
   syndrome - constipation, IBS-D: Irritable bowel syndrome - diarrhoea.

   To characterize the biological functions represented by the host genes
   that associate with specific gut microbes, we applied enrichment
   analysis to the set of gut microbiota-associated host genes in each
   disease (Methods). This is complementary to our group-level approach
   (Fig. [152]2) in that these host pathways are enriched among individual
   host gene–microbe pairs. We identified 18 host pathways that are unique
   to each disease, including 4 CRC-specific, 9 IBD-specific and 5
   IBS-specific pathways that associate with unique gut bacteria (Fig.
   [153]3b, Fisher’s exact test, Benjamini-Hochberg FDR < 0.1;
   Supplementary Table [154]9 and Methods). The host pathways enriched
   among CRC-specific associations are known to modulate tumour growth,
   progression and metastasis in CRC, such as interleukin-10 signalling,
   signalling by NOTCH1 in cancer, and regulation of MECP2 expression and
   activity^[155]51–[156]53. The host pathways we identified as enriched
   among IBD-specific associations are known to be responsible for
   maintenance of gastric mucosa integrity, inflammatory response and host
   defence against invading pathogens, such as thrombin signalling through
   proteinase activated receptors (PARs) and glucagon type ligand
   receptors^[157]54,[158]55. For IBS-specific associations, the enriched
   host pathways identified here have been shown to regulate homoeostasis
   of intestinal tissue and proinflammatory mechanisms in IBS, such as
   sumoylation of DNA damage response and repair proteins, and arachidonic
   acid metabolism^[159]56,[160]57.

   To characterize the potential mechanism of host gene–microbe
   associations, we further investigated the gut microbial taxa associated
   with host genes in these pathways (Fig. [161]3c–e). In CRC, we found
   that Anaerolineae and TM7—oral microbes that also inhabit the human
   gastrointestinal tract, and are known to promote oral and colorectal
   tumorigenesis^[162]58–[163]60—are negatively correlated with host genes
   enriched in the tumour-promoting interleukin-10 signalling pathway,
   such CXCL8 and IL1RN (Fig. [164]3c and Extended Data Fig. [165]2).
   CXCL8 is known to be overexpressed in CRC, and IL1RN is centrally
   involved in immune and inflammatory response, and its polymorphisms are
   implicated in colorectal carcinogenesis^[166]61,[167]62. Other host
   genes in interleukin-10 signalling, such as CCR2 and FPR1, are
   positively correlated with Bacteroidales (Fig. [168]3c and Extended
   Data Fig. [169]2). CCR2 and FPR1 are overexpressed in colorectal
   tumours, while Bacteroidales are enriched in CRC and associated with
   tumorigenesis^[170]63–[171]65.

   We observed that Peptostreptococcaceae, which is prevalent in patients
   with IBD^[172]50, is associated with multiple host genes and pathways
   in IBD (Fig. [173]3d). For example, its abundance is positively
   correlated with the expression of host genes MAPK3 and VIPR1, involved
   in thrombin signalling through proteinase activated receptors (PARs)
   and glucagon-type ligand receptors pathways, respectively. MAPK3 is
   known to play a role in the progression and development of IBD, and
   VIPR1 is overexpressed in inflamed mucosa^[174]66,[175]67. In
   IBS-specific associations, we found the levels of Prevotella, which is
   known to be overrepresented in individuals with loose stool, to be
   negatively associated with expression of SMC5, which is involved in the
   sumoylation pathway^[176]68,[177]69 (Fig. [178]3e). We also found the
   expression of PLA2G4A, a host gene that plays an important role in
   arachidonic acid metabolism and modulates gut epithelial
   homoeostasis^[179]70, to be positively correlated with the abundance of
   Bacteroides massiliensis in IBS, B. massiliensis being known to be
   prevalent in patients with gut malignancies^[180]41 (Fig. [181]3e).
   Taken together, these findings demonstrate that associations between
   specific gut microbial taxa and specific host genes and pathways vary
   by disease state.

Disease-specific gut microbe–host gene crosstalk

   To understand how gut microbes may associate with specific host genes
   across diseases, we explored the overlaps between host gene–microbe
   associations in CRC, IBD and IBS (Fig. [182]4a, lasso regression,
   Benjamini-Hochberg FDR < 0.1; Supplementary Table [183]10). We
   identified ‘shared’ gut microbes, namely gut microbes that associate
   with host genes in at least two diseases, and visualized their networks
   of association with host genes across diseases. We found three gut
   microbes, Peptostreptococcaceae, Streptococcus and Staphylococcus,
   whose abundance is correlated with host gene expression in all three
   diseases in our study cohorts (Fig. [184]4a, Network 1). Previous
   studies have revealed that Peptostreptococcaceae and Streptococcus spp.
   are found at elevated levels in CRC, IBD and
   IBS^[185]8,[186]37,[187]50,[188]71–[189]74. While traditionally
   considered nasal- or skin-associated bacteria, Staphylococcus spp. also
   colonize the human gastrointestinal tract and include opportunistic
   pathogens that can cause acute intestinal infections in patients with
   CRC and IBD, and are associated with increased risk of IBS and
   CRC^[190]74–[191]77. We found that the abundance of
   Peptostreptococcaceae is positively correlated with the expression of
   host genes PYGB and NCK2 in IBD, whereas it is negatively correlated
   with the expression of host gene HAS2 in IBS. PYGB and NCK2 are both
   upregulated in IBD, where PYGB is known to regulate the WNT/β-catenin
   pathway, and NCK2 is involved in integrin and epidermal growth factor
   receptor signalling^[192]78–[193]80. In contrast, HAS2 is known to have
   a protective effect on the colonic epithelium through regulation of
   intestinal homoeostasis and inflammation^[194]81,[195]82. In CRC, we
   found that the abundance of Peptostreptococcaceae is negatively
   associated with the expression of GAB1, a host gene for which
   overexpression stimulates tumour growth in colon cancer cells^[196]83.
   Streptococcus also shows a disease-specific pattern of association with
   host gene expression in our study cohort. In CRC, its abundance is
   correlated with the expression of RIPK4, which regulates WNT signalling
   and the NF-κB pathway, and is upregulated in several cancer types,
   including colon cancer^[197]84,[198]85. Similarly, in IBS,
   Streptococcus abundance is correlated with the expression of DPEP2,
   which is known to modulate macrophage inflammatory response^[199]86
   (Fig. [200]4a, Network 1).

Fig. 4. Disease-specific gut microbe–host gene crosstalk.

   [201]Fig. 4
   [202]Open in a new tab

   a, Associations for ‘shared’ gut microbes, namely microbes that are
   associated with host genes in at least two diseases. Centre: Venn
   diagram showing overlap between gut microbes associated with host genes
   in CRC, IBD and IBS. Counter-clockwise, left to right: networks showing
   host gene–microbe associations for gut microbes shared across
   associations in CRC, IBD and IBS (Network 1), CRC and IBS (Network 2),
   IBD and IBS (Network 3), and CRC and IBD (Network 4). b, Associations
   for ‘shared’ host genes, that is, genes that are associated with
   microbes in at least two diseases. Centre: Venn diagram showing overlap
   between host genes associated with gut microbes in CRC, IBD and IBS.
   Counter-clockwise, left to right: networks showing host gene–microbe
   associations for host genes shared across associations in CRC, IBD and
   IBS (Network 1), CRC and IBS (Network 2), IBD and IBS (Network 3), and
   CRC and IBD (Network 4). Circular nodes represent host genes,
   triangular nodes represent gut microbes. Coloured nodes represent
   specific diseases (purple, CRC; green, IBD; yellow, IBS), grey nodes
   represent gut microbes (a) and host genes (b) shared between
   associations across diseases. Edge colour represents positive (blue) or
   negative (red) association, and edge width represents strength of
   association (Spearman rho). All associations were determined at
   FDR < 0.1.

   Next, we visualized the networks of host gene–microbe associations for
   gut microbes that are associated with host genes in two diseases (Fig.
   [203]4a, Networks 2–4, lasso regression, Benjamini-Hochberg FDR < 0.1;
   Supplementary Table [204]10). We found 20 microbes for which abundance
   is associated with the expression of host genes in at least two
   diseases. Notably, the abundance of Blautia, a butyrate-producing
   beneficial microbe, is found to be negatively correlated with the
   expression of RIPK3 in both CRC and IBD (Fig. [205]4a, Network 4;
   Extended Data Fig. [206]3). RIPK3 promotes intestinal inflammation in
   IBD, and colon tumorigenesis^[207]87,[208]88. Interestingly, in CRC,
   Blautia is also associated with ZBP1 (Fig. [209]4a, Network 4), a host
   gene that recruits RIPK3 to induce NF-κB activation, and regulates
   innate immune response to mediate host defence against tumours and
   pathogens^[210]89,[211]90.

Extended Data Fig. 3. A common host gene-microbe association between
colorectal cancer (CRC) and Inflammatory Bowel Disease (IBD).

   [212]Extended Data Fig. 3
   [213]Open in a new tab

   The association for CRC (n = 44) is shown on the left, and for IBD
   (n = 56) is shown on the right. The black line represents the line of
   best fit using a linear model and the grey shaded area represents 95%
   confidence interval. Desparsified lasso was used to obtain 95%
   confidence intervals (CI) and p-values (pval) for the association.
   P-values were adjusted for multiple comparisons using
   Benjamini-Hochberg (FDR) method (padj). CRC: colorectal cancer, UC:
   Ulcerative colitis, CD: Crohn’s Diseases.

   Conversely, to explore how the same host genes may associate with
   different gut microbes across all diseases, we identified host genes
   that are associated with gut microbes in at least two diseases, and
   visualized their networks of association across diseases (Fig. [214]4b,
   lasso regression, FDR < 0.1; Supplementary Table [215]11). We
   identified 5 such host genes that associate with 4 gut microbes in CRC,
   5 gut microbes in IBS and 4 gut microbes in IBD (Fig. [216]4b, Network
   1; Supplementary Table [217]11). Of note, the expression of PINK1—a
   host gene that regulates mitochondrial homoeostasis and activates
   PI3-kinase/AKT signalling, contributing to intestinal inflammation in
   IBD and tumorigenesis^[218]32,[219]91—is associated with the abundance
   of Collinsella in CRC, Peptostreptococcaceae in IBD and Blautia in IBS.
   Previous studies have found that Collinsella is increased in abundance
   in CRC^[220]92, whereas Blautia has been found to be both positively
   and negatively correlated with IBS symptoms^[221]8,[222]43.

   In addition, we identified 135 host genes for which expression is
   associated with abundance of microbial taxa in at least two of the
   three diseases, and visualized the network of the most significant
   associations (Fig. [223]4b, Networks 2–4, lasso regression, FDR < 0.1;
   Supplementary Table [224]11). We found that the host genes whose
   expression is correlated with gut microbes in both CRC and IBD are
   enriched for pathways involved in immune response, including natural
   killer cell mediated toxicity, Leishmania infection and leucocyte
   transendothelial migration (Fig. [225]4b, Network 4, Fisher’s exact
   test, Benjamini-Hochberg FDR < 0.1). These host genes and the microbial
   taxa they associate with have been previously implicated in CRC and
   IBD. For example, expression of Annexin A1 or ANXA1—a host gene known
   to regulate intestinal mucosal injury and repair, and found to be
   dysregulated in CRC and IBD^[226]93,[227]94—is positively correlated
   with Bacteroidales in CRC, while negatively correlated with
   Peptostreptococcaceae in IBD (Fig. [228]4b, Network 4). TLR4—a host
   gene known to modulate inflammatory response in intestinal epithelium
   through recognition of bacterial lipopolysaccharide^[229]95, and
   previously implicated in IBD and CRC^[230]96,[231]97—is found to be
   associated with an oral microbe GN02 in CRC^[232]98, whereas in IBD it
   associates with Acidaminococcaceae—a gut microbe found to be increased
   in abundance in patients with Crohn’s disease^[233]99 (Fig. [234]4b,
   Network 4). Overall, our analysis shows that gut microbial taxa and
   host genes that are shared between associations across diseases depict
   disease-specific host–microbe crosstalk, thus suggesting that the
   mechanism of host gene–microbiome association might be specific to the
   disease.

Discussion

   While gut microbial communities and host gene expression have
   separately been implicated with human health and disease, the role of
   the association between gut microbes and host gene regulation in the
   pathogenesis of human gastrointestinal diseases remains largely
   unknown. Using a machine learning-based multi-omic integration
   framework, here we found both common and disease-specific interplay
   between gut microbes and host gene regulation that may contribute to
   the underlying pathophysiology of GI disorders, including CRC, IBD and
   IBS.

   Previous studies have found common microbial signatures across CRC, IBD
   and IBS. For example, all three diseases exhibit an overrepresentation
   of Peptostreptococcaceae and Streptococcus
   spp.^[235]8,[236]37,[237]50,[238]72. In addition, both CRC and IBD
   microbiomes are denoted by a loss of butyrate-producing gut bacteria,
   including Blautia, and an enrichment of enterotoxigenic Bacteroides
   fragilis^[239]5,[240]47,[241]72. In contrast to these microbiome
   similarities, host gene regulation shows distinct alterations across
   the three GI disorders; for example, unique antibacterial gene
   expression profile and disruption of the purine salvage pathway are
   specific to IBS, deregulation of proinflammatory IL-23/IL-17 signalling
   is unique to IBD, and prominent activation of oncogenic pathways such
   as Notch and WNT signalling is a hallmark of
   CRC^[242]8,[243]13,[244]100,[245]101. Here we found that the same
   disease-related gut microbes can associate with different host genes
   and pathways in each disease. Thus, it is compelling to hypothesize
   that although diseases can be characterized by similar microbial
   perturbations, these microbes can impact disease-specific
   pathophysiological processes through association with different host
   genes in each disease. For example, we found that in CRC, Streptococcus
   is correlated with the expression of host genes that regulate WNT
   signalling and the NF-κB pathway, whereas in IBS Streptococcus is
   correlated with host genes that modulate macrophage inflammatory
   response, thus suggesting that this gut microbe may perturb distinct
   host pathways in CRC and IBS. Of course, since our results are based on
   correlational analysis, it is challenging to assess directionality.
   While it is possible that these disease-specific associations have a
   role in disease pathogenesis, it is also possible that the
   disease-transformed colonic mucosa renders it more conducive to the
   same microbial taxa.

   We also identified a common set of host genes and pathways that are
   associated with gut microbiome composition in all three diseases. These
   included pathways that regulate gastrointestinal inflammation, immune
   response and energy metabolism, and have been previously implicated in
   these diseases^[246]30,[247]102,[248]103. Our analysis shows that these
   common host genes and pathways correlate with disease-specific gut
   microbes in CRC, IBD and IBS. For example, we found that the expression
   of host gene PINK1, which regulates the PI3-kinase/AKT signalling
   pathway^[249]91, is associated with the abundance of Collinsella in
   CRC, Peptostreptococcaceae in IBD, and Blautia in IBS in our study.
   This suggests that in some cases, distinct gut microbes may modulate
   host genes and pathways that are commonly dysregulated across different
   gut pathologies. At the same time, we also found disease-specific host
   gene–microbe associations. For example, in CRC, the Syndecan-1 pathway,
   a host pathway that modulates tumour growth and progression, is
   correlated with microbial taxa such as Parvimonas and Bacteroides
   fragilis that are known to promote intestinal
   carcinogenesis^[250]44,[251]46,[252]47. These associations are not
   found in IBD or IBS, and are unique to CRC in our study cohort. The
   disease-specific pattern of host gene–microbe crosstalk suggests that
   gut microbes, either through direct interaction with host cells or
   through indirect interaction (for example, via production of specific
   metabolites), may regulate host gene expression differently in specific
   disease contexts.

   Our study has several limitations. While we report the potential role
   of host gene–microbiome associations in the pathophysiology of GI
   disorders, our study identifies correlations, and we cannot directly
   infer causality from these results. Given the challenges associated
   with studying causal mechanisms in humans, future studies using cell
   culture or animal models would be useful in elucidating the causal role
   and directionality of associations between the gut microbiome and host
   gene regulation in these diseases^[253]104. Additionally, our analysis
   focused only on the taxonomic composition of the microbiome, and hence
   we could not characterize associations involving microbial genes and
   pathways. Lastly, there are several host and environmental variables
   that could potentially influence the microbiome and/or host gene
   expression, including age differences, sampling locations, diet, host
   genetics, treatment and medication history, which are not available
   across our disease cohorts. Thus, these factors are potential
   confounders that might influence our results.

   Overall, our work demonstrates the power of integrating gut microbiome
   and host gene expression data to provide insights into their combined
   role in GI diseases, including CRC, IBD and IBS. Taken together, our
   results indicate that GI disorders are characterized by a complex
   network of associations between microbes and host genes. Although these
   associations can be disease-specific, we find cases where the same
   microbial taxon is associated with different host genes in each
   disease, and vice-versa: cases where the same host pathway is
   associated with different microbes in each disease. Although much
   effort in microbiome research has been directed towards identifying
   specific microbial taxa that are responsible for the pathogenesis of
   disease, our findings indicate that it is critical to incorporate host
   genomics data, as it can provide invaluable information on the
   potential mechanisms through which microbes can affect health. Our
   results represent an important step towards characterizing the
   association between gut microbiome and host gene regulation, and
   understanding the contribution of the microbiome to disease aetiology.

Methods

Overall study design, samples and data

   Overall, our study included 208 paired microbiome (16S rRNA) and host
   gene expression (RNA-seq) samples, which include 88 pairs of samples in
   the CRC cohort (44 tumour and 44 patient-matched normal)^[254]3, 78
   pairs of samples in the IBD cohort (56 patients and 22
   controls)^[255]22,[256]25 and 42 pairs of samples in the IBS cohort (29
   patients and 13 controls)^[257]8 (Supplementary Tables [258]1 and
   [259]12–[260]17). All datasets, except the host gene expression data
   for CRC, have been previously published as individual
   studies^[261]3,[262]8,[263]22,[264]25. The original studies obtained
   written informed consent from study participants in each cohort.
   Details on randomization and blinding during data collection can be
   found in publications describing the original studies. No statistical
   methods were used to pre-determine sample sizes, but our sample sizes
   are similar to those reported in the previous
   publications^[265]3,[266]8,[267]22,[268]25. Below, we describe in
   detail the sample collection, sequencing and quality control for the
   CRC cohort host RNA-seq data, and summarize sample collection and data
   processing and acquisition for other datasets.

CRC samples and data

   We used 88 pairs of gut microbiome and host gene expression samples
   from 44 patients, with primary tumour and normal tissue samples taken
   from each individual. The individuals in this cohort included 23
   females and 21 males, with an average age of 65 years (median: 67,
   range: 17–91). Patient samples were characterized and described by
   Burns et al.^[269]3. Briefly, these de-identified samples were obtained
   from the University of Minnesota Biological Materials Procurement
   Network (Bionet). Tissue pairs were resected concurrently, rinsed with
   sterile water, flash frozen in liquid nitrogen and characterized by
   staff pathologists. Detailed cohort characteristics for this dataset
   are included in Supplementary Table [270]1.

Host RNA-seq sequencing, alignment and quality control

   Total RNA was extracted using a previously established
   protocol^[271]3,[272]105. Approximately 100 mg of flash-frozen tissue
   per sample were lysed by placing the tissue in 1 ml of Qiazol lysis
   reagent (Qiagen) and sonicating in a 65 °C water bath for 1–2 h.
   Nucleic acids were purified from the lysates using the Qiagen AllPrep
   DNA/RNA mini kit (Qiagen), quantified using a Nanodrop 2000
   spectrophotometer (Thermo Fisher) and submitted for RNA sequencing to
   the University of Minnesota Genomics Center. Total eukaryotic RNA
   isolates were quantified using a fluorimetric RiboGreen assay, and once
   the samples passed the initial QC step (≥1 μg and RNA integrity
   number ≥ 8), they were converted to Illumina sequencing libraries using
   Illumina’s TruSeq stranded total RNA library prep (for details, see
   [273]www.illumina.com). Truseq libraries were hybridized to a
   paired-end flow cell and individual fragments were clonally amplified
   by bridge amplification on the Illumina cBot. Once clustering was
   complete, the flow cell was loaded onto the HiSeq 2500 and sequenced
   using Illumina’s SBS chemistry. Base call (.bcl) files for each cycle
   of sequencing were generated by Illumina Real Time Analysis (RTA)
   software. Primary analysis and index de-multiplexing were performed
   using Illumina’s bcl2fastq v2.20.0.422, which outputs the demultiplexed
   FASTQ files.

   A quality check of raw sequence FASTQ files was performed using FastQC
   software (version 0.11.5)^[274]106. Quality trimming was performed to
   remove sequence adaptors and low-quality bases using Trimmomatic with
   3 bp sliding window trimming from the 3’ end requiring minimum Q16
   (phred33)^[275]107. FastQC was run on the resulting trimmed files to
   ensure good quality of sequences. The paired-end reads were mapped to
   NCBI v38 H. sapiens reference genome using HISAT2^[276]108, resulting
   in an average alignment rate of 87.11% overall for 88 samples. We
   obtained a range of read counts between 14,365,657 and 31,530,487
   aligned reads per sample, with an average of 22,475,688.2 and
   22,697,605.5 aligned reads per sample. SAMtools was used for sorting
   and indexing the aligned bam files. After alignment, the ‘Subread’
   package (version 1.4.6) within the ‘featureCounts’ programme was used
   to generate the transcript abundance file^[277]109 (Extended Data Fig.
   [278]4).

Extended Data Fig. 4. Quality control and transcript quantification of host
RNA-seq data for colorectal cancer samples.

   [279]Extended Data Fig. 4
   [280]Open in a new tab

   A. Average phred score for forward (R1) and reverse (R2) reads output
   by FastQC. B. Distribution of host gene expression (log10(expression)
   value) for each sample quantified by Subread package.

16S rRNA sequencing data

   The microbiome dataset used in this study was generated and published
   previously^[281]3. Briefly, total DNA was isolated from the
   flash-frozen tissue samples and their associated microbiomes by
   adapting an established nucleic acid extraction protocol^[282]105. DNA
   isolated from colon samples was quantified by quantitative PCR (qPCR),
   and the V5–V6 regions of the 16S rRNA gene were PCR amplified and
   sequenced using the Illumina MiSeq (v3 Kit) with 2 × 250 bp paired-end
   protocol. The forward and reverse reads were merged and trimmed using
   USEARCH v7^[283]110. The merged and filtered reads were used to pick
   operational taxonomic units (OTUs) with QIIME v.1.7.0^[284]111. We used
   the unnormalized and unfiltered OTU table in tab-delimited format,
   representing mucosal microbiome data from 44 tumour and 44
   patient-matched colon tissue samples. We describe the steps for
   preprocessing microbiome data for integration analysis below.

IBD samples and data

   We used previously generated and published host gene expression
   (RNA-seq) and mucosal gut microbiome (16S rRNA) data for the IBD cohort
   generated as part of the HMP2 project^[285]22,[286]25. These include
   data from colonic biopsy samples collected from 78 individuals,
   including 56 individuals with IBD and 22 individuals without IBD
   (‘non-IBD’ in HMP2). Out of 56 IBD patients, 34 patients had Crohn’s
   disease (CD) and 22 patients had ulcerative colitis (UC). The
   individuals in this cohort included 38 females and 40 males. Age at the
   time of sample collection is not reported in the metadata file
   available for this cohort ([287]http://ibdmdb.org). The detailed
   protocol for sample collection and processing is described at
   [288]http://ibdmdb.org/protocols. Briefly, biopsies were collected
   during the initial screening colonoscopy and stored in RNAlater for
   molecular data generation (host and microbial, stored at –20 °C). DNA
   and RNA were extracted from RNAlater-preserved biopsies using the
   AllPrep DNA/RNA universal kit from Qiagen. For microbiome profiling,
   bacterial genomic DNA was extracted from the total mass of the biopsied
   specimens using the MoBIO PowerLyzer tissue and cells DNA isolation
   kit. The 16S rDNA V4 region was PCR amplified and sequenced in the
   Illumina MiSeq platform using the 2 × 250 bp paired-end protocol. Read
   pairs were demultiplexed and merged using USEARCH v7.0.1090 and
   clustered into OTUs using the UPARSE algorithm^[289]110,[290]112. For
   host gene expression data, mRNA was extracted from biopsy samples,
   followed by RNA-seq library preparation using a variant of Illumina
   TruSeq stranded mRNA sample preparation kit. Sequencing was performed
   according to the manufacturer’s protocols using either the HiSeq 2000
   or HiSeq 2500 with 101 bp paired-end reads. Data were analysed using
   the Broad Picard Pipeline.

   Detailed cohort characteristics for this dataset are included in
   Supplementary Table [291]1. We downloaded metadata, host RNA-seq data
   and microbiome data for these samples from [292]http://ibdmdb.org in
   July 2018. We downloaded the unnormalized and unfiltered OTU table and
   host transcript read counts files in tab-delimited format. We describe
   the filtering and preprocessing steps for host gene expression and
   microbiome data below.

IBS samples and data

   We used previously generated and published host gene expression
   (RNA-seq) and mucosal gut microbiome (16S rRNA) data for the IBS
   cohort^[293]8. These include data from colonic biopsy samples collected
   from 42 individuals, including 29 individuals with IBS and 13 healthy
   individuals (non-IBS). The individuals in this cohort included 31
   females and 11 males, with an average age of 38 years (median: 35,
   range: 20–63). These samples were collected at Mayo Clinic Rochester
   and are described in detail by Mars et al.^[294]8. Briefly, for the
   microbiome data, DNA was extracted from biopsy sections using the
   QIAGEN PowerSoil kit (QIAGEN). The V4 region of the 16S rRNA gene was
   amplified, followed by paired-end 2 × 250 bp sequencing on an Illumina
   MiSeq. Trimming of adaptors, quality control and merging of reads were
   performed using Shi7^[295]113. Amplicon sequences were aligned to the
   16S rRNA genes using BURST^[296]114. For host gene expression data,
   mRNA was extracted from biopsy samples, followed by RNA-Seq library
   preparation using the Illumina TruSeq RNA Library Prep Kit v2.
   Sequencing was performed on an Illumina HiSeq 2000 with 101 bp
   paired-end reads. Gene expression counts were obtained using the
   MAP-RSeq v.2.0.0 that consists of alignment with TopHat 2.0.12 against
   the human hg19 genome build, and gene counts with the Subread package
   1.4.4^[297]115–[298]117.

   Detailed cohort characteristics for this dataset are included in
   Supplementary Table [299]1. We obtained the unnormalized and unfiltered
   OTU table and host transcript read count files in tab-delimited format
   via personal communication with authors of the paper^[300]8. For some
   individuals, samples were collected at two timepoints. For these cases,
   we averaged the gene expression levels and microbiome abundance
   measurements across the two timepoints. This is supported by a recent
   study showing that ‘omics’ methods are more accurate when using
   averages over multiple sampling timepoints^[301]118. We describe the
   filtering and preprocessing steps for host gene expression and
   microbiome data below.

Preprocessing host gene expression data

   For host gene expression data for each disease cohort, we used the
   ‘biomaRt’ R package (version 2.37.4) to only keep data for
   protein-coding genes^[302]119. We filtered out lowly expressed genes to
   retain genes that are expressed in at least half of the samples in each
   disease cohort. We performed variance stabilizing transformation using
   the R package ‘DESeq2’ (version 1.14.1) on the filtered gene expression
   read count data^[303]120. We filtered out genes with low variance,
   using 25% quantile of variance across samples in each disease cohort as
   cut-off. Performing these steps for RNA-seq data for each disease
   cohort separately resulted in a unique host gene expression matrix per
   disease for downstream analysis, including 12,513 genes in the CRC
   dataset, 11,985 genes in IBD dataset and 12,429 genes in IBS dataset.

Preprocessing microbiome data

   We performed the following steps for microbiome data from each disease
   cohort separately. First, sequences that were classified as either
   having originated from Archaea, chloroplasts, known contaminants
   originating from laboratory reagents or kits, and soil- or
   water-associated environmental contaminants were removed from the OTU
   table as described earlier^[304]121. Next, we summarized the OTU table
   at the species (if present), genus, family, order, class and phylum
   taxonomic levels, and performed prevalence and abundance-based
   filtering to retain taxa found at 0.001 relative abundance in at least
   10% of the samples.

   To allow for identification of associations at any taxonomic level
   without repeating the analyses at each taxonomic rank, we combined
   summarized taxa matrices at different ranks into a combined taxa
   matrix. This approach could potentially lead to multi-counting of reads
   within a taxonomic group, leading to addition of correlated features in
   the taxa abundance matrix. This issue was mitigated by our penalization
   approach using lasso with stability selection (see Methods: ‘Lasso
   regression analysis’ and ‘Stability selection for the lasso model’).
   Specifically, instead of picking multiple correlated microbial taxa
   from a given taxonomic clade, this approach only selects the microbial
   taxon out of a group of correlated taxa for which the abundance is most
   robustly associated with the expression of a host
   gene^[305]28,[306]122. This approach allowed us to identify signals
   found at any taxonomic level and avoid missing potentially relevant
   associations by limiting the analysis to a single taxonomic level. At
   the same time, given the large number of features in high-dimensional
   datasets such as gene expression and microbiome data, our approach
   circumvents the computationally intensive analysis that would be
   required if each taxonomic level was analysed separately.

   To account for compositionality effects in microbiome datasets, we
   tested two different approaches for performing centered log ratio (CLR)
   transformation on our taxonomic data for each disease: (1) we
   concatenated the summarized taxa matrices (count data) into a combined
   taxa matrix, and then applied CLR transform on the combined matrix, (2)
   we CLR-transformed each taxon rank, and then concatenated them into a
   combined matrix. We verified whether these two transformation
   approaches were correlated with each other. To do so, we compared the
   taxa abundance profiles resulting from the two transformation
   approaches, and found that the two profiles are significantly
   correlated (P value < 0.05), with an average Pearson’s correlation of
   0.92 and an average Spearman correlation of 0.87 across samples in a
   dataset (see Extended Data Fig. [307]5 for example correlations between
   the taxa profiles resulting from the two transformations on a few
   randomly selected samples from the CRC microbiome dataset). This
   concordance between the taxa profiles resulting from the two
   transformation approaches implies that the transformation approach is
   unlikely to impact downstream results. The first approach generated a
   taxa profile that is compositionally coherent and has a uniform
   transformation across taxa in a dataset, whereas the second approach
   resulted in a taxa profile with multimodal distribution (corresponding
   to composition at each taxonomic rank) that might bias the variable
   selection by lasso approach. Hence, we adopted the first approach for
   transforming our taxonomic data.

Extended Data Fig. 5. Scatterplots showing correlation between taxa profiles
generated from two transformation approaches across three random samples from
colorectal cancer cohort.

   [308]Extended Data Fig. 5
   [309]Open in a new tab

   The x-axis shows taxa profile resulting from one approach, where we
   summarize taxa ranks, combine summarized rank matrices, and CLR
   transform the combined matrix (CLR_taxa_combined), and the y-axis
   represents taxa profiles resulting from the second approach, where we
   summarize taxa ranks, CLR transform each taxa rank, and combine the
   CLR-transformed taxa ranks (CLR_taxa_ranks). The black line represents
   the line of best fit using a linear model. Spearman’s rho and
   associated two-tailed p-values are shown.

   As a result, we obtained a taxonomic abundance matrix for each disease
   cohort, which included 235 taxa in the CRC dataset, 121 taxa in the IBD
   dataset and 238 taxa in the IBS dataset. We observed that the number of
   unique taxonomic groups found in the IBD dataset was 40% of those found
   in the CRC and IBS datasets, thus implying that the IBD dataset has
   lower bacterial diversity than the other two disease datasets. This
   observation is consistent with previous studies that have shown reduced
   bacterial diversity in gut mucosal microbiome in patients with IBD
   compared with individuals without this disease^[310]123–[311]125,
   including the HMP2 study that generated and described the IBD dataset
   used in our study^[312]22,[313]25. In addition, previous studies that
   generated and characterized the CRC and IBS datasets used in our work
   have reported increased microbial diversity in these
   conditions^[314]3,[315]8.

Integrating host gene expression and gut microbiome data across diseases

   Our study includes three different disease cohorts with disparate
   protocols for sample collection, handling, preparation and sequencing.
   Previous studies have shown that differences in data generation
   protocols, including sample collection, storage, DNA/mRNA extraction,
   PCR amplification and sequencing can lead to potential batch effects
   regardless of the data processing pipeline used^[316]126–[317]128.
   Studies have shown that even when using the same data curation
   pipeline, biases in the data generation pipeline and batch effects
   still influence the assignment of taxonomic composition and gene
   expression profiles^[318]127,[319]129. Statistical approaches to
   correct for batch effects have been proposed for gene expression and
   microbiome datasets; however, most of these approaches are relevant to
   testing differences between cases and controls, and do not apply to
   integrative analyses^[320]126,[321]128.

   Previous studies have also shown that different clustering approaches,
   such as operational taxonomic units (OTUs), zero-radius OTUs (zOTU) and
   amplicon sequence variants (ASVs), and specific pipeline settings may
   not have major influence on taxonomic classification compared with
   experimental factors such as choice of sequencing primer^[322]130. To
   check this in our dataset, we re-analysed a few samples from some of
   our disease dataset using the DADA2 pipeline^[323]131, which uses ASV
   clustering, and compared results to the OTU clustering that we applied
   to process the data used in our study. We found that the estimated taxa
   profiles are correlated between the two approaches at different
   taxonomic levels. For example, in a few CRC samples, we found that taxa
   profiles obtained using OTU clustering and ASV clustering are
   significantly correlated at different taxonomic levels, including at
   the phylum (Spearman rho = 0.89, P value = 0.0068), class (Spearman
   rho = 0.68, P value = 0.029) and genus (Spearman rho = 0.6, P
   value = 0.088) levels. Despite these fairly correlated taxonomic
   profiles, it is hard to assess the overall bias due to differences in
   data processing on the downstream analyses. Additionally, it is
   difficult to disentangle and compare the bias due to differences in
   data processing pipelines with those due to differences in experimental
   factors, as the latter cannot be quantified in our study. These
   combined differences could potentially influence the assignment of
   taxonomic composition and gene expression profiles across disease
   cohorts, which may in turn impact downstream integration analyses.
   While it is difficult to fully eliminate these biases, we tried to
   minimize the overall batch effect across disease cohorts in our study
   by adopting a meta-analysis approach, where we performed our
   integration analysis and compared disease to control samples within
   each cohort separately, and combined the results across cohorts at the
   last analysis step. While meta-analysis approaches have disadvantages,
   such as reduced statistical power, they have been extensively used to
   minimize batch effects when integrating genomic data from multiple
   studies^[324]128, and have recently proven useful in microbiome
   studies^[325]132,[326]133.

   Another potential issue is the difference in sample size across disease
   cohorts and between the case and control groups within each cohort.
   These sample size differences can result in differences in statistical
   power when applying sparse CCA and lasso regression. This can impact
   the number of host gene–microbe associations and pathways identified in
   each disease cohort, and the number of overlapping associations and
   pathways identified across cohorts. We attempted to minimize this
   effect by applying a differential enrichment analysis that is more
   robust to different levels of statistical power due to sample size.
   Additionally, our analysis focused only on the taxonomic composition of
   the microbiome, hence we could not characterize associations involving
   microbial genes and pathways. We note that it might be challenging to
   generalize some of the results found here, as microbiome profiles may
   vary across individuals and diseases. Hence, investigating the
   functional repertoire of microbial changes and its association with
   host genomic data would be a promising future direction.

Procrustes analysis and Mantel test

   To assess overall correspondence between host gene regulation and gut
   microbiome composition in CRC, IBD and IBS, we performed Procrustes
   analysis in R using the ‘vegan’ package (version 2.4-5)^[327]134. For
   each disease cohort, we used Aitchison’s distance on host gene
   expression data and Bray-Curtis distance on gut microbiome data as
   input to the Procrustes analysis^[328]135. The significance of rotation
   agreement was obtained using the ‘protest()’ function with 9,999
   permutations. We also applied a Mantel test to verify overall
   correlation between dissimilarity matrices of host gene expression
   (Aitchison’s distance) and gut microbiome abundance (Bray-Curtis
   distance) in each disease cohort using the vegan package (version
   2.4-5) in R. We also used the Mantel test to verify the overall
   correspondence between paired data for each disease cohort, and found a
   similar pattern of significance of agreement as Procrustes analysis for
   CRC (P value = 0.0026), IBD (P value = 0.2597) and IBS (P
   value = 0.9525). Significance was tested using 9,999 permutations.

Overview of integration framework

   We developed a machine learning framework for integrating multi-omic
   high-dimensional datasets, such as host gene expression and gut
   microbiome abundance. Our integration approach has two parts: (1)
   sparse CCA^[329]26,[330]27 for identifying groups of host genes that
   associate with groups of gut microbial taxa to characterize
   pathway-level associations and (2) lasso penalized regression^[331]28
   for identifying specific associations between individual host genes and
   gut microbial taxa (Fig. [332]1a and Extended Data Fig. [333]1). We
   describe both methods in detail below. We applied our integration
   analysis to paired host gene expression data and gut microbiome data
   for each disease cohort separately to avoid any potential batch
   effects. For each disease cohort dataset, we conducted the integration
   analysis separately for the patient data (that is, CRC, IBD and IBS)
   and corresponding control data (non-CRC, non-IBD and non-IBS,
   respectively), and considered only associations that were found in
   patients but not in controls. We standardized and normalized the data
   in all host gene expression and microbiome datasets before the
   application of statistical methods to satisfy the distribution
   requirements of the statistical models.

Sparse CCA

   We used sparse CCA to identify group-level correlations between paired
   host gene expression and gut microbiome data in each disease cohort.
   CCA identifies linear projection of two sets of observations into
   shared latent space that maximizes correlation between the two
   datasets^[334]136. Sparse CCA is adapted from CCA for high-dimensional
   settings to incorporate feature selection by utilizing L1 or lasso
   penalty in CCA^[335]26. The objective function of sparse CCA can be
   expressed as follows:
   [MATH: <mrow><msub><mrow><mi
   mathvariant="normal">maximize</mi></mrow><mrow><mi>u</mi><mo>,</mo><mi>
   v</mi></mrow></msub><msup><mrow><mi>u</mi></mrow><mrow><mi>T</mi></mrow
   ></msup><msup><mrow><mi>X</mi></mrow><mrow><mi>T</mi></mrow></msup><mi>
   Y</mi><mi>v</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">subject</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">to</mi><mspace
   width="0.25em"></mspace><msup><mrow><mi>u</mi></mrow><mrow><mi>T</mi></
   mrow></msup><msup><mrow><mi>X</mi></mrow><mrow><mi>T</mi></mrow></msup>
   <mi>X</mi><mi>u</mi><mo>≤</mo><mn>1</mn><mo>,</mo><mspace
   width="0.25em"></mspace><msup><mrow><mi>v</mi></mrow><mrow><mi>T</mi></
   mrow></msup><msup><mrow><mi>Y</mi></mrow><mrow><mi>T</mi></mrow></msup>
   <mi>Y</mi><mi>v</mi><mo>≤</mo><mn>1</mn><mo>,</mo><mspace
   width="0.25em"></mspace><mo>∣</mo><mo>∣</mo><mi>u</mi><mo>∣</mo><msub><
   mrow><mo>∣</mo></mrow><mrow><mn>1</mn></mrow></msub><mo>≤</mo><msub><mr
   ow><mi>λ</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><mo>∣</mo><
   mo>∣</mo><mi>v</mi><mo>∣</mo><msub><mrow><mo>∣</mo></mrow><mrow><mn>1</
   mn></mrow></msub><mo>≤</mo><msub><mrow><mi>λ</mi></mrow><mrow><mn>2</mn
   ></mrow></msub></mrow> :MATH]

   where, X and Y denote two data matrices with the same number of samples
   but different number of features (representing gut microbiome taxonomic
   composition data and host gene expression data, respectively); u and v
   are canonical loading vectors of X and Y, respectively; λ[1] and λ[2]
   control lasso penalties of u and v, respectively; Τ denotes the
   transpose of a matrix.

   For each disease cohort, we separately applied sparse CCA using the R
   (version 3.3.3) package ‘PMA’ (version 1.1), with gut microbiome
   taxonomic composition and host gene expression as two sets of variables
   to be correlated^[336]137. Below, we describe details on hyperparameter
   tuning, fitting sparse CCA models, enrichment analysis and
   visualization of sparse CCA output.

Hyperparameter tuning and fitting for sparse CCA model

   We performed hyperparameter tuning to identify the sparsity penalty
   parameters for gut microbiome abundance (λ[1]) and host gene expression
   (λ[2]) data. Since the permutation search provided in the PMA package
   only performs a one-dimensional search in the tuning parameter space,
   we implemented a grid-search approach using leave-one-out
   cross-validation in R (version 3.3.3) for hyperparameter tuning. We
   selected penalty parameters that had the highest correlation under
   cross-validation. Using this approach, we identified λ[1] as 0.15 and
   λ[2] as 0.2 for CRC data, λ[1] as 0.177 and λ[2] as 0.333 for IBD data,
   and λ[1] as 0.4 and λ[2] as 0.1 for IBS data.

   After identifying sparsity parameters, we fit the sparse CCA model to
   obtain subsets of correlated host genes and gut microbes, known as
   components. Each sparse CCA component includes non-zero weights (or
   canonical loadings) on gut microbes, and non-zero weights on a subset
   of host genes correlated with those gut microbes to capture joint
   variation in the two sets of observations. We computed the first 10
   sparse CCA components for each disease cohort, performing a separate
   computation for case and control samples. Sparse CCA components were
   computed iteratively, informed by previously computed components, thus
   resulting in uncorrelated components^[337]138.

Significance of correlation for sparse CCA components

   We computed the significance of each pair of canonical variables (or a
   component) using the leave-one-out cross-validation approach in R
   (version 3.3.3). For a given component, we first used the penalty
   parameters determined above to compute the sparse CCA output, with one
   sample held out. We then computed the scores for the held-out sample,
   that is, we computed
   [MATH: <mrow><mi
   mathvariant="normal">score</mi><msub><mrow><mi>X</mi></mrow><mrow><mi>i
   </mi></mrow></msub><mo>=</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>i</
   mi></mrow></msub><msub><mrow><mi>u</mi></mrow><mrow><mo>−</mo><mi>i</mi
   ></mrow></msub></mrow> :MATH]
   and
   [MATH: <mrow><mi
   mathvariant="normal">score</mi><msub><mrow><mi>Y</mi></mrow><mrow><mi>i
   </mi></mrow></msub><mo>=</mo><msub><mrow><mi>Y</mi></mrow><mrow><mi>i</
   mi></mrow></msub><msub><mrow><mi>v</mi></mrow><mrow><mo>−</mo><mi>i</mi
   ></mrow></msub></mrow> :MATH]
   , where i is the held-out sample, X[i] and Y[i] denote the values for
   the ith sample of the input data matrices X and Y, and u[−i] and v[−i]
   are the canonical loadings estimated from the sparse CCA computation
   without the ith sample. We repeated this n times, where n is the total
   number of samples in the data, to obtain the vector of held-out scores.
   To assess the true strength of association and its significance, we
   used ‘cor.test()’ on the scores computed for the held-out samples. We
   corrected the P values for multiple hypothesis testing using the
   Benjamini-Hochberg (FDR) method within each disease cohort, and
   determined significant components at FDR < 0.1. Note that here we are
   testing for significance at the component level, that is, for a group
   of host genes correlated with a group of gut microbes, rather than the
   significance of the individual features selected, which depends on the
   level of sparsity penalization.

   Using this approach, we identified 7 significant components in CRC,
   with an average of 828 host genes and 8 gut microbes; 4 significant
   components in IBD, with an average of 2,095 host genes and 6 gut
   microbes; and 6 significant components in IBS, with an average of 577
   host genes and 61 gut microbes (FDR < 0.1, Supplementary Tables
   [338]2–[339]4).

Enrichment analysis for sparse CCA

   To characterize host pathways enriched for the set of host genes
   associated with microbes in each component, we implemented an
   enrichment analysis in R (version 3.3.3). We implemented Fisher’s exact
   test to assess pathway enrichment, where we used the set of host genes
   input to the sparse CCA analysis as background genes, and the set of
   host genes in a component as the genes of interest. We used the Kyoto
   Encyclopedia of Genes and Genomes (KEGG) and Pathway Interaction
   Database (PID) gene sets from the MsigDB canonical pathways
   collection^[340]139,[341]140. To avoid pathways that are too large to
   provide any specific biological insights or too small to provide
   adequate statistical power, we excluded any pathway with either (1)
   fewer than 25 genes, (2) more than 300 genes or (3) fewer than 5 genes
   that overlapped between the genes of interest and the pathway. We
   combined the set of enriched host pathways for all significant
   components for a given disease dataset, corrected for multiple
   hypothesis testing within each disease cohort using the
   Benjamini-Hochberg (FDR) approach and determined significant host
   pathways at FDR < 0.1. For a given pathway, we assigned the component
   that gave the highest significance for this pathway. This analysis was
   performed separately for case and control data for each disease. To
   identify case-specific host pathways, we used a two-part approach: (1)
   we identified pathways that were only significantly enriched in cases
   (FDR < 0.1) but not in controls and (2) we identified pathways that
   were significant in both the cases and controls at FDR < 0.1, and
   performed differential enrichment analysis for pathways in cases versus
   controls as described below. We retained pathways that were
   significantly enriched at FDR < 0.1 and differentially enriched in
   cases versus controls at FDR < 0.2. Finally, we combined the pathways
   from parts (1) and (2) to obtain case-specific pathways at FDR < 0.1.

Differential enrichment analysis of pathways

   We performed differential enrichment of pathways in cases versus
   controls by implementing a comparative log odds-ratio approach in
   R^[342]141,[343]142. To do so, we first computed the z-score for the
   odds ratio for the ith pathway in cases:
   [MATH: <mrow><msub><mrow><mi>z</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi
   mathvariant="normal">case</mi></mrow></msub><mo>=</mo><mi
   mathvariant="normal">log</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>δ</m
   i></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>/</m
   o><mi
   mathvariant="normal">SE</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>δ</mi
   ></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>,</mo
   ></mrow> :MATH]

   where δ[i] is the odds-ratio for the ith pathway in cases, and
   [MATH: <mrow><mi mathvariant="normal">SE</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>δ</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub></mrow></mfenced></mrow> :MATH]
   is the standard error for the ith pathway in cases, which is computed
   using the four elements, n[1] to n[4], of the 2 × 2 contingency table
   used in the enrichment analysis for the ith pathway as follows:
   [MATH: <mrow><mi
   mathvariant="normal">SE</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>δ</mi
   ></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</mo
   ><msqrt><mrow><mn>1</mn><mo>/</mo><msub><mrow><mi>n</mi></mrow><mrow><m
   n>1</mn></mrow></msub><mo>+</mo><mn>1</mn><mo>/</mo><msub><mrow><mi>n</
   mi></mrow><mrow><mn>2</mn></mrow></msub><mo>+</mo><mn>1</mn><mo>/</mo><
   msub><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msub><mo>+</mo><mn
   >1</mn><mo>/</mo><msub><mrow><mi>n</mi></mrow><mrow><mn>4</mn></mrow></
   msub></mrow></msqrt><mo>.</mo></mrow> :MATH]

   Similarly, we computed
   [MATH: <msub><mrow><mi>z</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi
   mathvariant="normal">ctrl</mi></mrow></msub> :MATH]
   for the same pathway in the controls. Next, we computed a comparative
   log odds-ratio for the ith pathway overlapping between cases and
   controls as follows:
   [MATH:
   <mrow><msub><mrow><mi>z</mi></mrow><mrow><mi>i</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi mathvariant="normal">case</mi><mo>−</mo><mi
   mathvariant="normal">ctrl</mi></mrow></msub><mo>=</mo><mfrac><mrow><mi
   mathvariant="normal">log</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>δ</m
   i></mrow><mrow><mi>i</mi><mo>,</mo><mi
   mathvariant="normal">case</mi></mrow></msub></mrow><mo>)</mo></mrow><mo
   >−</mo><mi
   mathvariant="normal">log</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>δ</m
   i></mrow><mrow><mi>i</mi><mo>,</mo><mi
   mathvariant="normal">ctrl</mi></mrow></msub></mrow><mo>)</mo></mrow></m
   row><mrow><mi
   mathvariant="normal">SE</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>δ</mi
   ></mrow><mrow><mi>i</mi><mo>,</mo><mi
   mathvariant="normal">case,ctrl</mi></mrow></msub></mrow><mo>)</mo></mro
   w></mrow></mfrac><mo>.</mo></mrow> :MATH]

   The greater the value of
   [MATH: <msub><mrow><mi>z</mi></mrow><mrow><mi>i</mi><mo>,</mo><mi
   mathvariant="normal">case</mi><mo>−</mo><mi
   mathvariant="normal">ctrl</mi></mrow></msub> :MATH]
   , the greater the odds that a pathway is differentially enriched in
   case versus control than by chance. P values were inferred assuming
   normal approximations, and corrected for multiple hypothesis testing
   using the Benjamini-Hochberg (FDR) approach.

Visualizing disease-specific and shared host pathways and components from
sparse CCA

   To determine ‘shared’ host pathways (that is, host pathways for which
   gene expression correlates with gut microbes across all three disease
   cohorts) and ‘disease-specific’ host pathways (that is, host pathways
   for which gene expression correlates with gut microbes in only one of
   the three disease cohorts), we computed overlaps between significant
   case-specific host pathways determined above across the three disease
   cohorts. Given the overlap across the curated gene sets from MsigDB, we
   controlled for redundancy across pathways for visualization purposes.
   To do this, we identified similar pathways on the basis of their
   relative overlap in terms of the set of genes using an overlap
   coefficient. The overlap coefficient between two pathways is defined as
   the number of common genes between the pathways divided by the number
   of genes in the pathway with fewer genes. Specifically, the overlap
   coefficient is represented as follows:
   [MATH: <mrow><mi
   mathvariant="normal">overlap</mi><mrow><mo>(</mo><mrow><mi>X</mi><mo>,<
   /mo><mi>Y</mi></mrow><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><mfenced
   close="∣"
   open="∣"><mrow><mi>X</mi><mo>∩</mo><mi>Y</mi></mrow></mfenced></mrow><m
   row><mi mathvariant="normal">min</mi><mfenced
   close="∣"><mrow><mrow><mo>(</mo><mrow><mo>∣</mo><mi>X</mi><mo>∣</mo><mo
   >,</mo><mo>∣</mo><mi>Y</mi><mo>∣</mo></mrow><mo>)</mo></mrow></mrow></m
   fenced></mrow></mfrac></mrow> :MATH]

   For the top 15 most significant host pathways (FDR < 0.1) discovered
   for each shared and disease-specific set (Supplementary Table [344]18),
   we computed pairwise similarity between pathways as overlap
   coefficients and used a maximum allowed similarity score of 0.5 as a
   cut-off. Using the pairs of pathways that satisfied the cut-off, we
   computed the connected components to identify clusters of overlapping
   pathways. For visualization purposes, we selected a representative
   pathway from each connected component, prioritizing the pathway with
   the highest number of genes (Fig. [345]2a and Supplementary Table
   [346]4). We visualized host pathway enrichment using the R package
   ‘ggplot2’ (version 3.2.1).

   For visualizing components corresponding to selected host pathways or
   common host genes across diseases (Fig. [347]2b–d), we ordered host
   genes and taxa by their absolute coefficients in the component, and
   selected the top 10 host genes and taxa for representation. If multiple
   taxa originating from the same lineage occurred in a component, we
   selected the one with the highest coefficient to reduce redundancy,
   thus representing the taxa with the largest contribution from a given
   lineage. The size of host genes and gut microbial taxa were scaled by
   the absolute value of their corresponding coefficients in a given
   component. All sparse CCA components were visualized using Cytoscape
   (version 3.5.1)^[348]143.

Lasso regression analysis

   We used lasso penalized regression to identify specific associations
   between individual host genes and gut microbial taxa within each
   disease cohort^[349]28. We implemented a gene-wise model using
   expression for each host gene as response and abundances of microbiome
   taxa as predictors, to identify microbial taxa that are correlated with
   a host gene. An ordinary least squares (OLS) regression is not suitable
   for this task since OLS results in unstable solutions under
   high-dimensional settings or when
   [MATH: <mrow><mi>p</mi><mo>></mo><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mo>></mo><mi>n</mi></mrow> :MATH]
   , that is, the number of predictors p is much higher than the number of
   samples n. Additionally, we expected the abundance of only a few
   microbial taxa to correlate with the expression of each host gene. To
   address this, we used lasso regression, which is similar to
   multivariate OLS, except that it uses shrinkage or regularization to
   perform variable selection, thus picking only a few taxa that associate
   with a host gene’s expression.

   To account for other factors that can impact host gene expression or
   microbiome composition, each model also included covariates in the
   predictor matrix (that is, microbiome abundance table) for gender (male
   or female), disease subtype for IBD (Crohn’s disease or ulcerative
   colitis), disease-subtype for IBS (constipation (IBS-C) or diarrhoea
   (IBS-D)).

   The lasso model estimates the lasso regression coefficient
   [MATH: <mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   by minimizing the following:
   [MATH: <mrow><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>n</mi></mrow></munderover><msup><mrow><mfenced
   close=")"
   open="("><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>−</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>0</mn></mrow></msub
   ><mo>−</mo><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>p</mi></mrow></munderover><msub><mrow><mi>β</m
   i></mrow><mrow><mi>j</mi></mrow></msub><msub><mrow><mi>x</mi></mrow><mr
   ow><mi>i</mi><mi>j</mi></mrow></msub></mrow></mfenced></mrow><mrow><mn>
   2</mn></mrow></msup><mo>+</mo><mi>λ</mi><munderover accent="false"
   accentunder="false"><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><m
   n>1</mn></mrow><mrow><mi>p</mi></mrow></munderover><mfenced close="∣"
   open="∣"><mrow><mi>β</mi></mrow></mfenced><mo>,</mo></mrow> :MATH]

   where n is the number of samples, p is the number of predictors (taxa
   and other covariates),
   [MATH:
   <mrow><mn>1</mn><mo>≤</mo><mi>i</mi><mo>≤</mo><mi>n</mi><mo>,</mo><mspa
   ce
   width="0.25em"></mspace><mn>1</mn><mo>≤</mo><mi>j</mi><mo>≤</mo><mi>p</
   mi></mrow> :MATH]
   , y is the response (host gene expression), x is the predictor (taxa
   abundance and other covariates) and λ is the tuning parameter,
   [MATH: <mrow><mi>λ</mi><mo>≥</mo><mn>0</mn></mrow> :MATH]
   .

   In addition to minimizing the residual sum of squares (first term in
   the equation), lasso minimizes the l[1] norm of the coefficients
   (second term in the equation), which has an effect of forcing some of
   the coefficients to zero as the value of the tuning parameter λ
   increases. Thus, lasso performs feature or variable selection that
   leads to sparse models.

   We implemented a lasso regression framework using the R (version 3.3.3)
   package glmnet (version 2.0-13), which uses cyclical coordinate descent
   to compute a regularization path^[350]144. Our framework executes a
   lasso regression for each host gene’s expression as response and
   abundances of microbial taxa and values of other covariates as
   predictors. We used leave-one-out cross-validation to estimate the
   tuning parameter λ, which was used to fit the final model on a given
   disease dataset.

   We then performed inference for the lasso model using a regularized
   projection approach known as desparsified lasso. The desparsified lasso
   uses the asymptotic normality of a bias-corrected version of the lasso
   estimator to obtain 95% confidence intervals and P values for the
   coefficient of each predictor (microbe) associated with a given host
   gene^[351]145. We used the R package ‘hdi’ (version 0.1-7) that
   implements the desparsified lasso approach for estimation of confidence
   intervals and hypothesis testing in high-dimensional and sparse
   settings^[352]145,[353]146. We then corrected for multiple hypothesis
   testing using the Benjamini-Hochberg (FDR) method.

Stability selection for the lasso model

   Since the lasso model is sensitive to small variations of the predictor
   variable, we used stability selection to pick out robust microbes
   associated with a host gene^[354]122. Stability selection is a
   resampling-based method that can be combined with different variable
   selection procedures in high-dimensional settings, including lasso.
   Briefly, stability selection with lasso proceeded as follows:

   Step 1. Select a random subset of the data.

   Step 2. Fit the lasso model with a randomly perturbed penalty term in
   the neighbourhood of the ‘best’ penalty λ. Record the set of selected
   variables (microbes).

   Step 3. Repeat steps (1) and (2) K times.

   Step 4. Compute the frequency of selection f[i] per variable (microbe)
   across all trials.

   Step 5. Select the variables (microbes) that are selected with a
   frequency of at least f[thr], a pre-specified threshold value. Thus, we
   selected a set of stable variables (microbes) such that
   [MATH:
   <mrow><msub><mrow><mi>f</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>≥<
   /mo><msub><mrow><mi>f</mi></mrow><mrow><mi
   mathvariant="normal">thr</mi></mrow></msub></mrow> :MATH]
   .

   The overall idea is that, if the same variables (microbes) are
   repeatedly selected when the parameters are perturbed, then they are
   robust variables. Stability selection also controls for family-wise
   error rate, thus controlling for false positives in addition to the FDR
   approach mentioned above^[355]122. In our analysis, we used the R
   package ‘stabs’ (version 0.6-3) to perform stability
   selection^[356]147. Specifically, we used the following parameters in
   the process described above: in Step 1, a random subset of size n/2 of
   data was selected, where n is total number of samples; in Step 3,
   K = 100; and in Step 5, f[thr] = 0.6, that is, a predictor (microbe)
   selected in at least 60% of the fitted models is considered stable. The
   choice of these parameters are in accordance with the proposal of
   stability selection by Meinshausen & Bühlmann^[357]122.

   Finally, we performed an intersection between associations identified
   by stability selection here and associations identified at FDR < 0.1 by
   the lasso model described above. We filtered out any significant and
   stability-selected host gene–gender and host gene–disease subtype
   (applicable to IBD and IBS) associations from the output to retain
   significant and stability-selected host gene–microbe associations at
   FDR < 0.1.

Parallel execution of lasso analysis on supercomputing nodes

   We leveraged supercomputing nodes to implement a parallel processing
   framework to allow scalable computation for the high-dimensional
   datasets. We implemented a parallel framework for executing the
   gene-wise lasso analysis, where we parallelized execution of lasso
   models on host genes across multiple nodes and cores on a compute
   cluster from Minnesota Supercomputing Institute. Our framework scalably
   computes gene-wise models for over 12,000 host genes, where each model
   includes hundreds of microbial taxa as covariates. We used job arrays
   to parallelize our analysis on multiple nodes on the cluster.
   Additionally, we used the R packages ‘doParallel’ (version 1.0.15) and
   ‘foreach’ (version 1.4.7) to run parallel processes on multiple cores
   of each compute node. Our parallel processing framework using 5 compute
   nodes took an average of 5 h 30 min per disease cohort.

Enrichment analysis for lasso output

   To characterize biological functions for the host genes that were found
   to be associated with specific gut microbes in a disease cohort by the
   lasso framework, we implemented an enrichment analysis in R (version
   3.3.3) using Fisher’s exact test. We used the set of expressed genes
   input to the lasso analysis as the background genes, and the set of
   host genes associated with gut microbes in patient samples as genes of
   interest. We used the KEGG, PID and REACTOME gene sets from the MsigDB
   canonical pathways collection^[358]139,[359]140. To avoid pathways that
   were too large to provide any specific biological insights or too small
   to provide adequate statistical power, we excluded from our analysis
   any pathways with more than 85 genes, fewer than 10 genes, or fewer
   than 5 genes that overlapped between the pathway and the genes of
   interest. Out of 1,881 host pathways, these criteria filtered out 297
   pathways for being too large, 299 pathways for being too small and an
   average of 1,186 pathways for not having sufficient overlap with the
   genes of interest, resulting in an average of 99 pathways that were
   tested for enrichment in each disease cohort. We are aware that this
   approach may filter out some potential pathways of interest from our
   analysis. However, even if included, these pathways are unlikely to
   yield significant associations due to lack of statistical power. The P
   values obtained from Fisher’s exact test were adjusted for multiple
   testing using the Benjamini-Hochberg (FDR) approach.

Comparing case versus control associations and pathways

   Using lasso regression and stability selection, we identified
   associations for cases and controls with non-zero lasso regression
   coefficients for each disease cohort. To identify associations that
   were found only in cases but not in controls within a disease cohort,
   we checked for any potential overlap between case and control
   associations, without subsetting associations using any P value or FDR
   cut-off. We found no overlapping host gene–microbe associations between
   cases and controls in any disease cohort, which could be driven by
   underlying biological differences between case and control conditions
   within each cohort. In addition, downsampling the cases to match the
   controls also did not yield any overlapping associations. Thus, we
   conclude that the designation of host gene–microbe associations as
   specific to cases and controls is robust to the significance cut-off
   and sample size differences in our study cohort.

   Another potential approach to identify case-specific associations would
   be to use an interaction term between the independent variable (that
   is, microbial taxa) and disease status in the lasso model, and
   determine associations that are significant in cases but not in
   controls. Incorporating such an interaction within the lasso framework
   is not straightforward, and various approaches have been proposed (for
   an overview, see Lim & Hastie^[360]148). However, these approaches do
   not explicitly allow for different sparsity structures in cases and
   controls (that is, where an effect is present for cases but not for
   controls, or present for controls but not for cases), which is crucial
   to our interpretation and subsequent analyses. Moreover, this approach
   is computationally challenging for our implementation for the following
   reason: for each disease cohort, we fit over 12,300 gene-wise lasso
   models on average, where each model uses the expression for a host gene
   as response and the abundance of about 200 microbial taxa on average as
   predictors/independent variables. Adding an interaction term between
   each predictor (that is, microbial taxon) and disease status will lead
   to doubling the number of predictors in each gene’s model, resulting in
   about 400 predictors per model, and including over
   12,300 × 200 = 2,460,000 additional terms for assessment of model fit
   per disease cohort. Hence, we did not include interaction terms in our
   model.

   Using enrichment analysis for host genes associated with specific gut
   microbes, we identified host pathways for cases and controls in each
   disease cohort. To account for case versus control comparison at the
   pathway level, we performed the following analysis: (1) from the
   results of our enrichment analysis, we retained the set of host
   pathways that were tested for enrichment in both cases and controls
   within each disease cohort without using any P value or FDR cut-off and
   (2) we tested for differential enrichment of these pathways between
   case and control groups using a comparative log odds-ratio approach
   (described above). For IBD and IBS cohorts, we found no pathways
   enriched in both case and control groups in step (1) above, implying
   that host pathways enriched in cases are indeed case-specific in these
   cohorts. In CRC, we found two host pathways that were common in both
   cases and controls in step (1); however, they were not differentially
   enriched in cases versus controls at FDR < 0.1 (step 2), implying that
   they were not necessarily specific to cases. Hence, we filtered out
   these two pathways from consideration for case-specific pathways.

   In addition to accounting for overlaps between cases and controls at
   the association level and at the pathway level, we also accounted for
   any overlaps between host genes that were associated with microbes in
   cases and controls. We only used case-specific host genes, that is,
   host genes that were found to be associated with microbes in cases but
   not in controls, to perform enrichment analysis to determine
   case-specific pathways in each disease cohort. Using this approach, we
   identified 18 host pathways that are unique to each disease, including
   4 CRC-specific, 9 IBD-specific and 5 IBS-specific pathways that
   associate with unique gut bacteria (FDR < 0.1, Supplementary Table
   [361]9).

   To identify any loss of function in disease (that is, associations that
   are found in controls but not in cases), we performed analysis at
   pathway level to determine functional trends in control associations
   compared to case associations. To do so, we first determined any host
   pathways that were found enriched only in controls but not in cases at
   FDR < 0.1. In IBD and IBS, no pathways were found to be enriched in
   controls at FDR < 0.1, so we could not identify any functional trends
   in control associations for these two disease cohorts. In CRC, top 10
   control-specific pathways at FDR < 0.1 are related to transcriptional
   regulation, rRNA expression, DNA methylation and other such general
   cellular function categories that did not provide any useful insight on
   loss of function in disease. Additionally, as mentioned above, we did
   not find any pathways that were differentially enriched in controls vs
   cases in any disease cohort. Hence, given that we did not find any
   specific functional trends in the control associations in CRC, IBD and
   IBS cohorts, we only focused on case-specific associations and pathways
   across diseases in this study.

Identification and visualization of taxa and genes that are shared or
distinct across associations in diseases

   To visualize association patterns for gut microbes and host genes that
   are shared between associations across diseases, we examined common
   microbes/genes between association-pairs across disease cohorts. Given
   the difference between the number of host genes and gut microbial taxa
   identified across diseases, we first determined overlaps in host genes
   and microbial taxa in input datasets and gene–taxa associations
   identified across disease cohorts. To do this, we used a two-step
   process:
    1. To examine common and distinct features in input datasets, we
       considered the host genes and taxa that were used as input to the
       lasso integration pipeline, and calculated the overlap in these
       host genes and taxa across cohorts (Supplementary Tables [362]19
       and [363]20). We computed the pairwise overlap as an overlap
       coefficient, which is a measure of similarity between two sets and
       is defined here as the number of common genes (or taxa) between the
       two disease datasets divided by the number of genes (or taxa) in
       the dataset with fewer genes (or taxa). Overall, we found that an
       average of 84% of host genes are commonly expressed between
       diseases (Supplementary Table [364]19), while the remaining 16% of
       host genes might be expressed in one disease cohort, but not in the
       other two. On the other hand, only about 37% of input gut microbial
       taxa overlapped between diseases (Supplementary Table [365]20),
       indicating that a majority of taxa were specific to each disease
       cohort. This is consistent with previous research that have shown
       high dissimilarity between disease-associated microbial
       communities^[366]132,[367]149.
    2. Next, we calculated the overlap between the set of host genes found
       associated with gut microbes only in one of the disease cohorts
       (that is, disease-specific genes in our identified associations)
       and the set of host genes used as input in the other two diseases
       (Supplementary Table [368]21). Similarly, we calculated the overlap
       between gut microbial taxa found associated with host genes only in
       one disease (that is, disease-specific taxa in our identified
       associations) and the set of taxa used as input in the other two
       diseases (Supplementary Table [369]22). Overall, we found similar
       trends as described above. We found that an average of 80% of
       disease-specific host genes in the identified associations were
       also included as input in the other two diseases (Supplementary
       Table [370]21). The remaining 20% of disease-specific host genes
       were probably not expressed in the other two conditions. On the
       contrary, we found that on average, only 12% of disease-specific
       taxa in the identified associations were included as input taxa in
       the other two diseases (Supplementary Table [371]22). This is in
       line with our observation of disease-specific trends for gut
       microbial taxa used as input to the integration pipeline, as
       described above.

   Given these patterns, we considered all the host genes and taxa that
   were identified after preprocessing the input dataset, which allowed us
   to identify disease-specific as well as overlapping host gene/taxa
   between associations across diseases.

   To visualize associations for gut microbes that are associated with
   host gene expression in multiple diseases (Fig. [372]4a), we identified
   common microbes between all possible overlaps between diseases (Fig.
   [373]4a, Networks 1–4), host genes that associate with these common
   microbes in each disease, and all associations involving these microbes
   and genes in each disease (FDR < 0.1). Next, we grouped gene–taxa
   associations identified per disease by shared taxa, sorted them by FDR
   value and picked the top gene–taxa association per shared taxon until
   we obtained at most 10 associations per disease (FDR < 0.1,
   Supplementary Table [374]10).

   Similarly, for visualizing associations for host genes that are
   associated with gut microbes in multiple diseases (Fig. [375]4b), we
   identified shared host genes between all possible overlaps between
   diseases (Fig. [376]4b, Networks 1–4), and host gene–taxa associations
   per disease for these host genes shared across diseases. We sorted the
   associations by FDR-adjusted p-values, i.e. q-values (ordered first by
   q-value in CRC associations, followed by q-value in IBD associations
   and finally, by q-value in IBS associations, depending on the
   overlapping set under consideration). We picked the top 10 genes from
   this merged output and identified at most the top 10 associations
   involving these genes in each disease for the overlapping set under
   consideration (FDR < 0.1, Supplementary Table [377]11). Since lasso
   gives biased estimates of the coefficients, we used Spearman
   correlation coefficient (rho) to depict strength of association for
   visualizing host gene–taxa associations. All the associations in Fig.
   [378]4 were visualized using Cytoscape v3.5.1, where shared features
   are in grey and disease-specific features are in disease-specific
   colours^[379]143.

Notes on our approach and comparison to previous studies

   While the disease cohorts used in our study have been previously
   published^[380]3,[381]8,[382]25,[383]28, to the best of our knowledge,
   our study appears to be the first to perform a comprehensive
   characterization of host gene–microbiome crosstalk within and across
   these disease cohorts. In the previous study that described the CRC
   cohort, Burns et al.^[384]3 characterized the tumour-associated
   microbiome, and how it varies in composition compared to the microbiome
   of adjacent matched normal colon tissue. For example, they reported
   loss in abundance of multiple taxa within the order Bacteroidales in
   tumour-associated microbiota compared with normal samples. Here we
   found that Bacteroidales is associated with host genes CCR2 and FPR1,
   which are part of the tumour-associated IL-10 signalling pathway. The
   previous study that described the IBD cohort compared IBD versus
   non-IBD samples, and found differences in transcriptional activity and
   abundance among taxa belonging to class Clostridia, and dysregulation
   of immune-related host pathways in disease state^[385]25,[386]28. Our
   integrative analysis for the same cohort revealed associations between
   immunoinflammatory pathways and members within class Clostridia,
   including Peptostreptococcaceae and Clostridium sensu stricto 1. In the
   original study describing the IBS cohort, Mars et al.^[387]8 found
   overrepresentation of Streptococcus species in patients with IBS
   compared with healthy individuals, and identified associations between
   faecal microbes, such as Peptostreptococcaceae, with host genes
   implicated in peptidoglycan binding. Our analyses revealed several
   important associations between closely interacting tissue-adherent
   microbiome and host genes and pathways in IBS, including associations
   between Streptococcus and host genes that modulate macrophage
   inflammatory response, and between Peptostreptococcaceae and host
   pathways that regulate intestinal homoeostasis and inflammation.

   An important contribution of our work is a machine learning-based
   integrative framework for characterization of host gene–microbe
   associations across human diseases. Although few recent studies have
   investigated associations between host transcriptome and gut microbiome
   in human gut disorders, our analysis uses a unique analytical technique
   that has several advantages^[388]21–[389]24. First, as opposed to
   analyses that rely on calculating pairwise correlations between
   features (for example, Dayama et al.^[390]24), our approach does not
   require restricting the data to a predetermined subset of taxa or genes
   of interest. In addition, compared to Procrustes analysis, which is
   commonly used for finding overall correspondence between paired
   datasets, our approach does not only detect overall association, but
   can also find specific associations between gut microbial taxa and host
   genes (using lasso) and pathways (using sparse CCA), shedding light on
   potential biological mechanisms of association. Furthermore, our
   approach can be applied to other types of multi-omic dataset, including
   microbial metabolomic and metagenomic data^[391]8. Lastly, our analysis
   incorporates data from several diseases, identifying commonalities
   across conditions as well as disease-specific patterns.

   Our study uncovered key insights at the systems level; for example, we
   found that gut microbes that have been associated with all three
   diseases, such as Streptococcus, associate with different host genes in
   each disease, suggesting that the same microbial taxon can contribute
   to different health outcomes by potentially regulating the expression
   of different host genes in the colon. We also identified numerous
   specific hypotheses in the form of disease-specific associations; for
   example, we found that: Bacteroidales is associated with host genes
   CCR2 and FPR1 in the IL-10 signalling host pathway in colorectal
   cancer; Peptostreptococcaceae is associated with MAPK3 and VIPR1 that
   are part of G protein-coupled receptors pathways in inflammatory bowel
   disease; and Bacteroides massiliensis is associated with the host gene
   PLA2G4A, a member of the prostaglandin biosynthesis pathway, in
   irritable bowel syndrome.

Ethics statement

   For the colorectal cancer cohort, all research conformed to the
   Helsinki Declaration and was approved by the University of Minnesota
   Institutional Review Board, protocol 1310E44403. For the inflammatory
   bowel disease and irritable bowel syndrome cohorts, ethical approval is
   described in their respective publications^[392]8,[393]22,[394]25.

Reporting Summary

   Further information on research design is available in the [395]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [396]Reporting Summary^ (2.1MB, pdf)
   [397]Supplementary Tables^ (21.6MB, zip)

   Supplementary tables.
   [398]Peer Review File^ (2.6MB, pdf)

Acknowledgements