Abstract

   The commensal microbiome is known to influence a variety of host
   phenotypes. Microbiome profiling followed by differential abundance
   analysis has been established as an effective approach to study the
   mechanisms of host-microbiome interactions. However, it is challenging
   to interpret the collective functions of the resultant microbe-sets due
   to the lack of well-organized functional characterization of commensal
   microbiome. We developed microbe-set enrichment analysis (MSEA) to
   enable the functional interpretation of microbe-sets by examining the
   statistical significance of their overlaps with annotated groups of
   microbes that share common attributes such as biological function or
   phylogenetic similarity. We then constructed microbe-set libraries by
   query PubMed to find microbe-mammalian gene associations and disease
   associations by parsing the Disbiome database. To demonstrate the
   utility of our novel MSEA methodology, we carried out three case
   studies using publicly available curated knowledge resource and
   microbiome profiling datasets focusing on human diseases. We found MSEA
   not only yields consistent findings with the original studies, but also
   recovers insights about disease mechanisms that are supported by the
   literature. Overall, MSEA is a useful knowledge-based computational
   approach to interpret the functions of microbes, which can be
   integrated with microbiome profiling pipelines to help reveal the
   underlying mechanism of host-microbiome interactions.

   Subject terms: Data integration, Data mining, Literature mining

Introduction

   With the advance in sequencing technology and growing interest in human
   microbiota, microbiome profiling datasets are accumulating rapidly.
   Standard microbiome data analysis pipelines primarily aim to identify
   individual microbial taxa, or microbial communities with differential
   abundance between healthy and diseased hosts. Then, genomic and/or
   metabolic strategies are used to characterize individual microbial taxa
   to help interpret their mechanisms in the pathogenesis of many complex
   human diseases^[32]1,[33]2. The host-microbiome interactions are
   conveyed either by alteration of sets of microbes or by their
   collective functions.

   Microbes are able to affect host phenotypes through modulation of gene
   expression^[34]3 or cell signaling in relevant host cells/tissues.
   However, the regulatory mechanisms of how microbiomes influence host
   physiology are not clear. Some studies demonstrated such
   host-microbiome interactions could be achieved via microbial
   metabolites. For instance, the host immune system has been shown to be
   modulated by the gut microbiome via microbial metabolites^[35]4,[36]5.
   As a component of the Human Functional Genomics Project (HFGP),
   Schirmer et al.^[37]4 found correlation between gut microbial features
   and production of various types of cytokines in a cohort of 500 healthy
   adults from the Netherlands. Next, they experimentally validated that
   two microbial metabolites, tryptophol and palmioleic acid, are able to
   modulate the production of IFNγ and TNFα, respectively, in peripheral
   blood mononuclear cells. In an in-depth investigation^[38]5, identified
   microbe-derived metabolite, ascorbate, as a selective inhibitor of
   activated CD4+ effector T cells, including IL-17A-, IL-4-, and
   IFNγ-producing cells. However, these mechanistic studies are resource
   intensive and often prone to empirical biases.

   As the knowledge about differential abundance of human microbiome
   species between healthy and diseases accumulates with the surge of
   microbiome profiling studies, our understanding of the mechanisms of
   how microbiome influence human phenotypes are still limited because of
   the complexity of host-microbe interactions. There is an urgent need
   for bioinformatics tools that leverage curated and structured knowledge
   to guide experimental studies. Functional enrichment analysis for
   gene-centric data, such as transcriptomics and proteomics, helps
   interpret sets of differentially expressed genes through prior
   knowledge about gene functions^[39]6. Similarly, it would be enormously
   useful to organize the knowledge about the effects of microbes on the
   host to aid the functional interpretation of microbiome
   datasets/signatures. Microbes can be grouped into microbe-sets based on
   shared attributes. Themed collections of such microbe-sets can be
   organized into microbe-set library as a representation of knowledge.

   Recently, an increasing number of such resources have been established.
   Disbiome^[40]7 emerged as the first database cataloging microbial
   composition differences in diseases, which covers 190 human diseases,
   800 microbial organisms across 674 published studies. There are also
   databases categorizing microbes based on genomic^[41]8, protein
   family^[42]9 and taxonomic information^[43]10. In addition, the
   research community established databases documenting different
   functional aspects of microbes including pathogenesis (e.g.
   EuPathDB^[44]11), transport and metabolism (e.g. TCDB^[45]12) and
   signal transduction and gene regulation (e.g. MiST^[46]13). These
   databases are valuable for deciphering the molecular mechanisms of how
   microbes influence host phenotypes. However, the cumulative knowledge
   on mechanistic studies of microbes and diseases is often scattered in
   literature.

   In this study, we developed microbe-set enrichment analysis (MSEA), a
   novel computational approach for interpreting microbe-sets using themed
   collections of functionally annotated microbe-sets representing prior
   knowledge. We demonstrated the outstanding utility of the MSEA
   methodology by carried out three case studies using publicly available
   curated knowledge resource and microbiome profiling datasets focusing
   on human diseases. We found MSEA not only yields consistent findings
   with the original studies, but also uncovers insights about disease
   mechanisms that are supported in the literatures. To disseminate our
   method to the microbiome research community, we developed a Python
   package “msea” to enable investigators to adopt this analytical
   approach (available at [47]https://pypi.org/project/msea/).

Results

Construction of a microbe-set library from PubMed literature

   Enrichment analysis is designed to infer the collective functions for a
   set of microbes instead of individual ones by identifying microbe-sets
   sharing common attributes with the input microbe-set. To perform MSEA,
   we first created a microbe-set library from PubMed literature as the
   background knowledge representation (Fig. [48]1). Since we aim to study
   the host-microbiome interactions to investigate how gut microbial
   organisms affect host phenotypes via the expression of host genes, we
   grouped microbes based on their literature-documented associations with
   mammalian genes. The microbe-gene associations were defined as
   significant co-occurrence across millions of PubMed abstracts. To
   create this comprehensive collection of literature-based microbe-gene
   associations, we first parsed the taxonomy information from
   Greengenes^[49]14 to get 1085 microbial genus and species names across
   Bacteria and Archaea kingdoms. The names of those microbial species
   were then used as search terms to query PubMed via Geneshot^[50]15.
   Amongst 978,217 PubMed abstracts hits across the 1085 queries, 970
   microbial names returned at least one PubMed hits, mentions of 8865
   distinct mammalian genes were recognized and mapped to HUGO Gene
   Nomenclature Committee (HGNC) gene symbols by the named-entity
   recognition (NER) tool Tagger^[51]16. We next computed Jaccard Index to
   quantify the association strength between microbe and mammalian genes
   to filter out week associations that were observed by chance. The
   filtering led to 42,944 associations covering 752 microbes and 2045
   mammalian genes.

Figure 1.

   [52]Figure 1
   [53]Open in a new tab

   Chart showing the workflow of the construction of microbe-set library
   and application of MSEA.

   As expected, mammalian genes that are most frequently associated with
   microbial entities are related to immunity and inflammatory responses,
   such as genes encoding cytokines including TNF, IL10 and IL6, as well
   as genes involved in innate immune responses such as Toll-like
   receptors (TLRs) and innate immune signal transduction adaptor MYD88
   (Table [54]1).

Table 1.

   Mammalian genes with most microbe-gene associations from PubMed
   literature.
Mammalian gene                                         HGNC symbol Microbe count
Tumor Necrosis Factor                                  TNF         401
Interleukin 10                                         IL10        278
Toll Like Receptor 4                                   TLR4        263
Toll Like Receptor 2                                   TLR2        238
Fos Proto-Oncogene                                     FOS         213
Angiotensin I Converting Enzyme                        ACE         209
C-Reactive Protein                                     CRP         203
Caspase 3                                              CASP3       182
Myeloid Differentiation Primary Response 88            MYD88       182
Glyceraldehyde-3-Phosphate Dehydrogenase               GAPDH       168
Interleukin 6                                          IL6         160
Prostaglandin-Endoperoxide Synthase 2                  PTGS2       157
Forkhead Box P3                                        FOXP3       155
C–C Motif Chemokine Ligand 2                           CCL2        144
CD86 Antigen                                           CD86        139
Nucleotide Binding Oligomerization Domain Containing 2 NOD2        139
Caspase 1                                              CASP1       132
Toll Like Receptor 9                                   TLR9        128
CD40 Antigen                                           CD40        125
Intercellular Adhesion Molecule 1                      ICAM1       124
   [55]Open in a new tab

   Interestingly, genes without apparent roles in immunity such as
   proto-oncogene FOS and apoptosis-related cysteine protease CASP3 are
   also shown to have many microbial associations. FOS is a central
   transcriptional regulator for innate immune system^[56]17. CASP3,
   although serves its function canonically in apoptosis, is also involved
   in inflammatory response and B-cell activation ^[57]18. We also found
   the top microbial genus and species with the most mammalian gene
   associations includes well-characterized microbial species used as
   model organisms (e.g. Escherichia coli and Saccharomyces cerevisiae),
   highly common commensal bacterium (e.g. Staphylococcus aureus and
   Lactobacillus) and certain well-known pathogens (e.g. Salmonella
   enterica, Pseudomonas aeruginosa and Helicobacter pylori) (Table
   [58]2).

Table 2.

   Microbial genus and species with most mammalian gene associations from
   PubMed publications.
   Microbe                      Gene count
   Escherichia                  959
   Escherichia coli             957
   Enterobacteriaceae bacterium 952
   Streptococcus sp.            858
   Streptococcus                858
   Pseudomonas                  799
   Staphylococcus               796
   Staphylococcus aureus        788
   Bacillus                     785
   Aerococcus viridans          757
   Mycobacterium                739
   Epsilonproteobacteria        705
   Pseudomonas aeruginosa       688
   Helicobacter pylori          665
   Salmonella enterica          649
   Saccharomyces                616
   Saccharomyces cerevisiae     614
   Lactobacillus                606
   Clostridium                  594
   Alphaproteobacteria          544
   [59]Open in a new tab

   To globally assess the quality of the microbe-gene associations
   constructed from PubMed abstracts, we intersected the microbe-gene
   associations with an independent and objective knowledge resource, the
   taxonomy for microbes from Greengenes^[60]14. The assumption for this
   assessment is that the set of microbes associated with the same
   mammalian genes are more likely to be enriched among certain taxonomic
   clades than random. We reduced the dimensionality of the microbe-set
   library of microbe-gene associations using t-Distributed Stochastic
   Neighbor Embedding (t-SNE)^[61]19 to derive an embedding for microbial
   genus and species based on their potential functional association
   spectrum of mammalian genes (Fig. [62]2). By overlaying the phylum
   information onto the t-SNE embedding, we observed several clusters of
   microbes including Firmicutes and Proteobateria belong to the same
   phylum. These results validated our approach of automated curating
   microbe-gene associations from the literature is able to recapitulate,
   to some extent, phylogenic similarities among microbes. The resultant
   microbe-set library also lays the foundation of subsequent case studies
   of MSEA.

Figure 2.

   [63]Figure 2
   [64]Open in a new tab

   t-SNE visualization of the normalized microbe-gene co-mentioning matrix
   derived from PubMed queries for the microbes from the following four
   phyla: Firmicutes, Proteobacteria, Bacteroidetes, and Actinobacteria.
   The t-SNE was applied to the TF-IDF normalized (see “[65]Methods”)
   microbe-gene co-mentioning matrix to calculate the 2-D coordinates for
   individual microbial genus or species. Each dot in the scatter plot
   represents a microbial genus or species, which is colored by their
   respective phylum based on Greengenes taxonomy.

   To demonstrate the use cases and effectiveness of our newly devised
   MSEA methodology, we carried out the following three case studies with
   real-word microbiome datasets from diverse biological contexts.

Case study 1: MSEA between disease-centric microbe-sets and gene-centric
microbe-sets

   First, we set up a case study for MSEA to examine whether microbe-sets
   can be used as an intermediary to connect mammalian genes and diseases.
   The rationale behind the case study is that many microbes are observed
   to associate with a variety of human diseases, we argue some of those
   links could be due to their ability to regulate certain mammalian genes
   that are implicated in the diseases. Therefore, one would expect to
   find known gene-disease associations via MSEA between disease-centric
   microbe-sets and gene-centric microbe-sets. To construct microbe-set
   library associating human diseases with microbes, we used the Disbiome
   database^[66]7, a resource for microbiome composition differences in
   diseases curated from case–control studies. On the other hand, we used
   the microbe-sets library of microbe- mammalian gene associations from
   literature as the background.

   MSEA analysis found several interesting microbe-mediated disease-gene
   associations (Table [67]3). For example, microbes with differential
   abundance in non-alcoholic fatty liver disease (NAFLD) are
   significantly enriched for SREBF1 and LPL genes via literature-based
   associations (Table [68]3). SREBF1, which encodes a sterol regulatory
   element binding transcription factor, is the known regulator of
   cholesterol and fatty acid synthesis in the liver^[69]20.
   Overexpression of SREBF1 was also shown to cause NAFLD in mice^[70]21.
   Additionally, lipoprotein lipase (LPL) also has well-characterized role
   in the pathophysiology of NAFLD: lipoprotein metabolism is the central
   pathway for the hepatocellular lipid homeostasis^[71]22,[72]23; more
   recently, the up-regulation of LPL in hepatic stellate cells has also
   been demonstrated to exacerbate liver fibrosis in non-alcoholic
   steatohepatitis (NASH)^[73]24, which can be considered as a subtype of
   NAFLD. Hence, the roles of those overlapping microbes associated with
   SREBF1 and LPL that also exhibit abnormal abundance in non-alcoholic
   fatty liver disease merit further investigations.

Table 3.

   Top enriched gene-disease connections via microbes identified using
   MSEA.
   Gene Disease Odds ratio P-value q-value Combined score (see
   “[74]Methods”) Supporting references Shared microbes