Abstract The commensal microbiome is known to influence a variety of host phenotypes. Microbiome profiling followed by differential abundance analysis has been established as an effective approach to study the mechanisms of host-microbiome interactions. However, it is challenging to interpret the collective functions of the resultant microbe-sets due to the lack of well-organized functional characterization of commensal microbiome. We developed microbe-set enrichment analysis (MSEA) to enable the functional interpretation of microbe-sets by examining the statistical significance of their overlaps with annotated groups of microbes that share common attributes such as biological function or phylogenetic similarity. We then constructed microbe-set libraries by query PubMed to find microbe-mammalian gene associations and disease associations by parsing the Disbiome database. To demonstrate the utility of our novel MSEA methodology, we carried out three case studies using publicly available curated knowledge resource and microbiome profiling datasets focusing on human diseases. We found MSEA not only yields consistent findings with the original studies, but also recovers insights about disease mechanisms that are supported by the literature. Overall, MSEA is a useful knowledge-based computational approach to interpret the functions of microbes, which can be integrated with microbiome profiling pipelines to help reveal the underlying mechanism of host-microbiome interactions. Subject terms: Data integration, Data mining, Literature mining Introduction With the advance in sequencing technology and growing interest in human microbiota, microbiome profiling datasets are accumulating rapidly. Standard microbiome data analysis pipelines primarily aim to identify individual microbial taxa, or microbial communities with differential abundance between healthy and diseased hosts. Then, genomic and/or metabolic strategies are used to characterize individual microbial taxa to help interpret their mechanisms in the pathogenesis of many complex human diseases^[32]1,[33]2. The host-microbiome interactions are conveyed either by alteration of sets of microbes or by their collective functions. Microbes are able to affect host phenotypes through modulation of gene expression^[34]3 or cell signaling in relevant host cells/tissues. However, the regulatory mechanisms of how microbiomes influence host physiology are not clear. Some studies demonstrated such host-microbiome interactions could be achieved via microbial metabolites. For instance, the host immune system has been shown to be modulated by the gut microbiome via microbial metabolites^[35]4,[36]5. As a component of the Human Functional Genomics Project (HFGP), Schirmer et al.^[37]4 found correlation between gut microbial features and production of various types of cytokines in a cohort of 500 healthy adults from the Netherlands. Next, they experimentally validated that two microbial metabolites, tryptophol and palmioleic acid, are able to modulate the production of IFNγ and TNFα, respectively, in peripheral blood mononuclear cells. In an in-depth investigation^[38]5, identified microbe-derived metabolite, ascorbate, as a selective inhibitor of activated CD4+ effector T cells, including IL-17A-, IL-4-, and IFNγ-producing cells. However, these mechanistic studies are resource intensive and often prone to empirical biases. As the knowledge about differential abundance of human microbiome species between healthy and diseases accumulates with the surge of microbiome profiling studies, our understanding of the mechanisms of how microbiome influence human phenotypes are still limited because of the complexity of host-microbe interactions. There is an urgent need for bioinformatics tools that leverage curated and structured knowledge to guide experimental studies. Functional enrichment analysis for gene-centric data, such as transcriptomics and proteomics, helps interpret sets of differentially expressed genes through prior knowledge about gene functions^[39]6. Similarly, it would be enormously useful to organize the knowledge about the effects of microbes on the host to aid the functional interpretation of microbiome datasets/signatures. Microbes can be grouped into microbe-sets based on shared attributes. Themed collections of such microbe-sets can be organized into microbe-set library as a representation of knowledge. Recently, an increasing number of such resources have been established. Disbiome^[40]7 emerged as the first database cataloging microbial composition differences in diseases, which covers 190 human diseases, 800 microbial organisms across 674 published studies. There are also databases categorizing microbes based on genomic^[41]8, protein family^[42]9 and taxonomic information^[43]10. In addition, the research community established databases documenting different functional aspects of microbes including pathogenesis (e.g. EuPathDB^[44]11), transport and metabolism (e.g. TCDB^[45]12) and signal transduction and gene regulation (e.g. MiST^[46]13). These databases are valuable for deciphering the molecular mechanisms of how microbes influence host phenotypes. However, the cumulative knowledge on mechanistic studies of microbes and diseases is often scattered in literature. In this study, we developed microbe-set enrichment analysis (MSEA), a novel computational approach for interpreting microbe-sets using themed collections of functionally annotated microbe-sets representing prior knowledge. We demonstrated the outstanding utility of the MSEA methodology by carried out three case studies using publicly available curated knowledge resource and microbiome profiling datasets focusing on human diseases. We found MSEA not only yields consistent findings with the original studies, but also uncovers insights about disease mechanisms that are supported in the literatures. To disseminate our method to the microbiome research community, we developed a Python package “msea” to enable investigators to adopt this analytical approach (available at [47]https://pypi.org/project/msea/). Results Construction of a microbe-set library from PubMed literature Enrichment analysis is designed to infer the collective functions for a set of microbes instead of individual ones by identifying microbe-sets sharing common attributes with the input microbe-set. To perform MSEA, we first created a microbe-set library from PubMed literature as the background knowledge representation (Fig. [48]1). Since we aim to study the host-microbiome interactions to investigate how gut microbial organisms affect host phenotypes via the expression of host genes, we grouped microbes based on their literature-documented associations with mammalian genes. The microbe-gene associations were defined as significant co-occurrence across millions of PubMed abstracts. To create this comprehensive collection of literature-based microbe-gene associations, we first parsed the taxonomy information from Greengenes^[49]14 to get 1085 microbial genus and species names across Bacteria and Archaea kingdoms. The names of those microbial species were then used as search terms to query PubMed via Geneshot^[50]15. Amongst 978,217 PubMed abstracts hits across the 1085 queries, 970 microbial names returned at least one PubMed hits, mentions of 8865 distinct mammalian genes were recognized and mapped to HUGO Gene Nomenclature Committee (HGNC) gene symbols by the named-entity recognition (NER) tool Tagger^[51]16. We next computed Jaccard Index to quantify the association strength between microbe and mammalian genes to filter out week associations that were observed by chance. The filtering led to 42,944 associations covering 752 microbes and 2045 mammalian genes. Figure 1. [52]Figure 1 [53]Open in a new tab Chart showing the workflow of the construction of microbe-set library and application of MSEA. As expected, mammalian genes that are most frequently associated with microbial entities are related to immunity and inflammatory responses, such as genes encoding cytokines including TNF, IL10 and IL6, as well as genes involved in innate immune responses such as Toll-like receptors (TLRs) and innate immune signal transduction adaptor MYD88 (Table [54]1). Table 1. Mammalian genes with most microbe-gene associations from PubMed literature. Mammalian gene HGNC symbol Microbe count Tumor Necrosis Factor TNF 401 Interleukin 10 IL10 278 Toll Like Receptor 4 TLR4 263 Toll Like Receptor 2 TLR2 238 Fos Proto-Oncogene FOS 213 Angiotensin I Converting Enzyme ACE 209 C-Reactive Protein CRP 203 Caspase 3 CASP3 182 Myeloid Differentiation Primary Response 88 MYD88 182 Glyceraldehyde-3-Phosphate Dehydrogenase GAPDH 168 Interleukin 6 IL6 160 Prostaglandin-Endoperoxide Synthase 2 PTGS2 157 Forkhead Box P3 FOXP3 155 C–C Motif Chemokine Ligand 2 CCL2 144 CD86 Antigen CD86 139 Nucleotide Binding Oligomerization Domain Containing 2 NOD2 139 Caspase 1 CASP1 132 Toll Like Receptor 9 TLR9 128 CD40 Antigen CD40 125 Intercellular Adhesion Molecule 1 ICAM1 124 [55]Open in a new tab Interestingly, genes without apparent roles in immunity such as proto-oncogene FOS and apoptosis-related cysteine protease CASP3 are also shown to have many microbial associations. FOS is a central transcriptional regulator for innate immune system^[56]17. CASP3, although serves its function canonically in apoptosis, is also involved in inflammatory response and B-cell activation ^[57]18. We also found the top microbial genus and species with the most mammalian gene associations includes well-characterized microbial species used as model organisms (e.g. Escherichia coli and Saccharomyces cerevisiae), highly common commensal bacterium (e.g. Staphylococcus aureus and Lactobacillus) and certain well-known pathogens (e.g. Salmonella enterica, Pseudomonas aeruginosa and Helicobacter pylori) (Table [58]2). Table 2. Microbial genus and species with most mammalian gene associations from PubMed publications. Microbe Gene count Escherichia 959 Escherichia coli 957 Enterobacteriaceae bacterium 952 Streptococcus sp. 858 Streptococcus 858 Pseudomonas 799 Staphylococcus 796 Staphylococcus aureus 788 Bacillus 785 Aerococcus viridans 757 Mycobacterium 739 Epsilonproteobacteria 705 Pseudomonas aeruginosa 688 Helicobacter pylori 665 Salmonella enterica 649 Saccharomyces 616 Saccharomyces cerevisiae 614 Lactobacillus 606 Clostridium 594 Alphaproteobacteria 544 [59]Open in a new tab To globally assess the quality of the microbe-gene associations constructed from PubMed abstracts, we intersected the microbe-gene associations with an independent and objective knowledge resource, the taxonomy for microbes from Greengenes^[60]14. The assumption for this assessment is that the set of microbes associated with the same mammalian genes are more likely to be enriched among certain taxonomic clades than random. We reduced the dimensionality of the microbe-set library of microbe-gene associations using t-Distributed Stochastic Neighbor Embedding (t-SNE)^[61]19 to derive an embedding for microbial genus and species based on their potential functional association spectrum of mammalian genes (Fig. [62]2). By overlaying the phylum information onto the t-SNE embedding, we observed several clusters of microbes including Firmicutes and Proteobateria belong to the same phylum. These results validated our approach of automated curating microbe-gene associations from the literature is able to recapitulate, to some extent, phylogenic similarities among microbes. The resultant microbe-set library also lays the foundation of subsequent case studies of MSEA. Figure 2. [63]Figure 2 [64]Open in a new tab t-SNE visualization of the normalized microbe-gene co-mentioning matrix derived from PubMed queries for the microbes from the following four phyla: Firmicutes, Proteobacteria, Bacteroidetes, and Actinobacteria. The t-SNE was applied to the TF-IDF normalized (see “[65]Methods”) microbe-gene co-mentioning matrix to calculate the 2-D coordinates for individual microbial genus or species. Each dot in the scatter plot represents a microbial genus or species, which is colored by their respective phylum based on Greengenes taxonomy. To demonstrate the use cases and effectiveness of our newly devised MSEA methodology, we carried out the following three case studies with real-word microbiome datasets from diverse biological contexts. Case study 1: MSEA between disease-centric microbe-sets and gene-centric microbe-sets First, we set up a case study for MSEA to examine whether microbe-sets can be used as an intermediary to connect mammalian genes and diseases. The rationale behind the case study is that many microbes are observed to associate with a variety of human diseases, we argue some of those links could be due to their ability to regulate certain mammalian genes that are implicated in the diseases. Therefore, one would expect to find known gene-disease associations via MSEA between disease-centric microbe-sets and gene-centric microbe-sets. To construct microbe-set library associating human diseases with microbes, we used the Disbiome database^[66]7, a resource for microbiome composition differences in diseases curated from case–control studies. On the other hand, we used the microbe-sets library of microbe- mammalian gene associations from literature as the background. MSEA analysis found several interesting microbe-mediated disease-gene associations (Table [67]3). For example, microbes with differential abundance in non-alcoholic fatty liver disease (NAFLD) are significantly enriched for SREBF1 and LPL genes via literature-based associations (Table [68]3). SREBF1, which encodes a sterol regulatory element binding transcription factor, is the known regulator of cholesterol and fatty acid synthesis in the liver^[69]20. Overexpression of SREBF1 was also shown to cause NAFLD in mice^[70]21. Additionally, lipoprotein lipase (LPL) also has well-characterized role in the pathophysiology of NAFLD: lipoprotein metabolism is the central pathway for the hepatocellular lipid homeostasis^[71]22,[72]23; more recently, the up-regulation of LPL in hepatic stellate cells has also been demonstrated to exacerbate liver fibrosis in non-alcoholic steatohepatitis (NASH)^[73]24, which can be considered as a subtype of NAFLD. Hence, the roles of those overlapping microbes associated with SREBF1 and LPL that also exhibit abnormal abundance in non-alcoholic fatty liver disease merit further investigations. Table 3. Top enriched gene-disease connections via microbes identified using MSEA. Gene Disease Odds ratio P-value q-value Combined score (see “[74]Methods”) Supporting references Shared microbes