Abstract
The rapid development of high-throughput sequencing techniques provides
an unprecedented opportunity to generate biological insights into
microbiome-related diseases. However, the relationships among microbes,
metabolites and human microenvironment are extremely complex, making
data analysis challenging. Here, we present NMFGOT, which is a
versatile toolkit for the integrative analysis of microbiome and
metabolome data from the same samples. NMFGOT is an unsupervised
learning framework based on nonnegative matrix factorization with graph
regularized optimal transport, where it utilizes the optimal transport
plan to measure the probability distance between microbiome samples,
which better dealt with the nonlinear high-order interactions among
microbial taxa and metabolites. Moreover, it also includes a spatial
regularization term to preserve the spatial consistency of samples in
the embedding space across different data modalities. We implemented
NMFGOT in several multi-omics microbiome datasets from multiple
cohorts. The experimental results showed that NMFGOT consistently
performed well compared with several recently published multi-omics
integrating methods. Moreover, NMFGOT also facilitates downstream
biological analysis, including pathway enrichment analysis and
disease-specific metabolite-microbe association analysis. Using NMFGOT,
we identified the significantly and stable metabolite-microbe
associations in GC and ESRD diseases, which improves our understanding
for the mechanisms of human complex diseases.
Subject terms: Microbial communities, Metagenomics
Introduction
With the rapid evolution of multi-omics microbiome technologies, it
becomes easy to collect heterogeneous data to explore biological
questions and generate biological hypothesis. 16S ribosomal RNA gene
amplicon and whole metagenomic shotgun sequencing (WMGS)^[26]1 are used
to detect the taxonomic composition. Simultaneously, high-throughput
untargeted and targeted technologies can also be used to estimate the
abundance of metabolites in the same sample^[27]2. The paired
metagenomic and metabolomic profiles from the same samples provide an
unprecedented opportunity to explore biological mechanisms and
cross-omics feature links in microbiome-related diseases.
Increasing evidences have shown that the microbial community in the
human body widely involves in metabolic activities and plays an
important role in host health and diseases^[28]3–[29]5. The microbial
metabolic activities break down some indigestible carbohydrates and
synthesize vitamins that are beneficial to host^[30]6. The bacterial
metabolites promote gut homeostasis, and may also lead to
gastrointestinal and systemic diseases^[31]7. However, findings from a
single data view or modality usually ignore the complementary and
compatible information from other views. For example, host genes and
gut microbiota would act in a coordinated way when they are involved in
common biological functions^[32]8. The metabolites generated in some
bacterial taxa from the cockroach gut have antagonistic activity
against certain pathogens^[33]9. These studies provided important
evidences of metabolite-microbe interactions, but a knowledge gap still
largely remains for entirety character the landscape of
metabolite-microbe associations. This gap stems from the incompleteness
of data, the limited characterization of bacterial genes, the metabolic
“dark matter” (yet uncharacterized metabolites) and so on. Hence, there
is an essential need for computational approaches and tools that can
effectively integrate microbiome and metabolome data to comprehensively
identify the underlying associations and patterns in the data.
Recently, a few multivariate integration methods have been developed to
analyze multi-omics data obtained from the same sample^[34]4,[35]10,
including SNF^[36]11, DIABLO^[37]12, and MVCPM^[38]13, but these
methods were not initially designed for microbiome-metabolome joint
analysis. The integration methods for paired metagenomic and
metabolomic profiles mainly contain CCA-based framework and its
variants: SCCA^[39]14,[40]15 and DCCAE^[41]16. SCCA and DCCAE assume
that the projections of two sets of observations are lineally
correlated and try to establish confident microbe-metabolite
associations. DCCAE introduced two autoencoders and minimized the
combination of CCA objective and the reconstruction error of the
autoencoders. SNF and MVCPM are graph-based multi-omics integration
methods, and assume that there exists a consensus sample similarity
network obtained by iteratively fusing operation. However, these
assumptions may not be realistic, because the interplay between
bacteria, metabolites, and other factors in the complex ecosystem,
renders it difficult to character the landscape of metabolite-microbe
interactions with linear mapping.
In this work, we propose NMFGOT for the integrative analysis of
multi-omics data, where microbiome and metabolome data are parallelly
profiled from the same sample. NMFGOT is a versatile toolkit that
enables clustering of the samples and facilitates downstream biological
analysis, including pathway enrichment analysis and metabolite-microbe
association analysis. NMFGOT is a novel unsupervised learning framework
based on nonnegative matrix factorization with graph regularized
optimal transport. Unlike CCA-based methods^[42]12,[43]14,[44]16,[45]17
which assume linear projection of two sets of observations, NMFGOT uses
the optimal transport plan to measure the probability distance between
microbiome samples, which better deals with the nonlinear high-order
interactions between microbial taxa and metabolites. NMFGOT not only
integrates the complementary information from different data
modalities, but also includes a spatial regularization term to preserve
the spatial consistency of samples in the embedding space. Through
analyzing on three microbiome-related multi-omics datasets from
different tissues (including gastric and gut) and diseases (including
gastric cancer, end-stage renal disease, and inflammatory bowel
disease), we show that NMFGOT is effective in distinguishing sample
types: NMFGOT achieves superior performance in clustering and
visualization. The factors for metabolome obtained in NMFGOT provide
rich biological significances: they are enriched for disease-specific
biological pathways, which are directly related to disease development.
NMFGOT also includes a model that infers disease-specific
metabolite-microbe associations, which is based on lasso penalized
regression model^[46]18 and stability selection approach^[47]19. An
overview of NMFGOT is shown in Fig. [48]1.
Fig. 1. An illustration of NMFGOT.
[49]Fig. 1
[50]Open in a new tab
NMFGOT is designed for joint analysis of microbiome and metabolome data
obtained from the same sample. a The microbial abundance and metabolite
abundance profile matrices
[MATH: X1 :MATH]
and
[MATH: X2 :MATH]
are the inputs of NMFGOT (row represents microbial taxa or metabolite,
and column represents sample). b By iteration fusion, NMFGOT learns the
microbe loading matrix
[MATH: W1 :MATH]
, the metabolite loading matrix
[MATH: W2 :MATH]
, the clustering indicator matrices
[MATH: H1 :MATH]
and
[MATH: H2 :MATH]
, and the sample-sample similarity matrix
[MATH: S :MATH]
. c The enrichment analysis provides biological insights of the top
metabolites identified in the matrix
[MATH: W2 :MATH]
. d Lasso penalized regression and stability selection are performed to
identify significant and stable metabolite-microbe pairs, which provide
an avenue to understand complicated disease mechanisms. e The
sample-sample similarity matrix
[MATH: S :MATH]
facilitates visualization and data clustering.
Results
NMFGOT achieves good clustering performance on multiple disease-specific
microbiome-metabolome datasets
We evaluated the performance of NMFGOT on three microbiome-metabolome
datasets: Gastric cancer (GC) dataset, End-stage renal disease (ESRD)
dataset and Inflammatory bowel disease (IBD) dataset, where microbial
and metabolite abundance are simultaneously profiled for the same
samples.
We compared NMFGOT with several baselines and state-of-the-art methods
for microbiome-metabolome data integration, including SCCA^[51]14,
SNF^[52]20, MVCPM^[53]21 and a deep learning algorithm DCCAE^[54]16.
For SCCA, we utilized grid search to determine the optimal
regularization parameters and implemented k-means clustering on the
canonical variables for microbiome and metabolome data. For SNF, we
evaluated its performance with its default parameters and clustering
method. For MVCPM, k-means was implemented on the low-dimensional
representations of samples. DCCAE consists of two autoencoders and
minimizes the combination of canonical correlation objective and the
reconstruction error of the autoencoders. We implemented DCCAE with the
defaulted parameters and used k-means on the low-dimensional
representations of samples. For NMFGOT, the k-nearest neighbor (KNN)
graph with k = 20 was first established, then Louvain clustering was
implemented on the k-nearest neighbor graph to obtain the final
clustering assignment.
The clustering performance assessed by AC, ARI and silhouette score are
presented in Fig. [55]2. As shown in Fig. [56]2, our proposed NMFGOT
algorithm achieves the best performance among all the methods on the
three datasets in terms of AC and ARI. For the average silhouette
coefficient criterion, SCCA also performs well in IBD data. The numeric
values of the clustering performance metrics are presented in
Supplementary Table [57]1.
Fig. 2. Assessment of clustering performance.
[58]Fig. 2
[59]Open in a new tab
a Evaluation of the clustering performance in terms of AC. b Evaluation
of the clustering performance in terms of the average silhouette score.
Silhouette score measures the clustering quality by comparing the
distances between each sample and its assigned cluster and its
neighboring cluster. Finally, the average silhouette score was
presented above. c Evaluation of the clustering performance in terms of
ARI.
To further validate the effectiveness and efficiency of NMFGOT, we also
conducted comparison experiments with a simplified variant of model
(5): in NMFGL, we replace Laplacian graph L[opt] with traditional graph
(based on similarity computation, such as Gaussian kernel function).
The performance of NMFGL was not as good as model (5), which suggests
that it is beneficial to model microbial data using optimal transport
distance (Supplementary Table [60]2). Ablation studies were also
implemented to investigate the roles of the spatial regularization
technique and optimal transport, where we set β and γ equal to 0 in
turn. The experimental results show that NMFGOT performs well in most
cases. More details were presented in Supplementary Table [61]3.
NMFGOT facilitates data visualization
We next implement UMAP visualization^[62]22 to further evaluate their
performance. For SNF and NMFGOT, the learned sample similarity matrix S
were used as input of UMAP. For other methods, the low-dimensional
representation matrices were used to perform UMAP. The results are
presented in Fig. [63]3. As Fig. [64]3 has shown, NMFGOT identified
more clear cluster structure on GC (Fig. [65]3a) and ESRD data (Fig.
[66]3b).
Fig. 3. Comparison of the visualization results.
[67]Fig. 3
[68]Open in a new tab
SCCA, SNF, MVCPM, DCCAE and NMFGOT are compared by implementing UMAP on
two multi-omics datasets (microbiome + metabolome). a GC data. b IBD
data. Samples are colored based on the true labels provided in their
original publications.
Interesting, on GC datasets NMFGOT identifies what appears to be a
transitional state from gastrectomy to healthy patients (left bottom of
Fig. [69]3a). The similar situation can be also found in IBD (Fig.
[70]3b, sample nodes in box; Supplementary Fig. [71]1).
NMFGOT identifies host pathways associated with disease-specific metabolites
on different datasets
The factors obtained from NMFGOT provide rich biological insights, and
are easy to interpret. Specifically, for metabolite profile data, we
selected the top 50 metabolites with large magnitudes in each column of
the metabolite loading matrix
[MATH: W2 :MATH]
, and implemented pathway enrichment analysis by using
MetaboAnalyst^[72]23. The enriched metabolite sets for metabolome data
agree with the biological function of the underlying sample types that
the factors represent (Supplementary Table [73]4). In GC data, factor 2
corresponds to GC type, which is inferred by inspecting the sample type
label of the samples with large values in
[MATH: H2 :MATH]
. The second column in the metabolite loading matrix
[MATH: W2 :MATH]
is enriched for “valine, leucine and isoleucine biosynthesis”
(log10(q-value) = −14.80), “Glutathione metabolism”
(log10(q-value) = −7.70), and “Phenylalanine, tyrosine and tryptophan
biosynthesis” (log10(q-value) = −5.67). The enriched pathways are
consistent with the previous studies^[74]24–[75]26. In ESRD data,
factor 1 corresponds ESRD sample type. The first column in the
metabolite loading matrix
[MATH: W2 :MATH]
is enriched for “Pyruvate metabolism” (log10(q-value) = −6.68),
“Glycolysis / Gluconeogenesis” (log10(q-value) = −6.68), These results
are also consistent with the previous studies^[76]27,[77]28 (:
Supplementary Table [78]4).
We further validated the metabolite markers by retrieving related
literatures from PubMed. The results were presented in Supplementary
Table [79]5-[80]6. Most of enriched metabolites included in these
pathways were found to be related to GC and ESRD.
Fig. [81]4 shows the top enriched metabolite sets in each of these two
diseases.
Fig. 4. Enrichment analysis for GC dataset and ESRD dataset.
[82]Fig. 4
[83]Open in a new tab
Enrichment analysis for GC data (a) and End-stage renal disease ESRD
data (b). The enriched metabolite sets are sorted by p-value
(FDR < 0.5).
To summarize, NMFGOT facilitates the identification of host pathways
associated with disease-specific metabolites. The enrichment analyses
for the metabolite loading matrix obtained from NMFGOT provide rich and
consistent biological insights on the identified sample types.
NMFGOT identifies disease-specific metabolite-microbe association
Microbial taxa and metabolites involved in common biological functions
may act in a coordinated manner. Based on this hypothesis, we used the
factors obtained from NMFGOT to character molecular-level associations
between microbiome and metabolome in each of these three diseases. More
specifically, by inspecting the sample labels with large values in
[MATH: Hi :MATH]
, sample types can be assigned to the factors in NMFGOT. We selected
top 100 microbial taxa for microbial abundance data and top 50
metabolites for metabolite abundance data, and then computed the
spearman correlation coefficients between microbial taxa and
metabolites. Fig. [84]5a represents the overall pattern of correlation
between microbial taxa and metabolites identified by NMFGOT in GC, ESRD
and IBD (p-value < 0.05).
Fig. 5. Associations between specific microbes and individual metabolites in
each disease.
[85]Fig. 5
[86]Open in a new tab
a Heatmap showing the overall pattern of correlation between top
microbial taxa (rows) and metabolites (columns) in factors identified
by NMFGOT for GC, ESRD and IBD (p-value < 0.05). b Correlation network
for significantly and stability-selected metabolites and microbial taxa
in GC and ESRD. The red nodes represent metabolites, the blue nodes
represent microbes. Edges represent metabolite-microbe associations.
The red line indicates positive correlation, and green line indicates
negative correlation.
Inferring disease-specific metabolite-microbe associations may be
valuable for understanding the mechanisms of complex diseases and
functions of microbes. Next, we further explored associations between
metabolites and microbes based on factors obtained from NMFGOT in GC
and ERSD. To do so, we firstly used lasso penalized regression model to
identify specific microbes whose abundances are associated with the
abundance of certain metabolite^[87]18. Specifically, we used the
abundances of the microbes as predictors and the abundance of
metabolite as response variable to fit the model. Then, stability
selection was applied to select robust associations^[88]19. Finally, an
intersection between associations identified by the lasso penalized
regression model and stability selection above was performed to retain
significant and stability-selected metabolite-microbe associations.
Using this way, we found 40 and 54 robust metabolite-microbe
associations in GC and ESRD, respectively (Fig. [89]5b). In GC data,
these associations consist of 17 microbes and 35metabolites, 33
microbes and 35 metabolites in ESRD data.
Next, we also implemented the additional experiments on holdout dataset
to validate the effectiveness and reproducibility of some markers.
Specifically, we first split ESRD data into two parts, the one is used
to train (50%), the other is used as holdout dataset. Then, on the
train data we implemented NMFGOT algorithm and detected the significant
and stability-selected metabolite-microbe associations. Meanwhile, we
also implemented the same operation on the holdout dataset to identify
the significant and stability-selected metabolite-microbe associations.
Finally, we compared the results obtained from these two datasets and
obtained the shared metabolite-microbe associations. 16 significant and
stability-selected metabolite-microbe associations were supported by
these two parts of data (Supplementary Table [90]7).
Taken together, these findings demonstrate the effectiveness of NMFGOT
in identifying the latent metabolite-microbe associations.
Discussion
Advances in multi-omics sequencing technologies provide an
unprecedented opportunity to explore metabolite-microbe associations
and understand the mechanism in human complex diseases. To this end, we
proposed NMFGOT, which integrates microbiome and metabolome data from
the same samples. CCA based methods used in multi-omics data analysis
(including SCCA and DCCAE) typically assume that the projections of two
sets of observations are lineally correlated. Unlike these approaches,
NMFGOT assumes that there exist complicated nonlinear interactions
among microbial taxa, metabolites, and human gut environment, and uses
OT plan to encode the complex relationships between samples. We
demonstrate through three multi-omics microbiome datasets that NMFGOT
consistently performs well when benchmarked with several recently
published multi-omics integrating methods. NMFGOT also takes advantage
of OT, which characters the probability distance between samples, as
well as integrates spatial regularizations to preserve the spatial
consistency of samples. Moreover, NMFGOT facilitates downstream
biological analysis, including pathway enrichment analysis and
disease-specific metabolite-microbe association analysis. With NMFGOT,
we identified significantly and stable metabolite-microbe associations
in GC and ESRD diseases. The results further show disease-specific
microbial taxa can regulate synthesis of host metabolites.
In the whole experiments, we set the number of factors in NMFGOT equal
to the number of sample types:
[MATH: k=2 :MATH]
in each of the three datasets. Experimental results show that NMFGOT
achieves the best performance. we also tested the robustness of each
method on the values of k, where we varied the values of k in the range
{2,3,4,5,6} on different datasets. However, we found that the
clustering performance of these methods evaluated by silhouette score
are sensitivities for these three datasets, which is likely because
microbiome multi-omics data tends to have high level of noise. For
datasets with multiple groups of samples, NMFGOT also performs well in
terms of silhouette score. The experimental results are also presented
in Supplementary Fig. [91]2.
We also implemented the side-by-side comparison experiments in which
single modality compositional data (microbial abundance data or
metabolite abundance data) and multi-omics data are used to test the
effectiveness of NMFGOT. The experimental results show that NMFGOT has
better performance on AC, ARI, and silhouette score metrics
(Supplementary Table [92]8). We further analyzed a colorectal cancer
data, where microbial abundance profile and metabolite abundance data
are simultaneously profiled in the same samples^[93]4. The 70 percent
of samples are used to train, and 30 percent of samples are used as
validation. The experimental results showed that the samples were
reasonably separated by NMFGOT, and it obtained the high average
silhouette score of samples (0.9310). The results were presented in
Supplementary Fig. [94]3.
In the future, we will extend NMFGOT to integrate more molecular
modalities to capture complementary biological insights into complex
mechanisms underlying multi-omics crosstalk^[95]15. In addition,
extending NMFGOT to analyze gene expression data may also be another
interesting direction: gene abundance can be considered as a modality
and using the gene loading and microbe loading matrices in NMFGOT to
analyze gene-microbe association.
Methods
Datasets description and data preprocessing
Gastric cancer (GC) data^[96]29: This dataset used in this manuscript
was downloaded from
([97]https://github.com/borenstein-lab/microbiome-metabolome-curated-da
ta)^[98]30. 96 faecal samples from 54 healthy individuals and 42
patients with gastrectomy for gastric cancer were collected. Shotgun
metagenomic sequencing and targeted metabolites quantification for the
same faecal samples are parallelly profiled.
End-stage renal disease (ESRD) data^[99]28: This data downloaded from
the literature^[100]28 includes 287 samples from 223 haemodialysis
patients with ESRD and 69 healthy volunteers. Microbial abundance and
metabolite abundance are simultaneously profiled by shotgun
metagenomics sequencing and a headspace solid phase microextraction–gas
chromatography-MS (GC-MS) method, respectively.
Inflammatory bowel disease (IBD) data^[101]3: The dataset were taken
from a published study of patients with IBD. It includes 121 samples
with IBD and 34 controls. Microbial abundance and metabolite abundance
data are parallelly profiled for the same sample. The statistical
information of datasets is presented in Supplementary Table [102]9.
The original microbial abundance data were normalized such that the
total counts of all species in each sample equal to 1. The metabolite
abundance data were log-transformed with a pseudo-count of 1. For the
microbial abundance data, taxa appear in less than 3 samples were
removed. For the metabolite abundance data, metabolites appear in less
than 2 samples were removed.
Overview of NMF
Given a nonnegative data matrix
[MATH:
X∈R+
mo>p×n
:MATH]
, NMF factorizes
[MATH: X :MATH]
into two low-rank matrices
[MATH:
W∈R+
mo>p×k
:MATH]
and
[MATH:
H∈R+
mo>k×n
:MATH]
, where p is the number of features, n is the number of observations
and
[MATH:
k≪min(p,n) :MATH]
is the number of factors^[103]31. The objective function of NMF is
written as follows.
[MATH:
minW,H<
/mi>≥0||X
−WH||F2, :MATH]
1
where
[MATH: W :MATH]
is basis matrix,
[MATH: H :MATH]
is coefficient matrix and can be used as clustering indicator.
[MATH:
∣∣⋅∣∣F :MATH]
indicates the Frobenius norm of a matrix.
Optimal transport - Earth mover’s distance
Optimal transport plan has been successfully applied to some fields,
including cell-cell communication^[104]32,[105]33, domain
adaptation^[106]34,[107]35 and single-cell multi-omics data
integration^[108]36. Given a cost matrix
[MATH:
M∈Rd
×d :MATH]
, probability vectors
[MATH: r :MATH]
and
[MATH: c :MATH]
belong to the simplex
[MATH:
∑d
≔x∈R+d:xT1d=1 :MATH]
, where
[MATH: 1d
:MATH]
is the
[MATH: d :MATH]
dimensional vector with all its elements to be 1 s, the optimal
transport plan aims to find a transport matrix
[MATH: P :MATH]
that maps
[MATH: r :MATH]
to
[MATH: c :MATH]
. The optimal transport problem can be defined as follows^[109]37.
[MATH:
minP
P,M,s.t.P∈Ur,c, :MATH]
2
Here,
[MATH: ⋅,⋅ :MATH]
is Frobenius dot-product,
[MATH:
U(r,c) :MATH]
denotes the transport polytope for
[MATH: r :MATH]
and
[MATH: c :MATH]
.
[MATH:
U(r,c) :MATH]
can be defined as the follows.
[MATH: Ur,c:=P∈R+d×d∣P1d=r,PT1d=c. :MATH]
3
The optimal transport distance between
[MATH: r :MATH]
and
[MATH: c :MATH]
is defined as follows.
[MATH:
dM
r,c=minP∈
Ur,c
mrow>P,M. :MATH]
4
In this paper,
[MATH:
dM
r,c
mrow> :MATH]
is used to compute the probability distance between samples from
different conditions.
Multi-view learning with graph regularized optimal transport plan
To dissect heterogeneity of samples from both microbiome abundance and
metabolite abundance layers, we introduce an unsupervised learning
framework, named nonnegative matrix factorization with graph
regularized optimal transport (NMFGOT). Considering a multi-view
dataset which consists of microbial abundance profile matrix
[MATH: X1∈R+l×
n :MATH]
(l microbial species in n samples) and metabolite profile matrix
[MATH: X2∈R+m×
n :MATH]
(m metabolites in n samples), the objective function of NMFGOT is
defined as follows:
[MATH: min<
msup>Wi,H
i,S
J=∑i=12Xi−WiH(i)
mrow>F2+α∑i=12S−H<
msup>iTHiF2
+βtr(<
mrow>HiTHi(1n×n−HjTHj))+φS1−1F2+γ∑i=12tr
HiLopt(i)H<
/mi>iT :MATH]
5
[MATH: s.t.Wi,Hi,S
,α,β,φ,<
/mo>γ≥0. :MATH]
where
[MATH: Wi :MATH]
,
[MATH:
H(i) :MATH]
indicate the basis matrix and coefficient matrix for the ith data
modality, respectively.
[MATH: Lopt(i)∈Rn×n :MATH]
is the Laplacian matrix for ith data modality. In this manuscript, we
used the optimal transport distance defined in subsection 2.2 to
compute
[MATH: Lopt :MATH]
, i.e.,
[MATH: Lopt=D−A :MATH]
,
[MATH: Dii=∑<
mrow>jAij :MATH]
, where D is a diagonal matrix, A is obtained via a Gaussian kernel
function based optimal transport distance. In manifold learning,
Laplacian graph is usually used to capture the high-order geometrical
structure relationships in original data^[110]38–[111]40.
[MATH:
S∈R+
mo>n×n
:MATH]
represents the learned sample-sample similarity matrix.
[MATH: 1 :MATH]
is a column vector with all its elements to be 1 s.
[MATH:
1n×n :MATH]
represents a matrix of 1 s.
[MATH: φ :MATH]
is a parameter that is used to control the strength of the constraint
[MATH: S1−1 :MATH]
.
[MATH: α :MATH]
and
[MATH: γ :MATH]
are graph regularization parameters.
[MATH: β :MATH]
is spatial regularization parameter that is used to control the
strength of spatial embeddings consistence, and is set
[MATH: β=0.1 :MATH]
for all datasets. We will discuss how to select
[MATH: α :MATH]
and
[MATH: γ :MATH]
parameters in the following section.
In the objective function of NMFGOT (Eq. [112]5), the first term,
[MATH:
∑i=1
mn>2∣∣Xi−WiH(i)
mrow>∣∣F2 :MATH]
is the NMF loss function for microbial abundance data and metabolite
abundance data. The second term,
[MATH:
∑i=1
mn>2∣∣S−HiTHi∣∣
F2 :MATH]
, is a consensus graph fusion strategy which aims to learn a sample
similarity matrix S. Through iteratively training, generated kernel
[MATH: HiTHi :MATH]
from each data modality was regularized towards a consensus graph S. In
the third term of the objective function,
[MATH: tr(HiTHi(1n×n−HjTHj))
:MATH]
, we adopt a spatial regularization technique to preserve the spatial
consistency of samples. For low-dimensional sample representation
matrices
[MATH: H1 :MATH]
and
[MATH: H2 :MATH]
obtained from microbial abundance and metabolite abundance data, we
assume that samples that are spatially distant in the one embedding
space, should be also pushed further in the other embedding space.
Meanwhile, this strategy also introduces more flexibility and allows
for specificity across different molecular modalities. The fourth term,
[MATH: ||S1−1||
:MATH]
, encourages each row in S to have summation close to 1.
The first four terms in Eq. [113]5 can learn the low-dimensional
representation matrices
[MATH: Wi :MATH]
and
[MATH:
H(i) :MATH]
for multi-omics microbiome data, but they may lose the high-dimensional
geometrical structure information in the original data
space^[114]38,[115]40. To solve this problem, we add the fifth term
[MATH: ∑i=12tr
(H(i)
Lopt(i)H(i
)T) :MATH]
in the object of NMFGOT. The Laplacian graph is established based on
the optimal transport distance between samples (see subsection 2.3),
and it can well capture the intrinsic geometry structure of feature
spaces in unsupervised learning environment^[116]41.
The details of constructing Laplacian graph
[MATH: Lopt :MATH]
are presented as follows:
We first used the optimal transport distance described above to
construct the Laplacian matrix
[MATH: Lopt :MATH]
. Given the optimal transport distance matrix
[MATH:
D(i) :MATH]
obtained from the ith compositional profile, the sample-sample
similarity matrix
[MATH:
A(i) :MATH]
are defined as follows:
[MATH: Ajl(i)
=exp−D<
mrow>jl(i)
μ
(σjl(i)
)2
, :MATH]
6
[MATH: σjl(i)
=Ejl+meanEj,N<
mi>j
+meanEk,N<
mi>l
3. :MATH]
7
where
[MATH: μ :MATH]
is a parameter that can be empirically set. In this study, we set
[MATH: μ=0.5 :MATH]
for all datasets.
[MATH: σjl(i)
:MATH]
is the bandwidth parameter that can be used to eliminate the scaling
problem.
[MATH: Ejl :MATH]
denotes the squared Euclidean distance between sample
[MATH: j :MATH]
and
[MATH: l :MATH]
.
[MATH: Nj
:MATH]
is the set of nearest neighbors of the
[MATH: jth :MATH]
sample where
[MATH: Nj=20 :MATH]
.
[MATH:
mean(E(j,N
j)) :MATH]
denotes the average of the squared Euclidean distances between the
[MATH: jth :MATH]
sample and its neighbors.
Then, the Laplacian matrix
[MATH: Lopt :MATH]
cab be defined as follows:
[MATH: Lopt(i)=Dopt(i)−Ai. :MATH]
8
Here,
[MATH: Dopt(i) :MATH]
is a diagonal matrix with entries
[MATH: Dopt(i)jj=
∑l=1nAjl(i)
:MATH]
.
We note that canonical correlations analysis (CCA) is also used to
integrate multi-omics microbiome data^[117]8. Our proposed NMFGOT
framework differs from CCA or its variants (sparse CCA) in the
following three aspects. 1) CCA-based methods assume that there exists
linear projection of two sets of observations, and maximize correlation
between these two data modalities. However, these methods do not
consider the complicated nonlinear relationships among microbial
species^[118]42,[119]43. In NMFGOT, we used optimal transport distance
to measure the relationships between microbial samples, which better
dealt with the high-order interactions (more than two species) among
microbial taxa or metabolites. 2) NMFGOT includes a spatial
regularization term to preserve the spatial consistency of samples in
the embedding space across different data modalities, and to some
extent it tolerates modality specificity. Obviously, spatial
regularization leads to better clustering solutions and
interpretability: the elements in
[MATH: H1 :MATH]
will tend to be consistent with the ones in
[MATH: H2 :MATH]
. 3) NMFGOT utilizes optimal transport Laplacian to encode the
geometrical structure relationships of microbial samples in the
original data space, and enhances the representation ability of
low-dimensional sample factor matrices.
The optimization algorithm for NMFGOT
We used the alternative iteration algorithm to solve the optimization
problem of NMFGOT model. The updating rules for
[MATH: W1 :MATH]
,
[MATH: W2, :MATH]
[MATH: H1 :MATH]
,
[MATH: H2 :MATH]
and
[MATH: S :MATH]
can be obtained as follows.
[MATH: Wij1⟵Wij1X1H1TijW1H1H1Tij. :MATH]
9
[MATH: Wij2⟵Wij2X2H2TijW2H2H2Tij. :MATH]
10
[MATH: Hij1⟵Hij1W1TX1+α
H1ST+γ<
mrow>H1A1+β
H1H2TH2ijW1TW1H1+α
H1H1TH1+γ
H1D1ij. :MATH]
11
[MATH: Hij2⟵Hij2W2TX2+α
H2ST+γ<
mrow>H2A2+β
H2H1TH1ijW2TW2H2+α
H2H2TH2+γ
H2D2ij. :MATH]
12
[MATH: Sij⟵S
mi>ijα∑l=12HlTHl+2
ηeeTij
2αS+ηeeTS<
/mi>ij. :MATH]
13
Selection of parameters α,
[MATH: φ :MATH]
and γ
In NMFGOT, there are three parameters α,
[MATH: φ :MATH]
and γ that need to be tuned, and they are determined as the following.
The optimization problems
[MATH:
minW1,H1≥0
||X1−W1H1T||F2
mn> :MATH]
,
[MATH:
minW2,H2≥0
||X2−W2H2T||F2
mn> :MATH]
were first solved by NNDSVD^[120]44 and obtain the solutions
[MATH: Wˆ1,Hˆ1,Wˆ2 :MATH]
and
[MATH: Hˆ2 :MATH]
. Then,
[MATH: Sˆ
:MATH]
can be obtained using SNF^[121]20, and set α,
[MATH: φ :MATH]
and γ as follows.
[MATH: α=X1−W^1H^1TF2+X2−W^2H^2TF2/5∑i=12S^−H^iH^iTF2φ=X1−W^1H^1TF2+X2−W^2H^2TF2/1000S^1−1F2,andγ=X1−W^1H^1TF2+X2−W^2H^2TF2/100∑i=12tr
H^iTLiH^i :MATH]
14
The sensitive analyses of
[MATH: α :MATH]
and
[MATH: γ :MATH]
are presented in Supplementary Fig. [122]4.
Evaluation metrics
Accuracy(AC), adjusted rand index (ARI)^[123]45 and silhouette
coefficient^[124]46 are used to assess the performance of the
clustering methods. For unlabeled dataset, we use an unsupervised
metric, silhouette coefficient^[125]47,[126]48, to evaluate the
clustering performance of each method. High silhouette coefficient
score indicates that the sample is close to other samples in the same
cluster, and distant from samples in other clusters. The average value
of silhouette scores is used as the final evaluation.
The robustness analysis of NMFGOT on the hyperparameters
To test the robustness of NMFGOT on the hyperparameters, we varied α
and γ in the range
[MATH: α*/
mo>10,α*/
mo>5,α*/
mo>2,α*,
mo>2α*,
mo>5α*,
mo>10α*,
mo> :MATH]
and
[MATH: γ*/
mo>10,γ*/
mo>5,γ*/
mo>2,γ*,
mo>2γ*,
mo>5γ*,
mo>10γ*
:MATH]
, respectively. Here
[MATH: α* :MATH]
and
[MATH: γ* :MATH]
are the hyperparameters chosen by the rules described in the main text.
The results are presented in Supplementary Fig. [127]4. For IBD
dataset, the silhouette score is relatively stable when the
hyperparameters vary. For GC dataset, the AC is stable when the
hyperparameters vary.
Extension of NMFGOT to unseen or holdout data
we extended NMFGOT to analyze the unseen or holdout data. Given the
unseen or holdout data
[MATH: Xˆ
:MATH]
, the transformation objective function can be defined as follows:
[MATH: minH^i,Š
J=∑i=12X^i−WiH^(i)F2+α∑i=12Š−H^iHiTH^iHiF2+βŠ1−1F2, :MATH]
15
[MATH: s.t.H^(i)
,Š,α,β≥0
:MATH]
where
[MATH: H^(i)
:MATH]
represents the low-dimensional representation of unseen data,
[MATH: Sˇ=
SˆAT
AS :MATH]
is the fused similarity matrix.
[MATH: Sˆ
:MATH]
is the similarity matrix between the unseen data, and
[MATH: A :MATH]
is the similarity matrix between the unseen data and the training data.
We used the consensus graph fusion,
[MATH:
∑i=1
mn>2∣∣Š−[H^i∣∣
Hi]T[H^i∣∣
Hi]∣∣F2<
/mn> :MATH]
, to transform the unseen data into the latent space.
We analyzed a colorectal cancer data, where microbial abundance profile
and metabolite abundance data are simultaneously profiled in the same
samples. The experimental results were presented in Supplementary Fig.
[128]3.
Lasso regression and stability selection approach for metabolite-microbe
associations
After obtaining the factors in our NMFGOT model, the lasso penalized
regression model was used to identify the association between a
metabolite and a set of microbial taxa^[129]18:
[MATH: ∑i=1nyi−β0−∑j=1pβjxij
2+λ∑j=1pβj, :MATH]
16
where
[MATH: n :MATH]
is the number of samples,
[MATH: p :MATH]
is the number of microbial taxa.
[MATH: y :MATH]
is the response variable (metabolite abundance),
[MATH: x :MATH]
is the predictor (taxa abundance).
[MATH: λ :MATH]
is the parameter that controls the sparseness of models.
Due to the sensitive of the lasso model, we also used stability
selection approach to choose robust microbial taxa associated with a
metabolite^[130]8,[131]19. In this manuscript, we used stability
selection with lasso to select stable microbes. The process is
demonstrated as follows:
Step 1. Select a random subset of the analyzed data. We used top 100
microbes with large entries in the column of the microbe loading matrix
[MATH: W1 :MATH]
and top 50 metabolites with large entries in the column of the
metabolite loading matrix
[MATH: W2 :MATH]
.
Step 2. Fit the lasso model with
[MATH: λˆ
:MATH]
that is about the best penalty
[MATH: λ :MATH]
, and record the set of selected microbes.
Step 3. Repeat steps 1 and 2 t times.
Step 4. Compute the frequency
[MATH: fi
:MATH]
of each microbe that was selected across all trials.
Step 5. Pick out the microbes with its
[MATH:
fi≥<
/mo>fthr :MATH]
.
[MATH: fthr :MATH]
is a prespecified threshold.
In this manuscript, we set the size of a random subset as
[MATH:
n/2
:MATH]
data,
[MATH: t=100 :MATH]
and
[MATH: fthr=0.6 :MATH]
.
Supplementary information
[132]Supplementary material^ (486.8KB, pdf)
Acknowledgements