Abstract
Motivation: Untargeted metabolomics comprehensively characterizes small
molecules and elucidates activities of biochemical pathways within a
biological sample. Despite computational advances, interpreting
collected measurements and determining their biological role remains a
challenge. Results: To interpret measurements, we present an
inference-based approach, termed Probabilistic modeling for Untargeted
Metabolomics Analysis (PUMA). Our approach captures metabolomics
measurements and the biological network for the biological sample under
study in a generative model and uses stochastic sampling to compute
posterior probability distributions. PUMA predicts the likelihood of
pathways being active, and then derives probabilistic annotations,
which assign chemical identities to measurements. Unlike prior pathway
analysis tools that analyze differentially active pathways, PUMA
defines a pathway as active if the likelihood that the path generated
the observed measurements is above a particular (user-defined)
threshold. Due to the lack of “ground truth” metabolomics datasets,
where all measurements are annotated and pathway activities are known,
PUMA is validated on synthetic datasets that are designed to mimic
cellular processes. PUMA, on average, outperforms pathway enrichment
analysis by 8%. PUMA is applied to two case studies. PUMA suggests many
biological meaningful pathways as active. Annotation results were in
agreement to those obtained using other tools that utilize additional
information in the form of spectral signatures. Importantly, PUMA
annotates many measurements, suggesting 23 chemical identities for
metabolites that were previously only identified as isomers, and a
significant number of additional putative annotations over spectral
database lookups. For an experimentally validated 50-compound dataset,
annotations using PUMA yielded 0.833 precision and 0.676 recall.
Keywords: machine learning, inference, untargeted metabolomics,
biological network, metabolic model
1. Introduction
Analyzing cellular responses to perturbations such as drug treatments
and genetic modifications promises to elucidate cellular metabolism,
leading to improved outcomes in personalized medicine and synthetic
biology. Metabolomics has emerged as the new ‘omics’, providing a
readout of cellular activity that is most predictive of phenotype.
Metabolomics, so far, has played a critical role in advancing
applications spanning biomarker discovery [[30]1], drug discovery and
development [[31]2], plant biology [[32]3], nutrition [[33]4], and
environmental health [[34]5]. Importantly, the advent of untargeted
metabolomics to measure molecular masses and spectral signatures of
thousands of small molecule metabolites for a biological sample allows
unprecedented opportunities to characterize the phenotype.
The success of untargeted metabolomics in providing insight into
cellular behavior, however, hinges on solving two problems. Metabolite
annotation concerns associating measured masses with their chemical
identities. This problem is challenging, as a particular mass may be
associated with multiple chemical formulas (e.g., there are 21,988
known molecules associated with C[20]H[24]N[2]O[3]). There are several
techniques for annotating measurements. Database lookups rely on
comparing the measured spectral signature against
experimentally-generated fragmentation patterns cataloged in reference
spectral databases (e.g., METLIN [[35]6], HMDB [[36]7], MassBank
[[37]8], NIST [[38]9]). Database coverage, however, is limited as
catalogued spectral signatures are obtained experimentally.
Alternatively, computational methods that either mimic the ionization
and fragmentation process or utilize machine learning techniques (e.g.,
MetFrag [[39]10], Fragment Identificator (FiD) [[40]11], CFM-ID
[[41]12], CSI:FingerID [[42]13]) score the measured spectra against the
predicted spectra of molecules in a candidate set. The chemical
identity associated with the highest scoring signature(s) is then
assigned to the measured spectra. Other annotation techniques exploit
the biological context of the measurements. For example, iMet [[43]14]
and BioCAn [[44]15] exploit data about local neighborhoods within the
network graphs to improve annotation, while EMMF uses the biological
context to engineer a candidate set based on enzyme promiscuity
[[45]16].
The second problem, pathway enrichment analysis, concerns interpreting
measurements within their biological context to study coordinated
changes arising in response to cellular perturbations.
Overrepresentation Analysis (ORA) tools (e.g., MESA [[46]17],
MetaboAnalyst [[47]18], MPEA [[48]19]) employ statistical testing
(e.g., Fisher’s exact test) to determine if a pathway is enriched in
measured metabolites to a degree different than expected by chance when
compared to other cellular pathways or those in a reference sample
[[49]20]. Pathway enrichment techniques can be broadly classified in
two categories. Topological analysis (TA) computes the observed
metabolites’ centrality and connectivity, metrics that reflect the
importance of a metabolite in the turnover of molecules through a
pathway or network (e.g., MetaboAnalyst [[50]18] and IMPaLA [[51]21]).
Metabolite annotation and pathway enrichment have traditionally been
solved as two independent problems, where pathway enrichment assumes
that the chemical identity of each measured mass is known a priori. In
general, pathway analysis techniques, therefore, do not adequately
address issues related to uncertainty in metabolite annotation. One
exception is Mummichog, a set of statistical algorithms that predict
functional activity directly from measurements considered significant
when compared to those in a reference sample [[52]22].
We present a novel inference-based probabilistic approach,
Probabilistic modeling for Untargeted Metabolomics Analysis (PUMA), for
interpreting metabolomics measurements. One input to PUMA is the set of
measurements that are already processed through metabolomics, e.g.,
MZmine [[53]23], XCMS [[54]24], CAMERA [[55]25]. Another input is a set
of pathways, each consisting of enzymatic reactions and their metabolic
products, that are specific to the sample under study. Such pathways
can be readily assembled from databases such as KEGG or MetaCyc or
others. Using these data, PUMA first calculates the likelihood of
activity of metabolic pathways within a biological sample. PUMA then
utilizes these predictions to derive probabilistic assignment of
measurements to candidate chemical identities. PUMA utilizes inference
and approximates posteriors using Gibbs sampling, a Markov Chain Monte
Carlo (MCMC) sampling technique [[56]26]. Although inference is a
well-known machine learning technique, there were several challenges in
developing PUMA including: (1) identifying a suitable generative model
that represents the underlying biological process, (2) expressing
complex relationships using probability distributions, (3) speeding the
inference procedure with complex mathematical marginalization and
vectorization, (4) identifying best model parameters, and (5)
validating model against the ground truth. Herein, we describe how PUMA
addresses such challenges. PUMA is then applied to two data sets
collected for Chinese Hamster Ovary (CHO) cells [[57]15] and human
urinary samples [[58]27]. Predicted pathway activities are analyzed for
biological significance and compared against activity predictions
obtained through statistical pathway enrichment analysis. We compare
PUMA annotations against those already established by prior analysis of
these datasets. For the CHO cell test case, metabolite annotations
obtained using PUMA are compared to those published prior [[59]15],
including annotations using METLIN [[60]6], HMDB [[61]7], and BioCAn
[[62]15]. For the human urinary samples, PUMA annotations are compared
to published annotations [[63]27] obtained using spectral databases and
experimental validation.
2. Methods
2.1. Motivating Example
A small example is provided to illustrate challenges in mapping
measurements to metabolites and pathways, and to show inference’s
ability to address these issues. [64]Figure 1 presents a snippet of a
network that shows two pathways (ovals), Pathway 1 and Pathway 2.
Metabolites with known chemical identities associated (circles) are
either associated with one pathway (red circle) or more than one
pathway (blue circles). Measurements (squares) correspond to masses
that can be associated with one particular metabolite (red square) or
multiple metabolites (blue squares). Not all metabolites within a
sample are measured due to either instrument limitations or because
they are simply not present in the sample due to biological or
environmental factors. Some metabolites are thus not associated with
any measurements (white circles), and maybe associated with one or more
pathways.
Figure 1.
[65]Figure 1
[66]Open in a new tab
Illustrative example of uncertainty when mapping measurements to
metabolites and pathways. Pathways (ovals) are associated with
metabolites (circles), which in turn are associated with measurements
(square). White circles represent non-measured metabolite with
membership in one or more pathways. Blue circles represent measured
metabolites that have multiple-pathway memberships (multiple-pathway
membership is assumed but not shown for j[3] and j[4]). The red circle
represents a metabolite that has membership in only one pathway.
Measurement w[5] uniquely maps to j[13], which uniquely maps to Pathway
2, while all other measurements map to multiple metabolites, as shown
by solid or dotted lines.
There are two types of uncertainties in interpreting measurements from
untargeted metabolomics. One type of uncertainty relates to assignment
of metabolites to pathways (circles to ovals, [67]Figure 1). For
example, measurement w[3] is assigned to metabolite j[5]. Because j[5]
is a metabolite common to both Pathways 1 and 2, there is an
uncertainty in assignment of the metabolite to the pathways: j[5] can
be the product of activity in either Pathway 1, Pathway 2, or both. The
other uncertainty relates to assignment of masses to metabolites, when
a mass can map to multiple metabolites (squares to circles, [68]Figure
1). Measurement w[4] can be attributed to one or two metabolites, j[6]
and j[7], both sharing the same mass. The uncertainty in assigning w[4]
to metabolites j[6] and j[7] manifests in further uncertainty. If w[4]
is associated with j[6], then it contributes to the activity of Pathway
1 (and/or other pathways with which j[6] is associated), while, if w[4]
is associated with j[7], then it clearly ought to contribute to the
activity of Pathways 2 (and/or other pathways with which j[7] is
associated). Not all measurements contribute to these uncertainties.
For example, measurement w[5] is unique to metabolite j[13]. In turn,
j[13] is unique to Pathway 2. Some measurements (such as w[5]) clearly
contribute more significantly than others (such as w[3] and w[4]) in
determining pathway activities.
Computing pathway activities using an enrichment ratio can be
misleading, because it does not take into account the uncertainty in
attributing measurements to metabolites and pathways. The enrichment
ratio for Pathway 1 can be computed as the ratio of four
putatively-measured metabolites divided by six total metabolites in the
pathway. While this enrichment ratio seems high, there is little
confidence that Pathway 1 is active since all measured metabolites form
this pathway could be due to active pathways other than Pathway 1.
Pathway 2 has an enrichment ratio equal to 3 divided by 8. The
significance or importance of this ratio is unclear. Inference will
conclude that Pathway 2 is active with high probability, as it includes
a measured metabolite that cannot be attributed to the activity of any
other pathway. In contrast to enrichment methods, our inference-based
technique considers uncertainties in measurement-metabolite and
metabolite-pathway relationships when computing the likelihood of
pathway activities. A pathway is considered active, if the likelihood
that it generated the observed measurements is above a particular
threshold. When we analyze our test cases, we will assume a threshold
of 0.5. A user of PUMA may decide to use this threshold or select a
more suitable threshold above which pathways are deemed active.
2.2. Generative Model
To determine pathway activities, an untargeted metabolomics workflow
([69]Figure 2A) begins with collecting measurements, followed by
metabolite annotation using annotation tools (e.g., database look ups
or annotation tools) and then applying pathway analysis tools (e.g.,
ORA or TA) to determine the pathway activities. A pathway is assumed
active when biological and environmental factors lead to the production
of some or all of its metabolic products. In some cases, metabolite
annotation is skipped, and statistical pathway activity is computed
directly from measurements. In contrast, our inference-based approach
utilizes a generative model ([70]Figure 2B) that mimics biological
processes inherent to the sample under study. Our presumed biological
process assumes that when pathways are active, they cause the presence
of some its metabolites, which in turn results in observations of
masses through untargeted metabolomics.
Figure 2.
[71]Figure 2
[72]Open in a new tab
Comparison of a workflow to collect and interpret observations (A), and
a generative model that captures a biological process (B).
PUMA first constructs a graphical model [[73]28] that captures the
complex relation among pathway activities, metabolites, and
measurements in a single integrated model. The model produces values
that are observed (measured), as well as hidden variables of interest,
which cannot be directly observed but rather inferred from those values
that can be observed. In our case, the observations correspond to mass
measurements collected through untargeted metabolomics. The hidden
variables are pathway activities and the presence of a metabolite in a
biological sample.
Our generative model assumes the following biological process: one or
more pathways are active. An active pathway causes the presence of some
of its metabolites, which in turn results in observations of masses
through untargeted metabolomics data collection. The generative model
also assumes that the mass spectrometer is not biased towards
particular measurements or classes of molecules. The lack of
observations regarding the presence of masses therefore contributes
some evidence regarding the corresponding pathway activity. The
generative model is parameterized with prior information, or prior
probabilities, about the behavior of the biological process. Here, we
provide priors on each step in the biological process: for pathway
activities, on pathways generating their metabolites, and metabolites
mapping to mass measurements. We assume that the biological sample has
a metabolic model with
[MATH: I :MATH]
pathways,
[MATH: J :MATH]
metabolites, and
[MATH: K :MATH]
unique metabolite masses. A metabolite may have membership in one or
more pathways. PUMA assumes prior knowledge of adducts and in-source
fragmentation and utilizes the adjusted measured masses when mapping
the measurements to model metabolites. An (adjusted) measured mass may
be associated with one or more masses of the model metabolites. Masses
of the model metabolites are mapped to discretized bins, where each bin
is centered at a unique mass value and allows for a mass tolerance of
±15 ppm. Each model metabolite is assigned to a single bin that is
centered closest to the metabolite’s mass. A binary vector
[MATH: w :MATH]
has
[MATH: K :MATH]
entries and indicates putative mass observations of metabolites in the
model. An entry of 1 for
[MATH: wk
:MATH]
in vector
[MATH: w :MATH]
indicates the observation (measurement) of at least one metabolite in
the k^th bin while a 0 indicates no observation for any metabolite in
that bin.
Let
[MATH:
a=(a<
mi>i:i=1,…
,I) :MATH]
denote the status of
[MATH: I :MATH]
pathways in the biological sample, so
[MATH: a :MATH]
is a vector of binary random variables, where a value of 1 indicates
that the corresponding pathway is active and 0 indicates inactivity. We
assume that the
[MATH: ai
:MATH]
random variables are independent, with a
[MATH: Bernoulli(λ :MATH]
) prior:
[MATH:
p(ai<
/msub>=1)=λ<
mo>, i=1, …, I :MATH]
(1)
For simplicity in defining our model, we assume that
[MATH: λ :MATH]
is a model parameter and set it to a constant. As an alternative, we
can give it a Beta prior.
Matrix
[MATH: Z :MATH]
is defined with
[MATH: I :MATH]
rows and
[MATH: J :MATH]
columns. Each entry
[MATH:
zij :MATH]
corresponds to the activeness of metabolite
[MATH: j :MATH]
in pathway
[MATH: i, :MATH]
where a value of 1 indicates metabolite
[MATH: j :MATH]
is active due to pathway
[MATH: i :MATH]
and a value of 0 indicates that metabolite
[MATH: j :MATH]
is not produced by pathway
[MATH: i :MATH]
. If a metabolite
[MATH: j :MATH]
is on a pathway
[MATH: i :MATH]
, then the metabolite is produced according to the following
probability:
[MATH:
p(zi
j=1|a<
/mi>i=1)=<
/mo>μ, p(zij=1
|ai=0<
mo>)=0 :MATH]
(2)
Otherwise,
[MATH:
p(zi
j=1|a<
/mi>i)=0 :MATH]
when
[MATH: j :MATH]
is not on
[MATH: i :MATH]
. For simplicity, we assume that all metabolites are equally likely to
be generated with probability
[MATH: μ :MATH]
within an active pathway. Vector
[MATH: m :MATH]
collapses the matrix
[MATH: Z :MATH]
into a binary vector with
[MATH: J :MATH]
elements, indicating the activeness of a metabolite due to whichever
pathway:
[MATH:
mj=[
∑izij>0] :MATH]
(3)
Here
[MATH:
[⋅]
:MATH]
gives 1 when the condition inside is true or 0 otherwise.
As not all masses can be captured using the mass spectrometer, its
observed accuracy is defined using parameter
[MATH: γ :MATH]
. Let
[MATH: Jk
:MATH]
define the group of metabolites that have masses in the
[MATH: k :MATH]
-th bin, then:
[MATH:
p(wk<
/msub>=0|mJ
mi>k)=(1−γ)∑j∈
Jkmj :MATH]
(4)
This probability means that every metabolite present in the biological
sample has a chance
[MATH: γ :MATH]
to be detected. In the case when all metabolites in
[MATH: Jk
:MATH]
are not observed (
[MATH: ∑j∈
Jkmj=0 :MATH]
), then mass
[MATH: k :MATH]
will not be observed. The detection of a metabolite is independent of
the detection of others in the sample. No two groups,
[MATH: Jk
:MATH]
and
[MATH:
Jk’ :MATH]
, intersects because a metabolite has only one mass. The described
model is described using the plate representation [[74]29] ([75]Figure
3). The model presents the joint probability distribution of random
variables
[MATH: a :MATH]
,
[MATH: z :MATH]
,
[MATH: m :MATH]
and
[MATH: w :MATH]
defined as:
[MATH:
p(a,z
mi>,m,w)
=p(a;λ) p(z|a;μ)
p(m|z) p(w|m;γ) :MATH]
(5)
Figure 3.
[76]Figure 3
[77]Open in a new tab
Graphical representation of the generative model. To avoid representing
all
[MATH: I :MATH]
pathways,
[MATH: J :MATH]
metabolites, and
[MATH: K :MATH]
masses in the graph we use the ‘plate’ notation and draw one
representative node per variable and enclosing these variables in a
plate (rectangular box). The number of instances of each enclosed
variable is indicated by the fixed constant in the lower right corner
of the box. Random variables of the model (a, z, m, w) are shown in
white circles. The variable m has a deterministic relationship with Z.
The shaded circle, labelled w, represents an observed random variable.
[MATH:
μ, λ, γ :MATH]
are parameters to the model.
2.3. Inference
Using the probabilistic model, we infer pathway activities and
metabolite presence from mass measurements. Specifically, we calculate
the following probabilities. For each pathway
[MATH: i :MATH]
in the biological sample we calculate
[MATH:
p(ai<
/msub>|w) :MATH]
, the posterior probability of pathway
[MATH: i :MATH]
being active given evidence in mass measurements. PUMA utilizes Gibbs
sampling to perform Bayesian inference [[78]26] to approximate the
posterior probabilities of pathway activities conditioned on the
measurements. We then infer the presence of metabolites by calculating
the posterior
[MATH:
p(mj<
/msub>|w) :MATH]
for all
[MATH: j :MATH]
. We use the latter probabilities to rank a candidate set of
metabolites for each mass measurement, where a candidate set provides
one or more suggestion of chemical identities that have the same mass,
within an error margin, as the observed one.
2.3.1. Inferring Pathway Activities
Gibbs sampling is employed to perform Bayesian inference to approximate
[MATH:
p(a|w
mi>) :MATH]
, the posterior probability of pathway activities conditioned on the
measurements. Naively sampling random variables a and Z, is time
consuming. To speed the Gibbs sampler, we marginalize hidden variables
[MATH: Z :MATH]
. From the Bayesian formula:
[MATH:
p(a|w
mi>)=p(w|a)p(a)/p(<
mi>w) :MATH]
(6)
Gibbs sampling is convenient in that there is no need to compute the
denominator
[MATH:
p(w) :MATH]
to draw samples from the posterior
[MATH:
p(a|w
mi>) :MATH]
. We only need to focus the computation of
[MATH:
p(w|a
mi>) :MATH]
and
[MATH:
p(a) :MATH]
, where the latter was already assumed to have a Bernoulli
distribution. Below we show how to compute
[MATH:
p(w|a
mi>) :MATH]
. We point out that
[MATH:
p(w|a
mi>) :MATH]
decomposes as follows:
[MATH:
p(w|a
mi>)=∏kp(wk|a<
/mrow>) :MATH]
(7)
This is because metabolites in separate
[MATH: Jk
:MATH]
groups are independent given
[MATH: a :MATH]
, so do masses that are computed within these groups. Then we focus on
the calculation of
[MATH:
p(wk<
/msub>|a) :MATH]
. Let
[MATH:
ϕj(a
) :MATH]
be the probability that at least one pathway in the biological sample
generates metabolite
[MATH: mj
:MATH]
. That is,
[MATH:
ϕj(a
)=p(m<
/mi>j=1|a) :MATH]
. The detailed calculation of
[MATH:
ϕj(a
) :MATH]
is provided in [79]Supplementary File S1, the calculation of
[MATH:
ϕj(a
) :MATH]
is:
[MATH:
ϕj(a
)=1−(<
/mo>1−μ)
nj
:MATH]
(8)
with
[MATH: nj
:MATH]
being the number of active pathways that
[MATH: mj
:MATH]
is on. Probability
[MATH:
p(wk<
/msub>|a) :MATH]
is then computed as follows:
[MATH:
p(wk
|a)={1−Πj∈Jk[<
mn>1−γϕj]wk=1Πj∈Jk[<
mn>1−γϕj]wk=0 :MATH]
(9)
The expression
[MATH:
1−γϕj :MATH]
, a number between 0 and 1, represents the likelihood that the mass
spectrometer did not measure the activity of metabolite
[MATH: mj
:MATH]
. Combining
[MATH:
p(wk<
/msub>|a) :MATH]
with the Bernoulli prior
[MATH:
p(a),<
/mo> :MATH]
we have the joint probability p
[MATH:
(w,a
) :MATH]
, which is sufficient for running the sampler and getting samples from
the posterior. If
[MATH: λ :MATH]
has a Beta prior, then we will sample a and
[MATH: λ :MATH]
together from
[MATH:
p(λ)p<
/mi>(a|λ)p(w|a
) :MATH]
.
2.3.2. Inferring Metabolite Annotations
With samples drawn from
[MATH:
p(a|w
mi>) :MATH]
, we approximate
[MATH:
p(mj<
/msub>|w) :MATH]
, the posterior probability distribution of metabolite
[MATH: j :MATH]
being present in the biological sample. Instead of running the Gibbs
sampling procedure again, we use previously collected samples of
[MATH: a :MATH]
from
[MATH: p(a|w) :MATH]
to estimate the probability
[MATH:
p(mj<
/msub>|w) :MATH]
. Let
[MATH:
S={a
mo>∈samples of p(a|w)} :MATH]
be a set of samples from the distribution
[MATH: p(a|w) :MATH]
, then:
[MATH:
p(mj<
/msub>|w)=Σap(
mo>mj,a|w)=Σap(
mo>mj|a,w)p(
a|w)≈
1|S|<
mo> Σa∈S p(mj|a, w
) :MATH]
(10)
The probability
[MATH:
p(mj<
/msub>|a, w) :MATH]
has efficient computation. Let
[MATH: kj
:MATH]
denote the entry of
[MATH: w :MATH]
corresponding to metabolite
[MATH: j :MATH]
, and let
[MATH:
\kj
:MATH]
denote other entries in
[MATH: w :MATH]
. Then:
[MATH:
p(mj<
/msub>|a, w)=p(<
mi>mj, w|a
)p(w|a)=p(mj, wk
j|a)p(w\kj|a
mrow>)p(
w|a) :MATH]
(11)
Here we use the fact that
[MATH: mj
:MATH]
and
[MATH:
wkj :MATH]
are independent of other mass observations when
[MATH: a :MATH]
is given. With this relation, we have:
[MATH:
p(mj<
/msub>=1|a, w
)=p(mj=1,
wkj<
/mrow>|a)<
mi>p(mj=
mo>1, wk
j|a)+p(mj=0, w<
msub>kj|a) :MATH]
(12)
Here, the terms that are constants to
[MATH: mj
:MATH]
are canceled. Finally, we can compute
[MATH:
p(mj<
/msub>,wkj
|a) :MATH]
by marginalizing over all
[MATH:
mj,
:MATH]
for
[MATH:
j′≠j :MATH]
and
[MATH:
j′∈J
k :MATH]
:
[MATH: p(
mj, w
kj|a)=ΣmJ<
/mi>k\j p(mj,
mJk\<
mi>j, w
kj|a) <
/mo> =ΣmJ<
/mi>k\j p(wkj|mj
mi> , mJk\j)<
/mo>p(mj
msub>, mJk\j|a<
mo>) :MATH]
(13)
We decompose the above formulation into two terms for managing
calculations. These two terms,
[MATH:
p(wk<
/msub>|mJk) :MATH]
and
[MATH:
p(mj<
/msub>, mJk
mi>\j|a
) :MATH]
are further derived and re-expressed in the [80]Supplementary Material,
to yield the following:
[MATH:
p(mj<
/msub>, wkj
mi>|a)={(1
−ϕj)(∏j′
∈Jk, <
msup>j′≠j(1−γϕ
mi>j′))mj=0, wkj
msub>=0
(1−ϕj)(1−∏j′
∈Jk, <
msup>j′≠j(1−γϕ
mi>j′))mj=0, wkj
msub>=1
ϕj(1−γ<
/mi>)∏j′
∈Jk, <
msup>j′≠j(1−γϕ
mi>j′)mj=1, wkj
msub>=0
ϕj(1−
(1−γ)<
mstyle displaystyle="true">∏j′
∈Jk, <
msup>j′≠j(1−γϕ
mi>j′))mj=1, wkj
msub>=1 :MATH]
(14)
We use these equations to calculate the probabilities
[MATH:
p(mj<
/msub>=1, wkj|a
) :MATH]
and
[MATH:
p(mj<
/msub>=0, wkj|a
) :MATH]
. By normalizing the two terms to have a sum of 1, we get the posterior
of metabolite annotations. The derived probabilities are used as a
scoring metric to rank a candidate set for each mass measurement.
Details on the derivation and implementation of metabolite annotation
are provided in [81]Supplementary File S1.
2.4. Implementation and Parameter Initialization
We implemented PUMA using PyMC3 [[82]30], a probabilistic programming
framework that allows for automatic Bayesian inference on user-defined
models. In the implementation, we assume that
[MATH: λ :MATH]
has a
[MATH: β :MATH]
prior with parameters
[MATH:
α=β=1
mrow> :MATH]
. We sample both random variables
[MATH: a :MATH]
and
[MATH: λ :MATH]
. To draw samples from a posterior distribution, PyMC3 utilizes a
Markov Chain Monte Carlo (MCMC) sampling technique [[83]31]. The
generative model was derived from the metabolic model for each of our
case studies. The observed accuracy of the mass spec,
[MATH: γ, :MATH]
is assumed to be 0.9. Each entry in μ is assumed to be 0.5 if
metabolite j exists on pathway
[MATH: i :MATH]
. T, the number of samples to draw from the model, is a variable that
can be set in PyMC3. The sampler was run multiple times with T values
equal to 500, 1000, and 1500. We assumed 100 burn-in samples. For all
reported runs, increasing the number of drawn samples altered the
computed probabilities slightly but did not affect the list of active
pathways, based on a 0.5 activity threshold. Drawing 1000 samples was
used as a default.
3. Results
3.1. Model Validation
To give confidence in the performance of PUMA, it is desirable to
validate the generative models against a “ground truth” dataset, where
all measured metabolites are annotated and there is sufficient
experimental evidence to allow attributing measured metabolites to
specific pathways. Predictions by PUMA can then be compared against
this ground truth. Despite several databases that catalogue various
metabolomics datasets (e.g., MetaboLights [[84]32], Metabolomics
Workbench [[85]33]), there are currently no untargeted metabolomics
sets that are 100% annotated. Further, there are no datasets that allow
attributing metabolites to specific pathways through experimental work.
We therefore design datasets that serve as “ground truth” datasets when
validating pathway analysis tools. The datasets are generated to mimic
biological processes where genes within pathways work in concert and
result in enzymatic activities that produce metabolites [[86]34]. These
metabolites are then observed via mass spectrometry. As central
metabolism and network topology is conserved across many organisms
[[87]35], we generated the synthetic datasets for a representative
organism, the CHO cell, a popular biological sample that is discussed
herein as a case study. Its cellular pathways and their metabolites are
used as the basis for generating pathway and metabolite activities.
To model a variety of plausible cellular activity, several synthetic
datasets are generated. The portion of active pathways and the portion
of active metabolites are varied for each dataset. A random portion
(0.3, 0.5, and 0.7) of pathways are assumed active, and a random
portion (0.05, 0.10, 0.15, 0.20, 0.25, 0.50, 0.75) of metabolites
within each active pathway are generated. For each portion of active
pathways and for each portion of active metabolites, 100 metabolomics
datasets reflecting the masses of the active metabolites were
generated. The observed accuracy
[MATH: γ :MATH]
was set to 1.
We compared the likelihood of pathway activities computed by PUMA
against the enrichment ratios for the synthetic datasets. The
enrichment ratio for a particular pathway is defined as the ratio of
measured masses that map to metabolites within the pathway to its size.
We use AUC, the area under the Receiver Operating Characteristic (ROC)
curve to report the results. A ROC curve plots TPR (True Positive Rate)
vs. FPR (False Positive Rate) at different classification thresholds.
The AUC considers the performance of a classifier across all possible
classification thresholds. Lowering the threshold for classifying a
pathway as active results in classifying more pathways as active, thus
increasing both false positives and true positives. The AUC effectively
reports on the probability that the method ranks an active pathway that
is selected randomly more highly than an inactive pathway that is
selected randomly. The ROC curves are plotted for PUMA and for
enrichment ratios ([88]Figure 4). The AUCs are consistently higher for
PUMA than for the enrichment ratios, with the exception of the case
when pathway activities are low (0.3) and only 0.1 of metabolites
within each pathway are produced. An increase in the number of
measurements while the pathway activity is fixed provides PUMA with
more evidence and in general results in PUMA performance improvements
that are more pronounced than for enrichment analysis. The lowest AUC
for PUMA occurs with the lowest pathway activity and lowest number of
generated metabolites (AUC = 0.69), while the lowest AUC for the
enrichment ratio analysis occurs with the highest portion of active
pathways, and smallest number of generated metabolites (AUC = 0.50).
Importantly, PUMA outperformed the enrichment ratio, on average, by 8%,
with average AUCs of 0.81 and 0.73 for PUMA and enrichment ratios,
respectively.
Figure 4.
[89]Figure 4
[90]Open in a new tab
AUC for ROC curves for the synthetic datasets under different
assumptions regarding pathway activity and metabolite generation.
(A,C,E) PUMA. (B,D,F) Enrichment ratio.
We then applied PUMA to each dataset and averaged PUMA’s precision,
recall and accuracy on identifying the presumed active pathways. At a
pathway activity of 0.3 ([91]Figure S1A), as we have more observed
metabolites, recall increases because PUMA has more evidence in terms
of observations to recover the correct pathway activities. Precision,
PUMA’s ability to label true positives correctly, is greater than 0.71,
regardless of the active fraction of metabolites. Accuracy improves
with increased active metabolites due to the corresponding increase in
PUMA’s ability to identify true positives. This trend holds for other
assumptions about pathway activities ([92]Figures S2A and S3A).
We investigate how uncertainty in metabolite annotation impacts
inference regarding pathway activity. Before running PUMA, each mass
measurement is attributed to a presumed active metabolite, thus
removing annotation uncertainty. Results ([93]Figures S1B, S2B, and
S3B) show a similar trend to those in [94]Figures S1A, S2A, and S3A. A
similar trend holds when each measured mas is randomly assigned a
metabolite amongst model metabolites with the same mass as a measured
mass ([95]Figures S1C, S2C, and S3C). This result emphasizes that
computing pathway activities without the explicit step of performing
metabolite annotation via spectral databases or annotation tools is a
profitable approach. PUMA can, therefore, be used to accelerate the
process of pathway activity analysis by direct use of mass measurements
and bypassing metabolite annotation using spectral databases.
We further investigated the robustness of the model to its parameters.
While prior runs assumed that the probability of observing a metabolite
due to a particular pathway activity was 0.5, we varied the
corresponding model parameter
[MATH: μ :MATH]
to 0.25 and to 0.75 and re-ran PUMA. The results ([96]Figure S4) show
that inference is dominated by other aspects of the model and that
inference is robust to this model parameter.
3.2. Case Study: Chinese Hamster Ovary (CHO) Cell
We apply PUMA to LC-MS (Liquid-Chromatography Mass Spectrometry)
metabolomics data for CHO cell cultures belonging to a low growth cell
line [[97]15] ([98]Supplementary Table S1). LC-MS data was collected
under three different combinations of liquid chromatography methods and
positive or negative ionization modes. The authors processed the data
using the CAMERA tool and the dataset was limited to masses
corresponding to ions formed through protonation or deprotonation. When
combined, the data provides a more comprehensive characterization of
the sample in the form of 8711 measurements. The metabolic model for
the CHO cell was extracted from KEGG [[99]36], based on metabolites and
pathways for the cricetulus griseus (Chinese hamster) under organism
code cge. The model had 86 pathways, 1534 metabolites, and 722 unique
mass measurements. The m/z of the precursor ions in the measurement
datasets were adjusted based on the ionization mode by adding or
subtracting the mass of one proton. Adjusted measurements ± a 15 ppm
are then used to initialize the observation vector
[MATH: w :MATH]
for each dataset, as described in [100]Section 2.2. This model and
LC-MS dataset were used prior to evaluate BioCAn [[101]15], a tool that
aggregates results from spectra databases, annotation tools and network
connectivity. The BioCAn evaluation thus provided putative identities
in the form of KEGG identities, and also provided annotations using
METLIN and HMDB. We, therefore, compare metabolite annotations with
those provided for METLIN, HMDB, and BioCAn.
3.2.1. Probabilities of Pathway Activities
Detailed results for each dataset and for the combined data set is
provided in [102]Supplementary Table S1. A pathway is considered active
if
[MATH:
p(ai<
/msub>|w) :MATH]
is equal to or greater than 0.5. As mass observations differ from one
set of measurements to another, the predicted activity differs among
the datasets. A detailed discussion of the results for the individual
datasets is provided in [103]Supplementary File S1. The rest of the CHO
cell analysis provided here is based on the combined dataset.
Many of the 42 pathways identified active by PUMA are biologically
relevant. The biological activity of most pathways such as TCA cycle,
essential for energy metabolism, Biotin (vitamin B7) metabolism, amino
acid synthesis, and many others, is expected. However, the activity of
some pathways, including caffeine and drug pathways, is biologically
unlikely active in the CHO cell samples. Based on our experiments using
the synthetic datasets, we expect some PUMA predictions to be false.
Pathway activities predicted by PUMA are contrasted against pathway
enrichment ratios ([104]Figure 5). Pathways are labeled as
statistically enriched based on statistical significance of their
ratios using Fisher’s Exact Test (FET). The null hypothesis is that
there is no difference between the enrichment ratios of pathways in the
sample. A p-value equal to or less than 0.05 is considered significant.
Eight pathways are designated statistically enriched. These pathways
are galactose metabolism, fatty acid degradation, purine metabolism,
N-glycan biosynthesis, amino sugar and nucleotide sugar metabolism,
glycosaminoglycan degradation, glycerophospholipid metabolism, lipoic
acid metabolism. Among them, six pathways were predicted by PUMA to be
active with probability equal to 1 while the N-Glycan biosynthesis
pathway had a 0.53 likelihood of being active. Fatty acid degradation
is predicted to be inactive. There were many pathways that had low
enrichment ratios and low PUMA-predicted activity.
Figure 5.
[105]Figure 5
[106]Open in a new tab
Probability of pathway activities as computed by PUMA vs. enrichment
ratios for CHO cell. Each data point is marked as either statistically
enriched (red) or non-statistically enriched (blue) based on a Fisher’s
Exact Test p-values of 0.05.
While there was consensus in some cases, there were also differences.
PUMA designates some pathways as active despite low enrichment ratios.
For example, the enrichment ratios of the TCA cycle, fatty acid
biosynthesis, ubiquinone and terpenoid-quinone biosynthesis are 0.15,
0.29, and 0.13, respectively. Meanwhile, PUMA predicted these pathways
active with a likelihood of 1. There are three pathways with enrichment
ratio equal to 0.5. Of them, one pathway, biotin metabolism, is
assigned active by PUMA with probability 1.0. The biotin metabolism
pathway has a measured mass that is unique and cannot be generated by
other pathways. However, the other two pathways, both glycosphingolipid
biosynthesis pathways, are predicted active with probability less than
0.5 (0.47 and 0.48). The reason was as follows: the observed mass
measurements in the glycosphingolipid biosynthesis pathways could be
mapped to galactose metabolism and glycosaminoglycan degradation
pathways that are associated with a unique measurement that cannot be
attributed to any other pathway in model (similar to the case of w[5]
in our illustrative example [107]Figure 1). As the result, the
glycosphingolipid biosynthesis pathways were assigned probabilities
less than 0.5, while the pathways with the unique measurements are
predicted active with high probability.
3.2.2. Probabilities of Metabolite Annotations
A particular measurement was associated with a model metabolite if its
mass matched the measured mass within the bin tolerance. Each
measurement therefore may be assigned zero, one or more possible
annotation. Probabilities of each metabolite being present in the
sample as inferred by PUMA are used to score and rank the putative
annotations. Here, the top ranked metabolite(s) for each mass is
considered as the PUMA candidate set.
We assess the accuracy of PUMA annotations by comparing the level of
agreement of PUMA annotations with those using two other techniques,
spectral database searches and BioCAn ([108]Figure 6). Spectral
signatures collected for the CHO Cell were looked up in METLIN and HMDB
as was previously reported [[109]15]. Out of 411 mass measurements, 85
were matched using their spectral signatures either in HMDB or METLIN.
Each such measurement had one or more chemical identities assigned to
the spectral signature. Here, the highest scoring metabolite(s) for
each measurement using METLIN and HMDB formed the spectral database
candidate set.
Figure 6.
[110]Figure 6
[111]Open in a new tab
Metabolite annotations attained with PUMA against those identified by:
(A) searching spectral databases, HMDB and METLIN, and (B) BioCAn. The
blue slice in each pie represents “agreement”. The orange and gray
slices represent “semi-agreement” and “disagreement” respectively.
Finally, the yellow slice represents the number of mass measurements
that could only be annotated by PUMA.
For each measurement, the PUMA candidate set was compared against the
spectral database candidate set. The comparison leads to four different
scenarios: “agreement”, “semi-agreement”, “disagreement”, and “only
PUMA”. An “agreement” scenario is where the PUMA candidate set exactly
matches the spectral database candidate set. Such agreement occurs in
60 cases. A “semi-agreement” is when the spectral database candidate
set is a subset of the PUMA candidate set. That is, unlike the
“agreement” scenario, there is not complete consensus regarding the top
candidate set, and hence the “semi-agreement” label. There are 15 cases
of “semi-agreement”. A “disagreement” scenario is when the spectral
database candidate set does not overlap with the PUMA candidate set.
There were 10 such cases of “disagreement”. In 7 such cases, the
spectral database candidate set is the second likely putative
annotation identified by PUMA. These putative annotations, which were
not part of the PUMA candidate set, had high PUMA activity scores that
were close in value to scores of the metabolite(s) in the PUMA
candidate set. In the remaining 3 cases, however, the spectral database
candidate is assigned a low score by PUMA. Clearly PUMA and METLIN/HMDB
disagree in regard to these three cases. An “only PUMA” scenario is
when the spectral database candidate set was empty, but the PUMA
candidate set was not. There were 326 such cases. This large number of
“only PUMA” cases reflects the low coverage of spectral databases.
PUMA annotations are compared against those obtained using BioCAn
[[112]15]. BioCAn aggregates results from spectral database searches
and in silico fragmentation tools. Further, BioCAn estimates the
confidence in annotation not only based on consensus scoring but also
based on the presence of metabolites that are connected to the mass
measurement through substrate-product relationships. BioCAn identifies
338 out of 411 mass measurements that are annotated by PUMA. The top
ranked metabolite(s) for each mass as annotated by BioCAn is considered
the “BioCAn candidate set”. We analyze various scenarios as we did when
comparing against the spectral database candidate set. Again, there
were four scenarios: “agreement”, “semi-agreement”, “disagreement”, and
“only PUMA”. The definitions of these scenarios are similar to the ones
provided for the METLIN/HMDB comparison, but against the BioCAn
candidate set instead of the spectral database candidate set. There are
255 cases of “agreement”, 46 cases of “semi-agreement”, 37 cases of
“disagreement”, and 73 “only PUMA” annotations. The disagreements fell
into two categories. In 17 out of 37 cases, there was disagreement
regarding the top candidate, where PUMA ranked BioCAn’s candidate as
second best. There were genuine disagreements in 20 cases where the
annotation by BioCAn was assigned a low score by PUMA.
To validate annotations, BioCAn experimentally validated 50 of their
predicted annotations against chemical standards. A subset of 37
compounds were confirmed present based on the standards. Thus, BioCAn’s
precision on this 50-compound dataset is 0.740 (37/50). PUMA correctly
calculated the likelihood for 33 of the 50 compounds in the sample: 25
were true positives, and 8 were true negatives. PUMA miscalculated the
likelihood for 17 annotations: 5 were false positives and 12 were false
negatives. PUMA’s precision for this 50-compound dataset is 0.833, thus
achieving a significant additional 0.093 improvement in precision over
BioCAn. Recall for PUMA for this 50-compound dataset was 0.676.
In summary, comparing PUMA annotations against those obtained through
spectral database and BioCAn shows significant levels of agreement.
METLIN, HMDB and BioCAn incorporate spectra signatures during
annotation while PUMA relies solely on pathway organization and mass
measurements. Importantly, for the CHO cell, PUMA increased annotation
by 383% over spectral databases and by 21% over BioCAn.
3.2.3. Evaluation of PUMA in Overcoming Uncertainty in Annotation
As a measured mass may be attributed to more than one metabolite, there
is uncertainty inherent in mapping measurements to metabolites. Matrix
[MATH: τ :MATH]
, with J rows and K columns, maps metabolites in the model to their
corresponding masses. We investigate if PUMA benefits from reducing the
mapping uncertainty by adjusting the
[MATH: τ :MATH]
matrix prior to running PUMA. In the case of the CHO cell, the
annotations obtained via spectral lookups in the METLIN and HMDB
databases are used to assign identities to some of the measured masses
prior to running PUMA. To capture this knowledge, matrix
[MATH: τ :MATH]
is modified to map each mass k to precisely a single metabolite j,
which is identified based on the METLIN and HMDB annotations. Column
entries other than
[MATH:
τ i,
mo>j :MATH]
are set to zero, indicating that mass k uniquely maps to metabolite
[MATH: j. :MATH]
Using the updated
[MATH: τ :MATH]
, PUMA calculated posteriors for pathway activities. There was a slight
change in predicted posteriors (average increase of 0.003) compared to
those obtained using the original
[MATH: τ :MATH]
matrix. The change however does not alter posterior probabilities
sufficiently to modify the list of active pathways. We repeated the
analysis but incorporated the annotation data available from BioCAn
instead of that obtained through spectral databases. The change in
[MATH: τ :MATH]
caused a slight change in predicted posteriors (an average of 0.001 per
pathway) compared to those obtained using the original
[MATH: τ :MATH]
matrix. The one significant change was for pathway Phenylalanine
metabolism where pathway activity changed from 0.03 to 1.0. The
phenylalanine metabolism pathway is responsible for producing tyrosine.
This finding suggests that substantial additional annotations, as
provided in the form of added annotations by BioCAn over the use of
spectral databases, are required to inform inference in regard to
pathway activities.
3.3. Case Study: Human Urinary Sample
We apply PUMA to untargeted metabolomics datasets collected for human
urinary samples analyzed by Roux et al. [[113]27]. Detailed annotations
using KEGG identities, if available, are provided for 384 metabolites
based on careful analysis of 659 and 825 annotated ions in the positive
and negative modes, respectively. As PUMA utilizes post-processed
metabolomics data, we utilized the 384 mass measurements, which were
already adjusted for ionization, adducts, and in-source fragment ions,
as input to PUMA. Byproducts in the urine may be attributed to human
metabolic pathways. We, therefore, modeled pathways responsible for
these byproducts using the human metabolic model from MetaCyc
[[114]37]. The model had 275 pathways, 716 metabolites, and 565 unique
masses. To evaluate PUMA without prior knowledge of metabolite
annotations, each of the 384 mass measurements were mapped to all model
metabolites that were within ±15 ppm, as described in [115]Section 2.2.
Only 123 out of the Roux et al. measured masses matched to metabolites
in the model.
3.3.1. Probabilities of Pathway Activities
PUMA designated 41 pathways as active in human urinary sample
([116]Supplementary Table S1). We investigate how inference results
compare with pathway enrichment ratios ([117]Figure 7). Of the 41
pathways designated to be active using PUMA, six pathways (tRNA
charging, 4-hydroxyproline degradation I, histidine degradation VI,
lysine degradation II, purine ribonucleosides degradation to
ribose-1-phosphate, nicotine degradation III) are statistically
enriched. As in the CHO cell cases, there were cases of agreement and
disagreement. There are several pathways were PUMA predicts low
activity, while enrichment assumes a high enrichment ratio, including
alanine biosynthesis II, glutamate degradation II, aspartate
biosynthesis, arginine degradation VI, and alanine degradation III. The
probabilities for these pathways are 0.26, 0.22, 0.17, 0.31, and 0.25,
respectively, while the corresponding enrichment ratios are 1.0, 0.57,
0.75, 0.6, and 1.0. Many measurements assigned to these pathways,
however, are not unique as they can generated due to activity of other
pathways.
Figure 7.
[118]Figure 7
[119]Open in a new tab
Probability of pathway activities as computed by PUMA vs. enrichment
ratios for the human urine sample. Each data point is marked as either
statistically enriched (red) or non-statistically enriched (blue) based
on a Fisher’s Exact Test p-values of 0.05.
3.3.2. Probabilities of Metabolite Annotations
The PUMA probabilities for each metabolite being present in the sample
are used to score and rank metabolites. Only the top ranked
metabolite(s) for each mass are considered as the PUMA candidate set.
We compared our annotation against those provided by Roux et al.
[[120]27] ([121]Figure 8). These annotations were either identified by
matching at least two of their physicochemical parameters to those in a
reference standard or annotated through spectral database lookups
(HMDB). Some measurements were annotated as isomers, without
identifying the precise chemical molecular identity. Each measurement
was assigned the best possible match. We refer to this match as the
Roux candidate. When comparing the PUMA candidate set to the Roux
candidate, there were four scenarios: “agreement”, “clarification”,
“disagreement”, and “model incompleteness issues”. An “agreement”
scenario is where the PUMA candidate set exactly matches the Roux
candidate. Such agreement occurs in 85 cases. A “clarification”
scenario is when the PUMA candidate set provided a specific chemical
annotation while the Roux candidate annotated the measurement as an
isomer. There were 23 cases of “clarification”. A “disagreement”
scenario is when the Roux candidate was available in the model but was
not predicted as match by PUMA. There was one case of “disagreement”. A
“model incompleteness issues” scenario is when the Roux candidate was
not in the model, reflecting that PUMA provides the best match within
the scope of model metabolites. There were 14 “model incompleteness
issue”. More comprehensive metabolic model could address such issues.
Figure 8.
[122]Figure 8
[123]Open in a new tab
Metabolite annotations attained with PUMA against those identified by
Roux et al. The blue slice represents “agreement”. The orange slice
represents “clarification”. The gray slice represents “disagreement”
and the yellow slice represents “model incompleteness issue”.
3.4. Model Convergence, Complexity, and Runtimes
The time and space complexity in sampling the model is O(
[MATH: T :MATH]
×
[MATH: I×J :MATH]
). The runtime for drawing 1000 samples for pathway activity prediction
and metabolite annotation for the CHO cell dataset were 231 and 0.5 s,
respectively. The corresponding runtimes for the Human Urinary case
study were 280 and 0.4 s, respectively. The runs were performed on a
Dell PowerEdge R815 server with 64 cores (4× AMD Opteron 6380
processors) and 128 GB of RAM, running at 2.5 GHz.
4. Discussion and Conclusions
We presented in this paper PUMA, a probabilistic approach to interpret
mass measurements collected through untargeted metabolomics. PUMA first
uses inference to determine pathway activities. While prior works
focused on computing pathway enrichment in the context of comparing one
sample against the other, here, we define a pathway as active based on
its likelihood of being responsible for the presence of one or more
metabolomics measurements. In determining activity, PUMA reasons about
the complex relationships between the measurements as well as known
pathway as defined through the underlying biochemical networks. In
doing so, levels of uncertainty in mapping measurements to metabolites
and pathways are significantly reduced. Moreover, a clearer view of the
likelihood of pathway activity levels emerges when compared to simple
enrichment analysis. PUMA then utilizes the likelihood of pathway
activities to compute the posterior probability distribution of
metabolites being present in the sample. The mathematical formulation
for computing pathway activities and metabolite annotation
probabilities was essential in achieving fast runtimes. For each test
case, PUMA’s runtime for calculating pathway activities was under five
minutes. The runtime for calculating metabolite annotation was less
than a second. A naïve implementation of pathway activity inference
would have caused significantly longer runtimes.
PUMA was validated using synthetic datasets, generated to compensate
for the lack of “ground truth” metabolomics datasets where all
metabolites and pathway activities are known. As expected, additional
measurements consistently improve PUMA recall scores. This study
provided two strong results. First, fixing identities of the
measurements prior to computing pathway activities provide limited
improvement in PUMA performance, thus emphasizing that bypassing
metabolite identification prior to computing pathway activity is a
valid approach for determining pathway activities. Second, using AUC
for the ROC curve as a metric, the pathway activity likelihoods
computed using PUMA, on average, outperform the pathway enrichment
analysis by an average of 8%.
PUMA was applied to two case studies, the CHO test case and the human
urine test case. In regard to pathway analysis, PUMA identifies
pathways that have a high likelihood of being active but have
statistically low enrichment ratios, and pathways with low likelihood
of being active yet with statistically high enrichment ratios. PUMA,
therefore, offers a perspective on pathway activity that is distinctly
different from that offered by statistical enrichment.
In regard to metabolite annotations, PUMA results had high agreement to
prior annotations. This high level of agreement occurs despite the fact
that PUMA does not utilize additional information in form of spectra
signatures, as employed by other techniques. In the case of the CHO
cell test case, PUMA increased the percentage of mass annotation by
383% over spectral lookups and by 21% over BioCAn. PUMA analysis of a
50-compound dataset that was experimentally verified by BioCAn, yielded
0.676 recall and 0.833 precision, which provides a significant 0.093
improvement in precision over BioCAn. For the human urine test case,
PUMA showed agreement in annotating 85 metabolites that were annotated
before using database looks ups. PUMA also suggested new putative
identities for measurements that were previously identified only as
isomers. Agreements demonstrated for both test cases against prior
experimental annotations (those provided by BioCAn [[124]15] and by
Roux et al. [[125]27]) provide strong evidence for PUMA’s annotation
capabilities.
The approach of predicting differential functional activity directly
from spectral features without a priori metabolite annotation was
previously shown effective using Mummichog [[126]22]. PUMA utilizes all
measurements to compute the likelihood of pathway activities that gave
rise to the measurements. PUMA considers a biological sample under a
certain condition, while Mummichog utilizes differentially expressed
measurements to determine differently observed pathways/modules. PUMA
uses the likelihood of pathway activities to derive metabolite
annotations. Analyzing both the synthetic data sets and the CHO cell
test case, PUMA confirms that the organization of metabolic networks
can resolve the ambiguity in metabolite annotation to a large extent,
as previously noted when using Mummichog.
PUMA is based on inference but differs from other inference-based
methods. ProbMetab [[127]38] uses a probabilistic method [[128]39] to
assign empirical formulas to measured spectra given potential formulas.
The method proposed by Jeong et al. constructs a generative model to
infer the likelihood of a metabolite in the sample and the correctness
of matching the measurement to a candidate metabolite within a spectral
database based on measured spectra’s similarity to that of the proposed
candidate and to other competing spectra in the database [[129]40]. The
competing spectra, however, may not be relevant to the sample. Del
Carratore et al. uses evidence in the form of isotope patterns, adduct
relationship and biochemical connections to infer metabolite
annotations [[130]41]. ZODIAC [[131]42] also utilizes inference to
re-rank molecular formula candidates suggested by SIRIUS [[132]43].
The presented method herein provides a novel, efficient,
inference-based pathway analysis technique. PUMA can be enhanced in
multiple ways. PUMA assumes that pathway activities are independent.
Under this assumption, observed (present or absent in the sample)
metabolites are not independent because they share the same set of
hidden random variables. This relationship is evident in the graphical
representation of the generative model ([133]Figure 3). However, PUMA
does not consider biochemical dependencies between substrates and
products. The generative model proposed herein can be strengthened by
considering such relationships. Further, modeling instrument bias and
sensitivity to metabolite concentrations can enhance PUMA, as well as
other annotation tools. PUMA, like Mummichog, BioCAn, and other
enrichment analysis tools depend on genome-scale metabolic models.
However, such models are often incomplete, yielding incomplete analyses
as evidenced in the scenario of “model incompleteness issues” when
applying PUMA to the human urine sample. Integrating metabolomics data
with genomics data can potentially yield improved metabolic models
[[134]44]. PUMA, like many other tools, can be enhanced by
investigating instrument bias and sensitivity to metabolite
concentrations. Herein, synthetic datasets served an important role in
validating PUMA’s performance. The synthetic datasets were generated
using the CHO cell as a model organism and assumed several scenarios
with a wide range of pathway activity and observations. It is possible
to make alternate assumptions regarding the data generation process and
further adjust the generation assumptions to account for potential
biases in collecting the measurements.
Acknowledgments