Abstract
Turnover numbers characterize a key property of enzymes, and their
usage in constraint-based metabolic modeling is expected to increase
the prediction accuracy of diverse cellular phenotypes. In vivo
turnover numbers can be obtained by integrating reaction rate and
enzyme abundance measurements from individual experiments. Yet, their
contribution to improving predictions of condition-specific cellular
phenotypes remains elusive. Here, we show that available in vitro and
in vivo turnover numbers lead to poor prediction of condition-specific
growth rates with protein-constrained models of Escherichia coli and
Saccharomyces cerevisiae, particularly when protein abundances are
considered. We demonstrate that correction of turnover numbers by
simultaneous consideration of proteomics and physiological data leads
to improved predictions of condition-specific growth rates. Moreover,
the obtained estimates are more precise than corresponding in vitro
turnover numbers. Therefore, our approach provides the means to correct
turnover numbers and paves the way towards cataloguing kcatomes of
other organisms.
Subject terms: Biochemical reaction networks, Computational models,
Metabolic engineering, Bioinformatics
__________________________________________________________________
The construction of protein-constrained genome-scale metabolic models
depends on the integration of organism-specific enzyme turnover
numbers. Here, the authors show that correction of turnover numbers by
simultaneous consideration of proteomics and physiological data leads
to improved predictions of condition-specific growth rates.
Introduction
Genome-scale metabolic models (GEMs) together with advances in
constrained-based modeling have led to an improved understanding of how
cellular resources are used to fulfill different cellular
tasks^[30]1–[31]3. Recent advances are largely propelled by the
development of protein-constrained GEMs (pcGEMs) in which the catalytic
capacities of individual enzymes are linked to the allocation of enzyme
abundances^[32]4. Such models have led to more accurate predictions of
maximum specific growth rates on different carbon sources^[33]5–[34]7,
flux distributions^[35]7, and other complex phenotypes^[36]8 in
Escherichia coli and Saccharomyces cerevisiae. However, the development
of pcGEMs critically depends on the integration of organism-specific
enzyme turnover numbers,
[MATH: kcat :MATH]
, comprising the kcatome of an organism^[37]9.
Measuring the kcatome of an organism based on in vitro characterization
is limited due to the impossibility to purify specific enzymes, lack of
availability of substrates, and knowledge of required cofactors, such
that their relevance for studies of in vivo phenotypes remains
questionable^[38]10,[39]11. Proxies for in vivo turnover numbers also
termed maximal apparent catalytic rates, can be estimated by combining
constraint-based approaches for flux prediction with measurements of
protein abundance under different growth conditions or genetic
modifications^[40]12–[41]14. The results from this approach, which
entails ranking condition-specific estimates that use individual data
sets, have shown that the proxies for in vivo turnover numbers
generally concur with in vitro
[MATH: kcat :MATH]
values in E. coli^[42]12. However, applications with data from S.
cerevisiae^[43]15 and A. thaliana^[44]16 indicated that these proxies
for in vivo turnover numbers do not reflect in vitro measurements.
Another approach to estimate the kcatome relies exclusively on the
machine and deep learning methods that use a variety of features of
enzymes (e.g. network-based, structure-based, and
biochemical)^[45]17–[46]19, resulting in predictive models that can
explain up to 70% of the variance in turnover numbers obtained in
vitro.
The estimates of turnover numbers are integrated into metabolic models
by different constraint-based approaches that have been grouped into
coarse-grained (e.g. MOMENT^[47]5, sMOMENT^[48]20, eMOMENT^[49]21, and
GECKO^[50]7, which all result in the same feasible space in protein
limited growth scenarios) and fine-grained (e.g. resource balance
analysis^[51]1 and ME-models^[52]2,[53]3). Of these, GECKO^[54]7 has
been adopted in several recent studies due to the elegantly structured
formulation of the protein constraints. In addition, GECKO allows for
the integration of protein contents and correction factors that account
for the mass fraction of enzymes (
[MATH: f :MATH]
) included in the model as well as the average in vivo saturation (
[MATH: σ :MATH]
) of all enzymes, facilitating the development of condition-specific
models. While data-driven estimation of in vivo turnover numbers
improves the coverage of
[MATH: kcat :MATH]
values in pcGEMs, the available estimates usually lead to
over-constrained models when using the allocation of total protein
mass, not considered in flux balance analysis (FBA)^[55]22,[56]23.
Here, we propose PRESTO (for protein-abundance-based correction of
turnover numbers), a scalable constraint-based approach to correct
turnover numbers by matching predictions from pcGEMs with measurements
of cellular phenotypes— simultaneously—over multiple conditions. As a
constraint-based approach, PRESTO facilitates the investigation of the
variability of the proposed corrections. We show that predictions of
growth by pcGEMs of S. cerevisiae with turnover numbers corrected by
PRESTO are more accurate than those based on the models that include
[MATH: kcat :MATH]
values corrected based on a contending heuristic that relies on enzyme
control coefficients^[57]22. We also demonstrate that the same
conclusions hold when enzyme abundances are integrated into the E. coli
pcGEM using PRESTO. Therefore, PRESTO paves the way to broaden the
applicability of pcGEMs for organisms with biotechnological
applications and to arrive at genotype-specific estimates of the
kcatome.
Results
Protein-abundance-based correction of turnover numbers
For a given data set of protein abundances over a set of conditions,
the enzymes with turnover numbers in a pcGEM can be partitioned into
three groups. For instance, a data set of protein abundances that was
recently used to estimate in vivo turnover numbers in S.
cerevisiae^[58]15 includes 45%, 41%, and 14% measured overall, at least
one (but not all), and none of the 27 used conditions, respectively.
Therefore, there is then different data support for correcting the
[MATH: kcat :MATH]
values of these classes of proteins. PRESTO relies on solving a linear
program that minimizes a weighted linear combination of the average
relative error for predicted specific growth rates and the correction
of the initial turnover numbers integrated into the pcGEM (Fig. [59]1,
see the “Methods” section). It further employs K-fold cross-validation
(here, K = 3) with 10 repetitions while ensuring a steady state and
integrating protein constraints for proteins measured overall
conditions (Fig. [60]1, see the “Methods” section). The training set of
conditions is used to generate a single set of corrected in vitro
[MATH: kcat :MATH]
values, by using the respective in vivo protein abundances. The
resulting corrected
[MATH: kcat :MATH]
values are in turn used to determine the relative error of the
predicted specific growth rate for each condition in the test set using
flux balance analysis with the pcGEM, while only constraining the total
protein content and measured uptake rates. The relative error of the
predicted specific growth rate along with the sum of introduced
corrections is lastly used to select the value for the tuning parameter
[MATH: λ :MATH]
in the objective function of PRESTO, as done in machine learning
approaches that rely on regularization.
Fig. 1. Schematic overview of the PRESTO approach for k[cat] correction.
[61]Fig. 1
[62]Open in a new tab
The approach uses a GECKO-formatted pcGEM^[63]7 containing turnover
numbers from BRENDA^[64]48. Using available data from n experimental
conditions,
[MATH: n :MATH]
condition-specific models are generated using nutrient uptake rates and
protein contents. PRESTO then uses data on abundances for the enzymes
measured across the
[MATH: n :MATH]
investigated conditions and solves a linear program that minimizes a
weighted sum of two objectives, the relative error to measured specific
growth rates and the sum of positive
[MATH: kcat :MATH]
corrections, δ. The optimal weighting factor, λ, which modulates the
trade-off between the two objectives, is then determined by
cross-validation (Tr: training set; Ts: test set), choosing the
parameter, which is associated with the lowest average relative error.
Using the optimal value for λ, PRESTO combines all models for the
experimental conditions to find a
[MATH: kcat :MATH]
correction for each enzyme with measured abundance. Last, the precision
of δ values is assessed by variability analysis as well as by sampling
and corrected
[MATH: kcat :MATH]
values are validated by comparing them to values obtained from other
approaches.
PRESTO outperforms a contending heuristic in S. cerevisiae
To determine the performance of PRESTO and compare it to that of
contending heuristics, we used a data set comprising protein abundances
and exchange fluxes from 27 diverse conditions, as supported by the
principal component analysis (Supplementary Fig. [65]1). Application of
PRESTO with a pcGEM of S. cerevisiae with initial in vitro turnover
numbers obtained from BRENDA resulted in a mean relative error of 0.68
from the cross-validation procedure, yielding a correction of on
average 213 turnover numbers (Supplementary Fig. [66]2a). For the S.
cerevisiae pcGEM, we found a value of
[MATH:
10−7
:MATH]
for the parameter λ in the PRESTO objective provides the optimal
trade-off between both the relative error and the sum of introduced
corrections (see the “Methods” section). Moreover, we observed a high
overlap between the sets of proteins with corrected turnover numbers in
the cross-validation (average Jaccard distance of 0.07 (Supplementary
Fig. [67]2b, c)), suggesting that the integrated data from different
conditions point to a specific subset of enzymes that need to be
corrected to improve performance of growth prediction.
Unlike PRESTO, GECKO implements a heuristic for the correction of
turnover numbers that are based on the objective control coefficient
calculated for each protein in a given condition (Supplementary
Fig. [68]3)^[69]22. The control coefficient of a protein is determined
by increasing the turnover number by 1000-fold and scoring the effect
on the predicted specific growth rate. The proteins are then ranked in
decreasing order of their control coefficients, and the turnover number
of the first enzyme in the list is changed to the maximum value found
in BRENDA for this enzyme across all organisms. This procedure is
repeated with the remaining enzymes until the pcGEM predicts a growth
rate that is at most 10% smaller than the measured specific growth rate
for that condition or no additional
[MATH: kcat :MATH]
value that strongly constrains the solution can be found (Supplementary
Fig. [70]3). This leads to condition-specific sets of corrected
[MATH: kcat :MATH]
with large intersections or full containment over a considered order of
conditions (Supplementary Fig. [71]4a).
In contrast to this procedure, PRESTO corrects at once the turnover
numbers of multiple enzymes that are measured in all investigated
conditions by simultaneously leveraging the data from the different
conditions, considerably reducing runtime and the number of solved
problems. As a result, rather than deriving condition-specific
corrected
[MATH: kcat :MATH]
values, which are difficult to use in making predictions for unseen
scenarios or for building large-scale kinetic metabolic
models^[72]24,[73]25, PRESTO results in a single set of corrected
[MATH: kcat :MATH]
values.
We compared the performance of PRESTO with the heuristic implemented in
GECKO in three modeling scenarios that consider: (i) only
condition-specific total protein content, (ii) both total protein
content and uptake constraints, and (iii) additional constraints from
abundances of enzymes measured in all conditions (Fig. [74]2). For
corrections of turnover number from PRESTO, we observed that the
relative error spans the range from 0.15 to 0.88 in the least
constrained scenario (i) (Fig. [75]2a) and from 0.69 to 0.98 in the
most constrained scenario (iii) (Fig. [76]2c). In contrast, the
relative error with the corrections of turnover numbers from the GECKO
heuristic is in the range from 0.96 to 1.00 in scenario (iii)
(Fig. [77]2c). In addition, in scenario (iii), the median relative
error in the case of the GECKO heuristic for each condition is larger
than the relative error of the PRESTO predicted specific growth rate
(Fig. [78]2c). We observed that predictions from FBA, considering
enzyme abundances, without a constraint on the total protein content,
led to an average relative error of 0.70 with
[MATH: kcat :MATH]
values corrected according to PRESTO and 0.99 with
[MATH: kcat :MATH]
values corrected according to GECKO (Supplementary Table [79]1).
Fig. 2. Comparison of predicted growth of S. cerevisiae from pcGEMs with
k[cat] corrections from GECKO and PRESTO.
[80]Fig. 2
[81]Open in a new tab
Condition-specific pcGEMs with corrected
[MATH: kcat :MATH]
values generated by the GECKO heuristic were used to predict the
specific growth rate for each condition (n = 27, a, b). The boxplots
indicate the distribution of the relative error resulting from each set
of condition-specific corrected
[MATH: kcat :MATH]
values obtained from the GECKO heuristic. Relative prediction error
from each set is indicated by a circle. The red diamonds show the
relative error of the predicted specific growth rate from the PRESTO
model
[MATH:
(λ=10−7)
:MATH]
by using the single set of corrected
[MATH: kcat :MATH]
values in the respective pcGEM. a Only the measured total protein pool
was used to constrain the solution and condition-specific uptake rates
were bounded by 1000
[MATH: mmolhgDW :MATH]
; b measured uptake rates were also considered; c abundances of enzymes
measured in all conditions were used as additional constraints. The
compared pcGEMs in each condition (n = 19) used the same respective
biomass reaction, GAM,
[MATH: σ :MATH]
, and
[MATH: Ptot :MATH]
values (see the “Methods” section). L: Lahtvee et al.^[82]34, D: Di
Bartolomeo et al.^[83]37, Y20: Yu et al.^[84]35, Y21: Yu et al.^[85]36.
Middle line and boxes in the box charts in panels a–c indicate the
median and 25th and 75th percentiles, respectively. Outlier values
(circles outside the whisker range) are more than 1.5× the
interquartile range away from the top or bottom of the box, and
whiskers connect the lower or upper quartiles with the non-outlier
minimum or maximum. Source data are provided as a Source Data file.
We also performed a sensitivity analysis by investigating a smaller
value, of
[MATH:
10−10
, :MATH]
for the weighting factor λ used in the PRESTO objective. We found that
when the weighting factor is
[MATH:
10−10
:MATH]
(at which the total corrections of the initial
[MATH: kcat :MATH]
values plateaus), the relative errors from PRESTO cross-validation can
be further decreased to 0.69 considering the constraint on the total
protein content, with no effects on the other findings (Supplementary
Fig. [86]2a). We also note that the relative error lies in the range
from 0.35 to 0.80 over the considered weighting factors in the range
from
[MATH:
10−14
:MATH]
to
[MATH:
10−1
:MATH]
. Together, these results demonstrated that
[MATH: kcat :MATH]
values corrected according to PRESTO provide better model performance
than the values obtained by the contending heuristic in the case of S.
cerevisiae in the scenarios where all available data are integrated
into the model constraints.
PRESTO provides precise corrections of turnover numbers
In the following, we investigated the precision of the corrected
[MATH: kcat :MATH]
values from the application of PRESTO to data and a pcGEM model of S.
cerevisiae. To this end, we determined the range that the correction of
the
[MATH: kcat :MATH]
value of each enzyme can take while fixing the relative error in
specific growth rate and total corrections from the optimum of PRESTO
(see the “Methods” section). Moreover, we complemented this analysis by
sampling corrected
[MATH: kcat :MATH]
values that achieve the optimum of PRESTO with two values of the
weighting factor λ of
[MATH:
10−7
:MATH]
and
[MATH:
10−10
:MATH]
.
In the case of the corrected
[MATH: kcat :MATH]
values for S. cerevisiae with a weighting factor of 10^−7, we found
that the
[MATH: kcat :MATH]
values with the largest corrections are more precisely determined
(Supplementary Fig. [87]5). In addition, the sampled corrections per
enzyme show an average Euclidean distance to the respective mean of
4.88 s^−1, indicating that the corrected values are more precise than
the values in BRENDA, exhibiting an average Euclidean distance of
27.54 s^−1 to the mean per EC number (Supplementary Fig. [88]6).
Importantly, while
[MATH: kcat :MATH]
values with smaller correction showed larger variability, the 25 and 75
percentiles of the sampled corrections for 42 enzymes are concentrated
around those resulting from PRESTO. Repeating the analysis with a
weighting factor of 10^−10 showed that the larger total corrections of
the initial
[MATH: kcat :MATH]
values resulted in also larger variability for the corrections over all
[MATH: kcat :MATH]
(Supplementary Fig. [89]7). Here, too, for 62 enzymes the 25 and 75
percentiles of the sampled corrections are concentrated around those
resulting from PRESTO. Therefore, we concluded that the corrections
from PRESTO are precise and can be used in downstream analyses.
Pathways enrichment for corrected turnover numbers
In pcGEMs generated by the GECKO toolbox^[90]7, turnover numbers are
assigned to each of the enzymes in the GEM using a fuzzy matching
algorithm. It takes into account the organism, substrate, and EC number
of a BRENDA entry. When we investigated the magnitude of the turnover
number correction dependent on the quality of the match between BRENDA
entry and the corresponding enzyme, we found that
[MATH: kcat :MATH]
values measured in S. cerevisiae were associated with smaller
corrections than those from other organisms (Supplementary
Fig. [91]8a).
To check which metabolic processes are more likely to require
correction of in vitro
[MATH: kcat :MATH]
values, we next conducted an enrichment analysis based on the KEGG
pathway terms linked to corrected
[MATH: kcat :MATH]
values (see the “Methods” section). The most prominent pathway in this
analysis was the synthesis of secondary metabolites, particularly the
synthesis of cofactors and terpenoids (Fig. [92]3a). However, several
terms linked to central carbon metabolisms, such as the tricarboxylic
acid cycle and oxidative phosphorylation, were also significantly
enriched. Interestingly, amino acid synthesis was the only term linked
to nitrogen metabolism that came up in this analysis, although many
pathways of nitrogen metabolism were among the tested terms. This
analysis suggested that particularly in vitro turnover numbers in
carbon metabolism need to be corrected, due to the underestimation of
in vitro assays.
Fig. 3. Comparison of enzymes with corrected k[cat] values by both GECKO and
PRESTO.
[93]Fig. 3
[94]Open in a new tab
a KEGG Pathway terms significantly enriched in the set of enzymes
corrected by PRESTO
[MATH:
(λ=10−7)
:MATH]
in the S. cerevisiae pcGEM. The x-axis gives the number of corrected
enzymes linked to the given term. The one-sided p-values were
calculated using the hypergeometric density distribution and corrected
for multiple hypothesis testing using the Benjamini–Hochberg
procedure^[95]45. b Venn diagram showing the overlap of enzymes whose
[MATH: kcat :MATH]
values were manually corrected^[96]7 (“Manual”), automatically
corrected by the GECKO heuristic in any of the conditions (“GECKO”), or
corrected by PRESTO (“PRESTO”). c Log-transformed
[MATH: kcat :MATH]
values corrected using both the GECKO heuristic and PRESTO are not
associated (Spearman correlation coefficient of 0.166, p-value = 0.45).
Source data are provided as a Source Data file.
Comparison of turnover number corrections from GECKO
Next, we aimed to identify the extent to which the corrected
[MATH: kcat :MATH]
values differ between PRESTO and the GECKO approach. To this end, we
determined the intersection of enzymes with
[MATH: kcat :MATH]
values corrected manually^[97]7, by PRESTO, and by the GECKO heuristic.
For this comparison, we considered the union of all condition-specific
corrected
[MATH: kcat :MATH]
values from the GECKO approach. With the weighing factor
[MATH:
λ=10−7
:MATH]
, PRESTO adapted the
[MATH: kcat :MATH]
values of 48% of enzymes corrected by the GECKO heuristic (Fig. [98]3b,
Supplementary Data [99]1). We did not find a significant Spearman
correlation (
[MATH: ρS=0.17 :MATH]
,
[MATH: P=0.45 :MATH]
) between the log-transformed
[MATH: kcat :MATH]
values in this intersection (Fig. [100]3c), owing to the different
principles employed in the two procedures. To determine the pathways
that comprise enzymes whose turnover number are corrected by GECKO and
PRESTO, we next repeated the pathway enrichment analysis for the
enzymes in the overlap. Among the significant terms, like in PRESTO, we
again found 2-Oxocarboxylic acid, amino acid, and secondary metabolism
to be enriched (Fig. [101]3a, S9). However, the more specific pathway
terms were associated with pathways that are part of carbohydrate
metabolism and aromatic amino acid metabolism corrected by both
approaches (Supplementary Fig. [102]9, Supplementary Data [103]2). In
addition, the intersection between enzymes with manually corrected
values and those corrected by the GECKO heuristic is higher than with
PRESTO. This is expected since the manual curation is partly aimed at
correcting the most constraining turnover numbers^[104]7.
We also compared the
[MATH: kcat :MATH]
values adjusted by GECKO against estimates of in vivo
[MATH: kcat :MATH]
values obtained by parsimonious FBA (pFBA) using the same proteomics
data^[105]15 (Supplementary Fig. [106]10a, b). We confirmed the low
correspondence (
[MATH: ρS=0.23 :MATH]
) between the
[MATH: kcat :MATH]
values obtained from BRENDA, included in the GECKO model without manual
modifications, and the in vivo
[MATH: kcat :MATH]
estimates. As expected, the correspondence of the estimated in vivo
[MATH: kcat :MATH]
values to the turnover numbers corrected based on PRESTO was higher (
[MATH: ρS=0.34 :MATH]
). To investigate how these estimates perform as model parameters, we
also generated a pcGEM in which BRENDA values were substituted by in
vivo
[MATH: kcat :MATH]
values from pFBA^[107]15, whenever available. In scenarios without
enzyme abundance values, this model performed worse than that including
the
[MATH: kcat :MATH]
values corrected by PRESTO as well as the model combining the maximum
of all condition-specific GECKO corrections (Supplementary
Fig. [108]11a, b). In the enzyme abundance-constrained scenario, the
model with in vivo turnover numbers estimated by pFBA performed
slightly better than GECKO but still only achieved a minimum relative
error of 0.93, which is larger than 0.69 resulting from PRESTO
(Supplementary Fig. [109]11c). These results demonstrated the value of
PRESTO in combining the genome-scale coverage of BRENDA with in vivo
proteomics chemostat measurements to obtain less biased estimates of
[MATH: kcat :MATH]
values.
PRESTO with protein-constrained model of E. coli metabolism
To demonstrate the applicability of PRESTO across species, we deployed
it with a pcGEM of E. coli (eciML1515)^[110]22,[111]26. To this end, we
used a large dataset comprising 31 different growth
conditions^[112]12,[113]27–[114]29. Due to the lack of data on nutrient
exchange rates, the same GAM value (i.e., 75.55
[MATH: mmolgDWh :MATH]
) was used across all conditions. Similarly, we used the same value for
total protein content since condition-specific measurements were not
available (see the “Methods” section).
By applying three-fold cross-validation, we found the optimal value for
the λ parameter to be
[MATH:
10−5
:MATH]
(Supplementary Fig. [115]12a). This value was associated with an
average relative error of 1.95 (average over all λ: 3.32) and 73
corrected turnover numbers, while on average 156
[MATH: kcat :MATH]
values were corrected across all explored values for λ. On average, the
Jaccard distance between cross-validation folds was 0.13 (Supplementary
Fig. [116]12b), while the average Jaccard distance between unique sets
of enzymes with corrected turnover numbers for each λ parameter was
three-fold larger (0.4, Supplementary Fig. [117]12c). Thus, the
corrected
[MATH: kcat :MATH]
values among cross-validation folds for each λ are more similar
(maximum Jaccard distance of 0.29). Moreover, the union of the set of
enzymes with corrected
[MATH: kcat :MATH]
values can remain similar over a range of chosen λ parameters up to
four orders of magnitude (Supplementary Fig. [118]12c), demonstrating
the robustness of the method.
The performance of PRESTO was assessed and compared to GECKO using
scenarios (i) and (iii) since no condition-specific uptake rates were
available. With default uptake rates, the relative error for predicted
growth ranged between 0.01 and 8.56 in the less constrained scenario
(i) (Fig. [119]4a). Further, we obtained relative errors between 0.01
and 0.88 for the more constrained scenario (iii), when using the
[MATH: kcat :MATH]
values corrected by PRESTO (Fig. [120]4b). In contrast, when using the
[MATH: kcat :MATH]
values from the GECKO approach, the relative error was in the range
between 0.01 and the 4.89 for scenario (i) and between 0.89 and 0.99
for scenario (iii). In this scenario, too, we observed that the
relative error using
[MATH: kcat :MATH]
values corrected by GECKO was consistently larger than the relative
error resulting from the single set of corrected
[MATH: kcat :MATH]
values obtained by PRESTO (Fig. [121]4b).
Fig. 4. Comparison of predicted growth of E. coli from pcGEMs with k[cat]
corrections from GECKO and PRESTO.
[122]Fig. 4
[123]Open in a new tab
Condition-specific pcGEMs with corrected
[MATH: kcat :MATH]
values generated by the GECKO heuristic were used to predict the
specific growth rate for each condition (a: n = 31, b: n = 27). The
boxplots indicate the distribution of the relative error resulting from
each set of condition-specific corrected
[MATH: kcat :MATH]
values obtained from the GECKO heuristic. Relative prediction error
from each set is indicated by a circle. The red diamonds show the
relative errors of predicted specific growth rates from the PRESTO
model
[MATH:
(λ=10−5)
:MATH]
by using the single set of corrected
[MATH: kcat :MATH]
values in the respective pcGEM. a Only the measured total protein pool
was used to constrain the solution and condition-specific uptake rates
were bounded by 1000
[MATH: mmolhgDW :MATH]
; b abundances of enzymes measured in all conditions were used as
additional constraints. Missing data points originate from the
infeasibility of the respective models. The compared pcGEMs in each
condition used the same respective biomass coefficients, GAM
[MATH: σ :MATH]
, and
[MATH: Ptot :MATH]
values (see the “Methods” section). P: Peebo et al.^[124]28, V:
Valgepea et al.^[125]27, S: Schmidt et al.^[126]29. Middle line and
boxes in the box charts in panels a and b indicate the median and 25th
and 75th percentiles, respectively. Outlier values (circles outside the
whisker range) are more than 1.5× the interquartile range away from the
top or bottom of the box, and whiskers connect the lower or upper
quartiles with the non-outlier minimum or maximum. Source data are
provided as a Source Data file.
Since we observed high relative errors in the less-constrained scenario
(Fig. [127]4a), we added a second step to PRESTO, which introduces
negative corrections that lead to the same relative errors with
consideration of proteomics data. This is an optional step that a user
can choose to perform in addition to the positive corrections (i.e.
relaxation of turnover numbers), introduced in the first step
(Supplementary Method [128]1). As a result, we found 170 negative
corrections that reduced the relative error in scenario (i)
(Supplementary Fig. [129]13).
We do not perform a simultaneous search for positive and negative
corrections because negative corrections can only reduce the relative
error when the current
[MATH: kcat :MATH]
values lead to an overprediction of growth, which is not the case when
considering proteomics data. Therefore, no negative corrections are
found if the absolute value of introduced positive and negative
corrections are to be considered in a single step.
Importantly, the aim of PRESTO is to correct turnover numbers, which
represent upper limits on the catalytic efficiency of enzymes.
Therefore, we can assume that in vitro turnover numbers that lead to
underprediction of specific growth rates when paired with protein
abundance data are too low. However, an overprediction of specific
growth rates in the same scenario can be caused by thermodynamic,
temperature effects, or in-vivo-specific effects. Thus, a reduction of
in vitro turnover numbers results in average apparent catalytic rates
for the considered conditions, rather than corrected turnover numbers.
Considering the models with positive
[MATH: kcat :MATH]
corrections, the sum of corrections reached a plateau at
[MATH:
10−11
:MATH]
for the weighting factor
[MATH: λ :MATH]
in the PRESTO objective. We found that the relative cross-validation
error at this value was 5.26, which is 2.7-fold larger than the
relative error obtained using the optimal λ. Hence, allowing for more
and larger corrections in PRESTO leads to a decrease in the overall
relative error within the PRESTO program at the cost of highly biased
parameters. The predictions with the highly biased parameters are worse
in the test conditions and result in a larger specific growth rate when
no enzyme abundance constraints are enforced. This observation is in
line with the small number of corrections introduced by the GECKO
approach, where only the pool constraint is considered. We conclude
that the prediction performance of the eciML1515 model was improved by
using turnover numbers corrected by PRESTO only when enzyme abundances
are integrated.
To assess the precision of the introduced
[MATH: kcat :MATH]
corrections, we performed variability analysis and sampling (see the
“Methods” section) of the introduced corrections to the initial
[MATH: kcat :MATH]
values for two values of the weighting factor
[MATH: λ :MATH]
, namely
[MATH:
10−5
:MATH]
and
[MATH:
10−11
:MATH]
. We observed that the 25th and 75th percentiles enclose a narrow
interval around the values resulting from PRESTO (Supplementary
Fig. [130]14) and are thus not evenly distributed across the respective
interval determined by the variability analysis. We further noted that
here, the predictions of smaller δ are generally more precise than the
large corrections (
[MATH:
δ≥p50 :MATH]
), which span ~2 orders of magnitude (small δ (
λ=10−1
1 :MATH]
; moreover, this value guarantees more precise estimates (Supplementary
Fig. [132]17). In conclusion, the application of PRESTO is not limited
to a single species but presents a versatile tool for the correction of
turnover numbers across species.
In contrast to the observations made in S. cerevisiae we found that a
model parameterized with in vivo turnover numbers estimated by
pFBA^[133]12 outperformed both PRESTO and GECKO in the modeling
scenario where no enzyme abundance constraints are taken into account
(Supplementary Fig. [134]11d). This is due to the fact, that pFBA, in
contrast to PRESTO and GECKO, can generate estimates lower than the in
vitro
[MATH: kcat :MATH]
values, in turn leading to more accurate predictions. However, in the
scenario with enzyme abundance constraints, PRESTO predicts specific
growth rates closer to the experimental observation in 87% of the
conditions (Supplementary Fig. [135]11e). Thus, in this scenario the
integration of information from different modeling conditions achieved
in PRESTO serves to obtain
[MATH: kcat :MATH]
value that performs better than the pFBA approach applied by^[136]12.
Finally, we compare the resulting flux distributions and predicted
protein abundances by models that are parameterized with
[MATH: kcat :MATH]
values that were corrected using either GECKO or PRESTO. Overall, the
feasible ranges (
[MATH:
vmax−
vmin :MATH]
) for both approaches resulted in Pearson correlation coefficients of
0.985 across all conditions (Supplementary Method [137]2). The
difference between both flux distributions is manifested in a smaller
interquartile range in feasible ranges with PRESTO (Supplementary
Fig. [138]15). More specifically, there are fewer reactions with highly
constrained flux after introducing
[MATH: kcat :MATH]
corrections with PRESTO compared to GECKO. Moreover, we used the models
that were corrected using GECKO and PRESTO to predict protein
abundances (Supplementary Method [139]3), which were then compared to
the measured proteomics data using Spearman correlation. Since PRESTO
only considers abundances of proteins that were measured across all
considered conditions, we computed the correlations for (1) the set of
proteins that are measured across all conditions and (2) all protein
abundances per condition. In the first scenario, the correlation was
higher after correction with PRESTO in 70% of conditions, compared to
the median correlation with GECKO and outperformed all GECKO pcGEMs in
30% of conditions (Supplementary Fig. [140]16a). When all measured
proteins were considered, PRESTO only performed better in 54% of
conditions (Supplementary Fig. [141]16b). Similar to the predicted
specific growth rate we also observed better performance for PRESTO
models that were subjected to an additional
[MATH: kcat :MATH]
down correction step (Supplementary Method [142]1). However, we note
that the reduced
[MATH: kcat :MATH]
cannot strictly be considered condition-independent
[MATH: kcat :MATH]
values because there may exist physiological states where these enzymes
may achieve the efficiency given by the original
[MATH: kcat :MATH]
value (see the “Discussion” section). Since PRESTO considers protein
abundances for the correction, which is not the case for GECKO, we
expected to find the increased prediction performance with PRESTO
compared to GECKO; however, we still observe low overall predictability
of protein abundances using the resulting models. Recently, a more
sophisticated protein abundance prediction approach using pcGEMs was
introduced that can be used to predict more reliable values and might
further be improved by considering corrections introduced by
PRESTO^[143]30.
Interestingly, in contrast to S. cerevisiae, we did not observe larger
corrections by PRESTO for organism-unspecific in vitro
[MATH: kcat :MATH]
values (Supplementary Fig. [144]8b). For the condition-specific sets of
[MATH: kcat :MATH]
corrections introduced by GECKO we observe that all smaller sets were
proper subsets of larger sets (Supplementary Fig. [145]4b) The low
number of corrections introduced by GECKO leads to an overlap of only
three (75%) enzymes whose
[MATH: kcat :MATH]
values were also corrected by PRESTO (Supplementary Data [146]3,
Supplementary Fig. [147]18a). These three enzymes catalyze reactions in
three distinct metabolic pathways: Phosphoribosylformylglycinamidine
synthase acts in the synthesis of purines, while serine
acetyltransferase and NADP dependent Ketol-acid reductoisomerase are
involved in the synthesis of sulfur amino acids and hydrophic amino
acids, respectively (Supplementary Data [148]4). The pathway enrichment
analysis for all PRESTO corrections at
[MATH:
λ=10−5
:MATH]
, indeed also identified amino acid and secondary metabolite synthesis
as significantly enriched terms among the enzymes with corrected
turnover numbers (Supplementary Fig. [149]18b). These results argue for
a systematic underestimation of in vivo turnover numbers in the
pathways in in vitro experiments, irrespective of the investigated
organism. However, the lower-order KEGG pathway terms enriched in E.
coli do not overlap with the ones found in S. cerevisiae. Here, fatty
acid metabolism and the synthesis of hydrophobic amino acids are among
the pathways requiring the correction of turnover numbers.
Robustness of turnover number corrections
All of the approaches for estimation of in vivo turnover numbers rely
on predicted (or estimated) fluxes and protein abundances from multiple
conditions^[150]12,[151]14,[152]15, but have not investigated the
robustness of the estimates to the number of conditions used.
Therefore, next, we investigate the difference in the sets of enzymes
with corrected turnover numbers and the concordance of their
corrections when ten randomly sampled subcollections of M experimental
conditions (M = 3, 5, 10, 15) were used instead of all experiments. The
differences and concordance were quantified with respect to the
estimates obtained by considering data from all available experiments
using the Jaccard index and the Pearson correlation coefficient,
respectively. In the case of S. cerevisiae, we found that the smallest
Jaccard difference over 200 scenarios was 0.36, while for E. coli this
was 0.41 (Supplementary Figs. [153]19 and [154]20). In addition, the
Pearson correlation coefficient between the (log-transformed) corrected
turnover numbers with consideration of all versus a subcollection of M
experiments in S. cerevisiae ranged from 0.99 to 1.00 (for M = 15) to
0.11 and 1.00 (for M = 3) (Supplementary Fig. [155]19). Repeating the
analysis in the case of E. coli, we found that that the Pearson
correlation coefficient ranged from 0.15 to 1.00 (for M = 15) to 0.14
and 1.00 (for M = 3) (Supplementary Fig. [156]20). This is in line with
the expectation that the corrections stabilize with an increasing
number of experiments. Altogether, these findings pointed out the
robustness of turnover number corrections derived from PRESTO with the
number of available experiments.
Discussion
Characterization of enzyme parameters that can inform models of
reaction rates is key to expanding and further propelling the usage of
metabolic models in diverse biotechnological applications. While the
generation of pcGEMs has facilitated the integration of more
biophysically relevant constraints, it necessitates access to estimates
of turnover numbers as key enzyme parameters. We assessed the bias in
the available in vitro and in vivo estimates of turnover numbers as the
discrepancy between measured and predicted growth in the ultimate
validation scenario when they are combined with constraints from
protein abundances. We use the modeling scenario that considers
measured protein abundances as the ultimate validation scenario not
only for the prediction of metabolic fluxes but also for the prediction
of specific growth rates as it contains considerably more biochemically
relevant constraints. Indeed for this scenario, we showed that
condition-specific growth rates cannot be reliably predicted with
pcGEMs of S. cerevisiae and E. coli when available in vitro
(Figs. [157]2c and [158]4b) and in vivo estimates (Supplementary
Fig. [159]11c, e) of turnover numbers are used. GECKO resolves this
issue by flexibilities measured protein abundances without considering
physiological information during the procedure. To overcome this
limitation, we developed PRESTO, which corrects turnover numbers and
facilitates the integration of enzyme abundance constraints.
In contrast to PRESTO, GECKO uses measured total protein content from a
single condition to achieve specific growth rates in the process of
correcting the turnover numbers. As a result, the corrected turnover
numbers vary between different experiments. Like all existing
approaches for the estimation of in vivo turnover numbers based on
GEMs, we integrated protein abundance data directly to correct turnover
numbers. Following this strategy in PRESTO is further justified by the
observation that the turnover numbers included in pcGEMs are often
neither from the same enzyme (i.e., EC number), substrate, nor
organism. While in vivo turnover number estimates can be adjusted by
considering recently proposed Bayesian statistical learning^[160]18,
this approach has not considered protein abundance information from
proteomics measurements.
We employed PRESTO with the largest data set of these measurements
available to date for S. cerevisiae^[161]15 and E. coli^[162]12.
Through a series of comparative analyses, we demonstrated that the
corrections of turnover numbers from PRESTO ultimately increase the
prediction accuracy of condition-specific growth for the two organisms
when enzyme abundance data are integrated into the corresponding
pcGEMs.
Since PRESTO generates a condition-independent
[MATH: kcat :MATH]
set it is bound to correct parameters that lead to the underprediction
of biological fluxes. The same reasoning is applied when obtaining in
vivo
[MATH: kcat :MATH]
estimates from pFBA by taking the maximum of apparent catalytic rates,
and overall conditions^[163]12,[164]15. Nevertheless, we included an
optional step, that additionally allows for the reduction of in vitro
[MATH: kcat :MATH]
values, which can be considered average apparent catalytic rate
estimates and result in better performance in the E. coli experiments
(Supplementary Figs. [165]13 and [166]16). Moreover, we showed that in
vivo turnover number proxies, obtained by the ranking of
condition-specific estimates that use proteomics and fluxomics data,
are more highly (but modestly) correlated to estimates from PRESTO than
to in vitro turnover numbers. Owing to the constraint-based formulation
of PRESTO, we also determined the precision of the correction of
turnover numbers. Previous studies have shown that even for the
well-studied model organism Saccharomyces cerevisiae, only 52% of
enzyme turnover numbers in the pcGEM can be obtained from
organism-specific in vitro measurements^[167]22. Using organism
unspecific
[MATH: kcat :MATH]
values for parameterization and correction of pcGEMs, as done in the
GECKO pipeline, assumes that enzyme kinetic properties are comparable
within one EC number class^[168]31,[169]32. However, we did not
identify clear differences between EC classes, down to the second
digit, when considering the distribution of
[MATH: kcat :MATH]
similarities within EC classes (Supplementary Fig. [170]21). Indeed, it
has been reported that EC class plays only a minor role in the
prediction of turnover numbers^[171]19 and show stronger similarity
with concordant GO categories^[172]33. Interestingly, our findings show
that the turnover number corrections obtained from PRESTO are more
precise than EC class-based corrections. (Supplementary Fig. [173]6).
Together, these findings demonstrated PRESTO can be readily used to
decrease the bias of turnover numbers. This paves the way for employing
the outcome of PRESTO and future extensions toward effectively
predicting the kcatome from available protein sequences.
Methods
Experimental data
For S. cerevisiae, we made use of a dataset gathered by^[174]15 from
four different studies^[175]34–[176]37, which included protein
abundance data (
[MATH: mmolgDW−1
:MATH]
) as well as measured growth or dilution rates (
[MATH: h−1 :MATH]
) and nutrient exchange fluxes (
[MATH: mmolgDW−1
h−1 :MATH]
). Exchange fluxes missing in certain conditions were set to
[MATH: 1000mmolgDW−1
h−1 :MATH]
if the nutrient was present in the used culture media. We further
extended this data set by total protein content measurements (
[MATH: ggDW−1
:MATH]
) from the original studies. For subsequent analyses, we used the
maximum abundance of each protein over all replicates per experimental
condition. Similarly, we used the average value for specific growth
rates and nutrient exchange rates. Since no measurement of total
protein content was available for the two conditions evaluated in the
Di Bartolomeo study^[177]37, we used the maximum protein content
measured across the remaining conditions for these conditions (i.e.,
[MATH: 0.67g/gDW :MATH]
). Moreover, we excluded three temperature stress conditions (i.e.,
Lahtvee2017_Temp33, Lahtvee2017_Temp36, Lahtvee2017_Temp38) from the
analysis since the temperature can have a large effect on the catalytic
activity of an enzyme. Gene names in the proteomics dataset were
translated to UniProt identifiers using the batch retrieval service of
the UniProt REST API^[178]38.
For E. coli, we used a dataset comprising 31 experimental conditions,
which was gathered by Davidi and colleagues and augmented by Xu et
al.^[179]12,[180]14 from three publications^[181]27–[182]29. Here, too,
we used the maximum protein abundance over all replicates (in
[MATH: mmolgDW−1
:MATH]
). Due to the absence of total protein content measurements in two of
the original studies, we relied on the maximum protein content measured
in the Valgepea study (i.e.,
[MATH: 0.61g/gDW :MATH]
) to be used for all conditions. Since precise data on nutrient uptake
rates were only given for a few conditions, we assigned a default upper
bound of
[MATH: 1000mmolgDW−1
h−1 :MATH]
to all nutrients contained in the minimal medium (Supplementary
Table [183]2) with additional carbon sources as specified. Gene
identifiers were translated to UniProt similar as for S. cerevisiae.
Model preparation
The proposed approach aims at parsimonious correction of turnover
values in genome-scale enzyme-constraint metabolic models using
measured protein abundances. Therefore, it is important to consider the
differential association between enzymes and reactions, i.e., isozymes,
enzyme complexes, and promiscuous enzymes. We decided to use the GECKO
formalism^[184]7, which deals with these problems elegantly by directly
encoding the required information in the stoichiometric matrix. The
genome-scale metabolic models for S. cerevisiae (YeastGEM v.8.5.0) and
E. coli (iML1515) were obtained from the yeast-GEM and ecModels GitHub
repository, respectively^[185]22,[186]39
[[187]https://github.com/SysBioChalmers; accessed on 22.08.2021]. For
subsequent steps, functions of the COBRA v3.0 toolbox^[188]40 and
GECKO2.0 toolbox^[189]22 were employed, of which several functions were
adapted for our purposes.
To arrive at raw protein-constrained models for both organisms, the
GECKO2.0 model enhancement pipeline was adapted to allow the
[MATH: kcat :MATH]
correction procedure to be omitted. Moreover, any manual corrections of
turnover numbers were excluded from model generation. In the process of
adapting the raw pcGEM to the respective experimental conditions for
both organisms, the GAM value per condition was fitted using the
scaleBioMass function of GECKO2.0, based solely on the
condition-specific nutrient exchange rates, and returning the minimum (
[MATH: 9mmolgDWh :MATH]
) or maximum (161
[MATH: mmolgDWh :MATH]
) interval boundary if reached (only S. cerevisiae). Furthermore, we
omitted enzyme abundances, which were not measured across all
experiments as the approach proposed here is only applicable for
enzymes in the set with measured abundances (
[MATH: M :MATH]
).
PRESTO approach
In the design of PRESTO, we modified the enzyme mass-balance
constraints of the augmented stoichiometric matrix, created by GECKO,
from
[MATH:
−1k
catij
msubsup>vj+ei=0 :MATH]
1
to inequality constraints that use the measured protein abundance
directly. The variable
[MATH: e :MATH]
denotes the predicted protein abundance in the pcGEM, while
[MATH: E :MATH]
represents the vector of measured enzyme abundances. Further, we assume
a single turnover number per enzyme
[MATH: i :MATH]
over all catalyzed reactions (
[MATH: kcatimin=<
mi>argminjkcatij
msubsup> :MATH]
):
[MATH:
∀r∈R∑i∈GPRrvr≤kcatimin⋅ei. :MATH]
2
GPR stands for gene–protein-reaction rule that associates reactions (
[MATH: R :MATH]
) with underlying genes and proteins. The variable
[MATH: E :MATH]
denotes the measured protein abundance in
[MATH: mmolgDW−1
:MATH]
. We justify making the assumption for Eq. ([190]2) based on our
observation that most enzymes in the S. cerevisiae model are associated
with no more than four reactions (Supplementary Fig. [191]22a, c).
Further, the vast majority of enzymes are assigned a single unique
turnover number even though they catalyze multiple reactions
(Supplementary Fig. [192]22b, d).
We then introduced a correction factor δ, which is added to each
[MATH: kcat :MATH]
if the protein abundances for the underlying enzyme were available:
[MATH:
∀r∈R∑i∈GPRrvr≤kcatimin+δiEi. :MATH]
3
To find a biologically relevant minimal set of adaptations with respect
to the sum of δ, we minimized the weighted sum of the average absolute
relative errors, ω, between measured (
[MATH: μexp
:MATH]
) and predicted specific growth rates (
[MATH: vbio :MATH]
) overall experimental conditions
[MATH: C :MATH]
, and the average δ:
[MATH:
minv,δ,ω1<
mfenced close="∣"
open="∣">C
∑j∈Cωj+λM
∑i∈Mδi.
:MATH]
4
Finally, the linear programming formulation of the
[MATH: kcat :MATH]
correction in PRESTO is the following:
[MATH:
minv,δ,ω1<
mfenced close="∣"
open="∣">C
∑j∈Cωj+λM
∑i∈Mδi :MATH]
subject to
[MATH: Nvj=0,∀j∈C :MATH]
5
[MATH: ∑i∈GPRrvrj≤kcat,imin
mi>+δi<
/mi>Ei<
mrow>j,∀r∈R,
mo>i∈M,∀j∈C :MATH]
[MATH:
vminj≤vj≤vmaxj;∀
j∈C :MATH]
6
[MATH:
δi≤ε−1⋅kcat,imin
mi>,∀i∈M :MATH]
7
[MATH: kcat,imin
mi>+δi<
/mi>≤Kmax<
/mi>,∀i∈M :MATH]
8
[MATH: vbioj⋅ωj≥μexpj−<
mrow>vbioj,∀j∈C :MATH]
9
[MATH: vbioj⋅ω
j≥vbioj−<
mrow>μexpj,∀j∈C :MATH]
10
[MATH: ω≤θ,δ≥0.
mtable> :MATH]
The value for ω was bound from above by a value θ, which was set to
0.6. Constraints that enforce metabolic steady state are captured in
Eqs. ([193]5) and ([194]6) represent the lower and upper bounds in the
flux through each reaction in each condition, respectively. The
constraints in Eqs. ([195]7) and ([196]8) impose an upper bound on δ,
which is the minimum of the allowed fold change in
[MATH: kcat :MATH]
values, ε, and a cut-off value
[MATH: Kmax
:MATH]
, which denotes the maximum allowed
[MATH: kcat :MATH]
value. The value for ε was set to
[MATH: 105
:MATH]
since lower values did not yield solutions and
[MATH: Kmax
:MATH]
was set to 57,500,000 s^−1 (5.3.1.1, Pyrococcus furiosus^[197]41).
Equations ([198]9) and ([199]10) ensure that ω is equal to
[MATH: μj<
mrow>exp−vbio,j<
/mfenced>vbio,j<
/mfrac> :MATH]
.
The parameter λ controls the trade-off between both minimization
objectives (see Eq. ([200]4)). As λ is unknown and may also be
condition- and model-specific, it was fitted using a 3-fold
cross-validation scheme, which was repeated for 10 iterations. To this
end, we scanned a log-scale interval between
[MATH:
10−14
:MATH]
and
[MATH:
10−1
:MATH]
. In each iteration, we performed
[MATH: kcat :MATH]
corrections on two folds of condition-specific models and validated the
obtained corrections on the remaining fold of condition-specific
models. The validation was done by predicting growth only with a
constraint on total protein content, without constraints from measured
protein abundances. This was done to counteract overprediction in the
scenario without constraints from proteomics data. The relative errors
(ω) and the sum of δ (i.e., Δ) were then used to calculate the scores
[MATH: sλ
:MATH]
, which helped us choose the optimal value for λ:
[MATH:
sλ=110∑τ=110
ωλ,τ−ωλ,<
mi>τminωλ,τmax−ωλ,τmin
⋅log10Δλ,τ<
/mrow>Δλ,τ<
/mrow>minlog10Δλ,τ<
/mrow>maxΔλ,τ<
/mrow>min
. :MATH]
11
The score can be described as the average product of min-max-scaled ω
and Δ across the 10 cross-validation iterations per explored λ. The
optimal value was then determined by finding the first sign change in
the second numerical gradient over
[MATH: sλ
:MATH]
, starting from the maximum value for λ. In addition to the optimal λ,
we also compared our results to a second λ, where the sum of δ reached
a plateau (
[MATH:
λ=10−1
0 :MATH]
for S. cerevisiae and
[MATH:
λ=10−1
1 :MATH]
for E. coli, Supplementary Fig. [201]2a and [202]12a). The presented
approach and analysis scripts were implemented using MATLAB^[203]42.
Variability analysis for δ
While PRESTO considers multiple experimental conditions to find a set
of universal corrections for
[MATH: kcat :MATH]
values, it does not provide an exhaustive view over all possible
solutions to this problem. To assess the precision of the corrections,
we first performed a variability analysis for δ to find the minimum and
maximum possible values. To guarantee that a solution of equal quality
is found with respect to the previously determined sum of δ and the
relative errors to experimentally measured specific growth rates (i.e.,
[MATH: ωopt :MATH]
), corresponding constraints were added to arrive at the following
linear programming problem:
[MATH:
minv,δ,ω/maxv,δ,ωδi :MATH]
s.t.
[MATH: Nvj=0,∀j∈C :MATH]
[MATH: ∑i∈GPRrvrj≤kcat,imin
mi>+δi<
/mi>Ei<
mrow>j,∀r∈R,
mo>i∈M,∀j∈C :MATH]
[MATH:
vminj≤vj≤vmaxj,∀j∈C :MATH]
[MATH:
δi≤ε−1⋅kcat,imin
mi>,∀i∈M :MATH]
[MATH: kcat,imin
mi>+δi<
/mi>≤Kmax<
/mi>,∀i∈M :MATH]
[MATH: vbioj⋅ωj≥μexpj−<
mrow>vbioj,∀j∈C :MATH]
[MATH: vbioj⋅ω
j≥vbioj−<
mrow>μexpj,∀j∈C :MATH]
[MATH:
0.99⋅ωj
opt≤ωj≤1.01
⋅ωopt,∀j∈C :MATH]
12
[MATH: Δopt−10
−3≤<
mrow>∑i∈Mδi≤Δopt+10
−3 :MATH]
13
[MATH: ω≤θ,δ≥0.
mtable> :MATH]
The minimal relative error determined for each condition
[MATH: j :MATH]
was fixed within a narrow tolerance (±1%, Eq. ([204]12)) and the
minimum sum of corrections
[MATH: Δ :MATH]
was fixed with a tolerance of
[MATH:
±10−3h−1 :MATH]
(Eq. ([205]13)).
As the distribution within the obtained min/max intervals can be
skewed, we sampled 10,000 points within the obtained intervals. For
uniform random sampling, we created random vectors of corrections
[MATH: δ*
:MATH]
within the determined intervals and projected them onto the solution
space by minimizing the distance of δ to the respective random vector.
Therefore, we updated the objective of the program above:
[MATH:
minv,δ,ω∑
i∈Mδi−δi*. :MATH]
14
To ensure reproducibility and compatibility with the COBRA
toolbox^[206]40, we solved all optimization problems using the
optimizeCbModel of the COBRA toolbox. Within this environment, we used
the Gurobi solver v9.1.1^[207]43 but we note that any other supported
solver can also be used. As we observed numerical instability of the
problems in some cases, we decreased the feasibility tolerance (i.e.,
feasTol parameter) for the COBRA solver to
[MATH:
10−9
:MATH]
for all predictions. The results were visualized using MATLAB^[208]42.
Validation of corrected models
We used the adapted GECKO pipeline (fitting a condition-specific GAM;
excluding manual
[MATH: kcat :MATH]
adaptions) to obtain models with
[MATH: kcat :MATH]
values adapted according to the objective control coefficient
heuristic. We note that, when no manual modifications were introduced
to the S. cerevisiae models, the
[MATH: kcat :MATH]
adaption of the GECKO pipeline would stop because no objective control
coefficient above the threshold of 0.001 could be found, and corrected
models would still be below the predicted growth error tolerance of
10%. To compare the predictive performance of PRESTO and GECKO
corrected models, the models were adapted with the same
condition-specific GAM, biomass reaction, and total protein content,
[MATH: Ptot. :MATH]
Additionally, PRESTO models were constrained using the same
condition-specific saturation rate,
[MATH: σ :MATH]
and enzyme mass fraction, f, as obtained from the GECKO pipeline. In
contrast to the GECKO formulation, we did not subtract the mass of
measured enzymes from the total protein pool constraint but instead
introduced the measured protein concentration as the upper bound on the
enzyme usage reaction,
[MATH: Ei
:MATH]
, in the respective scenario. This formulation still guarantees that
the mass of all used enzymes is lower or equal to the approximated
cellular protein pool according to
[MATH:
∑i∈rGPR(r)ei<
/mi>,j⋅<
mi
mathvariant="normal">MWi≤<
/mo>Ptot,j⋅f⋅σj,
∀j∈C :MATH]
15
where
[MATH: MW :MATH]
is the respective molecular weight of the protein. By considering
measured and unmeasured enzymes in Eq. ([209]15) we do not have to
change f and use the same factor as for the scenario where no protein
abundance measures are used^[210]7. Maximum growth was predicted in
three different constraint scenarios: (i) using only the protein pool
constraint and default uptake rates (1000 mmol/gDW/h), (ii) using the
pool constraint and experimentally measured uptake rates, (iii) using
the previous constraints plus the absolute enzyme abundance.
The two studies which generated in vivo
[MATH: kcat :MATH]
values from pFBA^[211]12,[212]15 calculated a single value per reaction
irrespective of the presence of isoenzymes. Thus, to parameterize the
raw pcGEM (containing only uncorrected BRENDA values) we substituted
the in vitro
[MATH: kcat :MATH]
values of all isoenzyme reactions with the respective estimate provided
in the study. Reactions catalyzed by complexes were not corrected.
Since PRESTO and the pFBA studies provide a single
condition-independent model, we generated a condition-independent GECKO
model by following the maximum overall conditions approach: For the
comparisons, shown in Supplementary Figs. [213]11 and [214]15, the
condition-wise GECKO models were aggregated into a single union model
in which for each reaction the maximum
[MATH: kcat :MATH]
value was used.
Pathway enrichment analysis
The KEGG pathway terms^[215]44, associated with each enzyme that was
measured in all conditions, were acquired using the KEGG REST API. The
one-sided p-value, p, for significant enrichment of a pathway term
among the enzymes with corrected
[MATH: kcat :MATH]
values was calculated using the hypergeometric density distribution:
[MATH: px=1−∑i=1x−1
KiM−KN−iMN
mrow>. :MATH]
16
Only KEGG pathway terms associated with at least two corrected enzymes
were taken into consideration. The p-values associated with all tested
pathway terms were corrected for a false discovery rate of 0.05 using
the Benjamini–Hochberg correction^[216]45.
Reporting summary
Further information on research design is available in the [217]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[218]Supplementary Information^ (2.8MB, pdf)
[219]Peer Review File^ (482.3KB, pdf)
[220]41467_2023_37151_MOESM3_ESM.pdf^ (114KB, pdf)
Description of Additional Supplementary Files
[221]Supplemental Data 1^ (14.6KB, xlsx)
[222]Supplemental Data 2^ (14.5KB, xlsx)
[223]Supplemental Data 3^ (13.5KB, xlsx)
[224]Supplemental Data 4^ (12.9KB, xlsx)
[225]Reporting Summary^ (1.4MB, pdf)
Acknowledgements