Abstract
Background
Machine learning (ML), with advancements in algorithms and
computations, is seeing an increased presence in life science research.
This study investigated several ML models' efficacy in predicting
preterm birth using untargeted metabolomics from serum collected during
the third trimester of gestation.
Methods
Samples from 48 preterm and 102 term delivery mothers from the All Our
Families Cohort (Calgary, AB) were examined. Four ML algorithms:
Partial Least Squares Discriminant Analysis (PLS-DA), linear logistic
regression, artificial neural networks (ANN), Extreme Gradient Boosting
(XGBoost) - with and without bootstrap resampling were used to examine
the small-scale clinical dataset for both model performance and
metabolite interpretation.
Results
Model performance was evaluated based on confusion matrices, area under
the receiver operating characteristic (AUROC) curve analysis, and
feature importance rankings. Linear models such as PLS-DA and logistic
regression demonstrated moderate classification performance (AUROC ≈
0.60), whereas non-linear approaches, including ANN and XGBoost,
exhibited marginal improvements. Among all models, XGBoost combined
with bootstrap resampling achieved the highest performance, yielding an
AUROC of 0.85 (95 % CI: 0.57–0.99, p < 0.001), indicating a significant
improvement in classification accuracy. Metabolite importance, derived
from Shapley Additive Explanations (SHAP), consistently identified
acylcarnitines and amino acid derivatives as principal discriminative
features. Pathway analysis revealed disruptions to tyrosine metabolism
as well as phenylalanine, tyrosine and tryptophan biosynthesis to be
associated with preterm delivery.
Conclusions
Our results highlight the complexity of metabolomics-based modelling
for preterm birth and support an iterative, model-driven approach for
optimizing predictive accuracy in small-scale clinical datasets.
Keywords: Preterm birth, Pregnancy, Metabolomics, Machine learning,
Predictive models, XGBoost
Graphical Abstract
[33]graphic file with name ga1.jpg
[34]Open in a new tab
Highlights
* •
Machine learning on maternal serum metabolomics dataset to predict
preterm birth.
* •
XGBoost with bootstrapping achieved highest AUROC of 0.85.
* •
Acylcarnitines, amino acids were key markers via SHAP analysis.
* •
Tyrosine and tryptophan pathways linked to preterm delivery.
1. Introduction
Advancements in machine learning (ML) with better computation and
sophisticated algorithms have greatly impacted the landscape of data
science, and their applications into clinical domains have drawn
interest from health care professionals. Metabolites, which can serve
as sensitive indicators of physiological alterations linking to disease
progression, have seen increased inclusion of ML models for
pathogenesis, diagnostics and monitoring. ML algorithms can play a
pivotal role in capitalizing on metabolite signatures to analyze
complex datasets, extracting meaningful biological interpretations, and
building predictive models for diagnostic purposes [35][1].
ML integration with metabolomics, the systematic approach for studying
metabolites, offers unique insights into disease pathogenesis and
biomarker discovery by recognizing hidden patterns and correlations
from metadata analysis of biological systems [36][2]. While partial
least squares-discriminant analysis (PLS-DA) continues to be a widely
applied and effective method [37][3], machine learning (ML) techniques
have gained increasing prominence in metabolomics research. Advanced ML
models, such as t-distributed stochastic neighbor embedding (t-SNE),
apply non-linear transformations that improve data fitting, enable
effective dimensionality reduction, and enhance visualization of class
separations [38][4]. These models also employ sophisticated feature
selection methods, such as recursive feature elimination and
dimensionality reduction, to identify the most informative metabolites
[39][5].
While binary classifications with classical yes-or-no decisions are the
area drawing the most interest, multiple-decision problems have long
existed, and ML has been trained to facilitate these decision processes
[40][6]. However, there are often several obstacles when analyzing
clinical data. Unlike other “big data”, clinical data-sets are often
restricted in sample size, which is a critical necessity for many
complex algorithms [41][7]. Thus, this project aims to take these
limitations into account to evaluate some common machine learning
approaches on a clinical maternal dataset investigating preterm birth
using untargeted metabolomics. Past research has demonstrated the
utility of machine learning-based metabolomic analysis in this field.
Employing maternal vaginal fluid metabolome samples collected between
20 and 24 weeks of gestation, a ML metabolomics approach outperformed
microbiome-only and maternal covariate-only models achieving an AUC
= 0.78, a finding subsequently validated in two independent cohorts
[42][8]. Likewise, Al Ghadban et al. used untargeted serum metabolomics
in a case–cohort of 399 pregnant women serially sampled at 12, 20, 28,
and 36 weeks of gestation. Six supervised machine learning methods were
trained on the top 47 features and results found a random forest model
to perform best (AUC = 0.73) followed by a generalized boosted model
(AUC = 0.71) for predicting spontaneous preterm birth [43][9].
Combining metabolic profiles with clinical risk factors has also
demonstrated success in predicting preterm birth. The Screening for
Pregnancy Outcomes by Early Pregnancy Evaluation (SCOPE) study applied
untargeted serum metabolomics to asymptomatic pregnant women and found
that combining metabolic profiles with clinical risk factors
significantly improved early identification of those at risk for
preterm birth with a metabolite informed clinical prediction model
achieving an AUC of 0.73 [44][10].
Overall, many of these models leveraged advanced feature selection
techniques and classification algorithms to identify discriminatory
metabolite patterns associated with preterm birth risk. Given this, it
was hypothesized that more complex, non-linear models would be
advantageous over conventional approaches when working with this type
of data due to their ability to better fit curves for the observed
variances. This study examined the hypothesis by evaluating the
performance of both linear and non-linear ML models in predicting
preterm birth with feature selection method to facilitate biological
interpretation.
2. Materials and methods
2.1. Participant characteristics and statistical analysis
Untargeted metabolomic analysis was performed using third trimester
serum samples collected from the antecubital vein in non-fasted
pregnant participants between 28 and 32 weeks gestation. This period
was prioritized for its robust metabolic signatures identified in prior
metabolomics studies and for being the final window for clinical
intervention in preterm birth risk mitigation [45][11]. Participants
(n = 150) were a subset from the All Our Families (AOF) pregnancy
cohort who donated a blood sample (formally All Our Babies) for an
ongoing project monitoring maternal/fetal status in Calgary, Alberta
[46][12]. Additional information regarding the AOF study methods and
results can be found at the project website:
[47]https://ucalgary.ca/allourfamilies. The study was approved by the
Child Health Research Office and the Conjoint Health Research Ethics
Board at the University of Calgary to acquire written informed consent
from participants with eligibility based on the individuals being
adults (≥ 18 years of age), less than 25 weeks of gestation at the time
of recruitment, receiving prenatal care in Alberta, Canada and being
willing to complete written questionnaires in English. Individuals were
classified as preterm if delivery occurred prior to 37 weeks of
gestation while term was considered > 37 weeks of gestation).
Gestational age was calculated based on the participants’ last reported
menstrual period. Participant characteristics are shown in [48]Table 1.
For continuous characteristics such as body-mass-index (BMI, kg/m^2),
the means and the standard deviations were calculated, whereas
categorical characteristics were represented by their means and
relative percentages (%). Statistical significance was determined based
on p-values, with a significance threshold set at 0.05, using an
unpaired t-test with a two-tailed distribution between the preterm and
the term groups.
Table 1.
Participant characteristics.
Maternal Preterm (n= 48) Term Birth (n= 102) p-value
Pre-pregnancy BMI (kg/m^2) 25.3 ± 5.7 24.3 ± 4.6 0.295
Maternal Age (y) 32.0 ± 4.8 30.9 ± 4.1 0.167
Gestation Weight Gain (kg) 10.0 ± 3.5 10.2 ± 4.2 0.759
Anxiety score [49]^a 34.0 ± 9.3 30.8 ± 7.5 0.082
Depression score [50]^b 7.3 ± 5.4 5.1 ± 4.4 0.098
Infant
Gestation (wk) 34.6 ± 1.8 38.8 ± 1.2 < 0.001
Fetal Birthweight (g) 2532 ± 562 3350 ± 442 < 0.001
Child Sex, female (%) 29 (60.4) 56 (54.9) 0.527
Delivery,n(%)
Spontaneous labor 23 (47.9) 75 (73.5) -
Medically indicated [51]^* or Induced 22 (45.8) - -
Induction (term) - 27 (26.5) -
Heart rate abnormality 8 (9.8) 10 (16.7) -
Hypertensive (n) (%) 7 (14.6) 4 (3.9) -
Gestational diabetes (%) 3 (6.3) 2 (2.0) -
Uterine scar 3 (6.3) 16 (15.7) -
Breech position 3 (6.3) 2 (2.0) -
Preeclampsia 1 (2.1) 3 (2.9) -
Eclampsia 6 (12.5) 1 (1.0) -
Premature membrane rupture 4 (8.3) 4 (3.9) -
Proteinuria 2 (4.2) 0 (0)
Restricted fetal growth 1 (2.1) 0 (0) -
Uterine inertia 0 (0) 1(1.0) -
Inadequate contraction 0 (0) 2 (2.0) -
Incomplete fetal head rotation 0 (0) 4 (3.9) -
[52]Open in a new tab
Values represent mean (standard deviation) or categorical values (%).
P-values were calculated using an unpaired t-test with a two-tailed
distribution. Significance set at p < 0.05.
^*
Preterm. Note: Due to small sample sizes, statistics were not evaluated
for these characteristics.
^a
Spielberger State-Trait Anxiety Inventory Score (STAI) is used to
assess anxiety.
^b
Edinburgh Perinatal Depression Scale (EPDS).
2.2. Metabolomic analysis
All samples were blinded to the investigators for assessment. Upon
collection, serum was centrifuged and stored at − 80 °C until the day
of analysis using previously described methods [53][13]. Serum samples
were thawed on ice from − 80 °C storage prior to protein precipitation
using 100 % methanol. The supernatants were collected via
centrifugation followed by solvent evaporation. Reconstitution was in
50:50 methanol:water followed by centrifugation through a 200-micron
filter to remove debris. The analytical method was based on positive
mode liquid chromatography mass spectrometry (LC/MS, QTOF 6545i,
Agilent, USA) using reverse phase chromatography through an Acquity HSS
(2.1 × 150 mm, Waters, USA). A gradient elution system under a mobile
phase setup of (A) 0.1 % formic acid in water and (B) 0.1 % formic acid
in acetonitrile was used for affinity-based separations prior to
detection by a time-of-flight mass spectrometer recording ions with
mass-to-charge (m/z) ratio between 50 and 1200 m/z. Collected spectra
were processed through XCMS version 3.7.1 for peak labelling and
intensity calculations [54][14]. Metabolites were identified using the
Human Metabolome Database (HMDB) to determine the most probable
compound candidate based on m/z values with a tolerance threshold of
30 ppm. Compounds with a CV > 30 % were excluded from further analysis.
In total, 181 endogenous, known metabolites were identified.
2.3. Computational modeling approach
A schematic of the computational approach, data processing, and the
subsequent modeling for this study is shown in [55]Fig. 1. The dataset
was subjected to outlier removal by the ROUT method with a moderate
threshold of Q = 1 % (GraphPad Prism 9.0, USA) [56][15]. Missing
values, which resulted from both the outlier test or absence of
instrument detection, were replaced by arbitrary values based on
K-nearest neighbor imputation [57][16]. Data values were then
normalized via z-score transformation using the standard scaler
operation. Majority of data pre-processing such as data filtration and
normalization, as well as the subsequent modeling, was performed in
Google Colaboratory (Colab). Each model was encoded within a single
Python script including the initial file upload to model performance
evaluation in the form of accuracy, area under the receiver operating
characteristic (AUROC) curve, and confusion matrix. Based on commonly
reported train/test split ratios in the literature, a ratio of 80:20
was selected [58][17]. The model randomly split the 150 total
participants into a training set of 120 and a testing set of 30,
stratified with respect to the preterm/term ratio in the original
dataset. The model was built and trained from the 120. The testing set
was then used to examine the trained model for model performance. While
many algorithms have additional hyperparameters influencing model
performance, only the most optimal after hyperparameter optimization
were reported here for comparable model results. Regularization methods
(L1 or L2) and dropout did not impact model performance.
Fig. 1.
[59]Fig. 1
[60]Open in a new tab
Overview of the data analysis workflow. All comparisons were made based
on the same preterm metabolomic dataset from the All Our Family Cohort.
The analysis pipeline was structured similarly across different models
to cross-examine performance.
For metabolomic studies, models must identify, and rank observed
features/metabolites based on their contributions to the model output
to draw scientifically relevant conclusions. Therefore, the present
study evaluates only models which can readily and consistently extract
feature importance for metabolite information. Machine learning
implementations were mostly conducted through Python’s Scikit-learn
library [61][18].
2.4. Model selection and discrimination
2.4.1. PLS-DA
Partial Least Squares Discriminant Analysis (PLS-DA) has been the go-to
approach for multivariate classification tasks, particularly when
dealing with high-dimensional data or collinear predictors to project
them into lower dimensional spaces [62][19]. To identify the
best-performing components to explain the data variances, PLS-DA was
conducted within Python’s Scikit-learn library, using the package
PLSRegression with 2 components which captured 25 % of the training
data variance for the highest prediction performance. The resulting
probabilities were first converted to binary predictions with the
defining threshold set at 0.5 using the astype(integer) function before
validation with the testing data for performance evaluation.
2.4.2. Linear logistic regression
Linear logistic regression establishes a relationship between the
predictor variables (x1, x2,…,xn) and the log odds of the outcome
(logit(p)) [63][20]. Mathematically, this relationship can be
represented as:
[MATH: logitp=β0+β1X1+β2X2+…+βnXx :MATH]
(1)
where β[0] is the intercept term and β[1], β[2],…,β[n] are the
coefficients associated with each feature. Through the logistic
transformation, this linear combination is converted into predicted
probabilities, ensuring they fall within the range of 0–1. The logistic
function facilitates this transformation, mapping the log odds to
probabilities:
[MATH: p=11
mn>+e−logit(p) :MATH]
(2)
Estimation of regression coefficients is typically done through
techniques like maximum likelihood estimation, reflecting the influence
of each predictor while holding other variables constant [64][21]. A
linear logistic regression model was implemented in this study using
Scikit-learn’s LogisticRegression package with a liblinear solver.
Regularization, which would trade a decrease in training accuracy for
an increase in generalizability, did not affect the model performance
and was not implemented. The final training and validation losses were
calculated to reflect training quality.
2.4.3. Artificial neural network
The concept of an artificial neural network (ANN) consists of
interconnected nodes, called neurons, constructed in layers of at
minimum three distinct structures: an input layer, one (or more) hidden
layers, and an output layer [65][22]. Essentially, each neuron receives
inputs from neurons in the previous layer, applies a series of
mathematical operations to these inputs, and produces an output signal.
The output a of neuron i in layer l can be defined by:
[MATH:
ai(l)=activation(zi
l) :MATH]
where z[i] includes the input value of the previous neurons and the
weights for neuron i. In this study, a neural network with a 3-layered
architecture was constructed with rectified linear unit as the
activation function through Python’s Tensorflow library [66][23].
Activation functions influence the types of transformation leading to
very different outcomes, and the optimal approach depends on the type
and quality of data [67][24]. Each model was set to iterate 100 times
(epoch) and calculate the training loss, the training accuracy, the
validation loss, and the validation accuracy during each epoch. The
successive epoch would optimize the model to lower the loss using
backpropagation based on the prior results. Adaptive Moment Estimation,
or Adam, optimization was used to minimize the losses during training
in the form of binary cross-entropy.
2.4.4. XGBoost classifier
XGBoost, or Extreme Gradient Boosting, is an ensemble of different
algorithms and learning methods particularly renowned for its efficacy
in predictive modeling and classification tasks [68][25]. At its core
lies an objective function, typically the sum of individual
instance-specific loss functions, such as squared error loss for
regression or logistic loss for classification, augmented with
regularization terms to reduce overfitting. The objective function for
XGBoost can be represented as:
[MATH: Obejective=∑i=1
nloss(yi,ŷi)+∑k=1
KΩ(fk) :MATH]
where loss (yi, ŷi) is the loss function measuring the discrepancy
between the true target yi and the predicted target ŷi, and Ω(f[k])
represents the regularization term for each tree in the ensemble.
XGBoost modeling was conducted using the XGBClassifier package from the
xgboost library [69][25]. The learning rate was set to 0.1 with 100
weak learners implemented. ‘Binary:logistic’ was selected for the
objective function to fit preterm/term classification task. The maximum
tree layer depth was set at 5, while the minimum tree split loss during
training was set to 0 for a more tolerant loss reduction and model
estimations.
2.5. Prediction certainties based on probability distribution
Once a model is trained, the probability distribution for the entire
dataset was calculated using both the training and the testing sets to
visualize the distribution of data points with Python’s matplotlib
graphing library [70][26]. Its Seaborn package was used to label and
distinguish different data categories. For the probability distribution
plots shown ranging from 0 to 1 in this study, a probability value of 0
represented the absolute certainty of the sample being preterm, whereas
a value of 1 indicated the absolute certainty of a term class sample.
The centred 0.5 axis was the classification threshold separating the
assignments of the two groups; the further away from 0.5, the higher
certainty in the prediction of each respective class.
2.6. Impact of resampling on preterm prediction
Statistical resampling refers to a group of methods used to generate
new subsets of the training dataset from existing data to estimate the
distribution of a statistic or assess the stability of a model without
relying on strong parametric assumptions. These methods involve
repeatedly drawing samples from the observed data, with replacement,
and executing subsequent analysis on each resampled dataset [71][27].
The effects of resampling can also vary across different modeling
algorithms depending on how they operate, and there can be dynamic
swings in model performance when resampling is introduced. This study
examined resampling with bootstrapping based on the method described by
Efron [72][28]. The resampling process was validated using a further
20 % holdout set (within the bootstrapped data) using the
StratifiedKFold package from Scikit-learn. Each sample set, both the
training and the testing set, was split equally into 5 sections to
examine section homogeneity by comparing variance of a single split
versus the other four. Consequently, the training and the validation
steps were conducted with 4/5 of the original sample size, 96 samples
for training and 24 samples for testing respectively. An illustration
of the resampling design can be found in [73]Supplementary Fig. 1.
2.7. Feature importance determination
Due to the variation in ML rationale, it is important to use a common
algorithm to evaluate the discriminant features within the models. In
the present study, Shapley Additive Explanations (SHAP) values were
employed to determine the feature importance for each model while
providing cross model comparisons. SHAP values, adopted from the
concept of game theory, consider all possible combinations of features
and calculate the average marginal contribution of each feature across
different prediction outcomes [74][29]. SHAP values were obtained by
installing and importing the SHAP library into Python from the
publisher’s repository. The Explainer function was used to describe the
observed metabolite contributions with respect to both training and
testing data. The ranking results were visualized through the summary
plot function for sample distributions among the top 20 contributors
for each model. A larger SHAP value of a metabolite indicated that its
change led to increased prediction power of term birth.
2.8. Pathway enrichment analysis
Pathway analysis was used to map metabolites to biological processes
and metabolic pathways that were significantly affected in preterm
birth. Detected metabolites were mapped onto known biochemical pathways
catalogued in KEGG (Kyoto Encyclopedia of Genes and Genomes). Analysis
was performed in MetaboAnalyst v5.0 [75][11].
3. Results
3.1. Participants
Participant characteristics are shown in [76]Table 1. Differences
between preterm and term groups were gestational age and birthweight
(p < 0.05). Other characteristics, including maternal age,
pre-pregnancy BMI (kg/m^2), and gestational weight gain between were
not different (p > 0.05) between groups. Although not statistically
significant, mental health assessments showed higher scores for anxiety
(Spielberger State-Trait Anxiety Inventory Score, p = 0.082) and
depression (Edinburgh Perinatal Depression Scale, p = 0.098) among the
preterm participants. The incident of spontaneous preterm birth
(n = 23) was nearly identical to that of medically indicated preterm
birth (n = 22), whereas spontaneous birth made up the majority (73.5 %)
of all recorded term birth. Maternal conditions such as hypertension,
preeclampsia, or premature membrane rupture were also documented, but
these conditions were not widespread in either group and were therefore
not taken into consideration in the subsequent modelling.
3.2. Discrimination of machine learning models
Serum untargeted metabolomic data from 48 preterm mothers and 102 term
mothers were fed into the six different based ML models to determine
their predictive potentials for preterm delivery. Among them, 120 were
used to train the model, while the remaining 30 were used to validate
the trained model performance ([77]Fig. 1).
3.2.1. Comparison of model confusion matrices
Accuracy was assessed for each model based on their confusion matrices
([78]Fig. 2A). Sensitivity, or the true positive rate, and specificity,
or the true negative rate, were also evaluated for classification
behavior within each category. The dataset was first evaluated by a
conventional PLS-DA model, which had an accuracy of 0.63. The model
appeared to experience challenges in properly classifying preterm birth
resulting in low specificity (0.40) with the overall classification
strongly favoring the term category. Linear logistic regression
achieved a prediction accuracy of 0.57 with near equivalent amounts of
false positives (n = 7) and false negatives (n = 6), while true
negatives (n = 14) greatly exceeded true positives (n = 3), had a
sensitivity of 0.67 and a specificity of 0.33. The ANN model, using
rectified linear units and sigmoid transformations at the neurons, was
able to achieve a validation accuracy of 0.63. XGBoost algorithm with a
reported accuracy of 0.70 which was the highest among the models
examined. The higher accuracy could be attributed to a higher tendency
to classify a sample as term birth, as shown by the classification
distribution with 27 samples assigned to term birth versus 3 for
preterm birth.
Fig. 2.
[79]Fig. 2
[80]Open in a new tab
Confusion matrix comparison between different models for binary
classification of preterm (n = 48) and term (n = 102). Each matrix
represents distributions of term and preterm assignments on the testing
set (n = 30) by the trained model (n = 120) after the train/test split.
Accuracy: (True Positives + True Negatives)/Total number of
assignments. Sensitivity: (True Positives + False Positives)/Total
Positives. Specificity: Sensitivity: (True Negatives + False
Negatives)/Total Negatives. F1 Score = 2 * True Positives /(2* True
Positives + False Positives + False Negatives) (Left) Non-resampled
models (Right) Bootstrapped models. For Bootstrapped models, the
standard deviation of accuracy across subsets is represented as
following: square-root (1/ (Total number of subsets-1) * (Sum of
(individual subset accuracy-mean accuracy)^2)).
3.2.2. Evaluation of model classification certainties
The PLS-DA model had most training and testing points around 0.5 with
misclassification being common among both sets ([81]Fig. 3). The linear
logistic model was overfit; while able to assign most points correctly
for the training data, it struggled with the testing set with large
numbers of mis-assigned points for both term and preterm groups. ANN
followed a similar trend and was able to capture the observed variance
in the training set to near perfection yet encountered a
generalizability issue on the validating dataset. For the XGBoost
model, most point assignments were accurate for the training set except
the few hovering around the 0.5 axis. However, the majority of the
testing set was assigned, correctly and incorrectly, to the term group,
leading to a lower validation accuracy.
Fig. 3.
[82]Fig. 3
[83]Open in a new tab
Preterm/Term assignments by different models. Feature 1 in the X-axis
reflects the predicted class probabilities while Feature 2 in the
Y-axis labels arbitrary coordinates for visualization. The dataset was
split into a training set and a testing set. The model was constructed
based on the training set and validated its accuracy on the testing
set. For bootstrapped models, average Preterm/Term assignments were
reported. (Left) Non-resampled models (Right) Bootstrapped models.
3.2.3. Assessment of model predictive strength
Predictive potentials were evaluated based on the AUROC ([84]Fig. 4).
PLS-DA, with an AUROC of 0.60, served as a reference for comparison
across models. The linear logistic regression showed an AUROC of 0.56,
which was slightly lower than the PLS-DA result. Combining this with
the accuracy value, the linear logistic model appeared to show worse
performance than the conventional PLS-DA approach for this dataset. The
ANN model was able to achieve a slight AUROC improvement (0.66) over
the PLS-DA model. These comparisons indicated that ANN, through their
layered neuron-like structures and iterative optimizations, might
improve predictive strength over conventional approaches. However, when
coupled with the observed training and validation accuracies
([85]Supplementary Table 2), it is likely overfit. While the training
accuracy was 1.0, the validation accuracy and predictive strength were
considerably lower. The algorithms were very potent to the point where
they might be forcing mathematical explanations to every data variance
in the training set yet had trouble applying these to the validation
set. Another type of disparity observed was in the XGBoost model.
Although it had one of the highest accuracies from the previous
section, the predictive power was actually the lowest (AUROC = 0.49).
This was due to imbalanced class assignments, favoring a major class to
achieve better accuracy by sacrificing its class discrimination power.
Fig. 4.
[86]Fig. 4
[87]Open in a new tab
Receiver-operating characteristic (ROC) curve evaluating predictive
potentials for preterm birth by different models. Area-under-the curve
(AUC) was caculated to determine the predictive strength. The 95 %
confidence interval and significance (p-value) are also shown on each
panel. (Left) Non-resampled models (Right) Bootstrapped models.
3.3. Effect of resampling on clinical datasets
Resampling was designed using a bootstrap method originally proposed by
Efron [88][28]. In this study, 100 subsample sets were created from the
training dataset with randomized replacements from the same set. Each
of the 4 models: PLS-DA, linear logistic regression, artificial neural
network, and XGBoost were configured to iterate through 100
bootstrapped sets for comparison.
3.3.1. Comparison of model confusion matrices, resampled dataset
Overall, the accuracies for bootstrapped models were notably higher
than those from the base models. Improvements in confusion matrix
metrics were also evident ([89]Fig. 2). For PLS-DA, the mean accuracy
increased from 0.63 to 0.71 with bootstrapping. While there was an
improved true term rate, the main drive for better accuracy was the
improved specificity (0.40–0.54), indicating a much-improved ability to
correctly assign preterm samples. Likewise, linear logistic regression
saw a 0.33 elevation in accuracy that could be attributed to much
improved sensitivity (0.67–0.86) and specificity (0.33–0.66) for the
same reason. On the contrary, when trained on bootstrapped data, the
neural network only marginally improved its performance. However, the
XGBoost model saw considerable improvement with the bootstrapped data.
While the base model already had an accuracy of 0.70, the mean accuracy
improved to reach a value of 0.82 with bootstrapping. This improvement
was achieved by a more balanced class distribution closer to the 1:2
preterm to term ratio in the original dataset as shown in [90]Fig. 2A.
3.3.2. Evaluation of model classification certainties, resampled dataset
Introducing bootstrap resampling had different effects on how points
were assigned in each model. The displayed probability distributions
represent the mean probability value for each sample across 100
iterations ([91]Fig. 3). The bootstrapped PLS-DA model diverged
slightly away from the 0.5 axis for better certainties with less
misclassifications of the test set compared to its base model.
Interestingly, the linear logistic regression model, which saw greatly
improved performance in the confusion matrix, had points moving closer
to the centre, indicating an actual reduction in certainties for most
samples. This effect was stronger in ANN, with bootstrapping resulting
in a wider spread of samples compared to the base model.
In contrast, the bootstrapped XGBoost model caused an impact in the
opposite direction, having more points diverging away from the centre
for improved certainties. Overall, bootstrapping reduced
misclassifications across all models. This echoed the improved
sensitivity and specificity from the confusion matrices of these models
([92]Fig. 2B).
3.3.3. Assessment of predictive strength, resampled dataset
When bootstrapping was applied, the predictive potential by AUROC
improved for all models ([93]Fig. 4). One important observation from
the resampling models was a smoother ROC curvature compared to the
stair-like shapes of their respective base models. This could be
attributed to the combining and averaging of the individual variations
from subsample sets. PLS-DA showed an improvement from 0.60 to 0.71,
whereas the linear logistic regression increased from 0.60 to 0.80.
Bootstrapping’s improvement on the predictive strength was also
observed for the ANN model, jumping slightly from 0.66 to 0.68. The
most noticeable improvement by bootstrapping occurred with the XGBoost
algorithm, which in this study raised the AUROC from 0.49 to 0.85,
significantly alleviating the class imbalance observed in the base
XGBoost model.
3.4. Metabolite identification by SHAP analysis
The highest SHAP performance was observed for the bootstrapped XGBoost
model ([94]Figs. 5A and [95]5B). While the ranking orders and the list
of metabolites varied across models ([96]Supplementary Fig. 3), the
types of moieties changing with preterm birth were consistent, and the
total number of representative compound classes based on the top 20
most contributing metabolites within each model were counted. A shared
characteristic across all models was the large number of acylcarnitines
and amino acid derivatives, accounting for at least half of all ranked
metabolites. In most cases, these two classes were reduced in mothers
delivering preterm. Several metabolites also stood out for their
consistent presence in many of the tested models. Notable metabolites
of interest included acylcarnitines, amino acid derivatives, kynurenic
and pipecolic acid.
Fig. 5.
[97]Fig. 5
[98]Open in a new tab
Metabolites ranking using Shapley additive explanations (SHAP) for the
bootstrapped XGBoost model. (A) The ranking by the mean absolute impact
on model output for the top 20 metabolites. (B) SHAP summary plot for
sample distribution within each metabolite. The Shapley value is the
average marginal contribution of a feature value across all possible
coalitions of features. Each sample was represented by a dot with a
red/blue spectrum representing high/low levels. A positive SHAP value
indicated that the metabolite led the model to predict term birth,
whereas a negative value did the opposite. (C) Pathway analysis
displaying changes in maternal serum between preterm and term births.
Plot based on pathway impact and –log₁₀(p-value). Circles represent a
metabolic pathway (KEGG), with larger and more intensely colored
circles indicating higher pathway impact and statistical significance.
Significance threshold set at p < 0.05 as indicated by the dashed line.
3.5. Pathway enrichment analysis
Pathway analysis revealed several significantly perturbed metabolic
pathways to be associated with preterm birth ([99]Fig. 5C). The most
enriched pathway was tyrosine metabolism, displaying the highest
statistical significance (p < 0.001) and pathway impact. This was
followed by the phenylalanine, tyrosine, and tryptophan biosynthesis
pathway (p = 0.01), phenylalanine metabolism (p = 0.02), ubiquinone and
other terpenoid-quinone biosynthesis (p = 0.046), both of which reached
statistical significance but with lower pathway impact scores. Lastly,
a trend for enrichment of pyrimidine metabolism was identified but this
failed to reach statistical significance (p = 0.09).
4. Discussion
There is growing interest in machine learning applications for clinical
studies. Unlike some other fields ML models have excelled at, clinical
data often faces obstacles such as limited sample size and
accessibility due to the nature of the data source. Some of these
issues may be compensated in explorative studies, yet additional
challenges arise to identify scientifically critical information from
non-relevant features. The present study retrospectively investigated
the predictive ability of several ML algorithms for binary
classification of preterm birth using untargeted metabolomic serum data
from an established pregnancy cohort.
4.1. Non-linear algorithms did not guarantee better preterm prediction
It was hypothesized that non-linear models would perform better than
conventional linear models such as PLS-DA or linear logistic regression
on this preterm birth dataset. While some improvements were observed
from the ANN model, the XGBoost model encountered the issue of
classification disparity, leading to a high training set accuracy but
the worst AUROC for poor generalizability. An advantage shown by the
layered models such as ANN was the increased predictive probabilities
of a sample. However, extensive model overfitting on the training set
which could even force a highly accurate model that struggled with
generalizability on the testing set [100][30]. This was highlighted in
this study by higher sensitivity and specificity, supported by higher
prediction certainties, observed for the ANN model over the PLS-DA and
linear logistic regression models, yet the accuracy and AUROC did not
see similar improvements.
Our findings, in conjunction with past literature, add evidence that
non-linear models are not necessarily superior to their linear
counterparts. In many cases, these two types of algorithm approaches
were comparable and showed similar performance, each with corresponding
equivalences to bridging and cross-examination [101][31]. In fact, the
majority of research suggests that there is no best modeling approach,
but rather the most optimal choice for the given situation is on a
trial-and-error basis.
4.2. Resampling by bootstrapping improved model performance by varying
degrees
This work shows that introducing bootstrap resampling improved model
performance in predicting preterm birth, though the improvement varied
across model types. It improved both predictive accuracies and
strengths by mitigating the overfitting of models for the term class.
The XGBoost model, whose predictive strength was impacted the most by
the shifted class imbalance, had the largest improvements with
bootstrapping because of this. Another outcome was that resampling
improved the performance by reducing the numbers of false
positives/negatives, which was most evident in the bootstrapped linear
logistic regression model. It was able to lower the number of
misassignments by more than half by taking the average classification
scores across all subsample iterations. Sampling with replacements
reduced the variability within the subset, and thus enabled more
distant and straightforward group separations that greatly benefited
the linear models [102][32]. It led to less false positives/negatives
from subset results, which then lowered the overall reported values
when averaging out each category [103][33]. A similar concept was
applicable to the bootstrapped PLS-DA model, though to a lesser extent.
This benefit was less apparent in the linear logistic regression and
the ANN models in our study.
The XGBoost model achieved the highest prediction accuracy and strength
for this preterm metabolomic dataset with bootstrapping. The XGBoost
algorithm has been widely utilized with great success in various
metabolomic studies, being the most effective model in several studies
for targeted conditions [104][34], [105][35]. Our study showed a
similar result for the data but also highlighted the training disparity
problem for the XGBoost model when working with an imbalanced dataset
and proposed a solution via bootstrap resampling to achieve better
prediction results. This consideration is essential for predicting
preterm birth given that the condition impacts approximately 10 % of
the general newborn population, meaning that the preterm/term data
disparity is even more extreme under normal circumstances.
4.3. Altered metabolites were identified by ML for prediction of preterm
birth
SHAP analysis was effective in highlighting metabolites contributing to
the prediction outputs, primarily acylcarnitines and amino acid
derivatives. Acylcarnitines are molecular transporters of fatty acids
across the mitochondrial membrane for energy metabolism, while amino
acid derivatives are primarily a result of protein degradation
[106][36], [107][37]. Alterations of these compounds have previously
been linked to increased risk of preterm birth and associated
conditions during gestation [108][38], [109][39]. Pipecolic acid, a key
metabolite of lysine degradation, was highlighted in all bootstrapped
models (4/4) and was increased in mothers experiencing preterm birth
[110][40], [111][41]. Another candidate metabolite appearing in 2/4
bootstrapped models was kynurenic acid, a metabolite from tryptophan
metabolism that was reduced in mothers experiencing preterm birth.
Kynurenic acid is involved in immune regulation and cellular energy
activities, and alterations to this pathway have previously been
associated with pregnancy complications [112][42].
4.4. Pathway analysis
The identified pathways are well known to influence immune regulation,
oxidative stress, and placental function, all of which are critical to
maintaining pregnancy. Notably, our findings reveal alterations in
pathways related to tyrosine metabolism as well as to phenylalanine,
tyrosine and tryptophan biosynthesis. Tyrosine is a non-essential amino
acid involved in protein synthesis and catecholamine production. This
pathway is not typically implicated in preterm birth risk. On the other
hand, disruptions to tryptophan have been previously linked to various
preterm phenotypes (e.g. preterm labor with intact membranes and
preterm premature rupture of the membranes) and maternal inflammation
[113][43]. While SHAP analysis also highlighted fatty acid related
species including acylcarnitines and peptides, pathway analysis often
overlooks these signals due to limited database coverage and the use of
generalized pathway representations [114][44].
4.5. Study strengths and limitations
Results demonstrated that both linear and non-linear algorithms were
predictive of preterm birth, and resampling could further improve the
model performance for this type of data. A key limitation, however, is
the absence of validation using a comparable cohort with a larger
sample size and greater discriminatory data variance. Robust external
validation using independent and diverse cohorts is essential to
confirm the reproducibility and reliability of these findings before
they can be translated into clinical practice.
Another limitation is that while our work demonstrated that resampling
techniques could mitigate class imbalance, which is inherently high in
preterm to term cases, generalization issues may still exist due to
varying sampling ratios across different analysis. Finally, this study
focused solely on binary classification of preterm birth, a condition
with diverse aetiologies that could be further sub-categorized. Future
work should explore multiclass classification approaches, considering
factors such as time of sampling and distinctions between types of
preterm birth (e.g. spontaneous vs. medically indicated). Additionally,
incorporating clinical risk factors could enhance model granularity,
ultimately supporting more informed clinical decision-making.
5. Conclusions
This study examined different machine learning applications on a
clinical untargeted metabolomic dataset to predict preterm birth. The
dataset was characterized by its small sample size and low-class
discrimination with the presence of indifferent features. Linear models
achieved moderate performance in predictive accuracy and strength,
whereas non-linear models either had slightly improved outcomes or
experienced lower predictive strength due to class imbalance.
Introducing resampling such as bootstrapping has shown improvements in
both accuracy and predictive strength, but the extent varied
significantly based on the type of model with XGBoost benefiting the
most. Acylcarnitines and amino acid derivatives were the major classes
of compounds contributing to preterm prediction. Multiple models also
identified kynurenic acid and pipecolic acid as key contributing
metabolites. Although additional research is necessary before clinical
implementation, these findings provide valuable considerations for
advancing the use of machine learning to predict preterm birth.
Author statement
Each named author has substantially contributed to conducting the
underlying research and drafting this manuscript. Authors approved the
submission to CSBJ.
CRediT authorship contribution statement
Ying-Chieh Han: Visualization, Methodology, Investigation, Formal
analysis, Data curation. Slater Donna C: Project administration,
Methodology, Investigation. Jane Shearer: Writing – review & editing,
Supervision, Resources, Investigation, Conceptualization. Tough Suzanne
C: Resources, Investigation, Funding acquisition. Chunlong Mu:
Visualization, Validation, Supervision, Software, Resources. Gavin E.
Duggan: Writing – review & editing, Visualization, Validation,
Software, Resources, Project administration, Formal analysis.
Ethics Approval
The study was conducted according to the guidelines of the Declaration
of Helsinki and approved by the Institutional Review Board of the
University of Calgary (REB15-0248).
Funding
This research was funded by NSERC (J.S., RGPIN/04238-2018). All Our
Families was funded through Alberta Innovates Interdisciplinary Team
Grant #200700595 and Alberta Children’s Hospital Foundation.
Software Availability
The codes for machine learning models can be downloaded from the google
drive with open access:
[115]https://drive.google.com/drive/folders/14VNrHVUyZH1PLgycjfWCBfmqFU
jCcSJV?usp=sharing.
The downloaded files can be viewed and executed through Google
Colaboratory: [116]https://colab.research.google.com/. Alternatively, a
GitHub repository has been created for viewing the same codes via
GitHub ([117]https://github.com/ianhstudent/ColabMLNotebook.git).
Declaration of Competing Interest
The authors declare the following financial interests/personal
relationships which may be considered as potential competing interests:
Co-Author Dr. Gavin Duggan is employed by Google Inc. This affiliation
did not influence the study design, data collection and analysis,
modeling, interpretation of results, decision to publish, or the
preparation of the manuscript. All aspects of the research were
conducted independently of the investigator’s employment.
All other authors declare that they have no known competing financial
interests or personal relationships that could have appeared to
influence the work reported in this paper.
Footnotes
^Appendix A
Supplementary data associated with this article can be found in the
online version at [118]doi:10.1016/j.csbj.2025.07.010.
Appendix A. Supplementary material
Supplementary material
[119]mmc1.pdf^ (1MB, pdf)
Data Availability
All the raw data is available upon request. Data is being uploaded to
the NIH Common Fund's National Metabolomics Data Repository (NMDR,
https://www.metabolomicsworkbench.org/data/index.php.
References