Abstract

Background

   Machine learning (ML), with advancements in algorithms and
   computations, is seeing an increased presence in life science research.
   This study investigated several ML models' efficacy in predicting
   preterm birth using untargeted metabolomics from serum collected during
   the third trimester of gestation.

Methods

   Samples from 48 preterm and 102 term delivery mothers from the All Our
   Families Cohort (Calgary, AB) were examined. Four ML algorithms:
   Partial Least Squares Discriminant Analysis (PLS-DA), linear logistic
   regression, artificial neural networks (ANN), Extreme Gradient Boosting
   (XGBoost) - with and without bootstrap resampling were used to examine
   the small-scale clinical dataset for both model performance and
   metabolite interpretation.

Results

   Model performance was evaluated based on confusion matrices, area under
   the receiver operating characteristic (AUROC) curve analysis, and
   feature importance rankings. Linear models such as PLS-DA and logistic
   regression demonstrated moderate classification performance (AUROC ≈
   0.60), whereas non-linear approaches, including ANN and XGBoost,
   exhibited marginal improvements. Among all models, XGBoost combined
   with bootstrap resampling achieved the highest performance, yielding an
   AUROC of 0.85 (95 % CI: 0.57–0.99, p < 0.001), indicating a significant
   improvement in classification accuracy. Metabolite importance, derived
   from Shapley Additive Explanations (SHAP), consistently identified
   acylcarnitines and amino acid derivatives as principal discriminative
   features. Pathway analysis revealed disruptions to tyrosine metabolism
   as well as phenylalanine, tyrosine and tryptophan biosynthesis to be
   associated with preterm delivery.

Conclusions

   Our results highlight the complexity of metabolomics-based modelling
   for preterm birth and support an iterative, model-driven approach for
   optimizing predictive accuracy in small-scale clinical datasets.

   Keywords: Preterm birth, Pregnancy, Metabolomics, Machine learning,
   Predictive models, XGBoost

Graphical Abstract

   [33]graphic file with name ga1.jpg
   [34]Open in a new tab

Highlights

     * •
       Machine learning on maternal serum metabolomics dataset to predict
       preterm birth.
     * •
       XGBoost with bootstrapping achieved highest AUROC of 0.85.
     * •
       Acylcarnitines, amino acids were key markers via SHAP analysis.
     * •
       Tyrosine and tryptophan pathways linked to preterm delivery.

1. Introduction

   Advancements in machine learning (ML) with better computation and
   sophisticated algorithms have greatly impacted the landscape of data
   science, and their applications into clinical domains have drawn
   interest from health care professionals. Metabolites, which can serve
   as sensitive indicators of physiological alterations linking to disease
   progression, have seen increased inclusion of ML models for
   pathogenesis, diagnostics and monitoring. ML algorithms can play a
   pivotal role in capitalizing on metabolite signatures to analyze
   complex datasets, extracting meaningful biological interpretations, and
   building predictive models for diagnostic purposes [35][1].

   ML integration with metabolomics, the systematic approach for studying
   metabolites, offers unique insights into disease pathogenesis and
   biomarker discovery by recognizing hidden patterns and correlations
   from metadata analysis of biological systems [36][2]. While partial
   least squares-discriminant analysis (PLS-DA) continues to be a widely
   applied and effective method [37][3], machine learning (ML) techniques
   have gained increasing prominence in metabolomics research. Advanced ML
   models, such as t-distributed stochastic neighbor embedding (t-SNE),
   apply non-linear transformations that improve data fitting, enable
   effective dimensionality reduction, and enhance visualization of class
   separations [38][4]. These models also employ sophisticated feature
   selection methods, such as recursive feature elimination and
   dimensionality reduction, to identify the most informative metabolites
   [39][5].

   While binary classifications with classical yes-or-no decisions are the
   area drawing the most interest, multiple-decision problems have long
   existed, and ML has been trained to facilitate these decision processes
   [40][6]. However, there are often several obstacles when analyzing
   clinical data. Unlike other “big data”, clinical data-sets are often
   restricted in sample size, which is a critical necessity for many
   complex algorithms [41][7]. Thus, this project aims to take these
   limitations into account to evaluate some common machine learning
   approaches on a clinical maternal dataset investigating preterm birth
   using untargeted metabolomics. Past research has demonstrated the
   utility of machine learning-based metabolomic analysis in this field.
   Employing maternal vaginal fluid metabolome samples collected between
   20 and 24 weeks of gestation, a ML metabolomics approach outperformed
   microbiome-only and maternal covariate-only models achieving an AUC
   = 0.78, a finding subsequently validated in two independent cohorts
   [42][8]. Likewise, Al Ghadban et al. used untargeted serum metabolomics
   in a case–cohort of 399 pregnant women serially sampled at 12, 20, 28,
   and 36 weeks of gestation. Six supervised machine learning methods were
   trained on the top 47 features and results found a random forest model
   to perform best (AUC = 0.73) followed by a generalized boosted model
   (AUC = 0.71) for predicting spontaneous preterm birth [43][9].
   Combining metabolic profiles with clinical risk factors has also
   demonstrated success in predicting preterm birth. The Screening for
   Pregnancy Outcomes by Early Pregnancy Evaluation (SCOPE) study applied
   untargeted serum metabolomics to asymptomatic pregnant women and found
   that combining metabolic profiles with clinical risk factors
   significantly improved early identification of those at risk for
   preterm birth with a metabolite informed clinical prediction model
   achieving an AUC of 0.73 [44][10].

   Overall, many of these models leveraged advanced feature selection
   techniques and classification algorithms to identify discriminatory
   metabolite patterns associated with preterm birth risk. Given this, it
   was hypothesized that more complex, non-linear models would be
   advantageous over conventional approaches when working with this type
   of data due to their ability to better fit curves for the observed
   variances. This study examined the hypothesis by evaluating the
   performance of both linear and non-linear ML models in predicting
   preterm birth with feature selection method to facilitate biological
   interpretation.

2. Materials and methods

2.1. Participant characteristics and statistical analysis

   Untargeted metabolomic analysis was performed using third trimester
   serum samples collected from the antecubital vein in non-fasted
   pregnant participants between 28 and 32 weeks gestation. This period
   was prioritized for its robust metabolic signatures identified in prior
   metabolomics studies and for being the final window for clinical
   intervention in preterm birth risk mitigation [45][11]. Participants
   (n = 150) were a subset from the All Our Families (AOF) pregnancy
   cohort who donated a blood sample (formally All Our Babies) for an
   ongoing project monitoring maternal/fetal status in Calgary, Alberta
   [46][12]. Additional information regarding the AOF study methods and
   results can be found at the project website:
   [47]https://ucalgary.ca/allourfamilies. The study was approved by the
   Child Health Research Office and the Conjoint Health Research Ethics
   Board at the University of Calgary to acquire written informed consent
   from participants with eligibility based on the individuals being
   adults (≥ 18 years of age), less than 25 weeks of gestation at the time
   of recruitment, receiving prenatal care in Alberta, Canada and being
   willing to complete written questionnaires in English. Individuals were
   classified as preterm if delivery occurred prior to 37 weeks of
   gestation while term was considered > 37 weeks of gestation).
   Gestational age was calculated based on the participants’ last reported
   menstrual period. Participant characteristics are shown in [48]Table 1.
   For continuous characteristics such as body-mass-index (BMI, kg/m^2),
   the means and the standard deviations were calculated, whereas
   categorical characteristics were represented by their means and
   relative percentages (%). Statistical significance was determined based
   on p-values, with a significance threshold set at 0.05, using an
   unpaired t-test with a two-tailed distribution between the preterm and
   the term groups.

Table 1.

   Participant characteristics.
   Maternal Preterm (n= 48) Term Birth (n= 102) p-value
   Pre-pregnancy BMI (kg/m^2) 25.3 ± 5.7 24.3 ± 4.6 0.295
   Maternal Age (y) 32.0 ± 4.8 30.9 ± 4.1 0.167
   Gestation Weight Gain (kg) 10.0 ± 3.5 10.2 ± 4.2 0.759
   Anxiety score [49]^a 34.0 ± 9.3 30.8 ± 7.5 0.082
   Depression score [50]^b 7.3 ± 5.4 5.1 ± 4.4 0.098
   Infant
   Gestation (wk) 34.6 ± 1.8 38.8 ± 1.2 < 0.001
   Fetal Birthweight (g) 2532 ± 562 3350 ± 442 < 0.001
   Child Sex, female (%) 29 (60.4) 56 (54.9) 0.527
   Delivery,n(%)
   Spontaneous labor 23 (47.9) 75 (73.5) -
   Medically indicated [51]^* or Induced 22 (45.8) - -
   Induction (term) - 27 (26.5) -
   Heart rate abnormality 8 (9.8) 10 (16.7) -
   Hypertensive (n) (%) 7 (14.6) 4 (3.9) -
   Gestational diabetes (%) 3 (6.3) 2 (2.0) -
   Uterine scar 3 (6.3) 16 (15.7) -
   Breech position 3 (6.3) 2 (2.0) -
   Preeclampsia 1 (2.1) 3 (2.9) -
   Eclampsia 6 (12.5) 1 (1.0) -
   Premature membrane rupture 4 (8.3) 4 (3.9) -
   Proteinuria 2 (4.2) 0 (0)
   Restricted fetal growth 1 (2.1) 0 (0) -
   Uterine inertia 0 (0) 1(1.0) -
   Inadequate contraction 0 (0) 2 (2.0) -
   Incomplete fetal head rotation 0 (0) 4 (3.9) -
   [52]Open in a new tab

   Values represent mean (standard deviation) or categorical values (%).
   P-values were calculated using an unpaired t-test with a two-tailed
   distribution. Significance set at p < 0.05.
   ^*

   Preterm. Note: Due to small sample sizes, statistics were not evaluated
   for these characteristics.
   ^a

   Spielberger State-Trait Anxiety Inventory Score (STAI) is used to
   assess anxiety.
   ^b

   Edinburgh Perinatal Depression Scale (EPDS).

2.2. Metabolomic analysis

   All samples were blinded to the investigators for assessment. Upon
   collection, serum was centrifuged and stored at − 80 °C until the day
   of analysis using previously described methods [53][13]. Serum samples
   were thawed on ice from − 80 °C storage prior to protein precipitation
   using 100 % methanol. The supernatants were collected via
   centrifugation followed by solvent evaporation. Reconstitution was in
   50:50 methanol:water followed by centrifugation through a 200-micron
   filter to remove debris. The analytical method was based on positive
   mode liquid chromatography mass spectrometry (LC/MS, QTOF 6545i,
   Agilent, USA) using reverse phase chromatography through an Acquity HSS
   (2.1 × 150 mm, Waters, USA). A gradient elution system under a mobile
   phase setup of (A) 0.1 % formic acid in water and (B) 0.1 % formic acid
   in acetonitrile was used for affinity-based separations prior to
   detection by a time-of-flight mass spectrometer recording ions with
   mass-to-charge (m/z) ratio between 50 and 1200 m/z. Collected spectra
   were processed through XCMS version 3.7.1 for peak labelling and
   intensity calculations [54][14]. Metabolites were identified using the
   Human Metabolome Database (HMDB) to determine the most probable
   compound candidate based on m/z values with a tolerance threshold of
   30 ppm. Compounds with a CV > 30 % were excluded from further analysis.
   In total, 181 endogenous, known metabolites were identified.

2.3. Computational modeling approach

   A schematic of the computational approach, data processing, and the
   subsequent modeling for this study is shown in [55]Fig. 1. The dataset
   was subjected to outlier removal by the ROUT method with a moderate
   threshold of Q = 1 % (GraphPad Prism 9.0, USA) [56][15]. Missing
   values, which resulted from both the outlier test or absence of
   instrument detection, were replaced by arbitrary values based on
   K-nearest neighbor imputation [57][16]. Data values were then
   normalized via z-score transformation using the standard scaler
   operation. Majority of data pre-processing such as data filtration and
   normalization, as well as the subsequent modeling, was performed in
   Google Colaboratory (Colab). Each model was encoded within a single
   Python script including the initial file upload to model performance
   evaluation in the form of accuracy, area under the receiver operating
   characteristic (AUROC) curve, and confusion matrix. Based on commonly
   reported train/test split ratios in the literature, a ratio of 80:20
   was selected [58][17]. The model randomly split the 150 total
   participants into a training set of 120 and a testing set of 30,
   stratified with respect to the preterm/term ratio in the original
   dataset. The model was built and trained from the 120. The testing set
   was then used to examine the trained model for model performance. While
   many algorithms have additional hyperparameters influencing model
   performance, only the most optimal after hyperparameter optimization
   were reported here for comparable model results. Regularization methods
   (L1 or L2) and dropout did not impact model performance.

Fig. 1.

   [59]Fig. 1
   [60]Open in a new tab

   Overview of the data analysis workflow. All comparisons were made based
   on the same preterm metabolomic dataset from the All Our Family Cohort.
   The analysis pipeline was structured similarly across different models
   to cross-examine performance.

   For metabolomic studies, models must identify, and rank observed
   features/metabolites based on their contributions to the model output
   to draw scientifically relevant conclusions. Therefore, the present
   study evaluates only models which can readily and consistently extract
   feature importance for metabolite information. Machine learning
   implementations were mostly conducted through Python’s Scikit-learn
   library [61][18].

2.4. Model selection and discrimination

2.4.1. PLS-DA

   Partial Least Squares Discriminant Analysis (PLS-DA) has been the go-to
   approach for multivariate classification tasks, particularly when
   dealing with high-dimensional data or collinear predictors to project
   them into lower dimensional spaces [62][19]. To identify the
   best-performing components to explain the data variances, PLS-DA was
   conducted within Python’s Scikit-learn library, using the package
   PLSRegression with 2 components which captured 25 % of the training
   data variance for the highest prediction performance. The resulting
   probabilities were first converted to binary predictions with the
   defining threshold set at 0.5 using the astype(integer) function before
   validation with the testing data for performance evaluation.

2.4.2. Linear logistic regression

   Linear logistic regression establishes a relationship between the
   predictor variables (x1, x2,…,xn) and the log odds of the outcome
   (logit(p)) [63][20]. Mathematically, this relationship can be
   represented as:
   [MATH: <mrow><mi mathvariant="italic">logit</mi><mrow><mfenced open="("
   close=")"><mrow><mi>p</mi></mrow></mfenced></mrow><mo
   linebreak="goodbreak">=</mo><mspace
   width="1em"></mspace><msub><mrow><mi>β</mi></mrow><mrow><mn>0</mn></mro
   w></msub><mo
   linebreak="badbreak">+</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>1</mn
   ></mrow></msub><msub><mrow><mi>X</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mo
   linebreak="badbreak">+</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>2</mn
   ></mrow></msub><msub><mrow><mi>X</mi></mrow><mrow><mn>2</mn></mrow></ms
   ub><mo linebreak="badbreak">+</mo><mo>…</mo><mo
   linebreak="badbreak">+</mo><msub><mrow><mi>β</mi></mrow><mrow><mi>n</mi
   ></mrow></msub><msub><mrow><mi>X</mi></mrow><mrow><mi>x</mi></mrow></ms
   ub></mrow> :MATH]
   (1)

   where β[0] is the intercept term and β[1], β[2],…,β[n] are the
   coefficients associated with each feature. Through the logistic
   transformation, this linear combination is converted into predicted
   probabilities, ensuring they fall within the range of 0–1. The logistic
   function facilitates this transformation, mapping the log odds to
   probabilities:
   [MATH: <mrow><mi>p</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>1</
   mn><mo>+</mo><msup><mrow><mi>e</mi></mrow><mrow><mo>−</mo><mi
   mathvariant="italic">logit</mi><mo stretchy="false">(</mo><mi>p</mi><mo
   stretchy="false">)</mo></mrow></msup></mrow></mfrac></mrow> :MATH]
   (2)

   Estimation of regression coefficients is typically done through
   techniques like maximum likelihood estimation, reflecting the influence
   of each predictor while holding other variables constant [64][21]. A
   linear logistic regression model was implemented in this study using
   Scikit-learn’s LogisticRegression package with a liblinear solver.
   Regularization, which would trade a decrease in training accuracy for
   an increase in generalizability, did not affect the model performance
   and was not implemented. The final training and validation losses were
   calculated to reflect training quality.

2.4.3. Artificial neural network

   The concept of an artificial neural network (ANN) consists of
   interconnected nodes, called neurons, constructed in layers of at
   minimum three distinct structures: an input layer, one (or more) hidden
   layers, and an output layer [65][22]. Essentially, each neuron receives
   inputs from neurons in the previous layer, applies a series of
   mathematical operations to these inputs, and produces an output signal.
   The output a of neuron i in layer l can be defined by:
   [MATH:
   <mrow><msubsup><mrow><mi>a</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>l</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo
   linebreak="goodbreak">=</mo><mi mathvariant="italic">activation</mi><mo
   stretchy="false">(</mo><msubsup><mrow><mi>z</mi></mrow><mrow><mi>i</mi>
   </mrow><mrow><mfenced open="("
   close=")"><mrow><mi>l</mi></mrow></mfenced></mrow></msubsup><mo
   stretchy="false">)</mo></mrow> :MATH]

   where z[i] includes the input value of the previous neurons and the
   weights for neuron i. In this study, a neural network with a 3-layered
   architecture was constructed with rectified linear unit as the
   activation function through Python’s Tensorflow library [66][23].
   Activation functions influence the types of transformation leading to
   very different outcomes, and the optimal approach depends on the type
   and quality of data [67][24]. Each model was set to iterate 100 times
   (epoch) and calculate the training loss, the training accuracy, the
   validation loss, and the validation accuracy during each epoch. The
   successive epoch would optimize the model to lower the loss using
   backpropagation based on the prior results. Adaptive Moment Estimation,
   or Adam, optimization was used to minimize the losses during training
   in the form of binary cross-entropy.

2.4.4. XGBoost classifier

   XGBoost, or Extreme Gradient Boosting, is an ensemble of different
   algorithms and learning methods particularly renowned for its efficacy
   in predictive modeling and classification tasks [68][25]. At its core
   lies an objective function, typically the sum of individual
   instance-specific loss functions, such as squared error loss for
   regression or logistic loss for classification, augmented with
   regularization terms to reduce overfitting. The objective function for
   XGBoost can be represented as:
   [MATH: <mrow><mi mathvariant="italic">Obejective</mi><mo
   linebreak="goodbreak">=</mo><mspace
   width="1em"></mspace><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1
   </mn></mrow><mrow><mi>n</mi></mrow></msubsup><mrow><mi
   mathvariant="italic">loss</mi><mo stretchy="false">(</mo><mi
   mathvariant="italic">yi</mi><mo>,</mo><mspace
   width="1em"></mspace></mrow><mi
   mathvariant="normal">ŷ</mi><mi>i</mi><mo stretchy="false">)</mo><mo
   linebreak="badbreak">+</mo><mspace
   width="1em"></mspace><msubsup><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1
   </mn></mrow><mrow><mi>K</mi></mrow></msubsup><mrow><mi>Ω</mi><mo
   stretchy="false">(</mo><msub><mrow><mi>f</mi></mrow><mrow><mi>k</mi></m
   row></msub><mo stretchy="false">)</mo></mrow></mrow> :MATH]

   where loss (yi, ŷi) is the loss function measuring the discrepancy
   between the true target yi and the predicted target ŷi, and Ω(f[k])
   represents the regularization term for each tree in the ensemble.
   XGBoost modeling was conducted using the XGBClassifier package from the
   xgboost library [69][25]. The learning rate was set to 0.1 with 100
   weak learners implemented. ‘Binary:logistic’ was selected for the
   objective function to fit preterm/term classification task. The maximum
   tree layer depth was set at 5, while the minimum tree split loss during
   training was set to 0 for a more tolerant loss reduction and model
   estimations.

2.5. Prediction certainties based on probability distribution

   Once a model is trained, the probability distribution for the entire
   dataset was calculated using both the training and the testing sets to
   visualize the distribution of data points with Python’s matplotlib
   graphing library [70][26]. Its Seaborn package was used to label and
   distinguish different data categories. For the probability distribution
   plots shown ranging from 0 to 1 in this study, a probability value of 0
   represented the absolute certainty of the sample being preterm, whereas
   a value of 1 indicated the absolute certainty of a term class sample.
   The centred 0.5 axis was the classification threshold separating the
   assignments of the two groups; the further away from 0.5, the higher
   certainty in the prediction of each respective class.

2.6. Impact of resampling on preterm prediction

   Statistical resampling refers to a group of methods used to generate
   new subsets of the training dataset from existing data to estimate the
   distribution of a statistic or assess the stability of a model without
   relying on strong parametric assumptions. These methods involve
   repeatedly drawing samples from the observed data, with replacement,
   and executing subsequent analysis on each resampled dataset [71][27].
   The effects of resampling can also vary across different modeling
   algorithms depending on how they operate, and there can be dynamic
   swings in model performance when resampling is introduced. This study
   examined resampling with bootstrapping based on the method described by
   Efron [72][28]. The resampling process was validated using a further
   20 % holdout set (within the bootstrapped data) using the
   StratifiedKFold package from Scikit-learn. Each sample set, both the
   training and the testing set, was split equally into 5 sections to
   examine section homogeneity by comparing variance of a single split
   versus the other four. Consequently, the training and the validation
   steps were conducted with 4/5 of the original sample size, 96 samples
   for training and 24 samples for testing respectively. An illustration
   of the resampling design can be found in [73]Supplementary Fig. 1.

2.7. Feature importance determination

   Due to the variation in ML rationale, it is important to use a common
   algorithm to evaluate the discriminant features within the models. In
   the present study, Shapley Additive Explanations (SHAP) values were
   employed to determine the feature importance for each model while
   providing cross model comparisons. SHAP values, adopted from the
   concept of game theory, consider all possible combinations of features
   and calculate the average marginal contribution of each feature across
   different prediction outcomes [74][29]. SHAP values were obtained by
   installing and importing the SHAP library into Python from the
   publisher’s repository. The Explainer function was used to describe the
   observed metabolite contributions with respect to both training and
   testing data. The ranking results were visualized through the summary
   plot function for sample distributions among the top 20 contributors
   for each model. A larger SHAP value of a metabolite indicated that its
   change led to increased prediction power of term birth.

2.8. Pathway enrichment analysis

   Pathway analysis was used to map metabolites to biological processes
   and metabolic pathways that were significantly affected in preterm
   birth. Detected metabolites were mapped onto known biochemical pathways
   catalogued in KEGG (Kyoto Encyclopedia of Genes and Genomes). Analysis
   was performed in MetaboAnalyst v5.0 [75][11].

3. Results

3.1. Participants

   Participant characteristics are shown in [76]Table 1. Differences
   between preterm and term groups were gestational age and birthweight
   (p < 0.05). Other characteristics, including maternal age,
   pre-pregnancy BMI (kg/m^2), and gestational weight gain between were
   not different (p > 0.05) between groups. Although not statistically
   significant, mental health assessments showed higher scores for anxiety
   (Spielberger State-Trait Anxiety Inventory Score, p = 0.082) and
   depression (Edinburgh Perinatal Depression Scale, p = 0.098) among the
   preterm participants. The incident of spontaneous preterm birth
   (n = 23) was nearly identical to that of medically indicated preterm
   birth (n = 22), whereas spontaneous birth made up the majority (73.5 %)
   of all recorded term birth. Maternal conditions such as hypertension,
   preeclampsia, or premature membrane rupture were also documented, but
   these conditions were not widespread in either group and were therefore
   not taken into consideration in the subsequent modelling.

3.2. Discrimination of machine learning models

   Serum untargeted metabolomic data from 48 preterm mothers and 102 term
   mothers were fed into the six different based ML models to determine
   their predictive potentials for preterm delivery. Among them, 120 were
   used to train the model, while the remaining 30 were used to validate
   the trained model performance ([77]Fig. 1).

3.2.1. Comparison of model confusion matrices

   Accuracy was assessed for each model based on their confusion matrices
   ([78]Fig. 2A). Sensitivity, or the true positive rate, and specificity,
   or the true negative rate, were also evaluated for classification
   behavior within each category. The dataset was first evaluated by a
   conventional PLS-DA model, which had an accuracy of 0.63. The model
   appeared to experience challenges in properly classifying preterm birth
   resulting in low specificity (0.40) with the overall classification
   strongly favoring the term category. Linear logistic regression
   achieved a prediction accuracy of 0.57 with near equivalent amounts of
   false positives (n = 7) and false negatives (n = 6), while true
   negatives (n = 14) greatly exceeded true positives (n = 3), had a
   sensitivity of 0.67 and a specificity of 0.33. The ANN model, using
   rectified linear units and sigmoid transformations at the neurons, was
   able to achieve a validation accuracy of 0.63. XGBoost algorithm with a
   reported accuracy of 0.70 which was the highest among the models
   examined. The higher accuracy could be attributed to a higher tendency
   to classify a sample as term birth, as shown by the classification
   distribution with 27 samples assigned to term birth versus 3 for
   preterm birth.

Fig. 2.

   [79]Fig. 2
   [80]Open in a new tab

   Confusion matrix comparison between different models for binary
   classification of preterm (n = 48) and term (n = 102). Each matrix
   represents distributions of term and preterm assignments on the testing
   set (n = 30) by the trained model (n = 120) after the train/test split.
   Accuracy: (True Positives + True Negatives)/Total number of
   assignments. Sensitivity: (True Positives + False Positives)/Total
   Positives. Specificity: Sensitivity: (True Negatives + False
   Negatives)/Total Negatives. F1 Score = 2 * True Positives /(2* True
   Positives + False Positives + False Negatives) (Left) Non-resampled
   models (Right) Bootstrapped models. For Bootstrapped models, the
   standard deviation of accuracy across subsets is represented as
   following: square-root (1/ (Total number of subsets-1) * (Sum of
   (individual subset accuracy-mean accuracy)^2)).

3.2.2. Evaluation of model classification certainties

   The PLS-DA model had most training and testing points around 0.5 with
   misclassification being common among both sets ([81]Fig. 3). The linear
   logistic model was overfit; while able to assign most points correctly
   for the training data, it struggled with the testing set with large
   numbers of mis-assigned points for both term and preterm groups. ANN
   followed a similar trend and was able to capture the observed variance
   in the training set to near perfection yet encountered a
   generalizability issue on the validating dataset. For the XGBoost
   model, most point assignments were accurate for the training set except
   the few hovering around the 0.5 axis. However, the majority of the
   testing set was assigned, correctly and incorrectly, to the term group,
   leading to a lower validation accuracy.

Fig. 3.

   [82]Fig. 3
   [83]Open in a new tab

   Preterm/Term assignments by different models. Feature 1 in the X-axis
   reflects the predicted class probabilities while Feature 2 in the
   Y-axis labels arbitrary coordinates for visualization. The dataset was
   split into a training set and a testing set. The model was constructed
   based on the training set and validated its accuracy on the testing
   set. For bootstrapped models, average Preterm/Term assignments were
   reported. (Left) Non-resampled models (Right) Bootstrapped models.

3.2.3. Assessment of model predictive strength

   Predictive potentials were evaluated based on the AUROC ([84]Fig. 4).
   PLS-DA, with an AUROC of 0.60, served as a reference for comparison
   across models. The linear logistic regression showed an AUROC of 0.56,
   which was slightly lower than the PLS-DA result. Combining this with
   the accuracy value, the linear logistic model appeared to show worse
   performance than the conventional PLS-DA approach for this dataset. The
   ANN model was able to achieve a slight AUROC improvement (0.66) over
   the PLS-DA model. These comparisons indicated that ANN, through their
   layered neuron-like structures and iterative optimizations, might
   improve predictive strength over conventional approaches. However, when
   coupled with the observed training and validation accuracies
   ([85]Supplementary Table 2), it is likely overfit. While the training
   accuracy was 1.0, the validation accuracy and predictive strength were
   considerably lower. The algorithms were very potent to the point where
   they might be forcing mathematical explanations to every data variance
   in the training set yet had trouble applying these to the validation
   set. Another type of disparity observed was in the XGBoost model.
   Although it had one of the highest accuracies from the previous
   section, the predictive power was actually the lowest (AUROC = 0.49).
   This was due to imbalanced class assignments, favoring a major class to
   achieve better accuracy by sacrificing its class discrimination power.

Fig. 4.

   [86]Fig. 4
   [87]Open in a new tab

   Receiver-operating characteristic (ROC) curve evaluating predictive
   potentials for preterm birth by different models. Area-under-the curve
   (AUC) was caculated to determine the predictive strength. The 95 %
   confidence interval and significance (p-value) are also shown on each
   panel. (Left) Non-resampled models (Right) Bootstrapped models.

3.3. Effect of resampling on clinical datasets

   Resampling was designed using a bootstrap method originally proposed by
   Efron [88][28]. In this study, 100 subsample sets were created from the
   training dataset with randomized replacements from the same set. Each
   of the 4 models: PLS-DA, linear logistic regression, artificial neural
   network, and XGBoost were configured to iterate through 100
   bootstrapped sets for comparison.

3.3.1. Comparison of model confusion matrices, resampled dataset

   Overall, the accuracies for bootstrapped models were notably higher
   than those from the base models. Improvements in confusion matrix
   metrics were also evident ([89]Fig. 2). For PLS-DA, the mean accuracy
   increased from 0.63 to 0.71 with bootstrapping. While there was an
   improved true term rate, the main drive for better accuracy was the
   improved specificity (0.40–0.54), indicating a much-improved ability to
   correctly assign preterm samples. Likewise, linear logistic regression
   saw a 0.33 elevation in accuracy that could be attributed to much
   improved sensitivity (0.67–0.86) and specificity (0.33–0.66) for the
   same reason. On the contrary, when trained on bootstrapped data, the
   neural network only marginally improved its performance. However, the
   XGBoost model saw considerable improvement with the bootstrapped data.
   While the base model already had an accuracy of 0.70, the mean accuracy
   improved to reach a value of 0.82 with bootstrapping. This improvement
   was achieved by a more balanced class distribution closer to the 1:2
   preterm to term ratio in the original dataset as shown in [90]Fig. 2A.

3.3.2. Evaluation of model classification certainties, resampled dataset

   Introducing bootstrap resampling had different effects on how points
   were assigned in each model. The displayed probability distributions
   represent the mean probability value for each sample across 100
   iterations ([91]Fig. 3). The bootstrapped PLS-DA model diverged
   slightly away from the 0.5 axis for better certainties with less
   misclassifications of the test set compared to its base model.
   Interestingly, the linear logistic regression model, which saw greatly
   improved performance in the confusion matrix, had points moving closer
   to the centre, indicating an actual reduction in certainties for most
   samples. This effect was stronger in ANN, with bootstrapping resulting
   in a wider spread of samples compared to the base model.

   In contrast, the bootstrapped XGBoost model caused an impact in the
   opposite direction, having more points diverging away from the centre
   for improved certainties. Overall, bootstrapping reduced
   misclassifications across all models. This echoed the improved
   sensitivity and specificity from the confusion matrices of these models
   ([92]Fig. 2B).

3.3.3. Assessment of predictive strength, resampled dataset

   When bootstrapping was applied, the predictive potential by AUROC
   improved for all models ([93]Fig. 4). One important observation from
   the resampling models was a smoother ROC curvature compared to the
   stair-like shapes of their respective base models. This could be
   attributed to the combining and averaging of the individual variations
   from subsample sets. PLS-DA showed an improvement from 0.60 to 0.71,
   whereas the linear logistic regression increased from 0.60 to 0.80.
   Bootstrapping’s improvement on the predictive strength was also
   observed for the ANN model, jumping slightly from 0.66 to 0.68. The
   most noticeable improvement by bootstrapping occurred with the XGBoost
   algorithm, which in this study raised the AUROC from 0.49 to 0.85,
   significantly alleviating the class imbalance observed in the base
   XGBoost model.

3.4. Metabolite identification by SHAP analysis

   The highest SHAP performance was observed for the bootstrapped XGBoost
   model ([94]Figs. 5A and [95]5B). While the ranking orders and the list
   of metabolites varied across models ([96]Supplementary Fig. 3), the
   types of moieties changing with preterm birth were consistent, and the
   total number of representative compound classes based on the top 20
   most contributing metabolites within each model were counted. A shared
   characteristic across all models was the large number of acylcarnitines
   and amino acid derivatives, accounting for at least half of all ranked
   metabolites. In most cases, these two classes were reduced in mothers
   delivering preterm. Several metabolites also stood out for their
   consistent presence in many of the tested models. Notable metabolites
   of interest included acylcarnitines, amino acid derivatives, kynurenic
   and pipecolic acid.

Fig. 5.

   [97]Fig. 5
   [98]Open in a new tab

   Metabolites ranking using Shapley additive explanations (SHAP) for the
   bootstrapped XGBoost model. (A) The ranking by the mean absolute impact
   on model output for the top 20 metabolites. (B) SHAP summary plot for
   sample distribution within each metabolite. The Shapley value is the
   average marginal contribution of a feature value across all possible
   coalitions of features. Each sample was represented by a dot with a
   red/blue spectrum representing high/low levels. A positive SHAP value
   indicated that the metabolite led the model to predict term birth,
   whereas a negative value did the opposite. (C) Pathway analysis
   displaying changes in maternal serum between preterm and term births.
   Plot based on pathway impact and –log₁₀(p-value). Circles represent a
   metabolic pathway (KEGG), with larger and more intensely colored
   circles indicating higher pathway impact and statistical significance.
   Significance threshold set at p < 0.05 as indicated by the dashed line.

3.5. Pathway enrichment analysis

   Pathway analysis revealed several significantly perturbed metabolic
   pathways to be associated with preterm birth ([99]Fig. 5C). The most
   enriched pathway was tyrosine metabolism, displaying the highest
   statistical significance (p < 0.001) and pathway impact. This was
   followed by the phenylalanine, tyrosine, and tryptophan biosynthesis
   pathway (p = 0.01), phenylalanine metabolism (p = 0.02), ubiquinone and
   other terpenoid-quinone biosynthesis (p = 0.046), both of which reached
   statistical significance but with lower pathway impact scores. Lastly,
   a trend for enrichment of pyrimidine metabolism was identified but this
   failed to reach statistical significance (p = 0.09).

4. Discussion

   There is growing interest in machine learning applications for clinical
   studies. Unlike some other fields ML models have excelled at, clinical
   data often faces obstacles such as limited sample size and
   accessibility due to the nature of the data source. Some of these
   issues may be compensated in explorative studies, yet additional
   challenges arise to identify scientifically critical information from
   non-relevant features. The present study retrospectively investigated
   the predictive ability of several ML algorithms for binary
   classification of preterm birth using untargeted metabolomic serum data
   from an established pregnancy cohort.

4.1. Non-linear algorithms did not guarantee better preterm prediction

   It was hypothesized that non-linear models would perform better than
   conventional linear models such as PLS-DA or linear logistic regression
   on this preterm birth dataset. While some improvements were observed
   from the ANN model, the XGBoost model encountered the issue of
   classification disparity, leading to a high training set accuracy but
   the worst AUROC for poor generalizability. An advantage shown by the
   layered models such as ANN was the increased predictive probabilities
   of a sample. However, extensive model overfitting on the training set
   which could even force a highly accurate model that struggled with
   generalizability on the testing set [100][30]. This was highlighted in
   this study by higher sensitivity and specificity, supported by higher
   prediction certainties, observed for the ANN model over the PLS-DA and
   linear logistic regression models, yet the accuracy and AUROC did not
   see similar improvements.

   Our findings, in conjunction with past literature, add evidence that
   non-linear models are not necessarily superior to their linear
   counterparts. In many cases, these two types of algorithm approaches
   were comparable and showed similar performance, each with corresponding
   equivalences to bridging and cross-examination [101][31]. In fact, the
   majority of research suggests that there is no best modeling approach,
   but rather the most optimal choice for the given situation is on a
   trial-and-error basis.

4.2. Resampling by bootstrapping improved model performance by varying
degrees

   This work shows that introducing bootstrap resampling improved model
   performance in predicting preterm birth, though the improvement varied
   across model types. It improved both predictive accuracies and
   strengths by mitigating the overfitting of models for the term class.
   The XGBoost model, whose predictive strength was impacted the most by
   the shifted class imbalance, had the largest improvements with
   bootstrapping because of this. Another outcome was that resampling
   improved the performance by reducing the numbers of false
   positives/negatives, which was most evident in the bootstrapped linear
   logistic regression model. It was able to lower the number of
   misassignments by more than half by taking the average classification
   scores across all subsample iterations. Sampling with replacements
   reduced the variability within the subset, and thus enabled more
   distant and straightforward group separations that greatly benefited
   the linear models [102][32]. It led to less false positives/negatives
   from subset results, which then lowered the overall reported values
   when averaging out each category [103][33]. A similar concept was
   applicable to the bootstrapped PLS-DA model, though to a lesser extent.
   This benefit was less apparent in the linear logistic regression and
   the ANN models in our study.

   The XGBoost model achieved the highest prediction accuracy and strength
   for this preterm metabolomic dataset with bootstrapping. The XGBoost
   algorithm has been widely utilized with great success in various
   metabolomic studies, being the most effective model in several studies
   for targeted conditions [104][34], [105][35]. Our study showed a
   similar result for the data but also highlighted the training disparity
   problem for the XGBoost model when working with an imbalanced dataset
   and proposed a solution via bootstrap resampling to achieve better
   prediction results. This consideration is essential for predicting
   preterm birth given that the condition impacts approximately 10 % of
   the general newborn population, meaning that the preterm/term data
   disparity is even more extreme under normal circumstances.

4.3. Altered metabolites were identified by ML for prediction of preterm
birth

   SHAP analysis was effective in highlighting metabolites contributing to
   the prediction outputs, primarily acylcarnitines and amino acid
   derivatives. Acylcarnitines are molecular transporters of fatty acids
   across the mitochondrial membrane for energy metabolism, while amino
   acid derivatives are primarily a result of protein degradation
   [106][36], [107][37]. Alterations of these compounds have previously
   been linked to increased risk of preterm birth and associated
   conditions during gestation [108][38], [109][39]. Pipecolic acid, a key
   metabolite of lysine degradation, was highlighted in all bootstrapped
   models (4/4) and was increased in mothers experiencing preterm birth
   [110][40], [111][41]. Another candidate metabolite appearing in 2/4
   bootstrapped models was kynurenic acid, a metabolite from tryptophan
   metabolism that was reduced in mothers experiencing preterm birth.
   Kynurenic acid is involved in immune regulation and cellular energy
   activities, and alterations to this pathway have previously been
   associated with pregnancy complications [112][42].

4.4. Pathway analysis

   The identified pathways are well known to influence immune regulation,
   oxidative stress, and placental function, all of which are critical to
   maintaining pregnancy. Notably, our findings reveal alterations in
   pathways related to tyrosine metabolism as well as to phenylalanine,
   tyrosine and tryptophan biosynthesis. Tyrosine is a non-essential amino
   acid involved in protein synthesis and catecholamine production. This
   pathway is not typically implicated in preterm birth risk. On the other
   hand, disruptions to tryptophan have been previously linked to various
   preterm phenotypes (e.g. preterm labor with intact membranes and
   preterm premature rupture of the membranes) and maternal inflammation
   [113][43]. While SHAP analysis also highlighted fatty acid related
   species including acylcarnitines and peptides, pathway analysis often
   overlooks these signals due to limited database coverage and the use of
   generalized pathway representations [114][44].

4.5. Study strengths and limitations

   Results demonstrated that both linear and non-linear algorithms were
   predictive of preterm birth, and resampling could further improve the
   model performance for this type of data. A key limitation, however, is
   the absence of validation using a comparable cohort with a larger
   sample size and greater discriminatory data variance. Robust external
   validation using independent and diverse cohorts is essential to
   confirm the reproducibility and reliability of these findings before
   they can be translated into clinical practice.

   Another limitation is that while our work demonstrated that resampling
   techniques could mitigate class imbalance, which is inherently high in
   preterm to term cases, generalization issues may still exist due to
   varying sampling ratios across different analysis. Finally, this study
   focused solely on binary classification of preterm birth, a condition
   with diverse aetiologies that could be further sub-categorized. Future
   work should explore multiclass classification approaches, considering
   factors such as time of sampling and distinctions between types of
   preterm birth (e.g. spontaneous vs. medically indicated). Additionally,
   incorporating clinical risk factors could enhance model granularity,
   ultimately supporting more informed clinical decision-making.

5. Conclusions

   This study examined different machine learning applications on a
   clinical untargeted metabolomic dataset to predict preterm birth. The
   dataset was characterized by its small sample size and low-class
   discrimination with the presence of indifferent features. Linear models
   achieved moderate performance in predictive accuracy and strength,
   whereas non-linear models either had slightly improved outcomes or
   experienced lower predictive strength due to class imbalance.
   Introducing resampling such as bootstrapping has shown improvements in
   both accuracy and predictive strength, but the extent varied
   significantly based on the type of model with XGBoost benefiting the
   most. Acylcarnitines and amino acid derivatives were the major classes
   of compounds contributing to preterm prediction. Multiple models also
   identified kynurenic acid and pipecolic acid as key contributing
   metabolites. Although additional research is necessary before clinical
   implementation, these findings provide valuable considerations for
   advancing the use of machine learning to predict preterm birth.

Author statement

   Each named author has substantially contributed to conducting the
   underlying research and drafting this manuscript. Authors approved the
   submission to CSBJ.

CRediT authorship contribution statement

   Ying-Chieh Han: Visualization, Methodology, Investigation, Formal
   analysis, Data curation. Slater Donna C: Project administration,
   Methodology, Investigation. Jane Shearer: Writing – review & editing,
   Supervision, Resources, Investigation, Conceptualization. Tough Suzanne
   C: Resources, Investigation, Funding acquisition. Chunlong Mu:
   Visualization, Validation, Supervision, Software, Resources. Gavin E.
   Duggan: Writing – review & editing, Visualization, Validation,
   Software, Resources, Project administration, Formal analysis.

Ethics Approval

   The study was conducted according to the guidelines of the Declaration
   of Helsinki and approved by the Institutional Review Board of the
   University of Calgary (REB15-0248).

Funding

   This research was funded by NSERC (J.S., RGPIN/04238-2018). All Our
   Families was funded through Alberta Innovates Interdisciplinary Team
   Grant #200700595 and Alberta Children’s Hospital Foundation.

Software Availability

   The codes for machine learning models can be downloaded from the google
   drive with open access:
   [115]https://drive.google.com/drive/folders/14VNrHVUyZH1PLgycjfWCBfmqFU
   jCcSJV?usp=sharing.

   The downloaded files can be viewed and executed through Google
   Colaboratory: [116]https://colab.research.google.com/. Alternatively, a
   GitHub repository has been created for viewing the same codes via
   GitHub ([117]https://github.com/ianhstudent/ColabMLNotebook.git).

Declaration of Competing Interest

   The authors declare the following financial interests/personal
   relationships which may be considered as potential competing interests:
   Co-Author Dr. Gavin Duggan is employed by Google Inc. This affiliation
   did not influence the study design, data collection and analysis,
   modeling, interpretation of results, decision to publish, or the
   preparation of the manuscript. All aspects of the research were
   conducted independently of the investigator’s employment.

   All other authors declare that they have no known competing financial
   interests or personal relationships that could have appeared to
   influence the work reported in this paper.

Footnotes

   ^Appendix A

   Supplementary data associated with this article can be found in the
   online version at [118]doi:10.1016/j.csbj.2025.07.010.

Appendix A. Supplementary material

   Supplementary material
   [119]mmc1.pdf^ (1MB, pdf)

Data Availability

   All the raw data is available upon request. Data is being uploaded to
   the NIH Common Fund's National Metabolomics Data Repository (NMDR,
   https://www.metabolomicsworkbench.org/data/index.php.

References