Abstract Background Machine learning (ML), with advancements in algorithms and computations, is seeing an increased presence in life science research. This study investigated several ML models' efficacy in predicting preterm birth using untargeted metabolomics from serum collected during the third trimester of gestation. Methods Samples from 48 preterm and 102 term delivery mothers from the All Our Families Cohort (Calgary, AB) were examined. Four ML algorithms: Partial Least Squares Discriminant Analysis (PLS-DA), linear logistic regression, artificial neural networks (ANN), Extreme Gradient Boosting (XGBoost) - with and without bootstrap resampling were used to examine the small-scale clinical dataset for both model performance and metabolite interpretation. Results Model performance was evaluated based on confusion matrices, area under the receiver operating characteristic (AUROC) curve analysis, and feature importance rankings. Linear models such as PLS-DA and logistic regression demonstrated moderate classification performance (AUROC ≈ 0.60), whereas non-linear approaches, including ANN and XGBoost, exhibited marginal improvements. Among all models, XGBoost combined with bootstrap resampling achieved the highest performance, yielding an AUROC of 0.85 (95 % CI: 0.57–0.99, p < 0.001), indicating a significant improvement in classification accuracy. Metabolite importance, derived from Shapley Additive Explanations (SHAP), consistently identified acylcarnitines and amino acid derivatives as principal discriminative features. Pathway analysis revealed disruptions to tyrosine metabolism as well as phenylalanine, tyrosine and tryptophan biosynthesis to be associated with preterm delivery. Conclusions Our results highlight the complexity of metabolomics-based modelling for preterm birth and support an iterative, model-driven approach for optimizing predictive accuracy in small-scale clinical datasets. Keywords: Preterm birth, Pregnancy, Metabolomics, Machine learning, Predictive models, XGBoost Graphical Abstract [33]graphic file with name ga1.jpg [34]Open in a new tab Highlights * • Machine learning on maternal serum metabolomics dataset to predict preterm birth. * • XGBoost with bootstrapping achieved highest AUROC of 0.85. * • Acylcarnitines, amino acids were key markers via SHAP analysis. * • Tyrosine and tryptophan pathways linked to preterm delivery. 1. Introduction Advancements in machine learning (ML) with better computation and sophisticated algorithms have greatly impacted the landscape of data science, and their applications into clinical domains have drawn interest from health care professionals. Metabolites, which can serve as sensitive indicators of physiological alterations linking to disease progression, have seen increased inclusion of ML models for pathogenesis, diagnostics and monitoring. ML algorithms can play a pivotal role in capitalizing on metabolite signatures to analyze complex datasets, extracting meaningful biological interpretations, and building predictive models for diagnostic purposes [35][1]. ML integration with metabolomics, the systematic approach for studying metabolites, offers unique insights into disease pathogenesis and biomarker discovery by recognizing hidden patterns and correlations from metadata analysis of biological systems [36][2]. While partial least squares-discriminant analysis (PLS-DA) continues to be a widely applied and effective method [37][3], machine learning (ML) techniques have gained increasing prominence in metabolomics research. Advanced ML models, such as t-distributed stochastic neighbor embedding (t-SNE), apply non-linear transformations that improve data fitting, enable effective dimensionality reduction, and enhance visualization of class separations [38][4]. These models also employ sophisticated feature selection methods, such as recursive feature elimination and dimensionality reduction, to identify the most informative metabolites [39][5]. While binary classifications with classical yes-or-no decisions are the area drawing the most interest, multiple-decision problems have long existed, and ML has been trained to facilitate these decision processes [40][6]. However, there are often several obstacles when analyzing clinical data. Unlike other “big data”, clinical data-sets are often restricted in sample size, which is a critical necessity for many complex algorithms [41][7]. Thus, this project aims to take these limitations into account to evaluate some common machine learning approaches on a clinical maternal dataset investigating preterm birth using untargeted metabolomics. Past research has demonstrated the utility of machine learning-based metabolomic analysis in this field. Employing maternal vaginal fluid metabolome samples collected between 20 and 24 weeks of gestation, a ML metabolomics approach outperformed microbiome-only and maternal covariate-only models achieving an AUC = 0.78, a finding subsequently validated in two independent cohorts [42][8]. Likewise, Al Ghadban et al. used untargeted serum metabolomics in a case–cohort of 399 pregnant women serially sampled at 12, 20, 28, and 36 weeks of gestation. Six supervised machine learning methods were trained on the top 47 features and results found a random forest model to perform best (AUC = 0.73) followed by a generalized boosted model (AUC = 0.71) for predicting spontaneous preterm birth [43][9]. Combining metabolic profiles with clinical risk factors has also demonstrated success in predicting preterm birth. The Screening for Pregnancy Outcomes by Early Pregnancy Evaluation (SCOPE) study applied untargeted serum metabolomics to asymptomatic pregnant women and found that combining metabolic profiles with clinical risk factors significantly improved early identification of those at risk for preterm birth with a metabolite informed clinical prediction model achieving an AUC of 0.73 [44][10]. Overall, many of these models leveraged advanced feature selection techniques and classification algorithms to identify discriminatory metabolite patterns associated with preterm birth risk. Given this, it was hypothesized that more complex, non-linear models would be advantageous over conventional approaches when working with this type of data due to their ability to better fit curves for the observed variances. This study examined the hypothesis by evaluating the performance of both linear and non-linear ML models in predicting preterm birth with feature selection method to facilitate biological interpretation. 2. Materials and methods 2.1. Participant characteristics and statistical analysis Untargeted metabolomic analysis was performed using third trimester serum samples collected from the antecubital vein in non-fasted pregnant participants between 28 and 32 weeks gestation. This period was prioritized for its robust metabolic signatures identified in prior metabolomics studies and for being the final window for clinical intervention in preterm birth risk mitigation [45][11]. Participants (n = 150) were a subset from the All Our Families (AOF) pregnancy cohort who donated a blood sample (formally All Our Babies) for an ongoing project monitoring maternal/fetal status in Calgary, Alberta [46][12]. Additional information regarding the AOF study methods and results can be found at the project website: [47]https://ucalgary.ca/allourfamilies. The study was approved by the Child Health Research Office and the Conjoint Health Research Ethics Board at the University of Calgary to acquire written informed consent from participants with eligibility based on the individuals being adults (≥ 18 years of age), less than 25 weeks of gestation at the time of recruitment, receiving prenatal care in Alberta, Canada and being willing to complete written questionnaires in English. Individuals were classified as preterm if delivery occurred prior to 37 weeks of gestation while term was considered > 37 weeks of gestation). Gestational age was calculated based on the participants’ last reported menstrual period. Participant characteristics are shown in [48]Table 1. For continuous characteristics such as body-mass-index (BMI, kg/m^2), the means and the standard deviations were calculated, whereas categorical characteristics were represented by their means and relative percentages (%). Statistical significance was determined based on p-values, with a significance threshold set at 0.05, using an unpaired t-test with a two-tailed distribution between the preterm and the term groups. Table 1. Participant characteristics. Maternal Preterm (n= 48) Term Birth (n= 102) p-value Pre-pregnancy BMI (kg/m^2) 25.3 ± 5.7 24.3 ± 4.6 0.295 Maternal Age (y) 32.0 ± 4.8 30.9 ± 4.1 0.167 Gestation Weight Gain (kg) 10.0 ± 3.5 10.2 ± 4.2 0.759 Anxiety score [49]^a 34.0 ± 9.3 30.8 ± 7.5 0.082 Depression score [50]^b 7.3 ± 5.4 5.1 ± 4.4 0.098 Infant Gestation (wk) 34.6 ± 1.8 38.8 ± 1.2 < 0.001 Fetal Birthweight (g) 2532 ± 562 3350 ± 442 < 0.001 Child Sex, female (%) 29 (60.4) 56 (54.9) 0.527 Delivery,n(%) Spontaneous labor 23 (47.9) 75 (73.5) - Medically indicated [51]^* or Induced 22 (45.8) - - Induction (term) - 27 (26.5) - Heart rate abnormality 8 (9.8) 10 (16.7) - Hypertensive (n) (%) 7 (14.6) 4 (3.9) - Gestational diabetes (%) 3 (6.3) 2 (2.0) - Uterine scar 3 (6.3) 16 (15.7) - Breech position 3 (6.3) 2 (2.0) - Preeclampsia 1 (2.1) 3 (2.9) - Eclampsia 6 (12.5) 1 (1.0) - Premature membrane rupture 4 (8.3) 4 (3.9) - Proteinuria 2 (4.2) 0 (0) Restricted fetal growth 1 (2.1) 0 (0) - Uterine inertia 0 (0) 1(1.0) - Inadequate contraction 0 (0) 2 (2.0) - Incomplete fetal head rotation 0 (0) 4 (3.9) - [52]Open in a new tab Values represent mean (standard deviation) or categorical values (%). P-values were calculated using an unpaired t-test with a two-tailed distribution. Significance set at p < 0.05. ^* Preterm. Note: Due to small sample sizes, statistics were not evaluated for these characteristics. ^a Spielberger State-Trait Anxiety Inventory Score (STAI) is used to assess anxiety. ^b Edinburgh Perinatal Depression Scale (EPDS). 2.2. Metabolomic analysis All samples were blinded to the investigators for assessment. Upon collection, serum was centrifuged and stored at − 80 °C until the day of analysis using previously described methods [53][13]. Serum samples were thawed on ice from − 80 °C storage prior to protein precipitation using 100 % methanol. The supernatants were collected via centrifugation followed by solvent evaporation. Reconstitution was in 50:50 methanol:water followed by centrifugation through a 200-micron filter to remove debris. The analytical method was based on positive mode liquid chromatography mass spectrometry (LC/MS, QTOF 6545i, Agilent, USA) using reverse phase chromatography through an Acquity HSS (2.1 × 150 mm, Waters, USA). A gradient elution system under a mobile phase setup of (A) 0.1 % formic acid in water and (B) 0.1 % formic acid in acetonitrile was used for affinity-based separations prior to detection by a time-of-flight mass spectrometer recording ions with mass-to-charge (m/z) ratio between 50 and 1200 m/z. Collected spectra were processed through XCMS version 3.7.1 for peak labelling and intensity calculations [54][14]. Metabolites were identified using the Human Metabolome Database (HMDB) to determine the most probable compound candidate based on m/z values with a tolerance threshold of 30 ppm. Compounds with a CV > 30 % were excluded from further analysis. In total, 181 endogenous, known metabolites were identified. 2.3. Computational modeling approach A schematic of the computational approach, data processing, and the subsequent modeling for this study is shown in [55]Fig. 1. The dataset was subjected to outlier removal by the ROUT method with a moderate threshold of Q = 1 % (GraphPad Prism 9.0, USA) [56][15]. Missing values, which resulted from both the outlier test or absence of instrument detection, were replaced by arbitrary values based on K-nearest neighbor imputation [57][16]. Data values were then normalized via z-score transformation using the standard scaler operation. Majority of data pre-processing such as data filtration and normalization, as well as the subsequent modeling, was performed in Google Colaboratory (Colab). Each model was encoded within a single Python script including the initial file upload to model performance evaluation in the form of accuracy, area under the receiver operating characteristic (AUROC) curve, and confusion matrix. Based on commonly reported train/test split ratios in the literature, a ratio of 80:20 was selected [58][17]. The model randomly split the 150 total participants into a training set of 120 and a testing set of 30, stratified with respect to the preterm/term ratio in the original dataset. The model was built and trained from the 120. The testing set was then used to examine the trained model for model performance. While many algorithms have additional hyperparameters influencing model performance, only the most optimal after hyperparameter optimization were reported here for comparable model results. Regularization methods (L1 or L2) and dropout did not impact model performance. Fig. 1. [59]Fig. 1 [60]Open in a new tab Overview of the data analysis workflow. All comparisons were made based on the same preterm metabolomic dataset from the All Our Family Cohort. The analysis pipeline was structured similarly across different models to cross-examine performance. For metabolomic studies, models must identify, and rank observed features/metabolites based on their contributions to the model output to draw scientifically relevant conclusions. Therefore, the present study evaluates only models which can readily and consistently extract feature importance for metabolite information. Machine learning implementations were mostly conducted through Python’s Scikit-learn library [61][18]. 2.4. Model selection and discrimination 2.4.1. PLS-DA Partial Least Squares Discriminant Analysis (PLS-DA) has been the go-to approach for multivariate classification tasks, particularly when dealing with high-dimensional data or collinear predictors to project them into lower dimensional spaces [62][19]. To identify the best-performing components to explain the data variances, PLS-DA was conducted within Python’s Scikit-learn library, using the package PLSRegression with 2 components which captured 25 % of the training data variance for the highest prediction performance. The resulting probabilities were first converted to binary predictions with the defining threshold set at 0.5 using the astype(integer) function before validation with the testing data for performance evaluation. 2.4.2. Linear logistic regression Linear logistic regression establishes a relationship between the predictor variables (x1, x2,…,xn) and the log odds of the outcome (logit(p)) [63][20]. Mathematically, this relationship can be represented as: [MATH: logitp=β0+β1X1+β2X2++βnXx :MATH] (1) where β[0] is the intercept term and β[1], β[2],…,β[n] are the coefficients associated with each feature. Through the logistic transformation, this linear combination is converted into predicted probabilities, ensuring they fall within the range of 0–1. The logistic function facilitates this transformation, mapping the log odds to probabilities: [MATH: p=11+elogit(p) :MATH] (2) Estimation of regression coefficients is typically done through techniques like maximum likelihood estimation, reflecting the influence of each predictor while holding other variables constant [64][21]. A linear logistic regression model was implemented in this study using Scikit-learn’s LogisticRegression package with a liblinear solver. Regularization, which would trade a decrease in training accuracy for an increase in generalizability, did not affect the model performance and was not implemented. The final training and validation losses were calculated to reflect training quality. 2.4.3. Artificial neural network The concept of an artificial neural network (ANN) consists of interconnected nodes, called neurons, constructed in layers of at minimum three distinct structures: an input layer, one (or more) hidden layers, and an output layer [65][22]. Essentially, each neuron receives inputs from neurons in the previous layer, applies a series of mathematical operations to these inputs, and produces an output signal. The output a of neuron i in layer l can be defined by: [MATH: ai(l)=activation(zi l) :MATH] where z[i] includes the input value of the previous neurons and the weights for neuron i. In this study, a neural network with a 3-layered architecture was constructed with rectified linear unit as the activation function through Python’s Tensorflow library [66][23]. Activation functions influence the types of transformation leading to very different outcomes, and the optimal approach depends on the type and quality of data [67][24]. Each model was set to iterate 100 times (epoch) and calculate the training loss, the training accuracy, the validation loss, and the validation accuracy during each epoch. The successive epoch would optimize the model to lower the loss using backpropagation based on the prior results. Adaptive Moment Estimation, or Adam, optimization was used to minimize the losses during training in the form of binary cross-entropy. 2.4.4. XGBoost classifier XGBoost, or Extreme Gradient Boosting, is an ensemble of different algorithms and learning methods particularly renowned for its efficacy in predictive modeling and classification tasks [68][25]. At its core lies an objective function, typically the sum of individual instance-specific loss functions, such as squared error loss for regression or logistic loss for classification, augmented with regularization terms to reduce overfitting. The objective function for XGBoost can be represented as: [MATH: Obejective=i=1 nloss(yi,ŷi)+k=1 KΩ(fk) :MATH] where loss (yi, ŷi) is the loss function measuring the discrepancy between the true target yi and the predicted target ŷi, and Ω(f[k]) represents the regularization term for each tree in the ensemble. XGBoost modeling was conducted using the XGBClassifier package from the xgboost library [69][25]. The learning rate was set to 0.1 with 100 weak learners implemented. ‘Binary:logistic’ was selected for the objective function to fit preterm/term classification task. The maximum tree layer depth was set at 5, while the minimum tree split loss during training was set to 0 for a more tolerant loss reduction and model estimations. 2.5. Prediction certainties based on probability distribution Once a model is trained, the probability distribution for the entire dataset was calculated using both the training and the testing sets to visualize the distribution of data points with Python’s matplotlib graphing library [70][26]. Its Seaborn package was used to label and distinguish different data categories. For the probability distribution plots shown ranging from 0 to 1 in this study, a probability value of 0 represented the absolute certainty of the sample being preterm, whereas a value of 1 indicated the absolute certainty of a term class sample. The centred 0.5 axis was the classification threshold separating the assignments of the two groups; the further away from 0.5, the higher certainty in the prediction of each respective class. 2.6. Impact of resampling on preterm prediction Statistical resampling refers to a group of methods used to generate new subsets of the training dataset from existing data to estimate the distribution of a statistic or assess the stability of a model without relying on strong parametric assumptions. These methods involve repeatedly drawing samples from the observed data, with replacement, and executing subsequent analysis on each resampled dataset [71][27]. The effects of resampling can also vary across different modeling algorithms depending on how they operate, and there can be dynamic swings in model performance when resampling is introduced. This study examined resampling with bootstrapping based on the method described by Efron [72][28]. The resampling process was validated using a further 20 % holdout set (within the bootstrapped data) using the StratifiedKFold package from Scikit-learn. Each sample set, both the training and the testing set, was split equally into 5 sections to examine section homogeneity by comparing variance of a single split versus the other four. Consequently, the training and the validation steps were conducted with 4/5 of the original sample size, 96 samples for training and 24 samples for testing respectively. An illustration of the resampling design can be found in [73]Supplementary Fig. 1. 2.7. Feature importance determination Due to the variation in ML rationale, it is important to use a common algorithm to evaluate the discriminant features within the models. In the present study, Shapley Additive Explanations (SHAP) values were employed to determine the feature importance for each model while providing cross model comparisons. SHAP values, adopted from the concept of game theory, consider all possible combinations of features and calculate the average marginal contribution of each feature across different prediction outcomes [74][29]. SHAP values were obtained by installing and importing the SHAP library into Python from the publisher’s repository. The Explainer function was used to describe the observed metabolite contributions with respect to both training and testing data. The ranking results were visualized through the summary plot function for sample distributions among the top 20 contributors for each model. A larger SHAP value of a metabolite indicated that its change led to increased prediction power of term birth. 2.8. Pathway enrichment analysis Pathway analysis was used to map metabolites to biological processes and metabolic pathways that were significantly affected in preterm birth. Detected metabolites were mapped onto known biochemical pathways catalogued in KEGG (Kyoto Encyclopedia of Genes and Genomes). Analysis was performed in MetaboAnalyst v5.0 [75][11]. 3. Results 3.1. Participants Participant characteristics are shown in [76]Table 1. Differences between preterm and term groups were gestational age and birthweight (p < 0.05). Other characteristics, including maternal age, pre-pregnancy BMI (kg/m^2), and gestational weight gain between were not different (p > 0.05) between groups. Although not statistically significant, mental health assessments showed higher scores for anxiety (Spielberger State-Trait Anxiety Inventory Score, p = 0.082) and depression (Edinburgh Perinatal Depression Scale, p = 0.098) among the preterm participants. The incident of spontaneous preterm birth (n = 23) was nearly identical to that of medically indicated preterm birth (n = 22), whereas spontaneous birth made up the majority (73.5 %) of all recorded term birth. Maternal conditions such as hypertension, preeclampsia, or premature membrane rupture were also documented, but these conditions were not widespread in either group and were therefore not taken into consideration in the subsequent modelling. 3.2. Discrimination of machine learning models Serum untargeted metabolomic data from 48 preterm mothers and 102 term mothers were fed into the six different based ML models to determine their predictive potentials for preterm delivery. Among them, 120 were used to train the model, while the remaining 30 were used to validate the trained model performance ([77]Fig. 1). 3.2.1. Comparison of model confusion matrices Accuracy was assessed for each model based on their confusion matrices ([78]Fig. 2A). Sensitivity, or the true positive rate, and specificity, or the true negative rate, were also evaluated for classification behavior within each category. The dataset was first evaluated by a conventional PLS-DA model, which had an accuracy of 0.63. The model appeared to experience challenges in properly classifying preterm birth resulting in low specificity (0.40) with the overall classification strongly favoring the term category. Linear logistic regression achieved a prediction accuracy of 0.57 with near equivalent amounts of false positives (n = 7) and false negatives (n = 6), while true negatives (n = 14) greatly exceeded true positives (n = 3), had a sensitivity of 0.67 and a specificity of 0.33. The ANN model, using rectified linear units and sigmoid transformations at the neurons, was able to achieve a validation accuracy of 0.63. XGBoost algorithm with a reported accuracy of 0.70 which was the highest among the models examined. The higher accuracy could be attributed to a higher tendency to classify a sample as term birth, as shown by the classification distribution with 27 samples assigned to term birth versus 3 for preterm birth. Fig. 2. [79]Fig. 2 [80]Open in a new tab Confusion matrix comparison between different models for binary classification of preterm (n = 48) and term (n = 102). Each matrix represents distributions of term and preterm assignments on the testing set (n = 30) by the trained model (n = 120) after the train/test split. Accuracy: (True Positives + True Negatives)/Total number of assignments. Sensitivity: (True Positives + False Positives)/Total Positives. Specificity: Sensitivity: (True Negatives + False Negatives)/Total Negatives. F1 Score = 2 * True Positives /(2* True Positives + False Positives + False Negatives) (Left) Non-resampled models (Right) Bootstrapped models. For Bootstrapped models, the standard deviation of accuracy across subsets is represented as following: square-root (1/ (Total number of subsets-1) * (Sum of (individual subset accuracy-mean accuracy)^2)). 3.2.2. Evaluation of model classification certainties The PLS-DA model had most training and testing points around 0.5 with misclassification being common among both sets ([81]Fig. 3). The linear logistic model was overfit; while able to assign most points correctly for the training data, it struggled with the testing set with large numbers of mis-assigned points for both term and preterm groups. ANN followed a similar trend and was able to capture the observed variance in the training set to near perfection yet encountered a generalizability issue on the validating dataset. For the XGBoost model, most point assignments were accurate for the training set except the few hovering around the 0.5 axis. However, the majority of the testing set was assigned, correctly and incorrectly, to the term group, leading to a lower validation accuracy. Fig. 3. [82]Fig. 3 [83]Open in a new tab Preterm/Term assignments by different models. Feature 1 in the X-axis reflects the predicted class probabilities while Feature 2 in the Y-axis labels arbitrary coordinates for visualization. The dataset was split into a training set and a testing set. The model was constructed based on the training set and validated its accuracy on the testing set. For bootstrapped models, average Preterm/Term assignments were reported. (Left) Non-resampled models (Right) Bootstrapped models. 3.2.3. Assessment of model predictive strength Predictive potentials were evaluated based on the AUROC ([84]Fig. 4). PLS-DA, with an AUROC of 0.60, served as a reference for comparison across models. The linear logistic regression showed an AUROC of 0.56, which was slightly lower than the PLS-DA result. Combining this with the accuracy value, the linear logistic model appeared to show worse performance than the conventional PLS-DA approach for this dataset. The ANN model was able to achieve a slight AUROC improvement (0.66) over the PLS-DA model. These comparisons indicated that ANN, through their layered neuron-like structures and iterative optimizations, might improve predictive strength over conventional approaches. However, when coupled with the observed training and validation accuracies ([85]Supplementary Table 2), it is likely overfit. While the training accuracy was 1.0, the validation accuracy and predictive strength were considerably lower. The algorithms were very potent to the point where they might be forcing mathematical explanations to every data variance in the training set yet had trouble applying these to the validation set. Another type of disparity observed was in the XGBoost model. Although it had one of the highest accuracies from the previous section, the predictive power was actually the lowest (AUROC = 0.49). This was due to imbalanced class assignments, favoring a major class to achieve better accuracy by sacrificing its class discrimination power. Fig. 4. [86]Fig. 4 [87]Open in a new tab Receiver-operating characteristic (ROC) curve evaluating predictive potentials for preterm birth by different models. Area-under-the curve (AUC) was caculated to determine the predictive strength. The 95 % confidence interval and significance (p-value) are also shown on each panel. (Left) Non-resampled models (Right) Bootstrapped models. 3.3. Effect of resampling on clinical datasets Resampling was designed using a bootstrap method originally proposed by Efron [88][28]. In this study, 100 subsample sets were created from the training dataset with randomized replacements from the same set. Each of the 4 models: PLS-DA, linear logistic regression, artificial neural network, and XGBoost were configured to iterate through 100 bootstrapped sets for comparison. 3.3.1. Comparison of model confusion matrices, resampled dataset Overall, the accuracies for bootstrapped models were notably higher than those from the base models. Improvements in confusion matrix metrics were also evident ([89]Fig. 2). For PLS-DA, the mean accuracy increased from 0.63 to 0.71 with bootstrapping. While there was an improved true term rate, the main drive for better accuracy was the improved specificity (0.40–0.54), indicating a much-improved ability to correctly assign preterm samples. Likewise, linear logistic regression saw a 0.33 elevation in accuracy that could be attributed to much improved sensitivity (0.67–0.86) and specificity (0.33–0.66) for the same reason. On the contrary, when trained on bootstrapped data, the neural network only marginally improved its performance. However, the XGBoost model saw considerable improvement with the bootstrapped data. While the base model already had an accuracy of 0.70, the mean accuracy improved to reach a value of 0.82 with bootstrapping. This improvement was achieved by a more balanced class distribution closer to the 1:2 preterm to term ratio in the original dataset as shown in [90]Fig. 2A. 3.3.2. Evaluation of model classification certainties, resampled dataset Introducing bootstrap resampling had different effects on how points were assigned in each model. The displayed probability distributions represent the mean probability value for each sample across 100 iterations ([91]Fig. 3). The bootstrapped PLS-DA model diverged slightly away from the 0.5 axis for better certainties with less misclassifications of the test set compared to its base model. Interestingly, the linear logistic regression model, which saw greatly improved performance in the confusion matrix, had points moving closer to the centre, indicating an actual reduction in certainties for most samples. This effect was stronger in ANN, with bootstrapping resulting in a wider spread of samples compared to the base model. In contrast, the bootstrapped XGBoost model caused an impact in the opposite direction, having more points diverging away from the centre for improved certainties. Overall, bootstrapping reduced misclassifications across all models. This echoed the improved sensitivity and specificity from the confusion matrices of these models ([92]Fig. 2B). 3.3.3. Assessment of predictive strength, resampled dataset When bootstrapping was applied, the predictive potential by AUROC improved for all models ([93]Fig. 4). One important observation from the resampling models was a smoother ROC curvature compared to the stair-like shapes of their respective base models. This could be attributed to the combining and averaging of the individual variations from subsample sets. PLS-DA showed an improvement from 0.60 to 0.71, whereas the linear logistic regression increased from 0.60 to 0.80. Bootstrapping’s improvement on the predictive strength was also observed for the ANN model, jumping slightly from 0.66 to 0.68. The most noticeable improvement by bootstrapping occurred with the XGBoost algorithm, which in this study raised the AUROC from 0.49 to 0.85, significantly alleviating the class imbalance observed in the base XGBoost model. 3.4. Metabolite identification by SHAP analysis The highest SHAP performance was observed for the bootstrapped XGBoost model ([94]Figs. 5A and [95]5B). While the ranking orders and the list of metabolites varied across models ([96]Supplementary Fig. 3), the types of moieties changing with preterm birth were consistent, and the total number of representative compound classes based on the top 20 most contributing metabolites within each model were counted. A shared characteristic across all models was the large number of acylcarnitines and amino acid derivatives, accounting for at least half of all ranked metabolites. In most cases, these two classes were reduced in mothers delivering preterm. Several metabolites also stood out for their consistent presence in many of the tested models. Notable metabolites of interest included acylcarnitines, amino acid derivatives, kynurenic and pipecolic acid. Fig. 5. [97]Fig. 5 [98]Open in a new tab Metabolites ranking using Shapley additive explanations (SHAP) for the bootstrapped XGBoost model. (A) The ranking by the mean absolute impact on model output for the top 20 metabolites. (B) SHAP summary plot for sample distribution within each metabolite. The Shapley value is the average marginal contribution of a feature value across all possible coalitions of features. Each sample was represented by a dot with a red/blue spectrum representing high/low levels. A positive SHAP value indicated that the metabolite led the model to predict term birth, whereas a negative value did the opposite. (C) Pathway analysis displaying changes in maternal serum between preterm and term births. Plot based on pathway impact and –log₁₀(p-value). Circles represent a metabolic pathway (KEGG), with larger and more intensely colored circles indicating higher pathway impact and statistical significance. Significance threshold set at p < 0.05 as indicated by the dashed line. 3.5. Pathway enrichment analysis Pathway analysis revealed several significantly perturbed metabolic pathways to be associated with preterm birth ([99]Fig. 5C). The most enriched pathway was tyrosine metabolism, displaying the highest statistical significance (p < 0.001) and pathway impact. This was followed by the phenylalanine, tyrosine, and tryptophan biosynthesis pathway (p = 0.01), phenylalanine metabolism (p = 0.02), ubiquinone and other terpenoid-quinone biosynthesis (p = 0.046), both of which reached statistical significance but with lower pathway impact scores. Lastly, a trend for enrichment of pyrimidine metabolism was identified but this failed to reach statistical significance (p = 0.09). 4. Discussion There is growing interest in machine learning applications for clinical studies. Unlike some other fields ML models have excelled at, clinical data often faces obstacles such as limited sample size and accessibility due to the nature of the data source. Some of these issues may be compensated in explorative studies, yet additional challenges arise to identify scientifically critical information from non-relevant features. The present study retrospectively investigated the predictive ability of several ML algorithms for binary classification of preterm birth using untargeted metabolomic serum data from an established pregnancy cohort. 4.1. Non-linear algorithms did not guarantee better preterm prediction It was hypothesized that non-linear models would perform better than conventional linear models such as PLS-DA or linear logistic regression on this preterm birth dataset. While some improvements were observed from the ANN model, the XGBoost model encountered the issue of classification disparity, leading to a high training set accuracy but the worst AUROC for poor generalizability. An advantage shown by the layered models such as ANN was the increased predictive probabilities of a sample. However, extensive model overfitting on the training set which could even force a highly accurate model that struggled with generalizability on the testing set [100][30]. This was highlighted in this study by higher sensitivity and specificity, supported by higher prediction certainties, observed for the ANN model over the PLS-DA and linear logistic regression models, yet the accuracy and AUROC did not see similar improvements. Our findings, in conjunction with past literature, add evidence that non-linear models are not necessarily superior to their linear counterparts. In many cases, these two types of algorithm approaches were comparable and showed similar performance, each with corresponding equivalences to bridging and cross-examination [101][31]. In fact, the majority of research suggests that there is no best modeling approach, but rather the most optimal choice for the given situation is on a trial-and-error basis. 4.2. Resampling by bootstrapping improved model performance by varying degrees This work shows that introducing bootstrap resampling improved model performance in predicting preterm birth, though the improvement varied across model types. It improved both predictive accuracies and strengths by mitigating the overfitting of models for the term class. The XGBoost model, whose predictive strength was impacted the most by the shifted class imbalance, had the largest improvements with bootstrapping because of this. Another outcome was that resampling improved the performance by reducing the numbers of false positives/negatives, which was most evident in the bootstrapped linear logistic regression model. It was able to lower the number of misassignments by more than half by taking the average classification scores across all subsample iterations. Sampling with replacements reduced the variability within the subset, and thus enabled more distant and straightforward group separations that greatly benefited the linear models [102][32]. It led to less false positives/negatives from subset results, which then lowered the overall reported values when averaging out each category [103][33]. A similar concept was applicable to the bootstrapped PLS-DA model, though to a lesser extent. This benefit was less apparent in the linear logistic regression and the ANN models in our study. The XGBoost model achieved the highest prediction accuracy and strength for this preterm metabolomic dataset with bootstrapping. The XGBoost algorithm has been widely utilized with great success in various metabolomic studies, being the most effective model in several studies for targeted conditions [104][34], [105][35]. Our study showed a similar result for the data but also highlighted the training disparity problem for the XGBoost model when working with an imbalanced dataset and proposed a solution via bootstrap resampling to achieve better prediction results. This consideration is essential for predicting preterm birth given that the condition impacts approximately 10 % of the general newborn population, meaning that the preterm/term data disparity is even more extreme under normal circumstances. 4.3. Altered metabolites were identified by ML for prediction of preterm birth SHAP analysis was effective in highlighting metabolites contributing to the prediction outputs, primarily acylcarnitines and amino acid derivatives. Acylcarnitines are molecular transporters of fatty acids across the mitochondrial membrane for energy metabolism, while amino acid derivatives are primarily a result of protein degradation [106][36], [107][37]. Alterations of these compounds have previously been linked to increased risk of preterm birth and associated conditions during gestation [108][38], [109][39]. Pipecolic acid, a key metabolite of lysine degradation, was highlighted in all bootstrapped models (4/4) and was increased in mothers experiencing preterm birth [110][40], [111][41]. Another candidate metabolite appearing in 2/4 bootstrapped models was kynurenic acid, a metabolite from tryptophan metabolism that was reduced in mothers experiencing preterm birth. Kynurenic acid is involved in immune regulation and cellular energy activities, and alterations to this pathway have previously been associated with pregnancy complications [112][42]. 4.4. Pathway analysis The identified pathways are well known to influence immune regulation, oxidative stress, and placental function, all of which are critical to maintaining pregnancy. Notably, our findings reveal alterations in pathways related to tyrosine metabolism as well as to phenylalanine, tyrosine and tryptophan biosynthesis. Tyrosine is a non-essential amino acid involved in protein synthesis and catecholamine production. This pathway is not typically implicated in preterm birth risk. On the other hand, disruptions to tryptophan have been previously linked to various preterm phenotypes (e.g. preterm labor with intact membranes and preterm premature rupture of the membranes) and maternal inflammation [113][43]. While SHAP analysis also highlighted fatty acid related species including acylcarnitines and peptides, pathway analysis often overlooks these signals due to limited database coverage and the use of generalized pathway representations [114][44]. 4.5. Study strengths and limitations Results demonstrated that both linear and non-linear algorithms were predictive of preterm birth, and resampling could further improve the model performance for this type of data. A key limitation, however, is the absence of validation using a comparable cohort with a larger sample size and greater discriminatory data variance. Robust external validation using independent and diverse cohorts is essential to confirm the reproducibility and reliability of these findings before they can be translated into clinical practice. Another limitation is that while our work demonstrated that resampling techniques could mitigate class imbalance, which is inherently high in preterm to term cases, generalization issues may still exist due to varying sampling ratios across different analysis. Finally, this study focused solely on binary classification of preterm birth, a condition with diverse aetiologies that could be further sub-categorized. Future work should explore multiclass classification approaches, considering factors such as time of sampling and distinctions between types of preterm birth (e.g. spontaneous vs. medically indicated). Additionally, incorporating clinical risk factors could enhance model granularity, ultimately supporting more informed clinical decision-making. 5. Conclusions This study examined different machine learning applications on a clinical untargeted metabolomic dataset to predict preterm birth. The dataset was characterized by its small sample size and low-class discrimination with the presence of indifferent features. Linear models achieved moderate performance in predictive accuracy and strength, whereas non-linear models either had slightly improved outcomes or experienced lower predictive strength due to class imbalance. Introducing resampling such as bootstrapping has shown improvements in both accuracy and predictive strength, but the extent varied significantly based on the type of model with XGBoost benefiting the most. Acylcarnitines and amino acid derivatives were the major classes of compounds contributing to preterm prediction. Multiple models also identified kynurenic acid and pipecolic acid as key contributing metabolites. Although additional research is necessary before clinical implementation, these findings provide valuable considerations for advancing the use of machine learning to predict preterm birth. Author statement Each named author has substantially contributed to conducting the underlying research and drafting this manuscript. Authors approved the submission to CSBJ. CRediT authorship contribution statement Ying-Chieh Han: Visualization, Methodology, Investigation, Formal analysis, Data curation. Slater Donna C: Project administration, Methodology, Investigation. Jane Shearer: Writing – review & editing, Supervision, Resources, Investigation, Conceptualization. Tough Suzanne C: Resources, Investigation, Funding acquisition. Chunlong Mu: Visualization, Validation, Supervision, Software, Resources. Gavin E. Duggan: Writing – review & editing, Visualization, Validation, Software, Resources, Project administration, Formal analysis. Ethics Approval The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Institutional Review Board of the University of Calgary (REB15-0248). Funding This research was funded by NSERC (J.S., RGPIN/04238-2018). All Our Families was funded through Alberta Innovates Interdisciplinary Team Grant #200700595 and Alberta Children’s Hospital Foundation. Software Availability The codes for machine learning models can be downloaded from the google drive with open access: [115]https://drive.google.com/drive/folders/14VNrHVUyZH1PLgycjfWCBfmqFU jCcSJV?usp=sharing. The downloaded files can be viewed and executed through Google Colaboratory: [116]https://colab.research.google.com/. Alternatively, a GitHub repository has been created for viewing the same codes via GitHub ([117]https://github.com/ianhstudent/ColabMLNotebook.git). Declaration of Competing Interest The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Co-Author Dr. Gavin Duggan is employed by Google Inc. This affiliation did not influence the study design, data collection and analysis, modeling, interpretation of results, decision to publish, or the preparation of the manuscript. All aspects of the research were conducted independently of the investigator’s employment. All other authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Footnotes ^Appendix A Supplementary data associated with this article can be found in the online version at [118]doi:10.1016/j.csbj.2025.07.010. Appendix A. Supplementary material Supplementary material [119]mmc1.pdf^ (1MB, pdf) Data Availability All the raw data is available upon request. Data is being uploaded to the NIH Common Fund's National Metabolomics Data Repository (NMDR, https://www.metabolomicsworkbench.org/data/index.php. References