Abstract Preterm birth (PTB) is the leading cause of death in children under five, yet comprehensive studies are hindered by its multiple complex etiologies. Epidemiological associations between PTB and maternal characteristics have been previously described. This work used multiomic profiling and multivariate modeling to investigate the biological signatures of these characteristics. Maternal covariates were collected during pregnancy from 13,841 pregnant women across five sites. Plasma samples from 231 participants were analyzed to generate proteomic, metabolomic, and lipidomic datasets. Machine learning models showed robust performance for the prediction of PTB (AUROC = 0.70), time-to-delivery (r = 0.65), maternal age (r = 0.59), gravidity (r = 0.56), and BMI (r = 0.81). Time-to-delivery biological correlates included fetal-associated proteins (e.g., ALPP, AFP, and PGF) and immune proteins (e.g., PD-L1, CCL28, and LIFR). Maternal age negatively correlated with collagen COL9A1, gravidity with endothelial NOS and inflammatory chemokine CXCL13, and BMI with leptin and structural protein FABP4. These results provide an integrated view of epidemiological factors associated with PTB and identify biological signatures of clinical covariates affecting this disease. __________________________________________________________________ Deep biological profiling and machine learning reveal insights into the epidemiology of preterm birth. INTRODUCTION Preterm birth (PTB)—birth before 37 weeks of gestation—is the leading cause of mortality and morbidity in children under 5 years of age across the globe, with a particularly high prevalence in low- and middle-income countries (LMICs) ([136]1). Children born preterm are at increased risk of multiple life-threatening short-term complications and long-term neurological, cardiovascular, and metabolic morbidities ([137]2). PTB therefore affects the lives of the mother, child, and family and poses a high burden to global public health that falls overwhelmingly on the health systems of LMICs ([138]3). However, efforts to improve our understanding of PTB are hindered by its complex etiologies. Proposed mechanisms for spontaneous PTB include infection, inflammation, loss of maternal-fetal immune tolerance, placental senescence, cervical insufficiency, and vascular disease, with each mechanism possibly affecting different populations to varying degrees ([139]4–[140]6). The complexity of PTB has also led to a scarcity of possible interventions and screening procedures, underscoring the need to further the understanding of this disease ([141]7). Epidemiologic associations between PTB and maternal clinical history, demographic characteristics, and social determinants have been extensively described, ​​but have been limited in inference owing to a lack of attendant biological data ([142]8). For example, PTB has a well-known association with maternal obstetric history. In particular, history of a previous PTB remains among the strongest risk factors for PTB in an active pregnancy ([143]9), although short interpregnancy intervals ([144]10) and histories of stillbirth ([145]11), Cesarean delivery ([146]12), and other previous adverse pregnancy outcomes all contribute to increased risk of PTB ([147]9, [148]13, [149]14). Socioeconomic determinants of health have recently come into focus as key players that affect the outcome of gestation ([150]14). Environmental exposures ([151]15), quality and number of antenatal care visits ([152]16, [153]17), psychological distress levels ([154]18, [155]19), and experiences of racial bias ([156]20) may also influence the health of a woman’s pregnancy. Furthermore, these maternal characteristics intersect and interact in complex ways that may further compound PTB risks ([157]15, [158]21). The recent development of high-throughput technologies to profile different levels of an individual’s biology in health and disease has brought forth a new era of precision medicine ([159]14). By using high-dimensional measurements of an individual’s biological systems—such as their transcripts (transcriptome), proteins (proteome), metabolites (metabolome), and immune cell frequencies and functions (immunome)—biomarkers of disease can be found and mechanistic hypotheses on biological processes can be inferred ([160]4, [161]5). In the context of pregnancy, multiomic profiling, together with the development and application of appropriate machine learning frameworks to interpret these complex data, has led to a better understanding of the immunologic and proteomic changes that regulate the chronicity of pregnancy ([162]22–[163]24) and the cross-talk between biological modalities that determines the onset of labor ([164]25). Furthermore, this approach has previously clarified the biology of adverse pregnancy outcomes ([165]26–[166]28), underscoring its potential for improving our understanding of PTB. However, population-specific differences in the biological processes that precede pregnancy complications and the presence of other possible confounding factors have limited the generalizability of multiomic models ([167]29). Here, we leveraged a large multinational cohort of pregnant women across four LMICs to evaluate a set of epidemiologically derived maternal factors associated with PTB. We used proteomic, metabolomic, and lipidomic profiling and performed multivariate modeling in a subcohort of these women to investigate the biological correlates of some of these maternal covariates for their underlying association with PTB. By combining epidemiological data from the full cohort and biological insights from the multiomic data of a subcohort, we built an integrated view of how epidemiological associations with PTB can affect maternal biologic measures. Our approach simultaneously identified signatures for maternal covariates affecting PTB and allowed us to propose an innovative conceptual framework for studying the connections between population-level risk factors and underlying pathological mechanisms of PTB. RESULTS Participants and study design Maternal covariates and plasma samples were collected during early to mid pregnancy from a multinational cohort of 13,841 pregnant women recruited by the Alliance for Maternal and Newborn Health Improvement (AMANHI) and the Global Alliance to Prevent Prematurity and Stillbirth (GAPPS) ([168]Fig. 1A). AMANHI sites included Sylhet, Bangladesh (AMANHIB); Karachi, Pakistan (AMANHIP); and Pemba, Tanzania (AMANHIT); while GAPPS sites included Matlab, Bangladesh (GAPPSB) and Lusaka, Zambia (GAPPSZ). Of the 13,841 pregnancies, 1578 (11.4%) resulted in PTBs (defined here as delivery before 37 gestational weeks). Maternal covariates included maternal characteristics previously associated with PTB—e.g., weight and height—as well as maternal clinical history, demographics, and socioeconomic determinants. Participant demographics, antepartum parameters, and pregnancy characteristics are listed in [169]Table 1. A full list of the maternal covariates measured can be found in the Supplementary Materials. Fig. 1. Study overview. [170]Fig. 1. [171]Open in a new tab (A) Maternal clinical data and plasma samples were collected from a cohort of 13,841 pregnant women across five sites in four low- and middle-income countries (LMICs). Plasma samples taken during early and mid pregnancy from a subcohort of 231 of these women were further analyzed to generate targeted lipidomic, untargeted metabolomic, and targeted proteomic datasets. Clinical data from the full cohort were used for the prediction of preterm birth (PTB). Multiomic profiling data from the multiomics subcohort were used for interomic correlation analysis and for the prediction of maternal clinical covariates. Clinical data from the full cohort and multiomic data from the multiomics subcohort were used for the epidemiological multiomic integration. (B) Raster plot depicting the gestational age (GA) at sampling for each woman in the multiomics subcohort stratified by site of origin, where each line represents an individual woman and the dark blue circles and light blue circles represent sampling dates and delivery dates, respectively. The dashed red line at 37 weeks of GA indicates the boundary between preterm (PTB) and term births. Table 1. Full cohort and multiomics subcohort characteristics. Summary of the demographic, clinical, and other participant characteristics for the cohorts in the study. Maternal characteristics Full cohort (N = 13,841) Multiomics subcohort (N = 231) Percentage or median [interquartile range] Percentage or median [interquartile range] Age (years) 13,815 25 [22–30] 229 25 [22–29] Body mass index (kg/m^2) 13,200 22.0 [19.8–25.3] 226 22.0 [19.9–25.3] Gravidity 13,831 3 [1–4] 231 3 [1–4] Parity (% nulliparous) 13,831 3,905 (28.2%) 231 70 (30.3%) Level of education (years) 13,816 8 [5–10] 231 8 [5–10] Pregnancy characteristics  Gestational age at delivery (weeks) 13,841 39.3 [38.1–40.1] 231 37.1 [34.4–39.6]  Preterm delivery (<37 weeks) 13,841 1,578 (11.4%) 231 113 (48.9%)  Infant sex 13,620 Female (48.6%) 231 Female (53.6%)  Birthweight (g) 11,985 2,970 [2,605–3,300] 213 2,575 [2,210–3,025]  Spontaneous delivery 13,555 13,555 (79.9%) 229 224 (97.8%) Site  Sylhet, Bangladesh (AMANHIB) 13,841 2,845 (20.6%) 231 48 (20.8%)  Karachi, Pakistan (AMANHIP) 13,841 2,335 (16.9%) 231 48 (20.8%)  Pemba, Tanzania (AMANHIT) 13,841 4,115 (29.7%) 231 46 (19.9%)  Matlab, Bangladesh (GAPPSB) 13,841 3,434 (24.8%) 231 48 (20.8%)  Lusaka, Zambia (GAPPSZ) 13,841 1,112 (8.0%) 231 41 (17.7%) Comorbidities  History of preterm birth 10,001 8.4% 161 18.6%  History of stillbirth 10,207 8.7% 165 8.5%  History of miscarriage 10,317 23.0% 165 25.5%  History of cesarean 10,169 9.6% 163 4.3%  Hypertension 13,450 3.0% 228 4.8%  Diabetes 13,435 0.5% 227 0.4% [172]Open in a new tab Measurement of the maternal proteome, metabolome, and lipidome in early and mid pregnancy Plasma samples taken in early to mid pregnancy [median gestational age (GA) at sampling = 12.2 weeks; SD = 3.2 weeks] from 231 participants were further analyzed to generate proteomic, metabolomic, and lipidomic datasets ([173]Fig. 1A). This subcohort was chosen in a case-control design and included 113 pregnant women with a PTB (48.9%) and 118 with a term delivery (51.1%). Samples were collected from singleton pregnancies without congenital malformations and with spontaneous live deliveries. Case and control samples were matched within each site based on GA at sampling, and maternal covariates were not significantly different between cases and controls. The individual variability in the timing of each visit and eventual delivery enabled the creation of a continuous variable for time at sampling (GA at sampling) and for birth (GA at birth) ([174]Fig. 1B). The targeted multiplexed proteomic immunoassay measured 1161 proteins using 1196 probes, while the untargeted metabolomic and targeted lipidomic produced 4329 and 632 measurements, respectively, for a total of 6157 multiomic features ([175]Fig. 2A). Dataset modularity, as estimated by the number of principal components needed to explain 90% of the variance, indicated a similar information content in the proteomic and metabolomic assays despite their difference in size, with a lower modularity in the lipidomic data ([176]Fig. 2B). The combined multiomic dataset was visualized by first calculating the Spearman correlation between each pair of features and then creating a two-dimensional representation of the feature correlation space using the Uniform Manifold Approximation and Projection (UMAP) algorithm ([177]Fig. 2C) ([178]30). Fig. 2. Multiomic characterization of early and mid pregnancy. [179]Fig. 2. [180]Open in a new tab Plasma samples taken during early and mid pregnancy from a subcohort of 231 women were analyzed to generate targeted lipidomic, untargeted metabolomic, and targeted proteomic datasets. (A) Quantification of the number of measurements (features) of each different omic analyzed. (B) Estimation of the modularity—i.e., the degree of internal correlation between features in a given omic dataset—using the number of principal components needed to explain 90% of the variance. (C) A two-dimensional representation of the multiomic correlation space was generated by first calculating the correlation matrix of the feature space and then using the UMAP dimensionality reduction algorithm for visualization, where green, purple, and brown circles represent lipidomic, metabolomic, and proteomic features, respectively. (D and E) Interomic and intraomic Spearman correlations were quantified and assessed for significance using a cutoff value of Bonferroni-adjusted Spearman P < 0.05, which corresponded to an absolute Spearman Rho > 0.38. (D) Distribution of all correlations by absolute strength of association for the multiomic dataset (dark blue) and a random dataset (light blue), where each simulated feature was generated through bootstrapping of the true feature distribution. (E) Visualization of significant interomic correlations. Left: Chord diagram showing the relative distribution of significant interomic correlations, where the outer ring depicts each individual omic and links correspond to interomic correlations, with colors assigned as shown in the legend. The size of the links is proportional to the total number of significant interactions normalized to the total number of possible correlations between each omic pair. Right: Quantification of the number of significant weak (0.38 to 0.6), moderate (0.6 to 0.8), and strong (0.8 to 1.0) absolute correlations between each omic pair. The distribution of significant correlations within (intra-) and between (inter-) omics in pregnancy revealed the rigorous orchestration of biological processes during gestation ([181]Fig. 2, D and E). Given the different scales and statistical distribution properties of each biological modality, the absolute nonparametric Spearman correlation coefficient was used to quantify the strength of the associations between features. After adjusting for multiple hypothesis testing, 540,653 of all 18,951,246 feature pairs (2.9%) were significantly correlated (Bonferroni-adjusted Spearman P < 0.05, |Spearman Rho| > 0.38) ([182]Fig. 2D). Significant correlations were driven by robust intraomic coordination so that associations within individual omics accounted for 84% of all significant correlations observed. Nonbiological sources of bias, such as technical bias introduced during data generation, can lead to correlation structures within each omic. As such, interomic correlations can more faithfully reflect real biological concordance. After normalizing for the size of each modality, interactions between the lipidome and the metabolome emerged as the largest relative contributor to interomic correlations, accounting for 52% of all significant interomic associations ([183]Fig. 2E). The early and mid pregnancy interactome consisted of 73,714 weak (|Rho| = 0.38 to 0.6), 10,585 moderate (|Rho| = 0.6 to 0.8), and 2549 strong (|Rho| = 0.8 to 1.0) interomic correlations, with strong interactions almost exclusively driven by associations between metabolome and proteome features ([184]Fig. 2E). Gene set overrepresentation analysis for Gene Ontology (GO) terms ([185]31, [186]32) performed on the proteins with strong correlations with metabolomic features revealed significant enrichment of pathways associated with cell cycle and cellular metabolism (fig. S1A). Pathway analysis ([187]33) performed on the set of metabolites strongly correlated with proteomic features indicated the enrichment of pathways associated with amino acid metabolism (fig. S1B). These results suggest that the strongest interomic correlations were driven by proteins and metabolites with interconnected metabolic functions. Overall, the correlations found indicate a strong alignment between the different biological systems profiled, which greatly surpasses what would be expected by random chance. Thus, the pregnancy interactome highlights the strict regulation of these biological systems and their cross-talk during gestation. Epidemiological factors—Maternal covariates and PTB Epidemiological characteristics associated with PTB have been previously described ([188]3, [189]34). To investigate these associations in the full cohort with a multivariate approach, a cross-validated gradient-boosted tree model (XGBoost) ([190]35) was built using only the maternal covariates measured in antenatal visits before 28 weeks and was able to distinguish between PTB and term births (Wilcoxon rank sum test P = 6.1 × 10^−147, N = 13,841) ([191]Fig. 3A). The model used a repeated cross-validation scheme to get an unbiased estimate of the performance of the model for unseen participants (Materials and Methods). The area under the receiver operating characteristic curve (AUROC) for the model was 0.70 [95% confidence interval (CI), 0.69 to 0.71], highlighting the difficulty of the predictive task ([192]Fig. 3B and fig. S2A). Given the class imbalance in the complete cohort, the area under the precision-recall curve (AUPRC) was also used to assess the performance of the model ([193]36), with an AUPRC of 0.27 (95% CI, 0.25 to 0.29) ([194]Fig. 3C and fig. S2B). When adjusted for the prevalence of PTB in this population, the relative AUPRC, or Lift, of the model showed a 2.4 improvement over the baseline. The performance of the model built only on maternal covariates demonstrated the existence of a potential signature for PTB risk in this population. Fig. 3. An epidemiological model for the prediction of PTB. [195]Fig. 3. [196]Open in a new tab A cross-validated gradient-boosted tree (XGBoost) model for the prediction of PTB was trained on the epidemiological data of the full cohort of 13,841 pregnant women. (A) Distribution of the cross-validated preterm risk scores—the probability PTB assigned by the epidemiological model to each participant—stratified by PTB status (P = 6.1 × 10^−147, N = 13,841). (B) Receiver operating characteristic (ROC) curve for the epidemiological model [area under the ROC curve (AUROC) = 0.70, 95% CI: 0.69 to 0.71]. (C) Precision-recall curve (PRC) for the epidemiological model [area under the PRC (AUPRC) = 0.27, 95% CI: 0.25 to 0.29]. (D) Correlation network of the maternal and fetal covariates collected during the study, where nodes are grouped by clinical categories and were arranged into a two-dimensional layout using a minimum spanning tree. Each node represents a different covariate, where the color of the node represents the strength of the univariate association of the covariate with PTB and the size of the node represents the predictability of the covariate using multiomic data from the multiomics subcohort. Edges were drawn between significantly associated covariates after adjusting for multiple hypothesis testing, and edge thickness represents the strength of the Spearman correlation between the two covariates. (E) A cross-validated XGBoost model was built for the prediction of PTB on the full cohort of women using only the covariates from each of the clinical categories used, and the performance of each model was visualized using an ROC curve. See the Supplementary Materials for more details on the clinical covariates included in each category. The gray shadows in the ROC curve and PRC plots represent the 95% CI of the respective curves. The maternal covariates and measurements of the growing fetus taken during pregnancy and at birth formed a correlation network, which was visualized using a minimum spanning tree ([197]Fig. 3D and fig. S3). The interconnectedness of maternal covariates within and across the clinical categories considered (e.g., maternal obstetric history and socioeconomic determinants; see the Supplementary Materials for more details) underscored the importance of a holistic view of maternal and fetal health when studying PTB. Top features associated with PTB included whether the pregnancy had multiple fetuses (Bonferroni-adjusted Wilcoxon rank sum test P = 3.0 × 10^−72, N = 13,841), the mother’s history of previous PTB (Bonferroni-adjusted Wilcoxon rank sum test P = 1.1 × 10^−26, N = 10,001), and the mother’s history of previous Cesarean delivery (Bonferroni-adjusted Wilcoxon rank sum test P = 8.2 × 10^−25, N = 10,169). Given the importance of maternal obstetric history for the prediction of PTB in the full cohort, the role of these features was further characterized in parous and nulliparous participants separately. To quantify the impact of including obstetric history when predicting PTB in these populations, models were built with and without these features in the nulliparous, parous, and combined populations and their performances were compared (fig. S4). As expected, inclusion of the maternal obstetric history did not change the performance of the model in nulliparous participants, while the performance of the model in parous participants markedly improved when these features were included. Furthermore, training on the full cohort with or without the obstetric history features led to an overall improvement in the performance of the model, demonstrating the benefit to model performance of a large and heterogeneous population. For a deeper assessment of the importance of each of these facets of maternal and fetal health in the prediction of PTB, cross-validated XGBoost models using only the features associated with each of these facets were built. The performance of each of these models was assessed using AUROC and AUPRC and is shown in [198]Table 2. Notably, the maternal bloodwork module—consisting of common clinical analytes measured during pregnancy as part of routine antenatal care (see the Supplementary Materials for full list of analytes)—was the clinical category with the highest AUROC, while the model built using the maternal nonobstetric medical history had the worst performance ([199]Fig. 3E). Information from all clinical categories was necessary to obtain the best performance. Table 2. Performance statistics of the model built on each clinical category for the prediction of PTB. Machine learning–based cross-validated models were built on the full study cohort for the prediction of PTB using epidemiological features. Model performances were assessed using the area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC), and the Wilcoxon rank sum test. See the Supplementary Materials for more details on the clinical covariates included in each category. Clinical category AUROC [95% CI] AUPRC Wilcoxon rank sum test P Maternal anthropometry 0.59 [0.58–0.61] 0.16 [0.15–0.17] 9.8 × 10^−33 Maternal bloodwork 0.65 [0.63–0.66] 0.19 [0.18–0.21] 9.3 × 10^−82 Medical history 0.54 [0.53–0.56] 0.13 [0.13–0.14] 8.9 × 10^−8 Obstetric history 0.59 [0.57–0.60] 0.17 [0.16–0.18] 1.1 × 10^−26 Socioeconomic determinants 0.58 [0.57–0.60] 0.14 [0.14–0.15] 3.9 × 10^−26 Ultrasound measurements 0.61 [0.60–0.63] 0.19 [0.18–0.21] 3.1 × 10^−48 [200]Open in a new tab The correlation network was also used to visualize the impact of each covariate on maternal omics. This impact was estimated with the covariate’s predictability by multivariate models built with multiomic data in the subcohort for which such data were available ([201]Fig. 3D). Highly predictable covariates clustered together in the MST layout, with the strongest biological associations arising from covariates in the maternal anthropometry, maternal obstetric history, GA measurements and labor variable, and socioeconomic determinant modules. In summary, this analysis described a multivariate risk factor association of PTB (conserved across multiple geographic regions) and revealed which early and mid pregnancy measurements were most useful for predicting PTB. Multiomic modeling of the maternal proteome, metabolome, and lipidome predicts time-to-delivery The progression of pregnancy and the onset of labor are highly coordinated biological processes that can be measured using multiomic profiling ([202]5, [203]22–[204]25, [205]37). To determine whether biological signatures were present in the multiomics subcohort that could capture the time from sampling to delivery, a cross-validated XGBoost model was built with the integrated maternal proteome, metabolome, and lipidome during early and mid pregnancy and strongly predicted the time-to-delivery [Pearson’s r (r) = 0.65, 95% CI: 0.57 to 0.72, P = 2.6 × 10^−28, root mean square error (RMSE) = 3.7 weeks, mean absolute error (MAE) = 3.0 weeks, N = 226] ([206]Fig. 4, A and B). Models built on each omic dataset individually had statistically significant performance in predicting time-to-delivery, with the proteome (P = 8.5 × 10^−27, r = 0.63) having the highest predictive power—matching that of the integrated model—followed by the metabolome (P = 7.5 × 10^−21, r = 0.57) and the lipidome (P = 9.6 × 10^−9, r = 0.37) ([207]Fig. 4A). Fig. 4. Multiomic modeling of the maternal proteome, metabolome, and lipidome predicts time-to-delivery. [208]Fig. 4. [209]Open in a new tab A cross-validated gradient-boosted tree (XGBoost) model for the prediction of time from sampling to delivery was trained on each individual omic as well as the combined multiomic dataset. (A) Comparison of the cross-validated performance of each model for the prediction of time-to-delivery. (B) Time-to-delivery predictions of the combined multiomic model (Pearson’s r = 0.65, 95% CI: 0.57 to 0.72, P = 2.6 × 10^−28, RMSE = 3.7 weeks, MAE = 3.0 weeks, N = 226). (C) Two-dimensional UMAP visualization of the multiomic features significantly correlated with time-to-delivery. Each circle represents a feature, with sizes proportional to Spearman correlation with time-to-delivery, and colors representing modality. (D) Gene Ontology (GO) overrepresentation analysis performed on the plasma proteomic features significantly correlated with time-to-delivery as assessed using Fisher’s exact test. (E) Metabolic pathway enrichment analysis performed on the metabolomic features significantly correlated with time-to-delivery. (F) Participants were randomly split into a training set (N = 159, 70%) and a test set (N = 67, 30%) to build a minimal XGBoost model for the prediction of time-to-delivery. Left: Cross-validated model predictions in the training set (Pearson’s r = 0.64, 95% CI: 0.54 to 0.73, P = 7.4 × 10^−20, RMSE = 3.9 weeks, MAE = 3.1 weeks, N = 159). Right: Model predictions in the test set (Pearson’s r = 0.68, 95% CI: 0.53 to 0.79, P = 1.7 × 10^−10, RMSE = 3.4 weeks, MAE = 2.8 weeks, N = 67). (G) A cross-validated XGBoost model for the prediction of GA at sampling was trained on the combined multiomic dataset. Boxplot depicts the discrepancy between predicted GA and ultrasound GA stratified by PTB status. The blue lines and blue shadows in scatterplots represent the regression line and 95% CI of the predicted values versus the ground truth. Understanding how the model’s error distributes among classes and the range of target values is important for its interpretability and application. The multiomic model for the time from sampling to delivery underestimates the time-to-delivery when far from delivery, while it slightly overestimates the time-to-delivery when closer to delivery (fig. S5A). Consequently, the model generally overestimates PTB risk by underestimating the time-to-delivery in term deliveries (fig. S5B). However, when incorporating the sampling time (which is known) into the model’s time-to-delivery predictions, the model is able to discriminate between PTB and term births (Wilcoxon rank sum test P = 0.04, N = 226) (fig. S5C). Generalizability across sites is critical for the translational potential of predictive multiomic models. However, the integrated model’s performance varied by site, with the highest performance in the AMANHI Pakistan site (r = 0.83) and the lowest in the GAPPS Bangladesh site (r = 0.41; [210]Table 3). To better understand the impact of integrating heterogeneous sites during training, site-specific models were trained and their performance was compared to the performance of the integrated model in each site (fig. S6A). This analysis revealed that all sites benefit from training using the full cohort, with a median improvement of 0.17 in the Pearson correlation between each site’s predicted values and the ground truth. In particular, the sites that benefit most from integrated training are those where the site-specific models have poor performance, such as the GAPPS Bangladesh site, with an increase of 0.41 in the Pearson correlation of the site’s predictions to the ground truth. Similarly, assessing the performance of the model in cohorts with standard PTB rates is necessary to establish its generalizability to real-world populations. To estimate the performance of the model in a population with a PTB rate of 11.4% (the prevalence of PTB in the full cohort), random cohorts with this PTB rate were sampled from the multiomics subcohort and the model performance metrics in the sampled cohorts were calculated (fig. S6B). The model’s performance in the sampled cohorts was not significantly inferior to its performance in the multiomics subcohort, indicating its applicability to real-world populations. Table 3. Performance statistics for the model for the prediction of time-to-delivery. A machine learning–based cross-validated model was built for the prediction of time-to-delivery using multiomic features. Model performance in each study site was assessed using Pearson’s correlation coefficient (Pearson’s r), the root mean square error (RMSE), and the mean absolute error (MAE). Site N Pearson’s r [95% CI] Pearson correlation P RMSE (weeks) MAE (weeks) Sylhet, Bangladesh (AMANHIB) 48 0.64 [0.43–0.78] 1.1 × 10^−6 3.2 2.6 Karachi, Pakistan (AMANHIP) 48 0.83 [0.72–0.90] 1.7 × 10^−13 3.4 2.9 Pemba, Tanzania (AMANHIT) 43 0.52 [0.26–0.71] 3.8 × 10^−4 4.4 3.3 Matlab, Bangladesh (GAPPSB) 48 0.41 [0.14–0.62] 3.7 × 10^−3 3.5 3.1 Lusaka, Zambia (GAPPSZ) 39 0.66 [0.43–0.81] 4.9 × 10^−6 4.2 3.4 Combined 226 0.65 [0.57–0.72] 1.6 × 10^−29 3.7 3.0 [211]Open in a new tab Features significantly associated with the time-to-delivery included some that have been previously described in the literature as well as proteins and metabolites correlated to the chronicity of pregnancy and time-to-delivery ([212]Fig. 4C). Pregnancy-associated steroid hormones such as estriol glucuronide (Spearman’s Rho (Rho) = 0.56) and progesterone (Rho = 0.42) clustered together with placental and fetal proteins such as placental alkaline phosphatase (ALPP; Rho = 0.67), α fetoprotein (AFP; Rho = 0.62), and placental growth factor (PGF; Rho = 0.55). Notably, immune system–associated proteins programmed death ligand 1 (PD-L1; Rho = 0.43), C-C motif chemokine ligand 28 (CCL28; Rho = 0.44), and leukemia inhibitory factor (LIF) receptor (LIFR; Rho = 0.53) had a strong positive correlation with time-to-delivery. GO term overrepresentation analysis performed on the proteins significantly associated with time-to-delivery revealed an enrichment in the reproductive process GO term (Fisher’s exact test P = 4.5 × 10^−4) and the steroid hormone biosynthesis GO term (P = 0.002) ([213]Fig. 4D). Pathway analysis performed on the set of metabolites significantly correlated with time-to-delivery indicated the enrichment of the linoleate metabolism pathway (P = 7.8 × 10^−7) ([214]Fig. 4E). Predictive multiomic models that maintain high performance with a minimal number of input measurements are of particular interest for their potential impact in LMICs. To explore the possibility of a minimal model for the prediction of time-to-delivery, participants were randomly split into a training set (N = 159, 70%) for feature selection and a test set (N = 67, 30%) for independent assessment of the performance of the minimal model. The feature selection paradigm is described in Materials and Methods. The minimal model consisted of only three features and achieved the performance of the full model in the independent test set (r = 0.68, 95% CI: 0.53 to 0.79, P = 1.7 × 10^−10, RMSE = 3.4 weeks, MAE = 2.8 weeks, N = 67) ([215]Fig. 4F). The features chosen for the minimal model included proteins ALPP, AFP, and serine peptidase inhibitor Kunitz type 1 (SPINT1; Rho = 0.59) (fig. S7). Previous work on the chronicity of pregnancy has postulated that models for predicting GA at sampling show “accelerated gestational clocks” in participants who will deliver prematurely ([216]24). In the multiomics subcohort, a cross-validated XGBoost model accurately predicted GA at sampling (r = 0.84, 95% CI: 0.80 to 0.87, P = 1.4 × 10^−62, RMSE = 1.9 weeks, MAE = 1.4 weeks, N = 231) (fig. S8), and the discrepancy between the multiomic GA and the ultrasound-based GA showed an accelerated gestational clock in the PTB group (P = 0.007, N = 231) ([217]Fig. 4G). In sum, multiomic profiling robustly captures biological processes in early and mid pregnancy, which can predict time-to-delivery and further show evidence of accelerated gestational clocks in women who will deliver prematurely. CXCL13 and eNOS are age-independent markers of pregnancy memory A mother’s age and obstetric history both have an impact on her risk of PTB, yet the independent risk contribution of these intercorrelated factors to PTB is unclear ([218]11, [219]13, [220]38). In the multiomics subcohort, maternal age and gravidity were both predictable with the integrated multiomic data (Age: r = 0.59, 95% CI: 0.50 to 0.67, P = 9.0 × 10^−23, RMSE = 5.0 years, MAE = 3.8, N = 229; Gravidity: r = 0.56, 95% CI: 0.47 to 0.64, P = 1.1 × 10^−20, RMSE = 2.0, MAE = 1.5, N = 231) ([221]Fig. 5, A and B). Maternal age and gravidity shared multiple significantly correlated features, of which type IX collagen (COL9A1) had the strongest association with either covariate (Age: Bonferroni-adjusted Spearman correlation P = 3.1 × 10^−18, Rho = −0.58; Gravidity: Bonferroni-adjusted Spearman correlation P = 1.1 × 10^−19, Rho = −0.60) ([222]Fig. 5C and fig. S9, A and B). This is partially explained by the positive correlation between maternal age and gravidity (r = 0.61 in the full cohort; r = 0.59 in the multiomics subcohort). Despite this positive correlation, proteome features C-X-C motif chemokine ligand 13 (CXCL13) and endothelial nitric oxide (NO) synthase (eNOS) were significantly correlated with gravidity but not significantly associated with maternal age ([223]Fig. 5D), suggesting age-independent effects on the maternal proteome associated with the mother’s obstetric history. Other features with age-independent significant associations with maternal gravidity included soyasapogenol A and dolichosterone. Fig. 5. CXCL13 and eNOS are age-independent markers of pregnancy memory. [224]Fig. 5. [225]Open in a new tab Cross-validated gradient-boosted tree (XGBoost) models for the prediction of maternal age and gravidity were trained on each individual omic as well as the combined multiomic dataset. (A) Comparison of the cross-validated performance of each model for the prediction of maternal age and gravidity. (B) Left: Maternal age predictions of the combined multiomic model (Pearson’s r = 0.59, 95% CI: 0.50 to 0.67, P = 9.0 × 10^−23, RMSE = 5.0 years, MAE = 3.8, N = 229). Right: Gravidity predictions of the combined multiomic model (Pearson’s r = 0.56, 95% CI: 0.47 to 0.64, P = 1.1 × 10^−20, RMSE = 2.0, MAE = 1.5, N = 231). The blue line and blue shadow represent the regression line and 95% CI of the predicted values versus the ground truth. (C) Example features significantly associated with maternal age, gravidity, or both. (D) Tile chart depicting the strength of the association between the top features significantly correlated with maternal age and gravidity, where a darker color depicts a stronger association. (E) CXCL13 (left) and eNOS (right) levels in maternal plasma in early and mid pregnancy stratified by gravidity. (F) Validation of findings in plasma proteome in an independent set of early pregnancy samples in the AMANHI and GAPPS biorepositories (N = 70). CXCL13 (left) and eNOS (right) levels in early pregnancy stratified by nulliparity. (G) Validation of findings in plasma proteome in an independent cohort from Lucile Packard Children’s Hospital at Stanford (N = 17). CXCL13 levels in early pregnancy stratified by nulliparity. A multiple regression model for each of these features fit to maternal age and gravidity adjusted with random intercepts for site of origin confirmed that the age-independent associations between gravidity and CXCL13 and eNOS are also site independent (CXCL13: mixed-model fit P = 6.4 × 10^−9; eNOS: mixed-model fit P = 5.8 × 10^−6) (fig. S10). Closer interrogation of the relationship between CXCL13 and eNOS and maternal gravidity revealed inverse correlations between these two proteins and maternal gravidity during early and mid pregnancy (CXCL13: Rho = −0.46; eNOS: Rho = −0.39) ([226]Fig. 5E). The findings regarding eNOS and CXCL13 expression were further validated in two independent cohorts—a set of early pregnancy samples from the AMANHI and GAPPS biorepositories ([227]27) (N = 70) and a cohort from Lucile Packard Children’s Hospital at Stanford University ([228]37) (N = 17). In the AMANHI and GAPPS cohort, nulliparous women had significantly increased levels of CXCL13 (Wilcoxon rank sum test P = 0.003, N = 70) and eNOS (Wilcoxon rank sum test P = 0.0003, N = 70) in early pregnancy when compared with women with previous pregnancies ([229]Fig. 5F). In the Stanford cohort, nulliparous women and those with previous pregnancies had significant differences in CXCL13 expression in the first trimester (Wilcoxon rank sum test P = 0.008, N = 17), validating the results obtained in the multiomics subcohort of this study ([230]Fig. 5G and fig. S11). Overall, these findings reveal previously unknown age-independent effects of previous pregnancies on the maternal biology during future pregnancies. Predictive modeling of the maternal body mass index Maternal body mass index (BMI) is known to have a U-shaped relationship with PTB, with an increased risk of PTB at both ends of the BMI spectrum ([231]39, [232]40). To further understand the impact of BMI on maternal biology in the multiomics subcohort, a cross-validated XGBoost model was built with the integrated multiomic data, which was highly predictive of maternal BMI (r = 0.81, 95% CI: 0.76 to 0.85, P = 8.7 × 10^−54, RMSE = 3.3, MAE = 2.5, N = 226) and revealed a particularly strong signature in the maternal proteome (r = 0.80, P = 1.1 × 10^−51) ([233]Fig. 6, A and B). Features significantly correlated with BMI consisted of proteins associated with metabolism and biomolecule storage and a cluster of highly intercorrelated lipid species ([234]Fig. 6C). Adipocyte hormone leptin (LEP; Rho = 0.68), structural protein fatty acid binding protein 4 (FABP4; Rho = 0.51), and lipid-associated perilipin 1 (PLIN1; Rho = 0.41) positively correlated with maternal BMI, findings supported by previous reports ([235]41–[236]43). Conversely, insulin-like growth factor binding protein 2 (IGFBP2; Rho = −0.54) and high-density lipoprotein (HDL)–associated paraoxonase 3 (PON3; Rho = −0.40) exhibited inverse correlations with maternal BMI ([237]43, [238]44). The lipidome features most correlated with BMI all showed positive correlations and included lipids PC(16:0/20:4) (Rho = 0.43), CE(20:4) (Rho = 0.44), and TAG58:10-FA20:4 (Rho = 0.42). Fig. 6. Predictive modeling of the maternal BMI. [239]Fig. 6. [240]Open in a new tab A cross-validated gradient-boosted tree (XGBoost) model for the prediction of maternal BMI was trained on each individual omic as well as the combined multiomic dataset. (A) Comparison of the cross-validated performance of each model for the prediction of maternal BMI. (B) Maternal BMI predictions of the combined multiomic model (Pearson’s r = 0.81, 95% CI: 0.76 to 0.85, P = 8.7 × 10^−54, RMSE = 3.3, MAE = 2.5, N = 226). The blue line and blue shadow represent the regression line and 95% CI of the predicted values versus the ground truth. (C) Two-dimensional UMAP visualization of the multiomic features significantly correlated with maternal BMI. Each circle represents a feature, with sizes proportional to Spearman correlation with maternal BMI, and colors representing modality. (D) GO overrepresentation analysis performed on the plasma proteomic features significantly correlated with maternal BMI as assessed using Fisher’s exact test. (E) Metabolic pathway enrichment analysis performed on the metabolomic features significantly correlated with maternal BMI. GO term overrepresentation analysis performed on the proteome features significantly correlated with BMI indicated an enrichment of metabolic gene sets such as fatty acid transport (Fisher’s exact test P = 0.0019) and lipid catabolic process (Fisher’s exact test P = 0.0098) ([241]Fig. 6D). Pathway analysis of the set of metabolites with a significant association with BMI revealed the enrichment of metabolism-related pathways like ​​ascorbate and aldarate metabolism (P = 7.7 × 10^−5), butanoate metabolism (P = 9.9 × 10^−5), and arginine and proline metabolism (P = 0.003) ([242]Fig. 6E). A cross-validated XGBoost model built on the maternal proteome from the multiomics subcohort was further able to predict maternal BMI in an independent set of early pregnancy samples from the AMANHI and GAPPS biorepositories (Pearson’s r = 0.50, 95% CI: 0.29 to 0.67, P = 12.5 × 10^−5, RMSE = 3.9, MAE = 2.9, N = 63) (fig. S12). In summary, these results provide an integrated view of epidemiologically associated variables for PTB and their respective biological correlates. These observations highlight the unprecedented opportunity to study how such variables can affect maternal biology. DISCUSSION This study leveraged a large, multinational cohort with harmonized data collection through two major biorepositories in LMICs to describe a generalizable epidemiological set of factors associated with PTB. We augmented those data with attendant high-dimensional proteomic, metabolomic, and lipidomic profiling of maternal plasma in a subcohort to build an integrated view of how such maternal factors, associated with PTB, can affect omics during pregnancy. The epidemiological model found both site-agnostic correlates and cohort-specific associations of PTB (fig. S2C), shedding light on the consistent risks of this complex disease as well as how other risks vary across populations. The integrated multiomic modeling of maternal factors revealed the biological signatures of the time from sampling to delivery, maternal age, gravidity, and BMI, deepening our understanding of maternal biological adaptations to pregnancy. Predicting PTB using only information from early and mid pregnancy remains a challenge in clinical practice, with most screening procedures achieving modest performance ([243]45, [244]46). While complex machine learning models trained on electronic health record (EHR) data have achieved high performance, deploying these models in LMICs—where PTB is most prevalent—poses a challenge, limiting their scalability ([245]47, [246]48). The epidemiological model described here uses easily collectable data from the mother’s medical history and early and mid pregnancy antenatal care visits to build a generalizable model that works across four different LMICs with an overall AUROC of 0.70, a performance comparable to previously published predictive models of PTB ([247]49–[248]52). The model’s AUPRC of 0.27, while modest, reflects a Lift of 2.4 over the baseline prevalence of PTB in our cohort, which is in line with the Lift of 2.5 reported by the only comparable previous study that also used AUPRC for assessing model performance ([249]52). The epidemiological risk prediction of PTB inferred from our full cohort confirmed previously known associations between maternal covariates and incidence of PTB. The most useful features for the prediction of PTB spanned across all the clinical categories used, highlighting the importance of an integrated view of maternal health when predicting PTB. History of previous PTB, Cesarean delivery, stillbirth, or miscarriage were strongly associated with PTB in our cohort, all of which are known to be risk factors of PTB ([250]11–[251]13). Maternal anthropometry and overall health were also relevant for PTB prediction, with lower maternal height ([252]53) and weight ([253]54), diabetes ([254]55), and hypertension ([255]56) all contributing to increased risk of PTB. Notably, the epidemiological risk prediction included several social determinants of health, with a lowered incidence of PTB in women with higher levels of education ([256]57, [257]58) and active employment. Determining the time-to-delivery using noninvasive methods (such as targeted measurements in blood and urine) is of high interest given their potential for scalability and deployment in LMICs ([258]28). In our multiomic model, predictions for time-to-delivery were powered by a set of highly intercorrelated plasma proteins and metabolites, including multiple placental and fetal proteins and pregnancy-associated steroid hormones. Many of the proteins in this module have been previously associated with the chronicity of pregnancy ([259]23, [260]27, [261]34, [262]59), and the set of steroid hormones recapitulates what is known regarding the dynamics of estriol glucuronide and progesterone during the course of pregnancy ([263]24, [264]60), both of which validate our approach. We also observed a negative correlation between cortolone glucuronide and time-to-delivery (Rho = −0.40), contrasting with previous reports on cortisol-related hormone levels during pregnancy ([265]61). This is possibly explained by a pregnancy-associated shift of the metabolism of cortisol-related metabolites, particularly since glucuronidation is a key modification for the excretion of steroid hormones ([266]62). The minimal model for the prediction of time-to-delivery needed only three proteins to achieve the performance of the full model in an independent test set ([267]Fig. 3F). Of these proteins, one was of fetal origin (AFP), while the other two were derived from the placenta (ALPP and SPINT1). Notably, low levels of SPINT1 have recently been shown to be a marker of placental insufficiency and fetal growth restriction ([268]63, [269]64). Together with our findings relating SPINT1 to the chronicity of pregnancy, these results highlight the need for further interrogation of the role of SPINT1 in the progression of pregnancy. Recent work focusing on the dynamics preceding labor onset highlighted the importance of the interactions between plasma proteins and immune system cellular responses for maintaining fetal tolerance and initiating parturition ([270]25). Mirroring this finding, we observed increasing levels of immune system–associated proteins such as PD-L1, CCL28, and LIFR in maternal plasma closer to the onset of labor. PD-L1 is a key regulator of T cell–mediated immunity and peripheral tolerance, and there is a growing body of evidence that it is also critical for the establishment and progression of a healthy pregnancy ([271]65, [272]66). CCL28 is a chemokine with diverse immune functions, including antimicrobial activity and immune regulation through the recruitment of regulatory T cells (T[regs]) to mucosal surfaces. LIFR and its ligand LIF play a role in maintaining a tolerogenic environment during pregnancy by promoting T[reg] differentiation ([273]67, [274]68). Soluble PD-L1 levels have been described to increase in maternal plasma throughout pregnancy ([275]69), while both CCL28 and LIFR levels were previously found to be elevated in the third trimester when compared to postpartum ([276]70). Our results complement these findings by providing a longitudinal assessment of these immunological proteins during pregnancy, showing increasing levels of all three proteins, which could help maintain immune tolerance of the fetus, maintain an anti-inflammatory environment, and prevent an early onset of parturition. Our work also revealed an age- and site-independent association between a woman’s gravidity and levels of eNOS and CXCL13 during early and mid pregnancy. This finding was further validated in an independent Stanford cohort, suggesting its generalizability across populations. Dysregulation of eNOS and CXCL13 has been previously associated with adverse pregnancy outcomes such as PTB ([277]71, [278]72) and preeclampsia ([279]73). eNOS produces NO, a strong vasodilator, making eNOS critical for proper vascular and endothelial function and the regulation of blood pressure ([280]72). CXCL13 is a potent lymphocyte—and particularly B cell—chemoattractant with antimicrobial and antiangiogenic activity that also plays a role in the recruitment of lymphocytes to sites of inflammation ([281]74). Both eNOS and CXCL13 in the periphery have been observed to increase during pregnancy and originate from the placenta ([282]73) and decidua ([283]75), respectively. In these tissues, eNOS plays a key role in supporting maternal-fetal exchange through the production of NO ([284]76), while CXCL13 is hypothesized to recruit B cells and T[regs] to support maternal-fetal immune tolerance ([285]71, [286]75). Our results suggest that multiparous women might have an incrementally decreased up-regulation eNOS and CXCL13, reflecting a form of “pregnancy memory” ([287]77) in which maternal adaptations to pregnancy affect future pregnancies. Pregnancy memory has previously been studied in the context of the prediction of PTB ([288]59), specifically given that having a previous PTB and nulliparity are both risk factors of PTB ([289]8, [290]9, [291]78). In particular, permanent epigenetic changes ([292]79) and the development of immunological memory to fetal antigens ([293]80) after the first pregnancy guides the healthy progression of subsequent pregnancies. On the basis of our findings, previous experience of a successful embryonic implantation and establishment of a functional maternal-fetal interface could lead to a decreased need to up-regulate eNOS and CXCL13 in the beginning of subsequent pregnancies. While our multiomics subcohort allowed us to show this in early and mid pregnancy, the Stanford validation cohort further illustrated the differences seen regarding CXCL13 levels between nulliparous and parous women across pregnancy and postpartum (fig. S11). In the Stanford cohort, each trimester showed increased levels of CXCL13 in the nulliparous women, with the difference between groups decreasing with the progression of pregnancy and vanishing postpartum. However, it has also been proposed that pregnancies permanently impair the responsiveness of the maternal vasculature to NO, leading to oxidative stress, reduced eNOS expression, and endothelial dysfunction in multiparous women ([294]81), which could also explain our findings but does not address why nulliparity is a risk factor for adverse pregnancy outcomes. To more completely understand how a mother’s obstetric history will affect her risk of complications in future pregnancies, further studies on how maternal adaptations to pregnancy persist as pregnancy memory are needed. Both low and high maternal BMI have been shown to increase the risk of PTB, making the impact of BMI on prematurity difficult to disentangle ([295]39, [296]40). In the full cohort, maternal BMI had a weak positive correlation with GA at delivery (Rho = 0.08, P = 6.4 × 10^−22, N = 13,200) and a moderate association with birth weight (Rho = 0.29, P = 1.8 × 10^−213, N = 11,985), echoing previously reported findings ([297]82). After stratifying by low (<18.5 kg/m^2) and high (>25.0 kg/m^2) BMI, we saw an increased risk of PTB in underweight women [risk ratio (RR) = 1.33, 95% CI = 1.16 to 1.51, P = 3.0 × 10^−5] when compared to women in the reference range, but no substantial increased risk was observed in women with high BMI when compared with those with normal BMI (RR = 1.07, 95% CI = 0.95 to 1.19, P = 0.26). These results are possibly due to the nature of the populations included in the study, particularly the study setting in LMICs. Maternal BMI had a significant impact on the maternal proteome, metabolome, and lipidome, with our cross-validated multiomic model exhibiting strong performance in predicting BMI (r = 0.80). Model predictions were driven by plasma proteins previously associated with BMI, such as LEP, FABP4, and IGFBP2 ([298]41, [299]42). Notably, dysregulation of these proteins has also been associated with pregnancy complications such as gestational diabetes mellitus ([300]83) and PTB ([301]84, [302]85). Our work therefore confirms the relationship between maternal BMI and specific plasma proteins with previously described roles in adverse pregnancy outcomes. This study had certain limitations. First, a case-control study design was used when selecting participants for the multiomics subcohort so that the findings described here were not inferred from a random subset of the full cohort. While the epidemiological cohort was clinically highly heterogeneous, this heterogeneity was not fully reflected in the multiomics subcohort due to the design used to select participants for multiomic profiling. The homogeneity of the multiomics subcohort thus limited the degree of inference that could be applied using the multiomic data to understand the biological impact of the variables most associated with PTB. Second, the sample size, while unprecedented for this type of multiomic study, is small compared to the substantially larger dimensionality of the feature space, which leads to problems with overfitting and lack of generalizability. However, the significant associations and models presented in this study were validated where possible with independent data, demonstrating the robustness of some of our observations and predictive models. Third, the clinical data collected lacked fasting status and nutritional status, both of which are critical to contextualize the findings presented in this study. Specifically, recent diet and nutritional quality can affect the abundance of metabolites and proteins and the risk of PTB, which could inform some of the results shown. Fourth, our study lacks systemic profiling of the immune system by technologies such as mass cytometry, which will be critical to gain further insight into the potential biological mechanisms proposed here. Overall, future studies with larger omic sample sizes and a more clinically diverse population will be needed to validate the findings described here, particularly as it relates to fully integrating the epidemiological variables with the multiomic data to predict PTB. Our study provides an integrated view of how previously identified epidemiological associations such as parity, BMI, and age are etiologically connected to an underlying biology evidenced by various omic correlates. This integrated view was possible by leveraging a large, multinational cohort together with multiomic profiling in a subcohort to simultaneously explore the population-level correlates of PTB and how these associate with individual women’s biology. Furthermore, our results put forward a conceptual framework for studying the biological mechanisms underlying some of the long-standing observed population-level risk factors of PTB. In addition, more frequent prenatal visits are associated with lower PTB rates in LMICs ([303]16), suggesting that early identification of at-risk pregnancies could enable targeted use of more frequent antenatal visits, a limited resource in LMICs. Finally, this approach promises to address the lack of generalizability that plagues predictive models for PTB and provides insights for the development of biological and socioeconomic interventions to prevent PTB that can be generalized across multiple populations. MATERIALS AND METHODS Experimental design The study population comprised pregnant women selected from five biorepository-supported cohorts in Matlab, Bangladesh; Lusaka, Zambia; Sylhet, Bangladesh; Karachi, Pakistan; and Pemba, Tanzania. The study was approved by the Stanford University Institutional Review Board, and ethical exemptions were sought and obtained independently from the respective country by each birth cohort supported by the AMANHI and GAPPS biorepositories. Written informed participant consent was obtained from each participant in the original cohorts and extends to the present study. No compensation or incentives were provided for participating in this study. We followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting guideline. This study analyzed plasma collected from May 2014 to August 2018, and data were analyzed from October 2020 to August 2021. GA at the time of sampling was determined by clinical and ultrasonographic assessment. To focus on prediction of PTB with measurements taken before 28 weeks GA, pregnant women enrolled in the study with an extremely preterm birth (GA at birth less than 28 weeks) were excluded from the analysis, as were pregnant women with only one antenatal care visit before 28 weeks. The multiomics subcohort was chosen in a case-control design and included 113 pregnant women with a PTB (48.9%) and 118 with a term delivery (51.1%). Samples were collected from singleton pregnancies without congenital malformations and with spontaneous live deliveries. Case and control samples were matched within each site based on GA at sampling, and maternal covariates were not significantly different between cases and controls. Missing values were imputed using mean imputation. Biological assays At all AMANHI and GAPPS sites, trained phlebotomists or nursing staff collected blood samples. After centrifugation, plasma was aliquoted and stored at −80°C until future analyses. Collection and processing of plasma samples were performed according to harmonized operating procedures at all study cohorts. Blood collected in EDTA tubes was cold centrifuged at 3000 rpm for 10 min within 4 hours of collection. Plasma was separated and stored at −80°C until shipment. From each repository, 0.5 ml of plasma for proteome, 0.5 ml of plasma for metabolome, and 0.5 ml of plasma for lipidome analysis were shipped. Samples were shipped on dry ice as a single batch and under continuous temperature monitoring. Lipidomic and metabolomic data were generated using targeted mass spectrometry and untargeted liquid chromatography–mass spectrometry (LC-MS). Metabolites and complex lipids were extracted using a biphasic separation with cold methyl tert-butyl ether (MTBE), methanol, and water in a deep-well plate format. Metabolite extracts were analyzed using a broad-spectrum untargeted LC-MS platform as previously described ([304]86), while complex lipids were quantified using a targeted MS-based approach ([305]87). Metabolite data from each mode were independently analyzed using Progenesis QI software (v2.3) (Nonlinear Dynamics, Durham, NC), and metabolite abundances were reported as spectral counts. Lipidyzer data were reported by the Lipidomics Workflow Manager (LWM; v1.0.5.0) software, which calculates concentrations for each detected lipid as average intensity of the analyte multiple reaction monitoring (MRM)/average intensity of the most structurally similar internal standard (IS) MRM multiplied by its concentration. Lipid abundances were reported as concentrations in nmol/g. The proteomic analysis was performed by O-link Proteomics (Watertown, MA) with a highly multiplex proteomic platform using proximity extension technology, which quantified relative amounts of protein via real-time polymerase chain reaction ([306]88). For this study, 13 panels were used, each measuring 92 different proteins simultaneously in 1 μl of plasma. Relative amounts of protein were quantified as normalized protein expression (NPX). NPX was derived by subtracting the Ct value of the extension control reaction from the raw Ct value (threshold cycle) to adjust for technical variations (dCt), then subtracting differences in Ct values between plates (inter-plate control) from the dCt value (ddCt value) to adjust for inter-assay variability, and then subtracting the ddCt value from a correction factor to adjust for background noise and invert the scale. See the Supplementary Materials for more details. Statistical analyses All analyses were performed with R version 4.0.4. A repeated cross-validation scheme was used in all multivariate modeling to prevent overfitting and get an estimate of the performance of the model on unseen participants. In brief, participants were randomly split evenly into a training set (50%) and a test set (50%). A gradient-boosted tree (XGBoost) ([307]35) model was built on the training data (see the Supplementary Materials for hyperparameters used). The XGBoost model was then used to predict the outcome for the participants in the test data. This procedure was performed 50 times using different train and test sets in each iteration, and final test predictions for each participant were generated by averaging the participant’s predictions from the iterations in which the participant was present in the test set. For the prediction of PTB in the full cohort, a weighted loss was used to address the class imbalance present in the data ([308]Table 1). Model performance was assessed using AUROC and AUPRC for classification tasks. For regression tasks, the Pearson correlation coefficient (Pearson’s r) between model predictions and ground truth was used. Feature selection for minimal model The feature selection paradigm consisted of repeated subsampling of the training set to build an XGBoost model in the sampled participants using the full feature set. In each iteration, the top 10 features were picked based on feature importance. After 100 iterations, the 3 features picked with the highest frequency were used to train a cross-validated minimal model on the whole training set and to make predictions on the independent test set. Model performance was then assessed in the cross-validated training set and the independent test set using Pearson’s r. Multiomic interactome analysis The multiomic interactome was described by calculating the Spearman correlation coefficients within (intra-) and between (inter-) omics. The Bonferroni correction was applied to control for multiple hypothesis testing. A Bonferroni-adjusted P-value significance threshold of 0.05 was used, which corresponded to an absolute Spearman correlation coefficient threshold of 0.38 based on the t distribution with 229 degrees of freedom. To generate a background distribution of random correlations, a set of random features was generated by sampling each feature with replacement and performing the same correlation calculation on the random feature set. Interomic correlations were further visualized using a chord diagram generated in R with the circlize package (version 0.4.13) ([309]89). The number of significant interomic correlations was normalized to the total number of possible correlations between each omic pair to visualize the relative contribution of each omic pair to the interactome. Correlation network generation The maternal and fetal covariates collected during the study were visualized using an MST where nodes represent covariates and are grouped by clinical categories. The underlying correlation structure for each clinical category was calculated using the Spearman correlation matrix and then reduced using an MST. The full MST was extracted by integrating the MSTs for each clinical category together and then arranged into a two-dimensional layout using the Fruchterman-Reingold force-directed algorithm ([310]90). Edges were drawn between significantly associated covariates after adjusting for multiple hypothesis testing using Bonferroni’s correction. The color and size of each node are proportional to the −log[10](Wilcoxon rank sum test P value) of the covariate’s association with PTB and the −log[10](Pearson P value) of the covariate’s predictability using multiomic data, respectively. The thickness of each edge is proportional to the Spearman’s rank correlation coefficient between the two covariates. Pathway enrichment analysis Pathway enrichment was performed on the top proteomic and metabolomic features associated with the clinical covariates of interest. Top proteomic features were analyzed using gene set overrepresentation analysis of GO terms ([311]32) performed in R with the topGO package (version 2.42.0) ([312]31). Specifically, Fisher’s exact test was used to determine the enrichment of each GO term in the Biological Process ontology. Top metabolomic features were analyzed in R using the Mummichog algorithm of the MetaboAnalystR package (version 3.0.3) ([313]33). The metabolic model used was Homo sapiens MFN, and significance was assessed by Fisher’s exact test. Validation cohorts Previous studies of pregnancy with comparable multiomic approaches were used for independent validation of hypotheses inferred from observations made in this study. Particularly, these studies used similar multiomic modalities and had publicly available data. AMANHI and GAPPS pilot study ([314]27): This study profiled 81 pregnant women from five AMANHI and GAPPS biorepository-supported sites. Plasma and urine samples were collected during early pregnancy and used to generate proteomic, metabolomic, and lipidomic datasets. Lucile Packard Children’s Hospital at Stanford University study ([315]37): This study profiled 17 pregnant women from Lucile Packard Children’s Hospital. Peripheral blood, plasma, serum, and culture swab samples were collected at each trimester and postpartum and used to generate immunome, transcriptome, microbiome, proteome, and metabolome datasets. P-value adjustment All P values were adjusted for multiple hypothesis testing using the Bonferroni correction where appropriate, in which each raw P value is multiplied by the number of statistical tests performed to calculate the adjusted P value. Acknowledgments