Abstract

Importance

   Worldwide, preterm birth (PTB) is the single largest cause of deaths in
   the perinatal and neonatal period and is associated with increased
   morbidity in young children. The cause of PTB is multifactorial, and
   the development of generalizable biological models may enable early
   detection and guide therapeutic studies.

Objective

   To investigate the ability of transcriptomics and proteomics profiling
   of plasma and metabolomics analysis of urine to identify early
   biological measurements associated with PTB.

Design, Setting, and Participants

   This diagnostic/prognostic study analyzed plasma and urine samples
   collected from May 2014 to June 2017 from pregnant women in 5
   biorepository cohorts in low- and middle-income countries (LMICs; ie,
   Matlab, Bangladesh; Lusaka, Zambia; Sylhet, Bangladesh; Karachi,
   Pakistan; and Pemba, Tanzania). These cohorts were established to study
   maternal and fetal outcomes and were supported by the Alliance for
   Maternal and Newborn Health Improvement and the Global Alliance to
   Prevent Prematurity and Stillbirth biorepositories. Data were analyzed
   from December 2018 to July 2019.

Exposures

   Blood and urine specimens that were collected early during pregnancy
   (median sampling time of 13.6 weeks of gestation, according to
   ultrasonography) were processed, stored, and shipped to the
   laboratories under uniform protocols. Plasma samples were assayed for
   targeted measurement of proteins and untargeted cell-free ribonucleic
   acid profiling; urine samples were assayed for metabolites.

Main Outcomes and Measures

   The PTB phenotype was defined as the delivery of a live infant before
   completing 37 weeks of gestation.

Results

   Of the 81 pregnant women included in this study, 39 had PTBs (48.1%)
   and 42 had term pregnancies (51.9%) (mean [SD] age of 24.8 [5.3]
   years). Univariate analysis demonstrated functional biological
   differences across the 5 cohorts. A cohort-adjusted machine learning
   algorithm was applied to each biological data set, and then a
   higher-level machine learning modeling combined the results into a
   final integrative model. The integrated model was more accurate, with
   an area under the receiver operating characteristic curve (AUROC) of
   0.83 (95% CI, 0.72-0.91) compared with the models derived for each
   independent biological modality (transcriptomics AUROC, 0.73 [95% CI,
   0.61-0.83]; metabolomics AUROC, 0.59 [95% CI, 0.47-0.72]; and
   proteomics AUROC, 0.75 [95% CI, 0.64-0.85]). Primary features
   associated with PTB included an inflammatory module as well as a
   metabolomic module measured in urine associated with the glutamine and
   glutamate metabolism and valine, leucine, and isoleucine biosynthesis
   pathways.

Conclusions and Relevance

   This study found that, in LMICs and high PTB settings, major biological
   adaptations during term pregnancy follow a generalizable model and the
   predictive accuracy for PTB was augmented by combining various omics
   data sets, suggesting that PTB is a condition that manifests within
   multiple biological systems. These data sets, with machine learning
   partnerships, may be a key step in developing valuable predictive tests
   and intervention candidates for preventing PTB.

Introduction

   Preterm birth (PTB) is defined by the World Health Organization as the
   delivery of a live infant before the completion of 37 weeks of
   gestation.^[180]1,[181]2 The worldwide rate of PTB in 2014 was
   estimated to be 10.6% (uncertainty interval, 9.0%-12.0%), with 80% of
   all cases occurring in South Asia and sub-Saharan Africa.^[182]2 Many
   risk factors for PTB have been highlighted in previous studies and
   include obstetrical (eg, previous PTB and multiple gestation), medical
   (eg, maternal obesity, diabetes, and chronodisruption), and external
   (eg, smoking and maternal stress)
   conditions.^[183]3,[184]4,[185]5,[186]6,[187]7,[188]8,[189]9 For
   example, a meta-analysis of individual- and population-level attributes
   among 4.1 million births concluded that “unknown factors requiring
   further research to act upon account for ~2/3 of the preterm birth
   rate.”^[190]10^(p13) Unveiling and elucidating the role of early
   biological antecedents of PTB has been deemed a necessary step toward
   developing new diagnostic tests and therapeutic
   interventions.^[191]11,[192]12,[193]13 Biological investigations into
   the mechanisms of PTB are complicated, as indicated by accumulating
   evidence that distinct patient subpopulations follow divergent
   biological trajectories.^[194]14,[195]15 Given this heterogeneity,
   simultaneously studying diverse cohorts is critical for identification
   of generalizable biological pathways.^[196]16

   Recent technological advances have enabled the characterization of a
   broad range of biological changes during pregnancy. Biological layers
   explored include single-cell profiling of signaling pathways,^[197]17
   measurements of plasma cell-free ribonucleic acid (cfRNA),^[198]18
   proteome^[199]19,[200]20 and metabolome^[201]21 characterization of the
   microbiome,^[202]14,[203]22 and detailed genomics analysis.^[204]23 In
   addition, a recent multiomics investigation demonstrated that
   biological changes during normal pregnancy involve a number of
   intricate interactions of biological processes, which can be measured
   using a coordinated set of assays.^[205]24 The integration of the
   large, multidimensional data sets generated in a multiomics setting
   requires complex machine learning pipelines that will remain robust in
   the face of the inconsistent intrinsic properties of these
   high-throughput assays and cohort-specific variations.^[206]15

   To our knowledge, this is the first multiomics analysis of term and
   preterm pregnancies from multiple cohorts in low- and middle-income
   countries (LMICs). These cohorts were established using biorepositories
   of samples and phenotypic data for studying maternal and fetal outcomes
   collected and stored from diverse populations of South Asia and
   sub-Saharan Africa. The study aimed to investigate the ability of
   transcriptomics and proteomics profiling of blood plasma and
   metabolomics analysis of urine to identify early biological
   measurements associated with PTB.

Methods

   Approval was obtained from the Stanford University Institutional Review
   Board, and ethical exemptions were sought and obtained independently
   from the respective country by each birth cohort supported by the
   Alliance for Maternal and Newborn Health Improvement (AMANHI) and the
   Global Alliance to Prevent Prematurity and Stillbirth (GAPPS)
   biorepositories. Written informed patient consent was obtained from
   each participant in the original cohorts and extends to the present
   study. We followed the Transparent Reporting of a Multivariable
   Prediction Model for Individual Prognosis or Diagnosis ([207]TRIPOD)
   reporting guideline. This study analyzed plasma and urine samples
   collected from May 2014 to June 2017, and data were analyzed from
   December 2018 to July 2019.

Participants and Study Design

   The study population comprised pregnant women selected from 5
   biorepository-supported cohorts in Matlab, Bangladesh; Lusaka, Zambia;
   Sylhet, Bangladesh; Karachi, Pakistan; and Pemba, Tanzania. No
   compensation or incentives were provided for participating in this
   study.

   Plasma samples were assayed to measure targeted proteins and cfRNA, and
   urine samples were analyzed for metabolites. The cfRNA analysis
   resulted in 20 659 measurements, the targeted proteomics assay measured
   1002 proteins in plasma, and 6630 metabolites were measured in urine.
   The number of measurements of these assays did not correlate with their
   modularity, as indicated by the number of principal components needed
   to account for 90% of the total variance ([208]Figure 1A). This result
   highlighted the need for a 2-layer metadimensional integrative approach
   to prevent the assays with more measurements to bias the predictive
   models (eMethods in the [209]Supplement). An overview of the entire
   data set was produced by first calculating a correlation network of all
   available measurements and then producing a 2-dimensional layout for
   visualization using the t-SNE^[210]25 algorithm ([211]Figure 1B).

Figure 1. Study Overview.

   [212]Figure 1.
   [213]Open in a new tab

   A, The 3 data sets (plasma cell-free ribonucleic acid [cfRNA] or
   transcriptomics, metabolomics, and proteomics) produced a number of
   different features and had a range of correlations among the measured
   features. The internal correlation between features from each data set
   was quantified using the number of principal components (PCs) needed to
   capture 90% variance (eg, the cf-RNA data set had the most features but
   was highly correlated internally; therefore, fewer PCs were needed). B,
   A 2-dimensional representation of all measurements demonstrates the
   correlation between subsets of urine metabolites and cfRNA detected in
   plasma as well as a limited number of plasma proteins.

Biological Assays

   From all AMANHI and GAPPS cohorts, trained phlebotomists collected
   blood samples for centrifugation and aliquoting of serum, plasma, and
   buffy coat for storage and future analyses. In addition, maternal urine
   was collected in parallel. Collection and processing of all sample
   types were performed according to harmonized operating procedures at
   all study cohorts. The eMethods in the [214]Supplement provides details
   on the biological assays.

Statistical Analysis

   Data were analyzed from December 2018 to July 2019. All analyses were
   performed with R, version 3.6.1 (R Foundation for Statistical
   Computing). All multivariate modeling was performed with a 2-layer
   cross-validation strategy to prevent overfitting of the data and to
   ensure generalizability. Mixed-effect models were used to account for
   cohort-specific variations (eMethods in the [215]Supplement). The
   analysis is independently reproducible. The measured features from all
   3 omics data sets (transcriptomics, metabolomics, and proteomics); the
   algorithms and source codes for reproduction of the results; and an
   interactive website for visualizing the entire data set, the feature
   evaluation scores for PTB and gestational age (GA) at sampling, and the
   pathway enrichment analysis are available online
   ([216]https://nalab.stanford.edu/multiomicsmulticohortpreterm/).

   We used linear discriminant analysis and principal component analysis
   (PCA), respectively, to create a 2-dimensional representation of the
   entire cohort with cohort labels as the supervised guide and without
   supervised information. To confirm the presence of cohort-specific
   signatures, we used random forest analysis. We created models for each
   patient to estimate GA at the time of sample collection. To
   simultaneously optimize the integrative model and test the performance
   of the model on previously unseen patients, we applied a
   cross-validation strategy. To predict PTB (GA at delivery <37 weeks),
   we used a leave-one-out cross-validation procedure to test the models
   on blinded participants.

Results

   Of the 81 pregnant women included in this study, 39 had PTBs (48.1%)
   and 42 had term pregnancies (51.9%). The mean (SD) maternal age was
   24.8 (5.3) years. The median sampling time was 13.6 weeks of gestation,
   according to ultrasonography ([217]Figure 1A).

Data Quality Assessment

   To investigate cohort-specific data signatures, PCA was used to create
   a 2-dimensional representation of the entire cohort for each biological
   modality and all modalities combined (eFigure 1A in the
   [218]Supplement). The PCA demonstrated that the largest source of
   variation in the data was not driven by fundamental differences between
   the cohorts. Supervised linear discriminant analysis^[219]26 confirmed
   the existence of more subtle cohort-specific signatures that were not
   statistically significant enough to be visualized in an unsupervised
   PCA (eFigure 1B in the [220]Supplement). The presence of
   cohort-specific signatures was confirmed using random forest
   analysis^[221]27 that underwent cross-validation to predict the cohort
   from which the patient was selected exclusively on the basis of each
   biological modality (eFigure 1C in the [222]Supplement).

   The impact of sample storage time was quantified with random forest
   analysis that underwent cross-validation in which the number of days
   between sample collection and laboratory analyses was used as a
   continuous prediction target. The results were statistically
   significant (thresholds of P = 1.25762 × 10^−01 for transcriptomics, P
   = 8.83433 × 10^−06 for metabolomics, and P = 5.56758 × 10^−02 for
   proteomics) only in the case of the urine metabolomics data set,
   indicating the potential for sample degradation over time (eFigure 1D
   in the [223]Supplement). However, this result did not confound the
   design of this study as GA at delivery did not correlate with storage
   time (r = –0.092; P > .41).

Predictive Modeling of Chronicity of Pregnancy

   We built models to estimate GA at the time of sample collection (as a
   surrogate for the chronicity of pregnancy) for each patient. A
   cross-validation strategy was used to simultaneously optimize the
   integrative model and test the performance of the model on previously
   unseen patients. Models built on all 3 modalities (transcriptomics,
   metabolomics, and proteomics) as well as the integrated model were
   statistically significantly correlated with GA at the time of sample
   collection (transcriptomics: 1.736089 × 10^−03; metabolomics: 8.936983
   × 10^−23; proteomics: 2.227379 × 10^−19; and integrated model: 8.990768
   × 10^−22; Bonferroni-adjusted Spearman correlation P < .05)
   ([224]Figure 2A and B). The features that most correlated with the
   progression of pregnancy (Spearman correlation P < .05) are color-coded
   in [225]Figure 2C. A cluster of highly correlated metabolomics and
   proteomics features was identified that included the
   trophoblast-derived placental growth factor (PGF). Previous studies
   have demonstrated that PGF plays a substantial role in the pathogenesis
   of preeclampsia but has not been associated with spontaneous
   PTB.^[226]28,[227]29 Pathway analysis^[228]30 of the metabolites in
   this module indicated the enrichment of the steroid hormone
   biosynthesis pathway (Fisher test for pathway enrichment analysis
   P < 1.2 × 10^−12). The purine metabolism pathway was enriched in an
   additional module of metabolites (Fisher test for pathway enrichment
   analysis P < 1.7 × 10^−5). Other proteins that were included in the
   model and close to this cluster were PAPP-A (pregnancy-associated
   plasma protein A), MMP-7 (matrix metallopeptidase 7), FGF and FGFBP1
   (fibroblast growth factors), and SIGLEC6 (sialic acid binding Ig-like
   lectin 6), all of which play important roles in placental
   development.^[229]31,[230]32,[231]33,[232]34 An additional cluster of
   proteins associated with cell migration and localization was identified
   by gene ontology analysis (Protein Analysis Through Evolutionary
   Relationships overrepresentation P < 10 × 10^−7).

Figure 2. Prediction of Gestational Age (GA) at the Time of Sample
Collection.

   [233]Figure 2.
   [234]Open in a new tab

   A, A cross-validation strategy was used to simultaneously optimize the
   integrated model and test the performance of the model on previously
   unseen patients. Models built on all 3 modalities (transcriptomics,
   metabolomics, and proteomics) and the integrated model were
   statistically significantly correlated with GA at the time of sample
   collection (Bonferroni-adjusted Spearman correlation P < .05). B, The
   correlation between GA at the time of sample collection and the
   estimated values on the blinded samples are shown. The shaded area
   represents the 95% CI. C, The features correlated with the progression
   of pregnancy (Spearman correlation P < .05) are color-coded according
   to biological modality. FGF indicates fibroblast growth factor; IGSF3,
   immunoglobulin superfamily member 3; PAPP-A, pregnancy-associated
   plasma protein A; PGF, placental growth factor; and SIGLEC6, sialic
   acid binding Ig-like lectin 6.

   To further highlight the interplay between plasma proteins and urine
   metabolites, we developed a random forest model to estimate the PGF
   levels of each patient using only the urine metabolomics data set
   (eFigure 2 in the [235]Supplement). Overall, this analysis highlighted
   the potential for biological profiling for estimating GA during
   pregnancy (a substantial challenge in LMICs) and the use of urine-based
   metabolite biomarkers as low-cost surrogates for models developed
   through multiomics analysis.

Predictive Modeling of PTB

   For prediction of PTB (GA at delivery <37 weeks), we used a
   leave-one-out cross-validation procedure to test the models on blinded
   participants. Before training the model using the entire data set, the
   feature space was limited to the top features in the cohort that
   corresponded to the blinded sample based on univariate testing.
   Overall, the models relied on a subset of all available features. The
   median number of features used by the models during cross-validation
   was 36 for transcriptomics, 35 for metabolomics, and 9 for proteomics.
   To combine predictions from each model, we developed an additional
   integration layer to produce the final weighted probabilities for
   statistical testing. The integrated model was more accurate than the
   model for each independent modality ([236]Figure 3A). The mean area
   under the receiver operating characteristic curve (AUROC) and 95% CI
   for each modality were as follows: transcriptomics (AUROC, 0.73; 95%
   CI, 0.61-0.83), metabolomics (AUROC, 0.59; 95% CI, 0.47-0.72),
   proteomics (AUROC, 0.75; 95% CI, 0.64-0.85), and integrated (AUROC,
   0.83; 95% CI, 0.72-0.91) ([237]Figure 3A). eFigure 3 in the
   [238]Supplement provides a comparison against other machine learning
   strategies applied to the same data set (support vector regression
   AUROC, 0.57; random forest AUROC, 0.66; lasso AUROC, 0.68; Gaussian
   process AUROC, 0.71; supervised learning cohort-adjusted model AUROC,
   0.83; merging AUROC, 0.71; stacked generalization AUROC, 0.76; data
   integration cohort-adjusted model AUROC, 0.83). In an independent
   analysis, this same pipeline was used to model participants who were
   randomly assigned to case and control groups, confirming that the
   findings presented in [239]Figure 3 did not result from model
   overfitting (transcriptomics AUROC, 0.54; metabolomics AUROC, 0.50;
   proteomics AUROC, 0.50; integrated AUROC, 0.50) (eFigure 4 in the
   [240]Supplement).

Figure 3. Predictive Modeling of Preterm Birth (PTB).

   Figure 3.
   [241]Open in a new tab

   A, This receiver operating characteristic (ROC) curve analysis used
   each biological modality and the integrated approach. The mean area
   under the ROC curve and 95% CI for each modality were as follows:
   transcriptomics (AUROC, 0.73; 95% CI, 0.61-0.83), metabolomics (AUROC,
   0.59; 95% CI, 0.47-0.72), proteomics (AUROC, 0.75; 95% CI, 0.64-0.85),
   and integrated (AUROC, 0.83; 95% CI, 0.72-0.91). B, Circle size is
   proportional to −log[10] (Wilcoxon) P value for discrimination between
   term pregnancies and PTBs. Top features included an inflammatory module
   (which included interleukin 6 [IL-6]; IL-1 receptor antagonist
   [IL-1RA], a regulatory member of the IL-1 family whose expression is
   induced IL-1β under inflammatory conditions; granulocyte
   colony-stimulating factor [G-CSF]; retinoic acid receptor responder
   protein 2 [RARRES2]; chemokine ligand 3 [CCL3]; angiopoietin-like 4
   [ANGPTL4]; protein-arginine deiminase type II [PADI2]; and transferrin
   receptor [TfR]) and a metabolomic module (which was enriched for
   glutamine and glutamate metabolism [Fisher test for pathway enrichment
   analysis P < 4.4 × 10^−9] and valine, leucine, and isoleucine
   biosynthesis pathways [P < 7.3 × 10^−6]).

   Field workers were trained to collect detailed phenotypic and
   demographic data from the women and their families through scheduled
   household visits during pregnancy and postpartum. Clinical covariates
   were manually harmonized across all 5 cohorts. Of all the variables
   collected, only the weight of the baby and GA at delivery were
   statistically significantly correlated with the final outcome of the
   model predicting PTB (Spearman correlation = 0.73). (eFigure 5 and
   eTable in the [242]Supplement). This finding confirmed that the model
   was not confounded by the other measured clinical covariates.

   Given the statistically significant differences observed across various
   cohorts, we used mixed-effect models (with each cohort encoded as a
   random effect) to compare the distribution of each measurement between
   term pregnancies and PTBs ([243]Figure 3B). Top features were contained
   within 2 correlated modules: (1) an inflammatory module, which included
   interleukin 6 (IL-6), IL-1 receptor antagonist (IL-1RA, a regulatory
   member of the IL-1 family whose expression is induced IL-1β under
   inflammatory conditions^[244]35,[245]36), granulocyte
   colony-stimulating factor (G-CSF), retinoic acid receptor responder 2
   (RARRES2), and chemokine ligand 3 (CCL3), and (2) a metabolomic module,
   which primarily consisted of urine metabolites enriched for glutamine
   and glutamate metabolism (Fisher test for pathway enrichment analysis
   P < 4.4 × 10^−9)^[246]30 and valine, leucine, and isoleucine
   biosynthesis pathways (P < 7.3 × 10^−6).^[247]37

   The presence of inflammatory mediators among the features correlated
   with PTB is consistent with finding in previous studies that suggested
   dysfunctional immune adaptations during pregnancy was central to the
   pathogenesis of PTB.^[248]38,[249]39 However, the predictive model also
   highlighted a set of proteomic features with no known inflammatory
   properties that were correlated with features from the inflammatory
   module. These proteins included protein-arginine deiminase type II
   (PADI2), a peptidylarginine deiminase that is responsible for protein
   citrullination and implicated in parturition and sensing
   infections^[250]40,[251]41; transferrin receptor (TfR), which is
   implicated in iron transport; angiopoietin-like 4 (ANGPTL4), which
   regulates glucose homeostasis and lipid metabolism^[252]42; and
   RARRES2, an adipokine that is increased in metabolic syndrome and
   gestational diabetes.^[253]43,[254]44

   To ascertain whether observed correlations between these proteins and
   the inflammatory module reflected biologically relevant inflammatory
   properties, we examined the capacity of each of these factors to
   stimulate human peripheral blood leukocytes using an ex vivo mass
   cytometry assay.^[255]45 The activity of major intracellular signaling
   responses previously^[256]17 implicated in maternal immune adaptations
   during pregnancy was assessed at baseline and after a 15-minute
   stimulation in major innate and adaptive immune cell types (eMethods in
   the [257]Supplement). As expected, robust and cell-specific signaling
   responses along the JAK/STAT and MyD88 signaling pathways were observed
   in classical monocytes (CMC) after stimulation with known
   proinflammatory cytokines, including IL-6 (mean [SD] pSTAT3 ArcSinh
   ratio over endogenous signal, 2.64 [0.22]; false discovery rate
   [FDR]–adjusted vs unstimulated P < 1.0 × 10^−6), G-CSF (mean [SD]
   pSTAT5 ArcSinh ratio over endogenous signal, 0.42 [0.12]; P = .007),
   and CCL3 (mean [SD] pCREB ArcSinh ratio over endogenous signal, 0.35
   [0.09]; P < 1.0 × 10^−6) (eFigures 6 and 7 and the eMethods in the
   [258]Supplement). Stimulation with PADI2 activated the key elements of
   the MyD88 pathway, including P38 (mean [SD] ArcSinh ratio over
   endogenous signal, 0.91 [0.52]; FDR-adjusted vs unstimulated P = .007),
   MK2 (mean [SD] ArcSinh ratio over endogenous signal, 0.38 [0.10]; P =
   .002), and NFkB (mean [SD] ArcSinh ratio over endogenous signal, 0.14
   [0.03]; P = .009), in monocytes, although little or no signaling
   responses were observed after stimulation with ANGPTL4, TfR, or
   RARRES2.

   We also tested whether stimulation with the most informative proteomic
   features of the predictive model of PTB would alter the effector
   function of circulating immune cells. To this end, we quantified the
   intracellular expression of select cytokines in circulating immune
   cells that were stimulated with the target proteins for 4 hours. In
   addition to the expected cytokine responses after exposure to CCL3,
   IL-6, and G-CSF, the results show that PADI2 and ANGPTL4 stimulated
   proinflammatory cytokine production in CMC (mean [SD] frequency of
   PADI2-stimulated IL-1β + CMC: 18.66 [1.93], FDR-adjusted vs
   unstimulated P < 1.0 × 10^−6; mean [SD] frequency of PADI2-stimulated
   IL-6 + CMC: 8.01 [1.47], P = 1.0 × 10^−6; mean [SD] frequency of
   PADI2-stimulated TNF + CMC: 7.43 [1.44], P = 1.0 × 10^−6) (eFigure 8
   and eMethods in the [259]Supplement).

   In contrast, stimulation with RARRES2 or TfR elicited little
   intracellular cytokine responses (mean [SD] frequency of
   RARRES2-stimulated IL-1β + CMC: 5.63 [0.25], FDR-adjusted vs
   unstimulated P < 1.0 × 10^−6; mean [SD] frequency of TfR-stimulated
   IL-1β + CMC: 2.25 [0.66], P = .16). These results provide evidence of
   the potential communication between different biological systems and
   add new elements to the complex pathogenesis of preterm birth.
   Furthermore, the results suggest that PADI2, in conjunction with other
   inflammatory cytokines (such as IL-1β), may exacerbate proinflammatory
   innate immune responses during PTBs, thereby playing a role in the
   early onset of labor.

Discussion

   To our knowledge, this study is the first multicohort and multiomics
   analyses of term and preterm birth conducted in LMICs through use of
   biorepository samples from relevant geographies in a harmonized
   fashion. The plasma and urine samples were collected, processed,
   stored, and shipped to the laboratories under uniform protocols. In
   this proof-of-concept study, a machine learning approach was
   implemented for quality control, analysis of the timing of pregnancy,
   and prediction of PTB. Cohort-specific signatures were observed in all
   cohorts, and data quality was consistent across all modalities.

   The prediction of GA at the time of sample collection was driven by an
   internally correlated module of placenta-related plasma proteins and
   urine metabolites. Correlations within this module provided an
   excellent example of leveraging multiomics data for identification of
   low-cost surrogates in an accessible biological sample (in this case,
   urine) for an otherwise complex plasma-based measurement with direct
   applications in LMICs. Accurate prediction of GA through laboratory
   testing of blood or urine, if validated in larger and more diverse
   cohorts, has the potential for widespread implementation in settings in
   which ultrasonography-based GA dating is not available or is
   impractical.

   Prediction of PTB using a multiomics model adjusted for each cohort
   resulted in an AUROC of 0.83. The sparse nature of the developed
   methods indicated the possibility of developing simplified models in a
   validation cohort for scalable analysis of larger cohorts. Mixed-effect
   modeling revealed several features of interest. The top-ranked
   features, including IL-1RA, pointed to promising anti-inflammatory
   therapy candidates that were under active development.^[260]46 Although
   the prediction of GA at the time of sample collection was consistent
   across all 5 cohorts, models for prediction of PTB required
   cohort-specific adjustments. This finding is consistent with that in
   previous publications that indicated that, although the normal
   chronicity of pregnancy may be shared across populations, pathological
   pregnancies are likely to be population-specific.^[261]47,[262]48

   Each multiomics data set differed not only across the subcohorts but
   also in terms of their size and internal complexities. Therefore, we
   used a 2-step machine learning strategy in which a model was first
   built on each omics data set and then combined for final predictions.
   This approach prevented large untargeted data sets from overwhelming
   small yet carefully targeted assays that could have a similar or even
   more discriminatory information content. This approach resulted in an
   increase in predictive power and improved interpretability of the
   results.

   In the present study, the predictive accuracy for PTB was augmented by
   combining various omics data sets, which was consistent with previous
   studies suggesting that PTB was a condition manifesting within multiple
   biological systems.^[263]18,[264]49,[265]50,[266]51,[267]52 Observed
   differences between cohorts also highlighted that the causes of PTB may
   be associated with varying environmental and socioeconomic
   factors.^[268]53 From a biological standpoint, examination of
   individual components of the multiomics model emphasized the role of
   inflammation in the pathobiological features of PTB. As such,
   inflammatory cytokines previously shown to be elevated in PTBs,
   including IL-6 and IL-1RA (often considered as a surrogate marker of
   IL-1β expression^[269]54) were among the most informative features of
   the multiomics model.^[270]55 These cytokines were integrated within a
   broader inflammatory module that revealed novel factors associated with
   preterm labor with previously unsuspected properties (eg, PADI2). In
   neutrophils, citrullination of histones by PADI2 is an important step
   in the formation of neutrophil extracellular traps, a defensive
   immunity tool that allows neutrophils to trap and kill
   bacteria.^[271]56,[272]57,[273]58,[274]59,[275]60 Increased soluble
   PADI2 observed in PTBs may potentially reflect heightened inflammatory
   responses to a bacterial pathogen, consistent with an infectious cause
   for PTB. We show that soluble PADI2 can also directly activate
   proinflammatory signaling pathways and cytokine production in classical
   monocytes, highlighting a synergistic mechanism that may further
   enhance the inflammatory state of PTB.

Strengths and Limitations

   This study had several strengths. First, the AMANHI and GAPPS
   biorepositories used accurate early trimester ultrasonography scans for
   GA dating. Second, urine and plasma specimens were collected,
   processed, and transported in a harmonized manner. All samples
   underwent a single freeze-thaw cycle only at Stanford University before
   final processing and analysis. Third, the machine learning strategy
   used was able to detect patterns that were generalizable across
   cohorts.

   This study also had several limitations. First, it used a small sample
   size compared with the number of measurements (which we accounted for
   through a rigorous 2-step cross-validation process). Therefore,
   reproduction of these results in larger and more diverse cohorts
   remains a major priority for our future efforts. For reproduction of
   these results to be successful, the validation of a reduced model with
   increased scalability will be a key step. Second, given the exploratory
   nature of this study, the cohort was clinically homogeneous (eTable and
   eFigure 2 in the [276]Supplement), which limits the generalizability of
   the results to real-world heterogeneous populations. Therefore, a
   future area of investigation is the direct integration of clinical
   covariates into the predictive models^[277]61 to increase the
   generalizability in data sets with diverse phenotypes.

Conclusions

   This diagnostic/prognostic study found that, in LMICs and high PTB
   settings, major biological adaptations during pregnancy may follow a
   generalizable model, but the biological signals that correlate with or
   are potentially associated with PTB can be detected using robust
   machine learning algorithms. In addition, this study demonstrated that
   a multiomics approach has the potential to both improve and help
   identify low-cost predictive surrogates in accessible biological
   samples for LMICs. Research to expand this analysis to a larger patient
   population and to broader cohorts and omics platforms are already under
   way. The data sets, together with state-of-the-art machine learning
   partnerships,^[278]62 will be a key step in developing valuable
   predictive tests and intervention candidates to tackle the long-term
   clinical challenge of preventing PTB.
   Supplement.

   eMethods.

   eFigure 1. Data Quality Assessment

   eFigure 2. Urine Metabolites as a Surrogate for PGF in Plasma

   eFigure 3. Empirical Algorithm Comparison

   eFigure 4. A Lower Bound for the Analysis Pipeline Using a Negative
   Example

   eFigure 5. Analysis of Clinical Covariates

   eTable. Table of Clinical Covariates Harmonized Across All Cohorts

   eFigure 6. Comprehensive Visualization of Single-Cell-Level
   Intracellular Signaling in Response to Selected Plasma Proteins

   eFigure 7. Top Proteomics Features Activate Intracellular Signaling
   Pathways in Peripheral Blood Classical Monocytes

   eFigure 8. Top Proteomics Features Activate Cytokine Production in
   Peripheral Blood Classical Monocytes

   eReferences