Graphical abstract

   graphic file with name fx1.jpg
   [85]Open in a new tab

Highlights

     * •
       Gestational diabetes mellitus (GDM) exhibits distinct cfDNA
       physical properties
     * •
       Dynamic cfDNA variations between GDM and controls
     * •
       Altered exocrine pancreas identified by islet acinar marker gene
       PRSS1
     * •
       CfDNA links between GDM and altered lipid metabolism
     __________________________________________________________________

   Tang et al. present a longitudinal integrative analysis of cell-free
   DNA, revealing a distinctive signature and dynamic patterns specific to
   gestational diabetes mellitus. Lipidomic alterations and changes in
   pancreatic exocrine markers are discerned.

Introduction

   Globally, gestational diabetes mellitus (GDM) is an increasingly
   prevalent disease among pregnant women and affects approximately one in
   six pregnancies.[86]^1 GDM is caused by insulin resistance and
   pancreatic β-cell dysfunction during pregnancy and is associated with
   long-term adverse outcomes such as type 2 diabetes mellitus and
   cardiovascular disease.[87]^2^,[88]^3 GDM is also associated with
   metabolic imbalance that increases the risk of neonatal complications,
   primarily fetal growth deviations[89]^4 and preterm birth.[90]^5
   Furthermore, increasing evidence now suggests that GDM has long-term
   consequences for children’s development such as cardiovascular diseases
   and insulin resistance.[91]^2^,[92]^3 Research also shows that promptly
   treating GDM before 20 weeks of gestation slightly reduces a
   combination of negative outcomes in newborns.[93]^6 Despite great
   advances in the diagnosis and clinical management of GDM, the disease
   mechanisms are still unclear and early condition predictions remain
   difficult.[94]^7 Since current diagnostic criteria for GDM are mainly
   based on glycemic levels, dynamic changes in genetic factors are often
   ignored.[95]^8 Thus, understanding temporal genetic markers in response
   to a growing fetus during GDM and exploring associated dynamic genetic
   changes can provide insights into gene function that influences insulin
   actions and regulates metabolism.[96]^9

   The association of cell-free DNA (cfDNA) in maternal plasma with
   gestational disease by recent studies may provide a new insight into
   GDM prediction using cfDNA.[97]^10^,[98]^11 Plasma cfDNA are short DNA
   fragments primarily derived from cell apoptosis, with fragments from
   different tissue-specific nucleosome arrangements.[99]^12 Initially
   used as a non-invasive prenatal testing (NIPT) approach for fetal
   chromosome abnormalities,[100]^13^,[101]^14 cfDNA was found to contain
   non-random fragments associated with tissue damage, gene expression,
   methylation levels, and other factors.[102]^15 Previous studies
   reported that cfDNA molecules manifested fragmental
   characteristics,[103]^16 nucleosome relationships,[104]^17 and
   endpoints that reflected tissue origins and turnover
   mechanisms.[105]^18 Guo et al.[106]^19 found that cfDNA coverage
   patterns in gene promoter regions reflected gene expression levels and
   could be used to predict pregnancy complications. Importantly, in
   patients with GDM, the cfDNA fetal fraction is lower when compared with
   non-GDM pregnant women.[107]^20

   Recently, it was shown that cfDNA methylation and transcriptomic
   signatures had the potential to predict adverse pregnancy
   outcomes.[108]^21 However, more research is required to characterize
   cfDNA feature associations with patients with GDM. In this study, we
   compared longitudinal cfDNA profiles between women with GDM and healthy
   control to identify distinctive changes in cfDNA features not
   previously observed. We then correlated the underlying implications of
   these cfDNA features and demonstrated the utility of cfDNA in
   unraveling the biological mechanisms underpinning GDM. Furthermore, the
   validation in external datasets from NIPT data supported the clinical
   relevance of our results, indicating their predictive applicability
   regardless of sequencing depth.

Results

Population demographics and pregnancy characteristics

   A total of 299 women with GDM and 299 matched healthy non-GDM pregnant
   women were selected from the Tianjin Birth Cohort (TJBC)[109]^22 as the
   discovery dataset. A small prospective cohort containing 8 women with
   GDM and 106 healthy controls was used as a validation cohort
   (validation dataset1) ([110]Figure 1A). Additionally, a dataset from
   Zhu et al.[111]^23 contained the NIPT data of 21,813 pregnant Chinese
   women. Following the inclusion criteria ([112]Figure S5C), we
   ultimately arrived at a final set of 6,104 samples (1,045 GDM and 5,059
   healthy controls) for validation dataset2. No significant differences
   in baseline characteristics were recorded in women who underwent oral
   glucose tolerance tests or fasting plasma glucose tests to diagnose GDM
   ([113]Table S1). Peripheral blood samples from the 1^st, 2^nd, and 3^rd
   trimester women underwent high-depth sequencing (∼60×) to analyze cfDNA
   features, including fragment size, fetal fraction, motif patterns, and
   transcription start site (TSS) scores. We found no significant
   differences between different sequencing batches ([114]Figure S1A).
   Women’s demographics and pregnancy characteristics are shown in
   [115]Tables 1 and [116]S1. The mean ages of control and GDM were
   31.20 ± 3.87 and 31.23 ± 3.93 years, respectively. Women with GDM had
   lower education levels, higher pre-pregnancy body mass index (BMI), a
   family history of diabetes, a lower gestational age for delivery, and a
   higher rate of cesarean section compared with control. As
   expected,[117]^24 women with GDM had higher total cholesterol (CHO),
   triglyceride (TG), and low-density lipoprotein cholesterol (LDLC)
   levels, but lower high-density lipoprotein cholesterol levels.

Figure 1.

   [118]Figure 1
   [119]Open in a new tab

   High-depth cfDNA sequencing captures fragmentation changes in GDM
   patient plasma vs. control plasma samples

   (A) Study design showing how samples were collected and analyzed
   (created with [120]BioRender.com).

   (B) Size profile distributions varied across different groups; solid
   and dashed lines represent nucleosome-derived and mitochondria-derived
   cfDNA fragments, respectively.

   (C) Longer cfDNA fragment ratios across GDM patient and control plasma
   samples. A longer size ratio was defined as the proportion of the
   number of reads from 160 to 300 bp (p < 0.001, Wilcoxon rank-sum test).

   (D) Longitudinal variations between GDM and controls. The fetal
   fraction differed significantly between GDM and controls (p < 0.01,
   LMM). The mean is represented by the central point and the error bars
   indicate the standard deviation from the mean.

Table 1.

   Participants' characteristics in the TJBC
   Variable Control, n = 299[121]^a GDM, n = 299[122]^a p value[123]^b
   Age, years 31.20 (3.87) 31.23 (3.93) 0.92
   Education level 0.01
    Below senior high school 36 (12.04%) 59 (19.73%) –
    Above college or others 263 (87.96%) 240 (80.27%) –
   Pre-pregnancy BMI, kg/m^2 22.88 (3.32) 25.22 (4.49) <0.001
   Primipara 0.6
    Yes 200 (66.89%) 206 (68.90%) –
    No 99 (33.11%) 93 (31.10%) –
   Family history of diabetes 0.039
    Yes 48 (16.05%) 68/299 (22.74%) –
    No 251 (83.95%) 231 (77.26%) –
   Current or former smoking 0.055
    Yes 72 (24.08%) 93 (31.10%) –
    No 227 (75.92%) 206 (68.90%) –
   Drinking after pregnancy 0.073
    Yes 12 (4.01%) 4 (1.34%) –
    No 287 (95.99%) 295 (98.66%) –
   Gestational age for delivery, weeks 39.15 (1.27) 38.68 (1.40) <0.001
   Delivery mode 0.072
    Caesarean section 144 (48.16%) 166 (55.52%) –
    Vaginal delivery 155 (51.84%) 133 (44.48%) –
   Gender of child 0.51
    Female 141 (47.16%) 133 (44.48%) –
    Male 158 (52.84%) 166 (55.52%) –
   Birth weight of child, g 3,374.03 (436.61) 3,380.50 (466.07) 0.89
   Birth length of child, cm 50.08 (1.44) 50.06 (1.53) 0.66
   BMI of child at 2 years, kg/m^2 16.02 (1.34) 16.19 (1.78) 0.61
    Unknown 183 180 –
   CHO, mmol/L 4.78 (0.74) 4.90 (0.84) 0.045
   TG, mmol/L 1.45 (0.56) 1.60 (0.63) 0.002
   HDLC, mmol/L 1.76 (0.37) 1.71 (0.40) 0.03
   LDLC, mmol/L 2.69 (0.63) 2.91 (0.78) 0.001
   UA, μmol/L 203.83 (45.45) 221.37 (67.91) <0.001
   PRO 0.65
    Positive 22 (7.36%) 25 (8.36%) –
    Negative 277 (92.64%) 274 (91.64%) –
   ALT, U/L 18.75 (12.38) 23.14 (18.45) <0.001
   AST, U/L 17.49 (7.05) 18.64 (10.07) 0.18
   BUN, mmol/L 3.08 (0.80) 3.09 (0.89) 0.92
   Cr, μmol/L 55.24 (14.09) 54.10 (17.81) 0.093
   Systolic blood pressure, mm Hg 108.44 (10.62) 113.27 (12.10) <0.001
   Diastolic blood pressure, mm Hg 70.34 (7.72) 72.58 (9.22) 0.002
   [124]Open in a new tab

   All variables were investigated or measured in early pregnancy; CHO,
   plasma total cholesterol; TG, plasma triglycerides; HDLC, high-density
   lipoprotein cholesterol; LDLC, low-density lipoprotein cholesterol; UA,
   blood uric acid; PRO, urine protein; ALT, alanine aminotransferase;
   AST, aspartate aminotransferase; BUN, blood urea nitrogen; Cr, serum
   creatinine.
   ^a

   Mean (SD); n (%).
   ^b

   Wilcoxon rank-sum tests; Pearson’s chi-squared tests; Fisher’s exact
   tests (drinking after pregnancy).

cfDNA physical properties are different in women with GDM

   We mapped cfDNA paired-end sequencing data to the human reference
   genome (GRCh38.p13) and divided it into nuclear and mitochondria
   origins ([125]Figures 1B and [126]S1B). In both healthy and GDM groups,
   nucleosome and mitochondria-derived cfDNA had peak fragment sizes of
   167 and 80 bp, respectively. Nuclear-derived cfDNA had smaller fragment
   peaks with periodic 10 bp intervals, patterns consistent with previous
   investigations,[127]^25 and indicating potential associations with
   histone-bound DNA lengths. Such patterns were similar in control and
   GDM groups ([128]Figure 1C). However, women with GDM had a
   significantly higher ratio of longer cfDNA than control (p < 0.001,
   Wilcoxon rank-sum test). Mitochondria-derived cfDNA distribution
   patterns between control and GDM groups were also different.
   Additionally, different gestation trimesters were associated with
   varying patterns of distribution in nuclear and mitochondria-derived
   cfDNA ([129]Figure S1B).

   Fetal fractions showed an increasing tendency as pregnancies progressed
   ([130]Figure 1D), consistent with findings reported previously.[131]^26
   Notably, fetal fractions in women with GDM were constantly lower than
   those in controls (p < 0.01, linear mixed-effects model [LMM]),
   consistent with previous studies.[132]^20^,[133]^27 We further compared
   fetal fractions in obesity and preterm subgroups and observed that
   lower fetal fractions associated with GDM were independent of multiple
   factors ([134]Figure S1C).

   Among the 256 cfDNA fragment 4-mer end motifs, CCCA was the most
   enriched in all samples, which aligned with previous research.[135]^17
   GDM had a significantly higher proportion of CCCA motifs than their
   controls (p < 0.05, LMM). Moreover, GDM had a significantly higher
   percentage of ACTT and ACCG motifs and a lower percentage of GCGG,
   GCGC, and TCGG motifs when compared with controls ([136]Figures 2A and
   [137]S2A). We then calculated the frequency diversity of all 256 motifs
   using motif diversity scores (MDSs), which showed decreased diversity
   in all fragment size categories as pregnancies progressed ([138]Figures
   2A and [139]S2B). Interestingly, the MDS in GDM was notably lower than
   in controls, especially during the 1^st and 2^nd trimesters. Moreover,
   MDS trends were comparable with the validation dataset1
   ([140]Figure S2C). We also used methylation-associated (MA) values to
   infer cfDNA methylation status[141]^15 in controls and GDM and found
   that the latter group had higher MA values than the former group
   ([142]Figure 2A). These differences represented specific cfDNA physical
   properties related to gene expression patterns during GDM.

Figure 2.

   [143]Figure 2
   [144]Open in a new tab

   CfDNA physical property differences between GDM and controls

   (A) Comparing cfDNA physical properties between GDM and controls.
   Signatures differed in motif frequency (CCCA = representative
   sequence), MDS (four subgroup sizes, (1) all: all fragments, (2) short:
   ≤150 bp, (3) peak: 160–170 bp, and (4) long: ≥250 bp), and MA
   (methylation-associated value). The center line in the boxplot
   represents the median, and lower, upper whiskers, and outliers
   correspond to the 1.5× interquartile range and outliers outside that
   range, respectively. p values for disease effects were calculated from
   the LMM.

   (B) Left: relative PIK3R1 coverage revealed differences between GDM and
   controls across different trimesters. Right: TSS PIK3R1 coverage
   revealed differences between GDM and controls across different
   trimesters (case1, case2, and case3 represent 1^st, 2^nd, and 3^rd
   trimester of GDM, respectively.).

   (C) TSS scores after a log transformation calculated for previously
   identified GDM-associated genes. Wilcoxon rank-sum tests were used to
   calculate p values.

   (D) TSS scores at various locations on multiple chromosomes. The x axis
   represents chromosome locations, while the y axis represents mean TSS
   scores after a log transformation was applied. Yellow dots on the graph
   indicate the 10 most significant TSS score differences between GDM and
   controls.

   (E) Top enriched gene set enrichment analysis terms in Wikipathways.
   Colors represent normalized enrichment scores and point size represents
   log[10](p adj) values.

TSS scores identify pathway changes in GDM

   We next analyzed cfDNA coverage and TSS scores in three previously
   reported GDM-related genes (PIK3R1, PPARG, and TCF7L2) in Asian
   populations.[145]^28^,[146]^29^,[147]^30 We found that the coverage of
   up- and downstream TSS regions in these genes differed between GDM and
   control ([148]Figures 2B, [149]S3A, and S3B). Specifically, the TSS
   score of PIK3R1 was significantly higher in women with GDM when
   compared to control ([150]Figure 2C, p < 0.001, Wilcoxon rank-sum test
   with Bonferroni correction). We further calculated sequencing coverages
   near TSS regions using a TSS score method and identified genome-wide
   changes in cfDNA coverage near these regions at different trimesters
   between control and GDM groups ([151]Figures 2D, [152]S3C, S3D, and
   S3E).

   For validation purposes, we performed a similar genome-wide TSS
   coverage analysis using a public dataset[153]^19 and observed a
   correlation (Pearson correlation coefficient = 0.61, p < 0.01,
   [154]Figure S6F and [155]Table S6) between the fold changes of TSS
   coverage in the 1^st trimester.

   To perform pathway enrichment analysis, we selected genes with TSS
   score fold-change differences between control and GDM groups. In the
   1^st trimester, no pathways were enriched. However, in 2^nd and 3^rd
   trimesters, we found increasing numbers of enriched pathways, including
   several signaling pathways related to diabetes ([156]Figure 2E). For
   instance, PI3K (phosphatidylinositol 3-kinase) and MAPK
   (mitogen-activated protein kinase) pathways, both of which are
   activated by insulin and involved in lipid
   metabolism,[157]^31^,[158]^32 were enriched in GDM (Benjamini-Hochberg
   corrected p = 0.032 for PI3K in the 2^nd trimester and 0.024 for MAPK
   in all trimesters) along with the insulin signaling pathway. PIK3R1, a
   member of PI3K/AKT pathway, contributed to this enrichment
   ([159]Table S3). Additionally, we identified enriched CHO biosynthesis
   and leptin signaling pathways, both of which are implicated in
   diabetes.[160]^33^,[161]^34 Interestingly, we also identified pathways
   involved in angiogenesis such as VEGFA (vascular endothelial growth
   factor-A)-VEGF receptor 2 pathway, synaptogenesis, neurogenesis
   (hippocampal synaptogenesis and neurogenesis), osteoblast
   differentiation and related diseases, and ectoderm differentiation
   ([162]Table S4). Gene ontology (GO) analysis confirmed that biological
   processes, such as brain development, tissue morphogenesis, cell
   proliferation, and growth regulation were altered in GDM
   ([163]Table S4), which putatively suggested GDM effects on fetal growth
   and development.

TSS score signatures identify genes related to preterm births during GDM

   Using an LMM, we identified 55 TSS regions spanning across
   all trimesters that exhibited significantly different TSS scores
   between control and GDM ([164]Figure 3A). These TSS scores,
   representing alterations in 50 genes to GDM, were defined as TSS score
   signatures. To investigate the biological implications of these TSS
   score signatures, we categorized the 50 genes into four temporal
   categories using a Euclidean distance approach ([165]Figure 3B). The
   first category comprised two genes with decreasing TSS scores
   throughout pregnancy trimesters, exhibiting consistently higher TSS
   scores in patients with GDM than in controls. The second category also
   had higher TSS scores in GDM, but scores tended to increase as
   gestation proceeded. In the third category, most TSS scores decreased
   as pregnancy progressed and were lower in women with GDM. The fourth
   category had lower TSS scores in women with GDM and showed an
   inverted-U shape trend against gestation. To understand the temporal
   patterns of these signature genes across trimesters, we applied an
   unsupervised clustering approach, segmenting the study population into
   two clusters in the 1^st trimester, four clusters in the 2^nd
   trimester, and three clusters in the 3^rd trimester. This analysis
   revealed a distinct pattern in population flow, with the majority
   transitioning from T1-2 to T2-3 and subsequently to T3-3
   ([166]Figure 3C). By comparing morbidity differences in each cluster,
   we observed an uneven GDM distribution complicated with preterm births,
   especially the T2-2 cluster showing no preterm births ([167]Figure 3C).
   We then analyzed correlation networks using GO pathways in TSS score
   signatures in the 50 genes and found the associations of PIK3R1,
   previously linked to diabetes,[168]^35 with multiple pathways
   ([169]Figure 3D and [170]Table S4). We then focused on associations
   between the TSS score signatures of the 50 genes and preterm birth. Six
   genes were involved in growth and development processes ([171]Tables S3
   and [172]S4), in which CNTN4 and RNF213 had TSS scores associated with
   preterm birth. Notably, CNTN4 is a member of the immunoglobulin
   contactin family, with previous research reporting that contactin-2 may
   exhibit changes before a preterm delivery diagnosis.[173]^36

Figure 3.

   [174]Figure 3
   [175]Open in a new tab

   CfDNA signatures between GDM and controls

   (A) Upset diagram shows selected TSS score signatures.

   (B) Heatmap shows TSS score signatures. Differential TSS scores
   separate GDM and control samples despite dynamic changes in trimesters.
   Four clusters show different temporal patterns.

   (C) Sankey diagram shows temporal cluster changes. Rectangle width
   corresponds to sample numbers in each trimester, and connections
   between rectangles represent subject flow between trimesters (T
   referred to trimester). The three bar charts below show preterm birth
   percentages in each trimester.

   (D) A correlation network of 50 TSS scores. Line color represents
   pathways and line size represents log(weight) values.

   (E) The odds ratios of six growth and development-related TSS scores.
   Data are presented as odds ratios with 95% confidence intervals.

TSS score signatures identified genes associated with altered lipid
metabolism in GDM

   As higher TG and LDLC levels ([176]Table 1) and enriched lipid-related
   pathways were observed in women with GDM in our study ([177]Figure 2E),
   we examined associations between lipid metabolism and TSS score
   signatures from 50 genes. TSS scores of multiple genes were correlated
   with lipid-related phenotypes from clinical records, including CHO, TG,
   and LDLC levels ([178]Figure 4A). Notably, a correlation was identified
   between PSD3 and plasma lipid profiles, consistent with previous
   findings showing that down-regulated PSD3 (via short interfering RNA)
   reduced intracellular lipid content in primary human
   hepatocytes.[179]^37 Furthermore, we found that the TSS score of
   SMARCD1, a molecular linker of the switch/sucrose non-fermentable
   (SWI/SNF) chromatin remodeling complexes and hepatic lipid
   metabolism,[180]^38 was correlated with CHO and LDLC levels
   ([181]Figure 4A). These results reveal the possibility of using TSS
   scores as the genetic markers of lipid homeostasis in GDM. To further
   investigate the link between TSS score signatures and lipid changes in
   GDM, we extracted plasma lipids from the study population and performed
   untargeted lipidomic profiling. When compared with controls, GDM had
   significantly elevated fatty acid (FA) 24:6, sphingomyelin (SM) d36:0,
   and phosphatidylcholine (PC) 16:0/20:4 levels and decreased PC 44:7e
   and PC 44:2e levels ([182]Figures 4B and [183]S4A).

Figure 4.

   [184]Figure 4
   [185]Open in a new tab

   Lipid profile associations

   (A) TSS score associations with different laboratory measurements. The
   upper section of the plot shows Pearson correlation analysis on
   multiple TSS scores. Line segment thickness corresponds to the
   magnitude of Pearson correlation coefficients, indicating relationship
   strength. Line color represents p values, which show statistical
   significance in observed correlations (Age, age of mother pregnancy;
   BMI_pre, BMI of the mother before pregnancy; CHO, plasma total
   cholesterol; TG, plasma triglycerides; LDLC, low-density lipoprotein
   cholesterol; ALT, alanine aminotransferase; AST, aspartate
   aminotransferase; birthBMI, BMI of the child at birth).

   (B) Lipid differences between women with GDM and controls.
   |log2(FC)| < 0.4 and −log10(p value) > 2 threshold values are applied.
   Blue and red dots represent significantly down-regulated and
   up-regulated lipids in GDM, respectively. Wilcoxon rank-sum tests were
   used to calculate p values.

   (C) Up: the fragment end counts and PRSS1’s TSS downstream 500 bp.
   Middle: a schematic showing the PRSS1 gene structure. Down: the
   coverage of the corresponding region. The gray-purple area represents
   introns 1–4 in ENST00000311737. Gray dotted lines indicate the
   positions of two end peaks situated in the intronic region (case1,
   case2, and case3 represent 1^st, 2^nd, and 3^rd trimester of GDM,
   respectively. ctr1, ctr2, and ctr3 represent 1^st, 2^nd, and 3^rd
   trimester of control, respectively).

   (D) Heatmap shows PRSS1 and other acinar-specific gene expression
   levels in different women from single-cell data from Camunas-Soler
   et al.[186]^39

   We then analyzed the correlation between TSS scores of 4 lipid
   metabolism-related genes and altered lipids in GDM ([187]Figures S4B
   and S4C). We identified significant positive correlations between PRSS1
   and up-regulated FA 24:6, SM d36:0, and PC 16:0/20:4 levels in women
   with GDM in the 2^nd trimester (Spearman ρ = 0.34, 0.33, and 0.39,
   respectively). Additionally, we consistently observed a positive
   correlation between PRSS1 and down-regulated PC 44:2e levels in
   controls ([188]Figure S4B), suggesting more comprehensive lipid
   profiles associated with PRSS1 expression. PRSS1 is a pancreatic acinar
   cell marker[189]^40 and encodes an enzyme secreted from the pancreas.
   An animal study confirmed that PRSS1 transgenic mice exhibited
   disordered lipid metabolism.[190]^41 We further investigated the
   coverage characteristics of cfDNA fragments near PRSS1 and found the
   predominant distribution of reads starting at 260 bp downstream of TSS
   (U-end) and ending at 427 bp (D-end) near the TSS ([191]Figure 4C).
   These accumulated fragments precisely overlapped with the PRSS1
   intronic region ([192]Figure 4C), exhibiting a highly consistent
   nuclease cleavage pattern (chromatosome: nucleosome + linker histone;
   ∼167 bp[193]^25^,[194]^42) across samples. It has been shown that the
   maternal-origin cfDNA has longer fragment sizes (∼167 bp) than
   fetal-origin cfDNA (∼144 bp) due to different enzyme cutting positions.
   Hence, the phenomenon of 167 bp cfDNA fragments concentrated in
   the PRSS1 intron may reflect its maternal origin.[195]^43 Notably, the
   cfDNA coverage of PRSS1 differed between GDM and control throughout
   pregnancy, particularly the 1^st trimester ([196]Figure 4C). The most
   probable explanation for this disparity is that the transcriptional
   activity of PRSS1 was altered in GDM. To validate our findings, we
   examined single-cell transcriptomic data from diabetic human
   islets.[197]^39 We found that type 2 diabetes and prediabetic patients
   showed not only higher PRSS1 expression but also higher expression of
   other pancreatic acinar marker genes ([198]Figure 4D). Our analysis
   demonstrated an increased expression of PRSS1 and other pancreatic
   acinar marker expression in both type 2 diabetes and prediabetic
   patients and agreed with recent genetic studies that suggested genetic
   background differences in the exocrine pancreas of individuals with
   diabetes.[199]^44^,[200]^45

Integrative specific neural network (SNN) models powerfully predict GDM and
offspring BMI

   Given the robust associations between cfDNA features and phenotypic and
   metabolic alterations in GDM, we exploited these cfDNA features to
   predict early GDM. Leveraging diverse clinical data (including age,
   pre-pregnancy BMI, drinking habits, and smoking status) and cfDNA
   features (such as motifs, MDS, MA, fetal fraction, and TSS score
   signatures) from the 1^st trimester, we constructed a neural network
   model aimed at predicting the likelihood of GDM developing in the 3^rd
   trimester ([201]Figure 5A). Utilizing solely clinical information for
   prediction yielded an area under the curve (AUC) of 0.697, with a 95%
   confidence interval (CI) ranging from 0.534 to 0.860 ([202]Figure 5C).
   Furthermore, incorporating various cfDNA features into the model
   enhanced its predictive performance. Notably, a substantial improvement
   in AUC was observed when TSS score signatures were incorporated with
   other features, achieving an AUC of 0.877 with a 95% CI spanning from
   0.794 to 0.960 ([203]Figure 5C). We then validated our prediction model
   in validation dataset1, which showed an AUC = 0.829 (95% CI:
   0.746–0.912) using clinical information and all cfDNA features
   ([204]Figure 5D). Interestingly, clinical information and
   growth-related cfDNA features could also predict the offspring BMI
   using an integrated neural network model, showing the correlation to
   the actual value of R^2 = 0.56 (95% CI: 0.44–0.68) and a mean absolute
   error = 1.75 (95% CI: 1.19–2.31) in birth BMI ([205]Figure S6A) and a
   correlation of R^2 = 0.62 (95% CI: 0.40–0.84) and a mean absolute
   error = 1.53 (95% CI: 1.34–1.72) in 2-year BMI ([206]Figure S6B). Next,
   we assessed the significance of cfDNA features in our model and
   identified the top 35 features crucial for predicting GDM and BMI
   ([207]Figures 5B and [208]S6C). For GDM prediction, we found fetal
   fraction and BMI to be the most important factors in our SNN model,
   indicating their higher contributions to the final loss value metric
   within the SNN model. Additionally, we employed the random forest Gini
   factor to validate feature importance, revealing that these values are
   model specific ([209]Figure 5B). The key gene LILRB1 was ranked 21^st
   for early GDM predictions and 1^st for birth weight predictions.
   Another crucial gene, PIK3R1, was positioned 8^th in early GDM and
   2-year BMI predictions. The key gene PRSS1 ranked 23^rd and 4^th in
   early GDM and 2-year BMI predictions. Therefore, these top features
   were crucial in the prediction model, and their relative importance
   offered a reference for the specific trained model. Among the 35
   crucial features, 9 genes with known roles in growth and development
   were further selected to assess their association with fetal
   development. We used a derived formula, C[BMI-a], from the TSS
   component of the predictive birth BMI model ([210]STAR Methods) to
   calculate a growth and development score. As expected, we observed a
   significant distinction between normal and large for gestational age
   children (p < 0.01, Wilcoxon rank-sum test. [211]Figure S6G).

Figure 5.

   [212]Figure 5
   [213]Open in a new tab

   Neural network model

   (A) SNN structure (created with [214]BioRender.com).

   (B) Feature importance plot showing different machine learning
   algorithms. The dark blue feature represents more importance when
   compared with the light blue, indicating boost in AUC when adding the
   feature. Right color bar signifies the feature class.

   (C and D) Classifier performance as quantified by a receiver operator
   characteristic curve to predict GDM using early gestation samples from
   TJBC (C) and validation dataset1 (D). The numerical values in
   parentheses denote the mean AUC value accompanied by its 95% confidence
   interval.

   (E and F) Performance of the refined model predicting GDM in TJBC,
   validation dataset1 (E), and validation dataset2 (F).

   Given the large number of features used in our GDM prediction model, we
   performed a further feature selection to develop a refined model with
   fewer features, including SeqFF (fetal fraction), BMI, CDH23, CNTN4,
   RNF213, SMARCD1, TWSG1, PSD3, and PIK3R1. Each feature was implicated
   in some aspect of lipid metabolism and GDM pathology. The refined model
   predicted GDM with an AUC = 0.849 (95% CI: 0.798–0.900) in the TJBC and
   0.734 (95% CI: 0.713–0.755) in the validation dataset1
   ([215]Figure 5E). Using the same features, the refined model was
   further tailored to adapt the ultra-lower sequencing depth in
   validation dataset2 (1% of TJBC) ([216]STAR Methods), which gave rise
   to GDM prediction performance of an AUC = 0.758 (95% CI: 0.724–0.792)
   ([217]Figure 5F). Lastly, DeLong’s tests were performed to confirm the
   enhancement to the GDM prediction performance by adding TSS features to
   clinical phenotypes in the prediction models ([218]Figures S6D and
   S6E).

Discussion

   CfDNA in maternal plasma mainly originates from maternal and placental
   apoptotic cells,[219]^46 and its heterogeneous origins are ideal for
   studying complex metabolic disorders during pregnancy.[220]^47 In this
   study, we demonstrated the associations of cfDNA physical properties
   and fragment features with GDM by showing the direct involvement of
   cfDNA fragment features with typical GDM-related pathways, offspring
   BMI, preterm birth, and lipid metabolite traits. We identified PRSS1 as
   a biologically significant and valuable GDM marker. Furthermore, by
   externally validating both medium-depth sequencing data and low-depth
   NIPT datasets, we demonstrated that dynamic cfDNA changes may represent
   an important strategy for predicting GDM in early pregnancy.

   Previous studies reported dynamic changes in cfDNA physical properties
   during pregnancy and gestational disease.[221]^21 For instance, the
   fetal fraction was shown to increase in line with gestational weeks and
   served as an early screening marker of complications associated with
   placental dysfunction in pregnancy.[222]^48 In our study, the fetal
   fraction in women with GDM was significantly decreased when compared to
   control, particularly at early gestation stages, consistent with a
   previous study.[223]^20 Motif differences have also been implicated in
   gestational diseases,[224]^49 and we observed distinctive cfDNA
   fragment motif patterns and reduced cfDNA MDS in GDM, which suggested
   lower cfDNA fragment complexity in GDM. Previous studies also indicated
   nuclease involvement in cfDNA turnover and cfDNA end formation.[225]^50
   Thus, nuclease activity may be changed in GDM, leading to altered cfDNA
   physical properties. The biological implications behind these cfDNA
   motif changes and diversity in women with GDM require further
   investigation.

   TSS scores were previously suggested to inversely correlate with
   expression levels at promoter nucleosome-depleted
   regions.[226]^51^,[227]^52 A major discovery in our study was that by
   comparing TSS scores between GDM and controls, a set of genes with
   functions related to diabetes, organ development, and differentiation
   pathways were identified. For instance, PI3K and MAPK pathways were
   concurrently altered in GDM. Specifically, the TSS score for PIK3R1, a
   member of PI3K/AKT pathway with well-established roles in regulating
   insulin and metabolic diseases,[228]^35^,[229]^53^,[230]^54 was
   significantly different between GDM and controls. Additionally, PIK3R1
   was enriched in several growth-related pathways. Diabetic pregnancies
   are associated with distinct growth hormone profiles, which contribute
   to varied fetal growth patterns.[231]^55 Increased fetal fat deposition
   commonly observed in diabetic pregnancies may be attributed to modified
   cellular differentiation and underlying mechanisms that regulate body
   composition.[232]^56 The enriched pathways identified in our study also
   had important functions in cell growth differentiation. Thus, our
   findings endorse the rationale of using candidate cfDNA biomarkers to
   detect and monitor GDM.

   We also examined significantly altered TSS score signatures between GDM
   and controls to identify 50 genes as the candidate cfDNA biomarkers for
   GDM. Consistent with aforementioned results, several genes were
   directly related to diabetes and growth and development, such as
   LILRB1, PIK3R1, and CBS. Moreover, some candidate biomarker genes were
   also involved in preterm birth and lipid metabolism biological
   processes, including CNTN4 and RNF213, which were significantly
   associated with preterm births. Thus, longitudinal TSS score-signature
   patterns in candidate biomarker genes may represent altered fetal
   growth and development in women with GDM, leading to increased preterm
   birth risks.

   Lipid metabolism is a well-established process in diabetes.[233]^57
   Previously, Mak et al.[234]^58 reported a correlation between
   circulating lipids and cfDNA genetic aberrations. In our study, we
   identified lipidomic profiling shifts with correlations with TSS scores
   in the islet acinar marker PRSS1. We further used single-cell data to
   validate exocrine pancreatic alterations in type 2 diabetes or
   prediabetic status. Exocrine pancreatic disorder in diabetes is a
   clinically relevant but poorly understood condition.[235]^59 Thus, our
   findings contribute to evidence linking exocrine pancreas deficiencies
   to diabetes in vivo.[236]^44^,[237]^45

   We also showed that cfDNA physical properties and clinical data could
   be integrated into an SNN model to predict early GDM. This prediction
   performance was further confirmed and validated in two geographically
   distant and independent cohorts, thus highlighting the potential
   effectiveness of cfDNA in early GDM diagnoses. Furthermore, the
   addition of TSS score signatures from 50 candidate genes to the
   prediction model greatly improved GDM prediction performances in the
   1^st trimester and supported these genes as potential GDM candidate
   biomarkers. As a complex disease, GDM induces sophisticated
   physiological and metabolic changes that have a multifaceted impact on
   offspring growth, including macrosomia, hypoglycemia, respiratory
   distress syndrome, etc.[238]^60 Our study provides a platform for
   exploiting the dynamic features of lipid metabolism, preterm birth,
   insulin regulation, and growth and development genes to predict GDM.
   Interestingly, we also predicted offspring BMI using the cfDNA features
   of 50 candidate biomarker genes, possibly because offspring’s BMI
   showed a strong correlation with maternal onset in GDM. Plasma cfDNA
   has been widely used to non-invasively test for fetal trisomy and
   copy-number variants in clinical practice.[239]^13 Critically, our work
   shows the potential for integrating GDM predictions into NIPT in early
   pregnancy in the future.

   In conclusion, we conducted an integrative cfDNA analysis that reveals
   a temporal relationship between cfDNA and GDM, providing comprehensive
   insights into the genetic alterations associated with various
   biological processes in GDM. This non-invasive approach holds promise
   for clinical applications in early prediction of GDM.

Limitations of the study

   Despite our study having one of the largest sample sizes in high-depth
   sequencing cfDNA GDM cohorts, it is relatively smaller in comparison to
   other prospective NIPT studies.[240]^61^,[241]^62 Moreover, we utilized
   a nested case-control design where samples were carefully matched for
   key variables such as maternal gestational age and sampling weeks,
   thereby enhancing comparability between groups and improving
   statistical efficiency. However, this approach may limit extrapolation
   to broader populations. Furthermore, despite conducting external
   validation using two separate cohorts with substantial numbers of
   participants, determining the optimal sequencing depth for clinical
   translation into NIPT remains an open question. Despite these efforts,
   additional replication studies involving diverse and larger populations
   are imperative.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   TJBC cfDNA BED files This paper NGDC:
   OMIX004723; [242]https://ngdc.cncb.ac.cn/omix/preview/z7lvCjn0
   Single-cell RNA sequencing data related to islet dysfunction in
   diabetes Camunas-Soler, J. et al.[243]^39
   [244]https://github.com/jcamunas/patchseq.
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   Python 3.8.13 N/A [245]https://www.python.org
   Transcription Start Site (TSS) NCBI
   [246]https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_00000
   1405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
   R 4.2.2 The R Foundation [247]https://www.r-project.org/
   Fastp 0.23.2 Chen, S et al.[248]^63
   [249]https://github.com/OpenGene/fastp
   Minmap2 Li, H[250]^64 [251]https://github.com/lh3/minimap2
   biobambam Tischler, G[252]^65 [253]https://github.com/gt1/biobambam
   Samtools 1.6 Li, H et al.[254]^66
   [255]https://github.com/samtools/samtools
   Bamtools 2.5.2 Barnett, D. W.[256]^67
   [257]https://github.com/pezmaster31/bamtools
   ggcor 0.9.8.1 Huang, H et al.[258]^68
   [259]https://github.com/mj163163/ggcor-1
   igraph 1.4.1 Csardi G et al.[260]^69
   [261]https://github.com/igraph/rigraph
   ggraph 2.1.0 Pedersen, T et al.[262]^70
   [263]https://github.com/thomasp85/ggraph
   lme4 1.1–31 Bates D et al.[264]^71 [265]https://github.com/lme4/lme4
   lmerTest 3.1–3 Kuznetsova A et al.[266]^72
   [267]https://github.com/runehaubo/lmerTestR
   nlme 3.1–160 Pinheiro J et al.[268]^73
   [269]https://github.com/cran/nlme
   clusterProfiler 4.6.0 Yu, G et al.[270]^74
   [271]https://github.com/YuLab-SMU/clusterProfiler
   metascape Zhou, Y et al.[272]^75
   [273]https://metascape.org/gp/index.html#/main/step1
   Sklearn 1.1.2 Pedregosa, F et al.[274]^76
   [275]https://scikit-learn.org/
   Tensorflow 2.9.1 Martin, A[276]^77 [277]https://www.tensorflow.org/
   cfDNA features analysis This paper
   [278]https://github.com/zqr2008/cfdna/tree/main/feature_analysis
   Modelcode This paper
   [279]https://github.com/zqr2008/cfdna/tree/main/modelcode
   Figure reproduce This paper
   [280]https://github.com/zqr2008/cfdna/tree/main/figure_reproduce
   [281]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources and reagents should be
   directed to and will be fulfilled by the lead contact, Dr. Qiangrong
   Zhai (zhaiqiangrong@genomics.cn).

Materials availability

   This study did not generate new unique reagents.

Data and code availability

     * •
       The data supporting our findings have been deposited into the CNGB
       Sequence Archive (CNSA)[282]^78 of China National GeneBank DataBase
       (CNGBdb)[283]^79 CNGB: CNP0004352
       ([284]https://db.cngb.org/search/?q=CNP0004352) and China National
       center for Bioinformation NGDC: OMIX004723
       ([285]https://ngdc.cncb.ac.cn/omix/preview/z7lvCjn0). These files
       are accessible in compliance with Chinese legal regulations
       (2023BAT1256), enabling future referencing and potential validation
       analyses using our data.
     * •
       Single-cell RNA sequencing data related to islet dysfunction in
       diabetes are available at
       [286]https://github.com/jcamunas/patchseq. The data used in the
       current article was preprocessed datasets provided[287]^39.
     * •
       The code used in the article is saved at
       [288]https://github.com/zqr2008/cfdna.
     * •
       Any additional information required to reanalyze the data reported
       in this work paper is available from the [289]lead contact upon
       request.

Experimental model and subject details

   This was a nested case-control study and women were selected from the
   TJBC,[290]^22 which is an ongoing prospective cohort established in
   2017 in Tianjin, China. The TJBC recruited pregnant women of ≤14 + 6
   gestational weeks (1^st trimester) who were followed-up at 15–27 weeks
   (2^nd trimester) and ≥28 weeks (3^rd trimester). The study was approved
   by the Ethics Committee of Tianjin Women and Children’s Health Center
   (No. 201706012-1), the Institutional Review Board of BGI (BGI-IRB
   17116), and was conducted in compliance with the ethical standards
   outlined in the Declaration of Helsinki and its subsequent revisions or
   similar ethical standards. Before enrollment, women provided written
   informed consent.

   A case-to-control ratio of 1:1 was selected to increase statistical
   power. Women underwent standard GDM screening; an overnight fast was
   followed by a 75 g OGTT or an FPG test at 24–28 weeks of
   gestation.[291]^80 The study consisted of 299 GDM cases and 299 healthy
   controls. Inclusion criteria for GDM cases were as follows: 1)
   singleton pregnancy, 2) blood samples donated at each visit during
   1^st, 2^nd, and 3^rd trimesters, and 3) questionnaires completed at
   each visit during pregnancy. Controls were selected using the same
   inclusion criteria. Both cases and controls were individually matched
   based on age (±2 years) and gestational week (±4 weeks) at the time of
   sample collection ([292]Figure 1). No women with GDM or controls had
   preexisting type 1 or type 2 diabetes. A GDM diagnosis was assigned
   between 24 and 28 weeks of gestation following Guidelines for GDM
   Diagnosis and Treatment (2014) (Department of Obstetrics and
   Gynecology, Chinese Medical Association).[293]^80 GDM screening and
   diagnosis tests were performed in central-regional-local healthcare
   networks in Tianjin using aforementioned tests. For FPG screening,
   women with an FPG value ≥5.1 mmol/L were diagnosed with GDM, and no
   additional diagnostic tests were required. For women with FPG values
   between 4.4 mmol/L and 5.1 mmol/L, a 75 g OGTT was recommended to
   confirm or rule out GDM. Women with no GDM had FPG values <4.4 mmol/L.
   For OGTT, women were deemed to have GDM if any of the following
   conditions were met: 1) fasting plasma glucose levels ≥5.1 mmol/L, 2)
   plasma glucose levels at 1 h ≥ 10.0 mmol/L, or 3) plasma glucose levels
   at 2 h ≥ 8.5 mmol/L. Of the 598 study women, 524 were diagnosed using
   the OGTT method and 74 using the FPG method. Obesity[294]^81 was
   defined as a pre-pregnancy body mass index (BMI) ≥ 28 kg/m^2.
   Gestational age at birth was defined as completed gestation weeks based
   on the estimated delivery date in women’s clinical records. Preterm
   birth was defined as a live birth before 37 completed gestation weeks.
   Large for gestational age (LGA) was defined as > the 90^th birth weight
   percentile for gestational age by infant gender. Birth weight, weight
   at 2 years old, and length/height data were also collected from
   clinical records. Low birth weight was defined at < 2500 g in live
   births and macrosomia was defined as ≥ 4000 g.

   To validate cfDNA dynamic signature changes throughout the three
   trimesters with respect to GDM, we selected cfDNA sequencing data
   (∼30×, paired-end sequencing) from an independent study of 202 samples
   (8 women with GDM and 106 healthy controls, each pregnant woman is
   sampled more than once) based on gestational age at plasma collection
   and subsequent follow-up as a Validation dataset1 ([295]Figure 1).
   These 114 individuals from the Shenzhen locality were recruited.
   Written informed consent was obtained from all individuals (No. BGI-IRB
   20012) ([296]Figure S5B).

   To confirm trait generalizability across datasets, characterized by
   lower sequencing depths and larger sample sizes, we used cfDNA
   sequencing data from Zhu et al.[297]^23 (∼0.6× and single-end
   sequencing) as a Validation dataset2. Using the same methodology, we
   computed TSS scores and fetal fraction for this dataset. Initially, we
   excluded samples with missing BMI values and fetal fraction <0.01.
   Subsequently, using the refined model, we validated 6104 samples
   (comprising 1045 women with GDM and 5059 healthy controls) with no more
   than three missing feature values ([298]Figure S5C).

Method details

Library construction and sequencing data

   At all visits, 5 mL of peripheral blood was drawn into
   ethylene-diamine-tetra-acetic-acid (EDTA) blood tubes (ComWin, Beijing,
   China). Plasma was generated using a two-step centrifugation
   protocol[299]^82 and stored at −80°C. CfDNA was isolated from 200 μL
   plasma using the MagPure Circulating DNA Mini KF kit (Magen, Guangzhou,
   China) according to manufacturer’s protocols. Extracted cfDNA was then
   used to prepare cfDNA libraries for sequencing using the MGIEasy free
   DNA library preparation reagent set (MGI, Shenzhen, China) following
   manufacturer’s protocols. Prepared libraries were circularized to
   create single-stranded DNA (ssDNA) circles using the MGIEasy
   Circularization Kit (MGI) following manufacturer’s instructions. A
   Qubit ssDNA assay kit (Invitrogen, USA) was used to quantify purified
   ssDNA circles, after which they underwent rolling circle amplification
   to generate DNA nanoballs. After these steps, the final products were
   quantified using a Qubit ssDNA Assay Kit (Invitrogen) and loaded onto a
   DNBSEQ platform (MGI) for multiplex sequencing using a paired-end
   100 bp strategy.

Lipid extraction and untargeted lipidomic profiling

   Lipids were extracted from maternal blood as previously
   described.[300]^83^,[301]^84 Briefly, precooled isopropanol spiked with
   a lipid internal standard mix (SPLASH LIPIDOMIX Mass Spec Standard,
   Avanti, USA) was added to plasma. After vortexing and then overnight
   incubation at −20°C, samples were centrifuged and supernatants analyzed
   using a liquid chromatography–mass spectrometer (LC-MS). Samples were
   separated on a CSH C18 column (1.7 μm, 2.1 × 100 mm, Waters, USA) and
   analyzed using a QExactive MS (Thermo Fisher Scientific, USA). LC
   gradient and MS conditions were previously reported.[302]^84 Mobile
   phase A contained acetonitrile/water (60:40) plus 10 mM ammonium
   formate and 0.1% formic acid, while mobile phase B contained
   isopropanol/acetonitrile (90:10) plus 10 mM ammonium formate and 0.1%
   formic acid. The scan range for MS detection was 200–200m/z with a
   resolution of 70,000, and the automatic gain control (AGC) target for
   MS acquisition was set to 3e6 with a maximum ion injection time of
   100 ms. The top three precursors were selected for subsequent tandem
   mass spectrometry fragmentation with a resolution of 17,500 and a
   maximum ion injection time of 50 ms. The AGC was 1e5, and the stepped
   normalized collision energy was set to 15, 30, and 45 eV. Lipid
   identification and quantitation were performed using LipidSearch 4.1
   SP2 software (Thermo Fisher, USA), and data scaling and normalization
   were processed in metaX.[303]^85

Sequence alignments and filtering

   Raw fastq data were subjected to quality filtering using Fastp
   0.23.2[304]^63 software based on the following criteria: (1) removal of
   adapter sequences, (2) elimination of sequences containing >10% unknown
   bases, and (3) removal of low-quality sequences. Filtered reads were
   then aligned to the GRCh38.p13 reference genome using MiniMap2[305]^64
   comparison software. Resultant comparison outputs were saved as bam
   files, which were sorted using Biobambam.[306]^65 Additionally,
   duplicate reads generated during amplification steps were marked in
   sorted BAM files and filtered. Only paired reads with proper mapping
   orientation and insert size (i.e., ≤600 bp) were retained for
   downstream analyses ([307]Table S2).

Fetal fraction

   The fetal fraction in plasma samples was estimated using SeqFF,[308]^86
   a method that capitalizes on dissimilar chromatin structures between
   the mother and fetus, leading to irregular cfDNA dispersion across the
   genome. To determine the fetal fraction, several regression models
   (Enet and WRSC) were developed using read counts from multiple cfDNA
   regions in maternal plasma. We used mean results from both models as
   the fetal fraction value ([309]Table S3).

Motifs and MDS

   End motif analysis was performed on single cfDNA fragments.[310]^17 We
   used the first four nucleotides at 5′ ends to calculate the frequencies
   of 256 (
   [MATH: <mrow><mn>4</mn><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo><mn>4</mn><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo><mn>4</mn><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo><mn>4</mn></mrow> :MATH]
   ) possible motifs (4 bp sequences and 4-mer motifs), and motifs were
   normalized according to the total number of ends.

   MDS in samples were calculated using [311]Equation 1.
   [MATH: <mrow><mi>M</mi><mi>D</mi><mi>S</mi><mo
   linebreak="badbreak">=</mo><mrow><munderover><mo>∑</mo><mrow><mi>i</mi>
   <mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mn>256</mn></munderover><m
   row><mo>−</mo><msub><mi>P</mi><mi>i</mi></msub><mo
   linebreak="badbreak">×</mo><msub><mi
   mathvariant="italic">log</mi><mn>2</mn></msub><mrow><mo
   stretchy="true">(</mo><msub><mi>P</mi><mi>i</mi></msub><mo
   stretchy="true">)</mo></mrow><mo linebreak="badbreak">/</mo><msub><mi
   mathvariant="italic">log</mi><mn>2</mn></msub><mrow><mo>(</mo><mn>256</
   mn><mo>)</mo></mrow></mrow></mrow></mrow> :MATH]
   (Equation 1)

   where P[i] is the frequency of a particular motif. A higher MDS
   indicated a higher diversity (i.e., a higher degree of randomness). The
   theoretical scale ranged from 0 to 1. A previous study demonstrated
   that maternal and fetal cfDNA fragments exhibited distinct length
   patterns, with maternal cfDNA generally longer than fetal
   fragments.[312]^87 In our analysis, we computed motif frequencies and
   MDS for all samples. We also divided fragments into three subsets:
   short, peak, and long. The short subset consisted of ≤150 bp fragments,
   the peak subset comprised fragments in the 160–170 bp range, and the
   long subset encompassed >250 bp fragments. Subsets were analyzed
   separately to investigate their respective characteristics
   ([313]Table S3).

MA

   Methylated cytosine-phosphate-guanines (CpGs) have a higher likelihood
   of cleavage at cytosine when compared to unmethylated CpGs, while
   having a reduced likelihood of cleavage at the base preceding the
   CpG.[314]^15 Such differential cleavage patterns can cause increased
   CGN motifs but decreased NCG motifs. By analyzing cfDNA cleavage
   patterns and resulting CGN/NCG motifs, cfDNA methylation status was
   inferred across different regions.[315]^15 We used CGN/NCG motif ratios
   ([316]Equation 2) to assess methylation status in samples and named it
   as methylation-associated (MA) value ([317]Table S3).
   [MATH: <mrow><mi>M</mi><mi>A</mi><mo
   linebreak="badbreak">=</mo><mfrac><mrow><mi>N</mi><mi>o</mi><mo>.</mo><
   mspace width="0.25em"></mspace><mi>o</mi><mi>f</mi><mspace
   width="0.25em"></mspace><msup><mn>5</mn><mo>’</mo></msup><mspace
   width="0.25em"></mspace><mi>C</mi><mi>G</mi><mi>N</mi><mspace
   width="0.25em"></mspace><mi>e</mi><mi>n</mi><mi>d</mi><mspace
   width="0.25em"></mspace><mi>m</mi><mi>o</mi><mi>t</mi><mi>i</mi><mi>f</
   mi><mi>s</mi></mrow><mrow><mi>N</mi><mi>o</mi><mo>.</mo><mspace
   width="0.25em"></mspace><mi>o</mi><mi>f</mi><mspace
   width="0.25em"></mspace><msup><mn>5</mn><mrow><mo>’</mo><mspace
   width="0.25em"></mspace></mrow></msup><mspace
   width="0.25em"></mspace><mi>N</mi><mi>C</mi><mi>G</mi><mspace
   width="0.25em"></mspace><mi>e</mi><mi>n</mi><mi>d</mi><mspace
   width="0.25em"></mspace><mi>m</mi><mi>o</mi><mi>t</mi><mi>i</mi><mi>f</
   mi><mi>s</mi></mrow></mfrac></mrow> :MATH]
   (Equation 2)

   5′CGN end motifs (i.e., 5′- CGA, CGT, CGG, and CGC). 5′ NCG end motifs
   (i.e., 5′- ACG, TCG, GCG, and CCG).

TSS scores

   Transcriptional activity is correlated with chromatin status around the
   TSS.[318]^88^,[319]^89 To quantify gene expression, we analyzed TSS
   region coverage. Specifically, we calculated upstream and downstream
   500 bp coverage of the TSS from aligned BAM files using the SAMtools
   1.6 depth function.[320]^66 To account for sequencing depth and bias,
   we normalized TSS region coverage by dividing the 1 kb region into
   three parts. Average side bin depth was used to normalize the depth of
   the middle 500 bp (TSS -250 to TSS +250) and the normalization rate of
   the middle bin was defined as the TSS score ([321]Equation 3).
   [MATH: <mrow><mi>T</mi><mi>S</mi><mi>S</mi><mspace
   width="0.25em"></mspace><mi>s</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>e</
   mi><mo
   linebreak="badbreak">=</mo><mfrac><mrow><mi>d</mi><mi>e</mi><mi>p</mi><
   mi>t</mi><mi>h</mi><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><mi>m</mi><mi>i</mi><mi>d
   </mi><mi>d</mi><mi>l</mi><mi>e</mi><mspace
   width="0.25em"></mspace><mi>b</mi><mi>i</mi><mi>n</mi></mrow><mo>)</mo>
   </mrow></mrow><mrow><mi>d</mi><mi>e</mi><mi>p</mi><mi>t</mi><mi>h</mi><
   mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><mi>s</mi><mi>i</mi><mi>d
   </mi><mi>e</mi><mspace
   width="0.25em"></mspace><mi>b</mi><mi>i</mi><mi>n</mi></mrow><mo>)</mo>
   </mrow></mrow></mfrac></mrow> :MATH]
   (Equation 3)

   where depth (side bin) is the average depth of TSS -500 to TSS -250 and
   TSS +250 to TSS +500 regions. Depth (middle bin) refers to the average
   depth of the middle 500 bp (TSS -250 to TSS +250). High TSS scores
   indicated high coverage in the TSS region, indicating that cfDNA was
   highly protected and not easily bound to transcription-related factors,
   thereby eliciting low gene expression. In contrast, lower TSS scores
   were associated with higher gene expression.[322]^19^,[323]^51 In our
   study, TSS scores were used as gene expression measures.

TSS score-signature identification

   Selected TSS score-signatures were based on statistical tests and
   log2-transformed fold-change (log2(FC)) values. We used a linear
   mixed-effects model (LMM) approach to screen TSS scores, using a
   two-sided false-discovery rate threshold of <0.1 for selection. The LMM
   model included trimesters, groups, and interactions between trimesters
   and groups as fixed effects while incorporating a subject-specific
   random effect. We used Least Squares Means estimates to test for
   differences between groups. To determine the threshold for TSS
   score-signatures, we performed pairwise comparisons using TSS scores to
   calculate log2(FC) values to represent differences between GDM and
   control samples for each trimester, and collectively for all
   trimesters. A log2(FC) > 0.05 value was the threshold. Simultaneously,
   we conducted univariate statistical tests for these four comparisons
   using Wilcoxon rank sum tests (p < 0.1 threshold). A TSS score meeting
   all aforementioned criteria was selected as a signature. If a gene
   corresponded to multiple TSS scores, the final TSS score was primarily
   selected using the maximum absolute log2(FC) value. If both positive
   and negative log2(FC) values occurred, the direction was determined
   based on the location of the majority of TSS scores, and the TSS score
   with the maximum absolute value in the determined direction of log2(FC)
   values was chosen.

Gene set enrichment analysis and gene ontology analysis

   We performed GSEA[324]^90 using the R package clusterProfiler[325]^74
   and biological pathways from Human Wikipathways[326]^91 were used as
   primary sources for analyses. The Benjamini–Hochberg (BH) method was
   used for multiple testing corrections. A gene list was generated based
   on TSS scores, after which genes were ranked using pairwise log2(FC)
   values between GDM and control samples. We also used Metascape[327]^75
   for gene ontology analysis using the following thresholds: minimum
   overlap = 3, P-value cutoff = 0.01, and minimum enrichment = 1.5. Gene
   Ontology (GO) biological process pathways were selected in analyses.

C[BMI-a] definition

   To gain deeper insights into the relationships between high-importance
   TSS score-signatures among the top 20 predictive features for birth BMI
   and growth development, we used a derived formula from the TSS
   component of the predictive birth BMI model, C[BMI-a] ([328]Equation
   4):
   [MATH:
   <mrow><msub><mi>C</mi><mrow><mi>B</mi><mi>M</mi><mi>I</mi><mo>−</mo><mi
   >a</mi></mrow></msub><mo linebreak="badbreak">=</mo><mn>13.4589</mn><mo
   linebreak="goodbreak">+</mo><mrow><mo>(</mo><mn>0.0809</mn><mo>)</mo></
   mrow><mo
   linebreak="goodbreak">×</mo><mi>L</mi><mi>I</mi><mi>L</mi><mi>R</mi><mi
   >B</mi><mn>1</mn><mo
   linebreak="goodbreak">+</mo><mrow><mo>(</mo><mrow><mo>−</mo><mn>0.1696<
   /mn></mrow><mo>)</mo></mrow><mo
   linebreak="goodbreak">×</mo><mi>M</mi><mi>I</mi><mi>R</mi><mn>8071</mn>
   <mo>−</mo><mn>1</mn><mo
   linebreak="goodbreak">+</mo><mi>T</mi><mi>H</mi><mi>A</mi><mi>P</mi><mn
   >12</mn><mo
   linebreak="goodbreak">+</mo><mrow><mo>(</mo><mrow><mo>−</mo><mn>0.0106<
   /mn></mrow><mo>)</mo></mrow><mo
   linebreak="goodbreak">×</mo><mi>R</mi><mi>N</mi><mi>F</mi><mn>213</mn><
   mo
   linebreak="goodbreak">+</mo><mrow><mo>(</mo><mrow><mo>−</mo><mn>0.2094<
   /mn></mrow><mo>)</mo></mrow><mo
   linebreak="goodbreak">×</mo><mi>S</mi><mi>P</mi><mi>A</mi><mi>T</mi><mi
   >S</mi><mn>2</mn><mi>L</mi><mo
   linebreak="goodbreak">+</mo><mrow><mo>(</mo><mrow><mo>−</mo><mn>1.4290<
   /mn></mrow><mo>)</mo></mrow><mo
   linebreak="goodbreak">×</mo><mi>B</mi><mi>R</mi><mi>P</mi><mi>F</mi><mn
   >1</mn><mo
   linebreak="goodbreak">+</mo><mrow><mo>(</mo><mrow><mo>−</mo><mn>0.0686<
   /mn></mrow><mo>)</mo></mrow><mo
   linebreak="goodbreak">×</mo><mi>M</mi><mi>I</mi><mi>R</mi><mn>3689</mn>
   <mi>C</mi><mo
   linebreak="goodbreak">+</mo><mrow><mo>(</mo><mrow><mo>−</mo><mn>0.0530<
   /mn></mrow><mo>)</mo></mrow><mo
   linebreak="goodbreak">×</mo><mi>P</mi><mi>C</mi><mi>M</mi><mi>T</mi><mi
   >D</mi><mn>2</mn><mo
   linebreak="goodbreak">+</mo><mrow><mo>(</mo><mrow><mo>−</mo><mn>0.0088<
   /mn></mrow><mo>)</mo></mrow><mo
   linebreak="goodbreak">×</mo><mi>L</mi><mi>P</mi><mi>G</mi><mi>A</mi><mi
   >T</mi><mn>1</mn></mrow> :MATH]
   (Equation 4)

   Where LILRB1, MIR8071-1, THAP12, RNF213, SPATS2L, BRPF1, MIR3689C,
   PCMTD2, and LPGAT1 are genes in the C[BMI-a]. In this equation, each
   gene was represented by its TSS gene score.

External dataset validation

   To rigorously authenticate our findings, we retrieved cfDNA sequencing
   data from women with GDM, as published by Guo et al..[329]^19
   Accounting for sequencing depth and reference genome distinctions, we
   meticulously determined TSS scores that displayed consistent trends
   across healthy and GDM cohorts. Using a significance FDR threshold =
   0.05, selected TSS scores were subjected to further correlation
   analysis.

The neural network models

Neural network input data

   We implemented a parallel-connected Neural Network Model. CfDNA data
   were extracted as motifs, MDS, MA, and TSS scores. In input data, we
   had 256 4-mer motif features, four MDS features, three MA features, and
   50 TSS score features from previous steps.

   The sampling process for this network model involved detailed
   extraction and feature selection from cfDNA data. We used 256 4-mer
   motif features, four MDS features, three MA features, and 50 TSS score
   features. The dataset was meticulously partitioned into training,
   validation, and testing sets, with a typical split of 80% for training
   and 20% for testing. Stratified random sampling procedures ensured the
   representation of key variables such as age and disease status across
   subsets. To address data imbalance, techniques such as oversampling the
   minority class or undersampling the majority class were used.

   For internal validation, we used 10-fold cross-validation in the
   training set, which allowed for a comprehensive evaluation of model
   performance. This method involved training the model on nine subsets
   and validating it on the remaining subset, in an iterative process.
   External validation was achieved using two external datasets to
   validate model applicability and robustness in different settings or
   populations.

Feature selection

   From preliminary studies, the majority of features from TSS genes and
   motifs included too much noise which contribute little to our results.
   Thus, we implemented forward feature selection as a first step to
   remedy this. This step started from an empty logistic model. We then
   added features one by one to determine the best performance feature for
   each step. The model was adapted from a logistic model used by Hu
   et al.[330]^92 ([331]Equation 5).
   [MATH:
   <mrow><mfrac><msup><mi>e</mi><msup><mi>y</mi><mo>′</mo></msup></msup><m
   row><mn>1</mn><mo
   linebreak="badbreak">+</mo><msup><mi>e</mi><msup><mi>y</mi><mo>′</mo></
   msup></msup></mrow></mfrac><mo linebreak="badbreak">=</mo><mrow><mo
   stretchy="true">[</mo><mrow><msub><mi>β</mi><mn>1</mn></msub><mspace
   width="0.25em"></mspace><msub><mi>β</mi><mn>2</mn></msub><mo>…</mo><msp
   ace width="0.25em"></mspace><msub><mi>β</mi><mi>n</mi></msub></mrow><mo
   stretchy="true">]</mo></mrow><mo linebreak="badbreak">×</mo><mrow><mo
   stretchy="true">[</mo><mtable
   columnalign="center"><mtr><mtd><msub><mi>f</mi><mn>1</mn></msub></mtd><
   /mtr><mtr><mtd><msub><mi>f</mi><mn>2</mn></msub></mtd></mtr><mtr><mtd><
   mo>…</mo></mtd></mtr><mtr><mtd><msub><mi>f</mi><mi>n</mi></msub></mtd><
   /mtr></mtable><mo stretchy="true">]</mo></mrow><mo
   linebreak="goodbreak">+</mo><mrow><mo>[</mo><mi>b</mi><mo>]</mo></mrow>
   </mrow> :MATH]
   (Equation 5)

   At first, there were no components in β and f vectors. Starting from an
   empty model, we first added one feature to the initial model and
   compared performance improvements using likelihood ratio Chi-square
   tests with the previous model. The feature with the smallest Chi-square
   test p value (p < 0.05) was selected for the model. Then, in the next
   step, we selected the best features from the remaining features. We
   continued this step-forward selection process until no qualified
   features were available to improve the model performance using
   likelihood ratio Chi-square test p < 0.05 values. These features were
   then used to generate an SNN model.

SNN sub-networks

   In our model, we generated the final input as a 93-features vector that
   included 31 motifs, four MDS, three MA, and 55 TSS scores. Given the
   distinct patterns in each feature, we developed different neural
   network branches in the SNN to process data prior to the final
   categorization.

   Here, we used a dense layer from Keras to build a fully connected
   neural network to process the 31 motifs ([332]Equation 6). We used Wi,
   Wj, …Wm to represent the weight matrix of layers i, j, …until m. Each
   layer processed the previous layers’ output and used the Relu function
   to generate an output for the next layer.
   [MATH: <mrow><mi>L</mi><mo linebreak="badbreak">=</mo><mrow><mo
   stretchy="true">[</mo><mrow><mi>R</mi><mi>e</mi><mi>l</mi><mi>u</mi><mr
   ow><mo
   stretchy="true">(</mo><mrow><msub><mi>W</mi><mi>m</mi></msub><mrow><mo
   stretchy="true">[</mo><mrow><mo>…</mo><mrow><mo
   stretchy="true">[</mo><mrow><mi>R</mi><mi>e</mi><mi>l</mi><mi>u</mi><mr
   ow><mo
   stretchy="true">(</mo><mrow><msub><mi>W</mi><mi>j</mi></msub><mrow><mo
   stretchy="true">[</mo><mrow><mi>R</mi><mi>e</mi><mi>l</mi><mi>u</mi><mr
   ow><mo
   stretchy="true">(</mo><mrow><msub><mi>W</mi><mi>i</mi></msub><mo>·</mo>
   <msub><mi>x</mi><mi>i</mi></msub><mo
   linebreak="badbreak">+</mo><msub><mi>b</mi><mi>i</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow><mo
   linebreak="badbreak">+</mo><msub><mi>b</mi><mi>k</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow><mo>…</mo></mrow><mo
   stretchy="true">]</mo></mrow><mo
   linebreak="badbreak">+</mo><msub><mi>b</mi><mi>n</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mo
   stretchy="true">]</mo></mrow></mrow> :MATH]
   (Equation 6)

   Our Motif features exhibits consistent values with a fixed range. From
   this uniformity, a fully connected neural network can effectively
   interpret such patterns and create an optimal classification model. In
   other words, such a network can yield the best results for classifying
   or categorizing data based on uniform motif features. For MDS and MA,
   we processed these with a simple 1-unit neuron ([333]Equation 7), where
   x was the input matrix of MDS or MA, and W was an n×1 weight matrix.
   Here, we used three MA features as an example. These features were
   relatively straightforward and did carry much information. Due to their
   simplicity, only one neuron was required to transfer processed data to
   the classification layer. This limited information did not necessitate
   a larger, more complex network structure; hence a single output neuron
   sufficed.
   [MATH: <mrow><mi>M</mi><mo
   linebreak="badbreak">=</mo><mi>R</mi><mi>e</mi><mi>l</mi><mi>u</mi><mro
   w><mo stretchy="true">(</mo><mrow><mrow><mo
   stretchy="true">[</mo><mrow><msub><mi>x</mi><mn>1</mn></msub><mspace
   width="0.25em"></mspace><msub><mi>x</mi><mn>2</mn></msub><mspace
   width="0.25em"></mspace><msub><mi>x</mi><mn>3</mn></msub></mrow><mo
   stretchy="true">]</mo></mrow><mo>×</mo><mrow><mo
   stretchy="true">[</mo><mtable
   columnalign="center"><mtr><mtd><msub><mi>w</mi><mn>1</mn></msub></mtd><
   /mtr><mtr><mtd><msub><mi>w</mi><mn>2</mn></msub></mtd></mtr><mtr><mtd><
   msub><mi>w</mi><mn>3</mn></msub></mtd></mtr></mtable><mo
   stretchy="true">]</mo></mrow><mo>+</mo><mrow><mo>[</mo><mi>b</mi><mo>]<
   /mo></mrow></mrow><mo stretchy="true">)</mo></mrow></mrow> :MATH]
   (Equation 7)

   Next, we used convolutional layers to process TSS scores ([334]Equation
   8), which we based on the observation that the TSS score matrix had
   many internal relationships between different genes; a convolution
   layer learns and emphasizes the relationship between two genes and thus
   helps with the final classification job.
   [MATH: <mrow><mi>R</mi><mo
   linebreak="badbreak">=</mo><mi>M</mi><mi>a</mi><mi>x</mi><mi>p</mi><mi>
   o</mi><mi>o</mi><mi>l</mi><mrow><mo
   stretchy="true">(</mo><mrow><munder><mo>∑</mo><mi>j</mi></munder><mrow>
   <munder><mo>∑</mo><mi>k</mi></munder><mrow><mi>f</mi><mi>i</mi><mi>l</m
   i><mi>t</mi><mi>e</mi><mi>r</mi><mrow><mo>[</mo><mrow><mi>j</mi><mo>,</
   mo><mi>k</mi></mrow><mo>]</mo></mrow><mi>T</mi><mi>S</mi><mi>S</mi><mro
   w><mo>[</mo><mrow><mi>m</mi><mo>−</mo><mi>j</mi><mo>,</mo><mi>n</mi><mo
   >−</mo><mi>k</mi></mrow><mo>]</mo></mrow></mrow></mrow></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   (Equation 8)

   j and k represent the coordinators of the convolutional kernel filter.
   We used a 2 dimension 3 times 3 kernal filter for the TSS matrix
   ([335]Equation 7), where m and n were TSS matrix coordinates. Thus, as
   depicted ([336]Equation 7), each value in the TSS matrix used
   multiplication and summation processes with a convolutional kernel
   filter using its neighboring values to generate a new feature map
   value.

Classification and regression

   After SNN processing, all SNN outputs were concatenated as a whole
   input matrix for classification and regression ([337]Equations 9
   and [338]10).
   [MATH: <mrow><mi>c</mi><mi>l</mi><mi>a</mi><mi>s</mi><mi>s</mi><mo
   linebreak="badbreak">=</mo><mi>s</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>
   m</mi><mi>a</mi><mi>x</mi><mrow><mo>(</mo><mi>p</mi><mo>)</mo></mrow><m
   o
   linebreak="badbreak">=</mo><mi>s</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>
   m</mi><mi>a</mi><mi>x</mi><mrow><mo
   stretchy="true">(</mo><mrow><mrow><mo>[</mo><mrow><mi>L</mi><mi>M</mi><
   mi>R</mi></mrow><mo>]</mo></mrow><mo>·</mo><msub><mi>W</mi><mi>b</mi></
   msub><mo
   linebreak="badbreak">+</mo><msub><mi>b</mi><mi>c</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   (Equation 9)

   where W[b] is the weight matrix of the final classification layer, and
   b refers to binary classification. As a binary classification, we used
   only one classification output p to represent GDM probability. Thus, W
   is an n × 1 matrix; n is the total output number from output L, M, and
   R from [339]formula 5, [340]6, and [341]7. Finally, we added a bias to
   the classification layer - shown as b[c] - where c stands for
   classification.
   [MATH:
   <mrow><mi>l</mi><mi>i</mi><mi>n</mi><mi>e</mi><mi>a</mi><mi>r</mi><mo
   linebreak="badbreak">=</mo><mi>R</mi><mi>e</mi><mi>l</mi><mi>u</mi><mro
   w><mo
   stretchy="true">(</mo><mrow><mrow><mo>[</mo><mrow><mi>L</mi><mi>M</mi><
   mi>R</mi></mrow><mo>]</mo></mrow><mo>·</mo><msub><mi>W</mi><mi>l</mi></
   msub><mo
   linebreak="badbreak">+</mo><msub><mi>b</mi><mi>l</mi></msub></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   (Equation 10)

   we used W[l] to represent the weight matrix of the final output in
   linear regression analyses. We used b[l] for the bias of the last
   output layer.

   We also used a fully connected layer, using Softmax activation and
   cross-entropy loss functions, for the classification output. The
   activation function of the regression output was Relu since there was
   no negative output for BMI and weight. The loss function of the
   regression output was the mean square error. We chose the Area under
   the ROC Curve (AUC) as the main SNN evaluation metric. Briefly, we
   randomly shuffled our dataset and split it (n = 598; GDM n = 299 and
   controls n = 299) according to 75% training and 25% testing. Test
   processing included a 10-fold cross-validation. Thus, we trained 10
   models for training data and performed 10 test steps for testing data,
   which yielded 10 AUCs for 95% CI.

Feature importance

   Feature importance contributed to the final model and was evaluated
   using the following steps: 1) We randomly shuffled the value of each
   feature for every data record in the test dataset; 2) We loaded the
   pre-trained model and predicted classification or regression results of
   the shuffled test dataset; 3) We recorded the loss of prediction
   results and compared it with the loss of the original non-shuffled test
   dataset; and 4) We determined differences in loss values as feature
   importance.

Random forest classification model

   We used a random forest classification model to verify feature
   importance in SNN models. In this scenario, we used features as
   classification attributes in the random forest model. This was
   implemented using sklearn[342]^76 with estimator trees = 100, the Gini
   factor as a criterion, and no limitation on tree max depth.

Forward stepwise feature selection (FSFS) in logistic regression for TSS
selection

   We further refined TSS features using FSFS which is a sequential method
   that begins with no predictors in the model and adds them one at a
   time. Each variable is chosen based on its contribution to improving
   the model’s fit, and evaluated by a statistical criterion (likelihood
   ratio test). The logistic regression model used in this selection
   process is described by the following [343]Equation 11:
   [MATH: <mrow><mi mathvariant="italic">ln</mi><mrow><mo
   stretchy="true">(</mo><mfrac><mi>p</mi><mrow><mn>1</mn><mo>−</mo><mi>p<
   /mi></mrow></mfrac><mo stretchy="true">)</mo></mrow><mo
   linebreak="badbreak">=</mo><msub><mi>β</mi><mn>0</mn></msub><mo
   linebreak="badbreak">+</mo><msub><mi>β</mi><mn>1</mn></msub><msub><mi>x
   </mi><mn>1</mn></msub><mo
   linebreak="badbreak">+</mo><msub><mi>β</mi><mn>2</mn></msub><msub><mi>x
   </mi><mn>2</mn></msub><mo linebreak="badbreak">+</mo><mo>…</mo><mo
   linebreak="goodbreak">+</mo><msub><mi>β</mi><mi>n</mi></msub><msub><mi>
   x</mi><mi>n</mi></msub></mrow> :MATH]
   (Equation 11)

   where p is the probability of a feature occurrence,
   [MATH: <mrow><msub><mi>β</mi><mn>0</mn></msub></mrow> :MATH]
   is the intercept, and
   [MATH: <mrow><msub><mi>β</mi><mn>1</mn></msub></mrow> :MATH]
   ,
   [MATH: <mrow><msub><mi>β</mi><mn>2</mn></msub></mrow> :MATH]
   , …,
   [MATH: <mrow><msub><mi>β</mi><mi>n</mi></msub></mrow> :MATH]
   are coefficients of the predictors x[1], x[2], …, x[n].

   This iterative process continued until new variable addition did not
   significantly improve the model, based on a predefined improvement
   threshold. The final model included a set of features that were highly
   predictive of GDM. By using FSFS, we built a model that was not only
   predictive but also interpretable in selecting the most important genes
   implicated in GDM.

Validation of the refined model

   To rigorously assess validation effectiveness, our refined model
   underwent training using the TJBC training dataset. This evaluation
   used a fully connected neural network architecture, incorporating nine
   distinct features (SeqFF, BMI, CDH23, CNTN4, RNF213, SMARCD1, TWSG1,
   PSD3, and PIK3R1) specified by the refined model. To evaluate the
   validation dataset, we saved the best performed model selected from the
   training TJBC and use the saved refined model to test the validation
   datasets. The validation process was conducted using two separate
   datasets: Validation dataset1 (from southern China) and Validation
   dataset2 (from central China). To validate the datasets thoroughly, we
   applied a comprehensive testing approach similar to what is known as
   10-fold cross-validation. This process involves dividing the dataset
   into ten equal parts, we used one part for testing for each fold. We
   repeated this procedure ten times, each time with a different part used
   for testing, to ensure the robustness of our validation method. This
   involved randomly shuffling validation datasets and partitioning them
   into 10 equal subsets, each constituting 10% of the data. Subsets were
   then independently tested, with the results aggregated to calculate the
   overall average performance and model CIs.

Quantification and statistical analysis

   Univariate comparisons were conducted using two-sided Wilcox Rank Sum
   Tests. For repeated collected measurements, our main goal was to
   compare how certain cfDNA features changed between cases and controls.
   We used LMM (Methods: TSS score-signature identification) to calculate
   average cfDNA values, identify differences between groups, and
   understand temporal trends.

   We developed two models: the first was unadjusted and the second LMM
   accounted for covariates. The primary focus of our data analysis
   centered on comparing cfDNA change patterns between women with GDM and
   controls. We used linear mixed-effects models to compute cfDNA least
   squares means with respect to group differences and trend assessments.
   Our primary outcome was the identification of pairwise differences
   between cases and controls. We also hypothesized a priori that
   pregnancy trimester progression was potentially associated with
   abnormal physiological changes in GDM, such as an increased
   insulin-resistant state, which may have manifested as altered cfDNA
   levels correlated to glucose levels. Therefore, our analytical models
   incorporated trimester progression and associated interactions with
   disease status. For covariates, based on known risk factors for GDM and
   cfDNA predictors, we considered pre-pregnancy BMI, maternal
   age,[344]^93 ethnicity,[345]^94 education,[346]^95 smoking
   status,[347]^96 and alcohol consumption[348]^97 as potential
   confounding variables. Some covariates were not included in the final
   adjusted model for the following reasons: 1) Age: To address maternal
   age bias, we carefully selected age-matched case-controls (methods
   Case-control study design). 2) Ethnicity: All participants were of
   Asian ethnicity. 3) Pre-pregnancy BMI: Although this is a known factor
   associated with GDM and potentially influences cfDNA features, we did
   not include it as a covariate in our analysis. This decision was based
   on previous research indicating that obesity-induced DNA release from
   adipocytes stimulated insulin resistance.[349]^98 The interplay between
   obesity, GDM, and cfDNA is complicated, and potentially introduced
   collider bias into our study. Furthermore, our primary focus was to
   examine whether GDM status affected cfDNA features, and we were less
   concerned with how these effects were mediated by obesity. We
   constructed two models: model 1 which was unadjusted, and model 2 which
   was a linear mixed-effects model that included adjustments for
   education, alcohol consumption (yes/no), and also smoking during
   pregnancy (yes/no). Our results and discussion were based on the
   outcomes from these models ([350]Table S5).

   We defined statistical significance as a two-sided p < 0.05 value and
   corrected non-parametric analyses for multiple testing using the BH
   method. To perform correlation analyses, we separately calculated
   Spearman’s correlations for each trimester between GDM and control
   groups.

Acknowledgments