Graphical abstract graphic file with name fx1.jpg [85]Open in a new tab Highlights * • Gestational diabetes mellitus (GDM) exhibits distinct cfDNA physical properties * • Dynamic cfDNA variations between GDM and controls * • Altered exocrine pancreas identified by islet acinar marker gene PRSS1 * • CfDNA links between GDM and altered lipid metabolism __________________________________________________________________ Tang et al. present a longitudinal integrative analysis of cell-free DNA, revealing a distinctive signature and dynamic patterns specific to gestational diabetes mellitus. Lipidomic alterations and changes in pancreatic exocrine markers are discerned. Introduction Globally, gestational diabetes mellitus (GDM) is an increasingly prevalent disease among pregnant women and affects approximately one in six pregnancies.[86]^1 GDM is caused by insulin resistance and pancreatic β-cell dysfunction during pregnancy and is associated with long-term adverse outcomes such as type 2 diabetes mellitus and cardiovascular disease.[87]^2^,[88]^3 GDM is also associated with metabolic imbalance that increases the risk of neonatal complications, primarily fetal growth deviations[89]^4 and preterm birth.[90]^5 Furthermore, increasing evidence now suggests that GDM has long-term consequences for children’s development such as cardiovascular diseases and insulin resistance.[91]^2^,[92]^3 Research also shows that promptly treating GDM before 20 weeks of gestation slightly reduces a combination of negative outcomes in newborns.[93]^6 Despite great advances in the diagnosis and clinical management of GDM, the disease mechanisms are still unclear and early condition predictions remain difficult.[94]^7 Since current diagnostic criteria for GDM are mainly based on glycemic levels, dynamic changes in genetic factors are often ignored.[95]^8 Thus, understanding temporal genetic markers in response to a growing fetus during GDM and exploring associated dynamic genetic changes can provide insights into gene function that influences insulin actions and regulates metabolism.[96]^9 The association of cell-free DNA (cfDNA) in maternal plasma with gestational disease by recent studies may provide a new insight into GDM prediction using cfDNA.[97]^10^,[98]^11 Plasma cfDNA are short DNA fragments primarily derived from cell apoptosis, with fragments from different tissue-specific nucleosome arrangements.[99]^12 Initially used as a non-invasive prenatal testing (NIPT) approach for fetal chromosome abnormalities,[100]^13^,[101]^14 cfDNA was found to contain non-random fragments associated with tissue damage, gene expression, methylation levels, and other factors.[102]^15 Previous studies reported that cfDNA molecules manifested fragmental characteristics,[103]^16 nucleosome relationships,[104]^17 and endpoints that reflected tissue origins and turnover mechanisms.[105]^18 Guo et al.[106]^19 found that cfDNA coverage patterns in gene promoter regions reflected gene expression levels and could be used to predict pregnancy complications. Importantly, in patients with GDM, the cfDNA fetal fraction is lower when compared with non-GDM pregnant women.[107]^20 Recently, it was shown that cfDNA methylation and transcriptomic signatures had the potential to predict adverse pregnancy outcomes.[108]^21 However, more research is required to characterize cfDNA feature associations with patients with GDM. In this study, we compared longitudinal cfDNA profiles between women with GDM and healthy control to identify distinctive changes in cfDNA features not previously observed. We then correlated the underlying implications of these cfDNA features and demonstrated the utility of cfDNA in unraveling the biological mechanisms underpinning GDM. Furthermore, the validation in external datasets from NIPT data supported the clinical relevance of our results, indicating their predictive applicability regardless of sequencing depth. Results Population demographics and pregnancy characteristics A total of 299 women with GDM and 299 matched healthy non-GDM pregnant women were selected from the Tianjin Birth Cohort (TJBC)[109]^22 as the discovery dataset. A small prospective cohort containing 8 women with GDM and 106 healthy controls was used as a validation cohort (validation dataset1) ([110]Figure 1A). Additionally, a dataset from Zhu et al.[111]^23 contained the NIPT data of 21,813 pregnant Chinese women. Following the inclusion criteria ([112]Figure S5C), we ultimately arrived at a final set of 6,104 samples (1,045 GDM and 5,059 healthy controls) for validation dataset2. No significant differences in baseline characteristics were recorded in women who underwent oral glucose tolerance tests or fasting plasma glucose tests to diagnose GDM ([113]Table S1). Peripheral blood samples from the 1^st, 2^nd, and 3^rd trimester women underwent high-depth sequencing (∼60×) to analyze cfDNA features, including fragment size, fetal fraction, motif patterns, and transcription start site (TSS) scores. We found no significant differences between different sequencing batches ([114]Figure S1A). Women’s demographics and pregnancy characteristics are shown in [115]Tables 1 and [116]S1. The mean ages of control and GDM were 31.20 ± 3.87 and 31.23 ± 3.93 years, respectively. Women with GDM had lower education levels, higher pre-pregnancy body mass index (BMI), a family history of diabetes, a lower gestational age for delivery, and a higher rate of cesarean section compared with control. As expected,[117]^24 women with GDM had higher total cholesterol (CHO), triglyceride (TG), and low-density lipoprotein cholesterol (LDLC) levels, but lower high-density lipoprotein cholesterol levels. Figure 1. [118]Figure 1 [119]Open in a new tab High-depth cfDNA sequencing captures fragmentation changes in GDM patient plasma vs. control plasma samples (A) Study design showing how samples were collected and analyzed (created with [120]BioRender.com). (B) Size profile distributions varied across different groups; solid and dashed lines represent nucleosome-derived and mitochondria-derived cfDNA fragments, respectively. (C) Longer cfDNA fragment ratios across GDM patient and control plasma samples. A longer size ratio was defined as the proportion of the number of reads from 160 to 300 bp (p < 0.001, Wilcoxon rank-sum test). (D) Longitudinal variations between GDM and controls. The fetal fraction differed significantly between GDM and controls (p < 0.01, LMM). The mean is represented by the central point and the error bars indicate the standard deviation from the mean. Table 1. Participants' characteristics in the TJBC Variable Control, n = 299[121]^a GDM, n = 299[122]^a p value[123]^b Age, years 31.20 (3.87) 31.23 (3.93) 0.92 Education level 0.01 Below senior high school 36 (12.04%) 59 (19.73%) – Above college or others 263 (87.96%) 240 (80.27%) – Pre-pregnancy BMI, kg/m^2 22.88 (3.32) 25.22 (4.49) <0.001 Primipara 0.6 Yes 200 (66.89%) 206 (68.90%) – No 99 (33.11%) 93 (31.10%) – Family history of diabetes 0.039 Yes 48 (16.05%) 68/299 (22.74%) – No 251 (83.95%) 231 (77.26%) – Current or former smoking 0.055 Yes 72 (24.08%) 93 (31.10%) – No 227 (75.92%) 206 (68.90%) – Drinking after pregnancy 0.073 Yes 12 (4.01%) 4 (1.34%) – No 287 (95.99%) 295 (98.66%) – Gestational age for delivery, weeks 39.15 (1.27) 38.68 (1.40) <0.001 Delivery mode 0.072 Caesarean section 144 (48.16%) 166 (55.52%) – Vaginal delivery 155 (51.84%) 133 (44.48%) – Gender of child 0.51 Female 141 (47.16%) 133 (44.48%) – Male 158 (52.84%) 166 (55.52%) – Birth weight of child, g 3,374.03 (436.61) 3,380.50 (466.07) 0.89 Birth length of child, cm 50.08 (1.44) 50.06 (1.53) 0.66 BMI of child at 2 years, kg/m^2 16.02 (1.34) 16.19 (1.78) 0.61 Unknown 183 180 – CHO, mmol/L 4.78 (0.74) 4.90 (0.84) 0.045 TG, mmol/L 1.45 (0.56) 1.60 (0.63) 0.002 HDLC, mmol/L 1.76 (0.37) 1.71 (0.40) 0.03 LDLC, mmol/L 2.69 (0.63) 2.91 (0.78) 0.001 UA, μmol/L 203.83 (45.45) 221.37 (67.91) <0.001 PRO 0.65 Positive 22 (7.36%) 25 (8.36%) – Negative 277 (92.64%) 274 (91.64%) – ALT, U/L 18.75 (12.38) 23.14 (18.45) <0.001 AST, U/L 17.49 (7.05) 18.64 (10.07) 0.18 BUN, mmol/L 3.08 (0.80) 3.09 (0.89) 0.92 Cr, μmol/L 55.24 (14.09) 54.10 (17.81) 0.093 Systolic blood pressure, mm Hg 108.44 (10.62) 113.27 (12.10) <0.001 Diastolic blood pressure, mm Hg 70.34 (7.72) 72.58 (9.22) 0.002 [124]Open in a new tab All variables were investigated or measured in early pregnancy; CHO, plasma total cholesterol; TG, plasma triglycerides; HDLC, high-density lipoprotein cholesterol; LDLC, low-density lipoprotein cholesterol; UA, blood uric acid; PRO, urine protein; ALT, alanine aminotransferase; AST, aspartate aminotransferase; BUN, blood urea nitrogen; Cr, serum creatinine. ^a Mean (SD); n (%). ^b Wilcoxon rank-sum tests; Pearson’s chi-squared tests; Fisher’s exact tests (drinking after pregnancy). cfDNA physical properties are different in women with GDM We mapped cfDNA paired-end sequencing data to the human reference genome (GRCh38.p13) and divided it into nuclear and mitochondria origins ([125]Figures 1B and [126]S1B). In both healthy and GDM groups, nucleosome and mitochondria-derived cfDNA had peak fragment sizes of 167 and 80 bp, respectively. Nuclear-derived cfDNA had smaller fragment peaks with periodic 10 bp intervals, patterns consistent with previous investigations,[127]^25 and indicating potential associations with histone-bound DNA lengths. Such patterns were similar in control and GDM groups ([128]Figure 1C). However, women with GDM had a significantly higher ratio of longer cfDNA than control (p < 0.001, Wilcoxon rank-sum test). Mitochondria-derived cfDNA distribution patterns between control and GDM groups were also different. Additionally, different gestation trimesters were associated with varying patterns of distribution in nuclear and mitochondria-derived cfDNA ([129]Figure S1B). Fetal fractions showed an increasing tendency as pregnancies progressed ([130]Figure 1D), consistent with findings reported previously.[131]^26 Notably, fetal fractions in women with GDM were constantly lower than those in controls (p < 0.01, linear mixed-effects model [LMM]), consistent with previous studies.[132]^20^,[133]^27 We further compared fetal fractions in obesity and preterm subgroups and observed that lower fetal fractions associated with GDM were independent of multiple factors ([134]Figure S1C). Among the 256 cfDNA fragment 4-mer end motifs, CCCA was the most enriched in all samples, which aligned with previous research.[135]^17 GDM had a significantly higher proportion of CCCA motifs than their controls (p < 0.05, LMM). Moreover, GDM had a significantly higher percentage of ACTT and ACCG motifs and a lower percentage of GCGG, GCGC, and TCGG motifs when compared with controls ([136]Figures 2A and [137]S2A). We then calculated the frequency diversity of all 256 motifs using motif diversity scores (MDSs), which showed decreased diversity in all fragment size categories as pregnancies progressed ([138]Figures 2A and [139]S2B). Interestingly, the MDS in GDM was notably lower than in controls, especially during the 1^st and 2^nd trimesters. Moreover, MDS trends were comparable with the validation dataset1 ([140]Figure S2C). We also used methylation-associated (MA) values to infer cfDNA methylation status[141]^15 in controls and GDM and found that the latter group had higher MA values than the former group ([142]Figure 2A). These differences represented specific cfDNA physical properties related to gene expression patterns during GDM. Figure 2. [143]Figure 2 [144]Open in a new tab CfDNA physical property differences between GDM and controls (A) Comparing cfDNA physical properties between GDM and controls. Signatures differed in motif frequency (CCCA = representative sequence), MDS (four subgroup sizes, (1) all: all fragments, (2) short: ≤150 bp, (3) peak: 160–170 bp, and (4) long: ≥250 bp), and MA (methylation-associated value). The center line in the boxplot represents the median, and lower, upper whiskers, and outliers correspond to the 1.5× interquartile range and outliers outside that range, respectively. p values for disease effects were calculated from the LMM. (B) Left: relative PIK3R1 coverage revealed differences between GDM and controls across different trimesters. Right: TSS PIK3R1 coverage revealed differences between GDM and controls across different trimesters (case1, case2, and case3 represent 1^st, 2^nd, and 3^rd trimester of GDM, respectively.). (C) TSS scores after a log transformation calculated for previously identified GDM-associated genes. Wilcoxon rank-sum tests were used to calculate p values. (D) TSS scores at various locations on multiple chromosomes. The x axis represents chromosome locations, while the y axis represents mean TSS scores after a log transformation was applied. Yellow dots on the graph indicate the 10 most significant TSS score differences between GDM and controls. (E) Top enriched gene set enrichment analysis terms in Wikipathways. Colors represent normalized enrichment scores and point size represents log[10](p adj) values. TSS scores identify pathway changes in GDM We next analyzed cfDNA coverage and TSS scores in three previously reported GDM-related genes (PIK3R1, PPARG, and TCF7L2) in Asian populations.[145]^28^,[146]^29^,[147]^30 We found that the coverage of up- and downstream TSS regions in these genes differed between GDM and control ([148]Figures 2B, [149]S3A, and S3B). Specifically, the TSS score of PIK3R1 was significantly higher in women with GDM when compared to control ([150]Figure 2C, p < 0.001, Wilcoxon rank-sum test with Bonferroni correction). We further calculated sequencing coverages near TSS regions using a TSS score method and identified genome-wide changes in cfDNA coverage near these regions at different trimesters between control and GDM groups ([151]Figures 2D, [152]S3C, S3D, and S3E). For validation purposes, we performed a similar genome-wide TSS coverage analysis using a public dataset[153]^19 and observed a correlation (Pearson correlation coefficient = 0.61, p < 0.01, [154]Figure S6F and [155]Table S6) between the fold changes of TSS coverage in the 1^st trimester. To perform pathway enrichment analysis, we selected genes with TSS score fold-change differences between control and GDM groups. In the 1^st trimester, no pathways were enriched. However, in 2^nd and 3^rd trimesters, we found increasing numbers of enriched pathways, including several signaling pathways related to diabetes ([156]Figure 2E). For instance, PI3K (phosphatidylinositol 3-kinase) and MAPK (mitogen-activated protein kinase) pathways, both of which are activated by insulin and involved in lipid metabolism,[157]^31^,[158]^32 were enriched in GDM (Benjamini-Hochberg corrected p = 0.032 for PI3K in the 2^nd trimester and 0.024 for MAPK in all trimesters) along with the insulin signaling pathway. PIK3R1, a member of PI3K/AKT pathway, contributed to this enrichment ([159]Table S3). Additionally, we identified enriched CHO biosynthesis and leptin signaling pathways, both of which are implicated in diabetes.[160]^33^,[161]^34 Interestingly, we also identified pathways involved in angiogenesis such as VEGFA (vascular endothelial growth factor-A)-VEGF receptor 2 pathway, synaptogenesis, neurogenesis (hippocampal synaptogenesis and neurogenesis), osteoblast differentiation and related diseases, and ectoderm differentiation ([162]Table S4). Gene ontology (GO) analysis confirmed that biological processes, such as brain development, tissue morphogenesis, cell proliferation, and growth regulation were altered in GDM ([163]Table S4), which putatively suggested GDM effects on fetal growth and development. TSS score signatures identify genes related to preterm births during GDM Using an LMM, we identified 55 TSS regions spanning across all trimesters that exhibited significantly different TSS scores between control and GDM ([164]Figure 3A). These TSS scores, representing alterations in 50 genes to GDM, were defined as TSS score signatures. To investigate the biological implications of these TSS score signatures, we categorized the 50 genes into four temporal categories using a Euclidean distance approach ([165]Figure 3B). The first category comprised two genes with decreasing TSS scores throughout pregnancy trimesters, exhibiting consistently higher TSS scores in patients with GDM than in controls. The second category also had higher TSS scores in GDM, but scores tended to increase as gestation proceeded. In the third category, most TSS scores decreased as pregnancy progressed and were lower in women with GDM. The fourth category had lower TSS scores in women with GDM and showed an inverted-U shape trend against gestation. To understand the temporal patterns of these signature genes across trimesters, we applied an unsupervised clustering approach, segmenting the study population into two clusters in the 1^st trimester, four clusters in the 2^nd trimester, and three clusters in the 3^rd trimester. This analysis revealed a distinct pattern in population flow, with the majority transitioning from T1-2 to T2-3 and subsequently to T3-3 ([166]Figure 3C). By comparing morbidity differences in each cluster, we observed an uneven GDM distribution complicated with preterm births, especially the T2-2 cluster showing no preterm births ([167]Figure 3C). We then analyzed correlation networks using GO pathways in TSS score signatures in the 50 genes and found the associations of PIK3R1, previously linked to diabetes,[168]^35 with multiple pathways ([169]Figure 3D and [170]Table S4). We then focused on associations between the TSS score signatures of the 50 genes and preterm birth. Six genes were involved in growth and development processes ([171]Tables S3 and [172]S4), in which CNTN4 and RNF213 had TSS scores associated with preterm birth. Notably, CNTN4 is a member of the immunoglobulin contactin family, with previous research reporting that contactin-2 may exhibit changes before a preterm delivery diagnosis.[173]^36 Figure 3. [174]Figure 3 [175]Open in a new tab CfDNA signatures between GDM and controls (A) Upset diagram shows selected TSS score signatures. (B) Heatmap shows TSS score signatures. Differential TSS scores separate GDM and control samples despite dynamic changes in trimesters. Four clusters show different temporal patterns. (C) Sankey diagram shows temporal cluster changes. Rectangle width corresponds to sample numbers in each trimester, and connections between rectangles represent subject flow between trimesters (T referred to trimester). The three bar charts below show preterm birth percentages in each trimester. (D) A correlation network of 50 TSS scores. Line color represents pathways and line size represents log(weight) values. (E) The odds ratios of six growth and development-related TSS scores. Data are presented as odds ratios with 95% confidence intervals. TSS score signatures identified genes associated with altered lipid metabolism in GDM As higher TG and LDLC levels ([176]Table 1) and enriched lipid-related pathways were observed in women with GDM in our study ([177]Figure 2E), we examined associations between lipid metabolism and TSS score signatures from 50 genes. TSS scores of multiple genes were correlated with lipid-related phenotypes from clinical records, including CHO, TG, and LDLC levels ([178]Figure 4A). Notably, a correlation was identified between PSD3 and plasma lipid profiles, consistent with previous findings showing that down-regulated PSD3 (via short interfering RNA) reduced intracellular lipid content in primary human hepatocytes.[179]^37 Furthermore, we found that the TSS score of SMARCD1, a molecular linker of the switch/sucrose non-fermentable (SWI/SNF) chromatin remodeling complexes and hepatic lipid metabolism,[180]^38 was correlated with CHO and LDLC levels ([181]Figure 4A). These results reveal the possibility of using TSS scores as the genetic markers of lipid homeostasis in GDM. To further investigate the link between TSS score signatures and lipid changes in GDM, we extracted plasma lipids from the study population and performed untargeted lipidomic profiling. When compared with controls, GDM had significantly elevated fatty acid (FA) 24:6, sphingomyelin (SM) d36:0, and phosphatidylcholine (PC) 16:0/20:4 levels and decreased PC 44:7e and PC 44:2e levels ([182]Figures 4B and [183]S4A). Figure 4. [184]Figure 4 [185]Open in a new tab Lipid profile associations (A) TSS score associations with different laboratory measurements. The upper section of the plot shows Pearson correlation analysis on multiple TSS scores. Line segment thickness corresponds to the magnitude of Pearson correlation coefficients, indicating relationship strength. Line color represents p values, which show statistical significance in observed correlations (Age, age of mother pregnancy; BMI_pre, BMI of the mother before pregnancy; CHO, plasma total cholesterol; TG, plasma triglycerides; LDLC, low-density lipoprotein cholesterol; ALT, alanine aminotransferase; AST, aspartate aminotransferase; birthBMI, BMI of the child at birth). (B) Lipid differences between women with GDM and controls. |log2(FC)| < 0.4 and −log10(p value) > 2 threshold values are applied. Blue and red dots represent significantly down-regulated and up-regulated lipids in GDM, respectively. Wilcoxon rank-sum tests were used to calculate p values. (C) Up: the fragment end counts and PRSS1’s TSS downstream 500 bp. Middle: a schematic showing the PRSS1 gene structure. Down: the coverage of the corresponding region. The gray-purple area represents introns 1–4 in ENST00000311737. Gray dotted lines indicate the positions of two end peaks situated in the intronic region (case1, case2, and case3 represent 1^st, 2^nd, and 3^rd trimester of GDM, respectively. ctr1, ctr2, and ctr3 represent 1^st, 2^nd, and 3^rd trimester of control, respectively). (D) Heatmap shows PRSS1 and other acinar-specific gene expression levels in different women from single-cell data from Camunas-Soler et al.[186]^39 We then analyzed the correlation between TSS scores of 4 lipid metabolism-related genes and altered lipids in GDM ([187]Figures S4B and S4C). We identified significant positive correlations between PRSS1 and up-regulated FA 24:6, SM d36:0, and PC 16:0/20:4 levels in women with GDM in the 2^nd trimester (Spearman ρ = 0.34, 0.33, and 0.39, respectively). Additionally, we consistently observed a positive correlation between PRSS1 and down-regulated PC 44:2e levels in controls ([188]Figure S4B), suggesting more comprehensive lipid profiles associated with PRSS1 expression. PRSS1 is a pancreatic acinar cell marker[189]^40 and encodes an enzyme secreted from the pancreas. An animal study confirmed that PRSS1 transgenic mice exhibited disordered lipid metabolism.[190]^41 We further investigated the coverage characteristics of cfDNA fragments near PRSS1 and found the predominant distribution of reads starting at 260 bp downstream of TSS (U-end) and ending at 427 bp (D-end) near the TSS ([191]Figure 4C). These accumulated fragments precisely overlapped with the PRSS1 intronic region ([192]Figure 4C), exhibiting a highly consistent nuclease cleavage pattern (chromatosome: nucleosome + linker histone; ∼167 bp[193]^25^,[194]^42) across samples. It has been shown that the maternal-origin cfDNA has longer fragment sizes (∼167 bp) than fetal-origin cfDNA (∼144 bp) due to different enzyme cutting positions. Hence, the phenomenon of 167 bp cfDNA fragments concentrated in the PRSS1 intron may reflect its maternal origin.[195]^43 Notably, the cfDNA coverage of PRSS1 differed between GDM and control throughout pregnancy, particularly the 1^st trimester ([196]Figure 4C). The most probable explanation for this disparity is that the transcriptional activity of PRSS1 was altered in GDM. To validate our findings, we examined single-cell transcriptomic data from diabetic human islets.[197]^39 We found that type 2 diabetes and prediabetic patients showed not only higher PRSS1 expression but also higher expression of other pancreatic acinar marker genes ([198]Figure 4D). Our analysis demonstrated an increased expression of PRSS1 and other pancreatic acinar marker expression in both type 2 diabetes and prediabetic patients and agreed with recent genetic studies that suggested genetic background differences in the exocrine pancreas of individuals with diabetes.[199]^44^,[200]^45 Integrative specific neural network (SNN) models powerfully predict GDM and offspring BMI Given the robust associations between cfDNA features and phenotypic and metabolic alterations in GDM, we exploited these cfDNA features to predict early GDM. Leveraging diverse clinical data (including age, pre-pregnancy BMI, drinking habits, and smoking status) and cfDNA features (such as motifs, MDS, MA, fetal fraction, and TSS score signatures) from the 1^st trimester, we constructed a neural network model aimed at predicting the likelihood of GDM developing in the 3^rd trimester ([201]Figure 5A). Utilizing solely clinical information for prediction yielded an area under the curve (AUC) of 0.697, with a 95% confidence interval (CI) ranging from 0.534 to 0.860 ([202]Figure 5C). Furthermore, incorporating various cfDNA features into the model enhanced its predictive performance. Notably, a substantial improvement in AUC was observed when TSS score signatures were incorporated with other features, achieving an AUC of 0.877 with a 95% CI spanning from 0.794 to 0.960 ([203]Figure 5C). We then validated our prediction model in validation dataset1, which showed an AUC = 0.829 (95% CI: 0.746–0.912) using clinical information and all cfDNA features ([204]Figure 5D). Interestingly, clinical information and growth-related cfDNA features could also predict the offspring BMI using an integrated neural network model, showing the correlation to the actual value of R^2 = 0.56 (95% CI: 0.44–0.68) and a mean absolute error = 1.75 (95% CI: 1.19–2.31) in birth BMI ([205]Figure S6A) and a correlation of R^2 = 0.62 (95% CI: 0.40–0.84) and a mean absolute error = 1.53 (95% CI: 1.34–1.72) in 2-year BMI ([206]Figure S6B). Next, we assessed the significance of cfDNA features in our model and identified the top 35 features crucial for predicting GDM and BMI ([207]Figures 5B and [208]S6C). For GDM prediction, we found fetal fraction and BMI to be the most important factors in our SNN model, indicating their higher contributions to the final loss value metric within the SNN model. Additionally, we employed the random forest Gini factor to validate feature importance, revealing that these values are model specific ([209]Figure 5B). The key gene LILRB1 was ranked 21^st for early GDM predictions and 1^st for birth weight predictions. Another crucial gene, PIK3R1, was positioned 8^th in early GDM and 2-year BMI predictions. The key gene PRSS1 ranked 23^rd and 4^th in early GDM and 2-year BMI predictions. Therefore, these top features were crucial in the prediction model, and their relative importance offered a reference for the specific trained model. Among the 35 crucial features, 9 genes with known roles in growth and development were further selected to assess their association with fetal development. We used a derived formula, C[BMI-a], from the TSS component of the predictive birth BMI model ([210]STAR Methods) to calculate a growth and development score. As expected, we observed a significant distinction between normal and large for gestational age children (p < 0.01, Wilcoxon rank-sum test. [211]Figure S6G). Figure 5. [212]Figure 5 [213]Open in a new tab Neural network model (A) SNN structure (created with [214]BioRender.com). (B) Feature importance plot showing different machine learning algorithms. The dark blue feature represents more importance when compared with the light blue, indicating boost in AUC when adding the feature. Right color bar signifies the feature class. (C and D) Classifier performance as quantified by a receiver operator characteristic curve to predict GDM using early gestation samples from TJBC (C) and validation dataset1 (D). The numerical values in parentheses denote the mean AUC value accompanied by its 95% confidence interval. (E and F) Performance of the refined model predicting GDM in TJBC, validation dataset1 (E), and validation dataset2 (F). Given the large number of features used in our GDM prediction model, we performed a further feature selection to develop a refined model with fewer features, including SeqFF (fetal fraction), BMI, CDH23, CNTN4, RNF213, SMARCD1, TWSG1, PSD3, and PIK3R1. Each feature was implicated in some aspect of lipid metabolism and GDM pathology. The refined model predicted GDM with an AUC = 0.849 (95% CI: 0.798–0.900) in the TJBC and 0.734 (95% CI: 0.713–0.755) in the validation dataset1 ([215]Figure 5E). Using the same features, the refined model was further tailored to adapt the ultra-lower sequencing depth in validation dataset2 (1% of TJBC) ([216]STAR Methods), which gave rise to GDM prediction performance of an AUC = 0.758 (95% CI: 0.724–0.792) ([217]Figure 5F). Lastly, DeLong’s tests were performed to confirm the enhancement to the GDM prediction performance by adding TSS features to clinical phenotypes in the prediction models ([218]Figures S6D and S6E). Discussion CfDNA in maternal plasma mainly originates from maternal and placental apoptotic cells,[219]^46 and its heterogeneous origins are ideal for studying complex metabolic disorders during pregnancy.[220]^47 In this study, we demonstrated the associations of cfDNA physical properties and fragment features with GDM by showing the direct involvement of cfDNA fragment features with typical GDM-related pathways, offspring BMI, preterm birth, and lipid metabolite traits. We identified PRSS1 as a biologically significant and valuable GDM marker. Furthermore, by externally validating both medium-depth sequencing data and low-depth NIPT datasets, we demonstrated that dynamic cfDNA changes may represent an important strategy for predicting GDM in early pregnancy. Previous studies reported dynamic changes in cfDNA physical properties during pregnancy and gestational disease.[221]^21 For instance, the fetal fraction was shown to increase in line with gestational weeks and served as an early screening marker of complications associated with placental dysfunction in pregnancy.[222]^48 In our study, the fetal fraction in women with GDM was significantly decreased when compared to control, particularly at early gestation stages, consistent with a previous study.[223]^20 Motif differences have also been implicated in gestational diseases,[224]^49 and we observed distinctive cfDNA fragment motif patterns and reduced cfDNA MDS in GDM, which suggested lower cfDNA fragment complexity in GDM. Previous studies also indicated nuclease involvement in cfDNA turnover and cfDNA end formation.[225]^50 Thus, nuclease activity may be changed in GDM, leading to altered cfDNA physical properties. The biological implications behind these cfDNA motif changes and diversity in women with GDM require further investigation. TSS scores were previously suggested to inversely correlate with expression levels at promoter nucleosome-depleted regions.[226]^51^,[227]^52 A major discovery in our study was that by comparing TSS scores between GDM and controls, a set of genes with functions related to diabetes, organ development, and differentiation pathways were identified. For instance, PI3K and MAPK pathways were concurrently altered in GDM. Specifically, the TSS score for PIK3R1, a member of PI3K/AKT pathway with well-established roles in regulating insulin and metabolic diseases,[228]^35^,[229]^53^,[230]^54 was significantly different between GDM and controls. Additionally, PIK3R1 was enriched in several growth-related pathways. Diabetic pregnancies are associated with distinct growth hormone profiles, which contribute to varied fetal growth patterns.[231]^55 Increased fetal fat deposition commonly observed in diabetic pregnancies may be attributed to modified cellular differentiation and underlying mechanisms that regulate body composition.[232]^56 The enriched pathways identified in our study also had important functions in cell growth differentiation. Thus, our findings endorse the rationale of using candidate cfDNA biomarkers to detect and monitor GDM. We also examined significantly altered TSS score signatures between GDM and controls to identify 50 genes as the candidate cfDNA biomarkers for GDM. Consistent with aforementioned results, several genes were directly related to diabetes and growth and development, such as LILRB1, PIK3R1, and CBS. Moreover, some candidate biomarker genes were also involved in preterm birth and lipid metabolism biological processes, including CNTN4 and RNF213, which were significantly associated with preterm births. Thus, longitudinal TSS score-signature patterns in candidate biomarker genes may represent altered fetal growth and development in women with GDM, leading to increased preterm birth risks. Lipid metabolism is a well-established process in diabetes.[233]^57 Previously, Mak et al.[234]^58 reported a correlation between circulating lipids and cfDNA genetic aberrations. In our study, we identified lipidomic profiling shifts with correlations with TSS scores in the islet acinar marker PRSS1. We further used single-cell data to validate exocrine pancreatic alterations in type 2 diabetes or prediabetic status. Exocrine pancreatic disorder in diabetes is a clinically relevant but poorly understood condition.[235]^59 Thus, our findings contribute to evidence linking exocrine pancreas deficiencies to diabetes in vivo.[236]^44^,[237]^45 We also showed that cfDNA physical properties and clinical data could be integrated into an SNN model to predict early GDM. This prediction performance was further confirmed and validated in two geographically distant and independent cohorts, thus highlighting the potential effectiveness of cfDNA in early GDM diagnoses. Furthermore, the addition of TSS score signatures from 50 candidate genes to the prediction model greatly improved GDM prediction performances in the 1^st trimester and supported these genes as potential GDM candidate biomarkers. As a complex disease, GDM induces sophisticated physiological and metabolic changes that have a multifaceted impact on offspring growth, including macrosomia, hypoglycemia, respiratory distress syndrome, etc.[238]^60 Our study provides a platform for exploiting the dynamic features of lipid metabolism, preterm birth, insulin regulation, and growth and development genes to predict GDM. Interestingly, we also predicted offspring BMI using the cfDNA features of 50 candidate biomarker genes, possibly because offspring’s BMI showed a strong correlation with maternal onset in GDM. Plasma cfDNA has been widely used to non-invasively test for fetal trisomy and copy-number variants in clinical practice.[239]^13 Critically, our work shows the potential for integrating GDM predictions into NIPT in early pregnancy in the future. In conclusion, we conducted an integrative cfDNA analysis that reveals a temporal relationship between cfDNA and GDM, providing comprehensive insights into the genetic alterations associated with various biological processes in GDM. This non-invasive approach holds promise for clinical applications in early prediction of GDM. Limitations of the study Despite our study having one of the largest sample sizes in high-depth sequencing cfDNA GDM cohorts, it is relatively smaller in comparison to other prospective NIPT studies.[240]^61^,[241]^62 Moreover, we utilized a nested case-control design where samples were carefully matched for key variables such as maternal gestational age and sampling weeks, thereby enhancing comparability between groups and improving statistical efficiency. However, this approach may limit extrapolation to broader populations. Furthermore, despite conducting external validation using two separate cohorts with substantial numbers of participants, determining the optimal sequencing depth for clinical translation into NIPT remains an open question. Despite these efforts, additional replication studies involving diverse and larger populations are imperative. STAR★Methods Key resources table REAGENT or RESOURCE SOURCE IDENTIFIER Deposited data __________________________________________________________________ TJBC cfDNA BED files This paper NGDC: OMIX004723; [242]https://ngdc.cncb.ac.cn/omix/preview/z7lvCjn0 Single-cell RNA sequencing data related to islet dysfunction in diabetes Camunas-Soler, J. et al.[243]^39 [244]https://github.com/jcamunas/patchseq. __________________________________________________________________ Software and algorithms __________________________________________________________________ Python 3.8.13 N/A [245]https://www.python.org Transcription Start Site (TSS) NCBI [246]https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_00000 1405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz R 4.2.2 The R Foundation [247]https://www.r-project.org/ Fastp 0.23.2 Chen, S et al.[248]^63 [249]https://github.com/OpenGene/fastp Minmap2 Li, H[250]^64 [251]https://github.com/lh3/minimap2 biobambam Tischler, G[252]^65 [253]https://github.com/gt1/biobambam Samtools 1.6 Li, H et al.[254]^66 [255]https://github.com/samtools/samtools Bamtools 2.5.2 Barnett, D. W.[256]^67 [257]https://github.com/pezmaster31/bamtools ggcor 0.9.8.1 Huang, H et al.[258]^68 [259]https://github.com/mj163163/ggcor-1 igraph 1.4.1 Csardi G et al.[260]^69 [261]https://github.com/igraph/rigraph ggraph 2.1.0 Pedersen, T et al.[262]^70 [263]https://github.com/thomasp85/ggraph lme4 1.1–31 Bates D et al.[264]^71 [265]https://github.com/lme4/lme4 lmerTest 3.1–3 Kuznetsova A et al.[266]^72 [267]https://github.com/runehaubo/lmerTestR nlme 3.1–160 Pinheiro J et al.[268]^73 [269]https://github.com/cran/nlme clusterProfiler 4.6.0 Yu, G et al.[270]^74 [271]https://github.com/YuLab-SMU/clusterProfiler metascape Zhou, Y et al.[272]^75 [273]https://metascape.org/gp/index.html#/main/step1 Sklearn 1.1.2 Pedregosa, F et al.[274]^76 [275]https://scikit-learn.org/ Tensorflow 2.9.1 Martin, A[276]^77 [277]https://www.tensorflow.org/ cfDNA features analysis This paper [278]https://github.com/zqr2008/cfdna/tree/main/feature_analysis Modelcode This paper [279]https://github.com/zqr2008/cfdna/tree/main/modelcode Figure reproduce This paper [280]https://github.com/zqr2008/cfdna/tree/main/figure_reproduce [281]Open in a new tab Resource availability Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Dr. Qiangrong Zhai (zhaiqiangrong@genomics.cn). Materials availability This study did not generate new unique reagents. Data and code availability * • The data supporting our findings have been deposited into the CNGB Sequence Archive (CNSA)[282]^78 of China National GeneBank DataBase (CNGBdb)[283]^79 CNGB: CNP0004352 ([284]https://db.cngb.org/search/?q=CNP0004352) and China National center for Bioinformation NGDC: OMIX004723 ([285]https://ngdc.cncb.ac.cn/omix/preview/z7lvCjn0). These files are accessible in compliance with Chinese legal regulations (2023BAT1256), enabling future referencing and potential validation analyses using our data. * • Single-cell RNA sequencing data related to islet dysfunction in diabetes are available at [286]https://github.com/jcamunas/patchseq. The data used in the current article was preprocessed datasets provided[287]^39. * • The code used in the article is saved at [288]https://github.com/zqr2008/cfdna. * • Any additional information required to reanalyze the data reported in this work paper is available from the [289]lead contact upon request. Experimental model and subject details This was a nested case-control study and women were selected from the TJBC,[290]^22 which is an ongoing prospective cohort established in 2017 in Tianjin, China. The TJBC recruited pregnant women of ≤14 + 6 gestational weeks (1^st trimester) who were followed-up at 15–27 weeks (2^nd trimester) and ≥28 weeks (3^rd trimester). The study was approved by the Ethics Committee of Tianjin Women and Children’s Health Center (No. 201706012-1), the Institutional Review Board of BGI (BGI-IRB 17116), and was conducted in compliance with the ethical standards outlined in the Declaration of Helsinki and its subsequent revisions or similar ethical standards. Before enrollment, women provided written informed consent. A case-to-control ratio of 1:1 was selected to increase statistical power. Women underwent standard GDM screening; an overnight fast was followed by a 75 g OGTT or an FPG test at 24–28 weeks of gestation.[291]^80 The study consisted of 299 GDM cases and 299 healthy controls. Inclusion criteria for GDM cases were as follows: 1) singleton pregnancy, 2) blood samples donated at each visit during 1^st, 2^nd, and 3^rd trimesters, and 3) questionnaires completed at each visit during pregnancy. Controls were selected using the same inclusion criteria. Both cases and controls were individually matched based on age (±2 years) and gestational week (±4 weeks) at the time of sample collection ([292]Figure 1). No women with GDM or controls had preexisting type 1 or type 2 diabetes. A GDM diagnosis was assigned between 24 and 28 weeks of gestation following Guidelines for GDM Diagnosis and Treatment (2014) (Department of Obstetrics and Gynecology, Chinese Medical Association).[293]^80 GDM screening and diagnosis tests were performed in central-regional-local healthcare networks in Tianjin using aforementioned tests. For FPG screening, women with an FPG value ≥5.1 mmol/L were diagnosed with GDM, and no additional diagnostic tests were required. For women with FPG values between 4.4 mmol/L and 5.1 mmol/L, a 75 g OGTT was recommended to confirm or rule out GDM. Women with no GDM had FPG values <4.4 mmol/L. For OGTT, women were deemed to have GDM if any of the following conditions were met: 1) fasting plasma glucose levels ≥5.1 mmol/L, 2) plasma glucose levels at 1 h ≥ 10.0 mmol/L, or 3) plasma glucose levels at 2 h ≥ 8.5 mmol/L. Of the 598 study women, 524 were diagnosed using the OGTT method and 74 using the FPG method. Obesity[294]^81 was defined as a pre-pregnancy body mass index (BMI) ≥ 28 kg/m^2. Gestational age at birth was defined as completed gestation weeks based on the estimated delivery date in women’s clinical records. Preterm birth was defined as a live birth before 37 completed gestation weeks. Large for gestational age (LGA) was defined as > the 90^th birth weight percentile for gestational age by infant gender. Birth weight, weight at 2 years old, and length/height data were also collected from clinical records. Low birth weight was defined at < 2500 g in live births and macrosomia was defined as ≥ 4000 g. To validate cfDNA dynamic signature changes throughout the three trimesters with respect to GDM, we selected cfDNA sequencing data (∼30×, paired-end sequencing) from an independent study of 202 samples (8 women with GDM and 106 healthy controls, each pregnant woman is sampled more than once) based on gestational age at plasma collection and subsequent follow-up as a Validation dataset1 ([295]Figure 1). These 114 individuals from the Shenzhen locality were recruited. Written informed consent was obtained from all individuals (No. BGI-IRB 20012) ([296]Figure S5B). To confirm trait generalizability across datasets, characterized by lower sequencing depths and larger sample sizes, we used cfDNA sequencing data from Zhu et al.[297]^23 (∼0.6× and single-end sequencing) as a Validation dataset2. Using the same methodology, we computed TSS scores and fetal fraction for this dataset. Initially, we excluded samples with missing BMI values and fetal fraction <0.01. Subsequently, using the refined model, we validated 6104 samples (comprising 1045 women with GDM and 5059 healthy controls) with no more than three missing feature values ([298]Figure S5C). Method details Library construction and sequencing data At all visits, 5 mL of peripheral blood was drawn into ethylene-diamine-tetra-acetic-acid (EDTA) blood tubes (ComWin, Beijing, China). Plasma was generated using a two-step centrifugation protocol[299]^82 and stored at −80°C. CfDNA was isolated from 200 μL plasma using the MagPure Circulating DNA Mini KF kit (Magen, Guangzhou, China) according to manufacturer’s protocols. Extracted cfDNA was then used to prepare cfDNA libraries for sequencing using the MGIEasy free DNA library preparation reagent set (MGI, Shenzhen, China) following manufacturer’s protocols. Prepared libraries were circularized to create single-stranded DNA (ssDNA) circles using the MGIEasy Circularization Kit (MGI) following manufacturer’s instructions. A Qubit ssDNA assay kit (Invitrogen, USA) was used to quantify purified ssDNA circles, after which they underwent rolling circle amplification to generate DNA nanoballs. After these steps, the final products were quantified using a Qubit ssDNA Assay Kit (Invitrogen) and loaded onto a DNBSEQ platform (MGI) for multiplex sequencing using a paired-end 100 bp strategy. Lipid extraction and untargeted lipidomic profiling Lipids were extracted from maternal blood as previously described.[300]^83^,[301]^84 Briefly, precooled isopropanol spiked with a lipid internal standard mix (SPLASH LIPIDOMIX Mass Spec Standard, Avanti, USA) was added to plasma. After vortexing and then overnight incubation at −20°C, samples were centrifuged and supernatants analyzed using a liquid chromatography–mass spectrometer (LC-MS). Samples were separated on a CSH C18 column (1.7 μm, 2.1 × 100 mm, Waters, USA) and analyzed using a QExactive MS (Thermo Fisher Scientific, USA). LC gradient and MS conditions were previously reported.[302]^84 Mobile phase A contained acetonitrile/water (60:40) plus 10 mM ammonium formate and 0.1% formic acid, while mobile phase B contained isopropanol/acetonitrile (90:10) plus 10 mM ammonium formate and 0.1% formic acid. The scan range for MS detection was 200–200m/z with a resolution of 70,000, and the automatic gain control (AGC) target for MS acquisition was set to 3e6 with a maximum ion injection time of 100 ms. The top three precursors were selected for subsequent tandem mass spectrometry fragmentation with a resolution of 17,500 and a maximum ion injection time of 50 ms. The AGC was 1e5, and the stepped normalized collision energy was set to 15, 30, and 45 eV. Lipid identification and quantitation were performed using LipidSearch 4.1 SP2 software (Thermo Fisher, USA), and data scaling and normalization were processed in metaX.[303]^85 Sequence alignments and filtering Raw fastq data were subjected to quality filtering using Fastp 0.23.2[304]^63 software based on the following criteria: (1) removal of adapter sequences, (2) elimination of sequences containing >10% unknown bases, and (3) removal of low-quality sequences. Filtered reads were then aligned to the GRCh38.p13 reference genome using MiniMap2[305]^64 comparison software. Resultant comparison outputs were saved as bam files, which were sorted using Biobambam.[306]^65 Additionally, duplicate reads generated during amplification steps were marked in sorted BAM files and filtered. Only paired reads with proper mapping orientation and insert size (i.e., ≤600 bp) were retained for downstream analyses ([307]Table S2). Fetal fraction The fetal fraction in plasma samples was estimated using SeqFF,[308]^86 a method that capitalizes on dissimilar chromatin structures between the mother and fetus, leading to irregular cfDNA dispersion across the genome. To determine the fetal fraction, several regression models (Enet and WRSC) were developed using read counts from multiple cfDNA regions in maternal plasma. We used mean results from both models as the fetal fraction value ([309]Table S3). Motifs and MDS End motif analysis was performed on single cfDNA fragments.[310]^17 We used the first four nucleotides at 5′ ends to calculate the frequencies of 256 ( [MATH: 4×4×4×4 :MATH] ) possible motifs (4 bp sequences and 4-mer motifs), and motifs were normalized according to the total number of ends. MDS in samples were calculated using [311]Equation 1. [MATH: MDS=i =1256Pi×log2(Pi)/log2(256) :MATH] (Equation 1) where P[i] is the frequency of a particular motif. A higher MDS indicated a higher diversity (i.e., a higher degree of randomness). The theoretical scale ranged from 0 to 1. A previous study demonstrated that maternal and fetal cfDNA fragments exhibited distinct length patterns, with maternal cfDNA generally longer than fetal fragments.[312]^87 In our analysis, we computed motif frequencies and MDS for all samples. We also divided fragments into three subsets: short, peak, and long. The short subset consisted of ≤150 bp fragments, the peak subset comprised fragments in the 160–170 bp range, and the long subset encompassed >250 bp fragments. Subsets were analyzed separately to investigate their respective characteristics ([313]Table S3). MA Methylated cytosine-phosphate-guanines (CpGs) have a higher likelihood of cleavage at cytosine when compared to unmethylated CpGs, while having a reduced likelihood of cleavage at the base preceding the CpG.[314]^15 Such differential cleavage patterns can cause increased CGN motifs but decreased NCG motifs. By analyzing cfDNA cleavage patterns and resulting CGN/NCG motifs, cfDNA methylation status was inferred across different regions.[315]^15 We used CGN/NCG motif ratios ([316]Equation 2) to assess methylation status in samples and named it as methylation-associated (MA) value ([317]Table S3). [MATH: MA=No.< mspace width="0.25em">of5CGNendmotifsNo.of5NCGendmotifs :MATH] (Equation 2) 5′CGN end motifs (i.e., 5′- CGA, CGT, CGG, and CGC). 5′ NCG end motifs (i.e., 5′- ACG, TCG, GCG, and CCG). TSS scores Transcriptional activity is correlated with chromatin status around the TSS.[318]^88^,[319]^89 To quantify gene expression, we analyzed TSS region coverage. Specifically, we calculated upstream and downstream 500 bp coverage of the TSS from aligned BAM files using the SAMtools 1.6 depth function.[320]^66 To account for sequencing depth and bias, we normalized TSS region coverage by dividing the 1 kb region into three parts. Average side bin depth was used to normalize the depth of the middle 500 bp (TSS -250 to TSS +250) and the normalization rate of the middle bin was defined as the TSS score ([321]Equation 3). [MATH: TSSscore=dep< mi>th(mid dlebin) depth< mspace width="0.25em">(sid ebin) :MATH] (Equation 3) where depth (side bin) is the average depth of TSS -500 to TSS -250 and TSS +250 to TSS +500 regions. Depth (middle bin) refers to the average depth of the middle 500 bp (TSS -250 to TSS +250). High TSS scores indicated high coverage in the TSS region, indicating that cfDNA was highly protected and not easily bound to transcription-related factors, thereby eliciting low gene expression. In contrast, lower TSS scores were associated with higher gene expression.[322]^19^,[323]^51 In our study, TSS scores were used as gene expression measures. TSS score-signature identification Selected TSS score-signatures were based on statistical tests and log2-transformed fold-change (log2(FC)) values. We used a linear mixed-effects model (LMM) approach to screen TSS scores, using a two-sided false-discovery rate threshold of <0.1 for selection. The LMM model included trimesters, groups, and interactions between trimesters and groups as fixed effects while incorporating a subject-specific random effect. We used Least Squares Means estimates to test for differences between groups. To determine the threshold for TSS score-signatures, we performed pairwise comparisons using TSS scores to calculate log2(FC) values to represent differences between GDM and control samples for each trimester, and collectively for all trimesters. A log2(FC) > 0.05 value was the threshold. Simultaneously, we conducted univariate statistical tests for these four comparisons using Wilcoxon rank sum tests (p < 0.1 threshold). A TSS score meeting all aforementioned criteria was selected as a signature. If a gene corresponded to multiple TSS scores, the final TSS score was primarily selected using the maximum absolute log2(FC) value. If both positive and negative log2(FC) values occurred, the direction was determined based on the location of the majority of TSS scores, and the TSS score with the maximum absolute value in the determined direction of log2(FC) values was chosen. Gene set enrichment analysis and gene ontology analysis We performed GSEA[324]^90 using the R package clusterProfiler[325]^74 and biological pathways from Human Wikipathways[326]^91 were used as primary sources for analyses. The Benjamini–Hochberg (BH) method was used for multiple testing corrections. A gene list was generated based on TSS scores, after which genes were ranked using pairwise log2(FC) values between GDM and control samples. We also used Metascape[327]^75 for gene ontology analysis using the following thresholds: minimum overlap = 3, P-value cutoff = 0.01, and minimum enrichment = 1.5. Gene Ontology (GO) biological process pathways were selected in analyses. C[BMI-a] definition To gain deeper insights into the relationships between high-importance TSS score-signatures among the top 20 predictive features for birth BMI and growth development, we used a derived formula from the TSS component of the predictive birth BMI model, C[BMI-a] ([328]Equation 4): [MATH: CBMIa=13.4589+(0.0809)×LILRB1+(0.1696< /mn>)×MIR8071 1+THAP12+(0.0106< /mn>)×RNF213< mo linebreak="goodbreak">+(0.2094< /mn>)×SPATS2L+(1.4290< /mn>)×BRPF1+(0.0686< /mn>)×MIR3689 C+(0.0530< /mn>)×PCMTD2+(0.0088< /mn>)×LPGAT1 :MATH] (Equation 4) Where LILRB1, MIR8071-1, THAP12, RNF213, SPATS2L, BRPF1, MIR3689C, PCMTD2, and LPGAT1 are genes in the C[BMI-a]. In this equation, each gene was represented by its TSS gene score. External dataset validation To rigorously authenticate our findings, we retrieved cfDNA sequencing data from women with GDM, as published by Guo et al..[329]^19 Accounting for sequencing depth and reference genome distinctions, we meticulously determined TSS scores that displayed consistent trends across healthy and GDM cohorts. Using a significance FDR threshold = 0.05, selected TSS scores were subjected to further correlation analysis. The neural network models Neural network input data We implemented a parallel-connected Neural Network Model. CfDNA data were extracted as motifs, MDS, MA, and TSS scores. In input data, we had 256 4-mer motif features, four MDS features, three MA features, and 50 TSS score features from previous steps. The sampling process for this network model involved detailed extraction and feature selection from cfDNA data. We used 256 4-mer motif features, four MDS features, three MA features, and 50 TSS score features. The dataset was meticulously partitioned into training, validation, and testing sets, with a typical split of 80% for training and 20% for testing. Stratified random sampling procedures ensured the representation of key variables such as age and disease status across subsets. To address data imbalance, techniques such as oversampling the minority class or undersampling the majority class were used. For internal validation, we used 10-fold cross-validation in the training set, which allowed for a comprehensive evaluation of model performance. This method involved training the model on nine subsets and validating it on the remaining subset, in an iterative process. External validation was achieved using two external datasets to validate model applicability and robustness in different settings or populations. Feature selection From preliminary studies, the majority of features from TSS genes and motifs included too much noise which contribute little to our results. Thus, we implemented forward feature selection as a first step to remedy this. This step started from an empty logistic model. We then added features one by one to determine the best performance feature for each step. The model was adapted from a logistic model used by Hu et al.[330]^92 ([331]Equation 5). [MATH: ey1+ey=[β1β2βn]×[f1< /mtr>f2< mo>…fn< /mtr>]+[b] :MATH] (Equation 5) At first, there were no components in β and f vectors. Starting from an empty model, we first added one feature to the initial model and compared performance improvements using likelihood ratio Chi-square tests with the previous model. The feature with the smallest Chi-square test p value (p < 0.05) was selected for the model. Then, in the next step, we selected the best features from the remaining features. We continued this step-forward selection process until no qualified features were available to improve the model performance using likelihood ratio Chi-square test p < 0.05 values. These features were then used to generate an SNN model. SNN sub-networks In our model, we generated the final input as a 93-features vector that included 31 motifs, four MDS, three MA, and 55 TSS scores. Given the distinct patterns in each feature, we developed different neural network branches in the SNN to process data prior to the final categorization. Here, we used a dense layer from Keras to build a fully connected neural network to process the 31 motifs ([332]Equation 6). We used Wi, Wj, …Wm to represent the weight matrix of layers i, j, …until m. Each layer processed the previous layers’ output and used the Relu function to generate an output for the next layer. [MATH: L=[Relu(Wm[[Relu(Wj[Relu(Wi· xi+bi)]+bk)]]+bn)] :MATH] (Equation 6) Our Motif features exhibits consistent values with a fixed range. From this uniformity, a fully connected neural network can effectively interpret such patterns and create an optimal classification model. In other words, such a network can yield the best results for classifying or categorizing data based on uniform motif features. For MDS and MA, we processed these with a simple 1-unit neuron ([333]Equation 7), where x was the input matrix of MDS or MA, and W was an n×1 weight matrix. Here, we used three MA features as an example. These features were relatively straightforward and did carry much information. Due to their simplicity, only one neuron was required to transfer processed data to the classification layer. This limited information did not necessitate a larger, more complex network structure; hence a single output neuron sufficed. [MATH: M=Relu([x1x2x3]×[w1< /mtr>w2< msub>w3]+[b]< /mo>) :MATH] (Equation 7) Next, we used convolutional layers to process TSS scores ([334]Equation 8), which we based on the observation that the TSS score matrix had many internal relationships between different genes; a convolution layer learns and emphasizes the relationship between two genes and thus helps with the final classification job. [MATH: R=Maxp ool(j kfilter[j,k]TSS[mj,nk]) :MATH] (Equation 8) j and k represent the coordinators of the convolutional kernel filter. We used a 2 dimension 3 times 3 kernal filter for the TSS matrix ([335]Equation 7), where m and n were TSS matrix coordinates. Thus, as depicted ([336]Equation 7), each value in the TSS matrix used multiplication and summation processes with a convolutional kernel filter using its neighboring values to generate a new feature map value. Classification and regression After SNN processing, all SNN outputs were concatenated as a whole input matrix for classification and regression ([337]Equations 9 and [338]10). [MATH: class=soft max(p)=soft max([LM< mi>R]·Wb+bc) :MATH] (Equation 9) where W[b] is the weight matrix of the final classification layer, and b refers to binary classification. As a binary classification, we used only one classification output p to represent GDM probability. Thus, W is an n × 1 matrix; n is the total output number from output L, M, and R from [339]formula 5, [340]6, and [341]7. Finally, we added a bias to the classification layer - shown as b[c] - where c stands for classification. [MATH: linear=Relu([LM< mi>R]·Wl+bl) :MATH] (Equation 10) we used W[l] to represent the weight matrix of the final output in linear regression analyses. We used b[l] for the bias of the last output layer. We also used a fully connected layer, using Softmax activation and cross-entropy loss functions, for the classification output. The activation function of the regression output was Relu since there was no negative output for BMI and weight. The loss function of the regression output was the mean square error. We chose the Area under the ROC Curve (AUC) as the main SNN evaluation metric. Briefly, we randomly shuffled our dataset and split it (n = 598; GDM n = 299 and controls n = 299) according to 75% training and 25% testing. Test processing included a 10-fold cross-validation. Thus, we trained 10 models for training data and performed 10 test steps for testing data, which yielded 10 AUCs for 95% CI. Feature importance Feature importance contributed to the final model and was evaluated using the following steps: 1) We randomly shuffled the value of each feature for every data record in the test dataset; 2) We loaded the pre-trained model and predicted classification or regression results of the shuffled test dataset; 3) We recorded the loss of prediction results and compared it with the loss of the original non-shuffled test dataset; and 4) We determined differences in loss values as feature importance. Random forest classification model We used a random forest classification model to verify feature importance in SNN models. In this scenario, we used features as classification attributes in the random forest model. This was implemented using sklearn[342]^76 with estimator trees = 100, the Gini factor as a criterion, and no limitation on tree max depth. Forward stepwise feature selection (FSFS) in logistic regression for TSS selection We further refined TSS features using FSFS which is a sequential method that begins with no predictors in the model and adds them one at a time. Each variable is chosen based on its contribution to improving the model’s fit, and evaluated by a statistical criterion (likelihood ratio test). The logistic regression model used in this selection process is described by the following [343]Equation 11: [MATH: ln(p1p< /mi>)=β0+β1x 1+β2x 2++βn xn :MATH] (Equation 11) where p is the probability of a feature occurrence, [MATH: β0 :MATH] is the intercept, and [MATH: β1 :MATH] , [MATH: β2 :MATH] , …, [MATH: βn :MATH] are coefficients of the predictors x[1], x[2], …, x[n]. This iterative process continued until new variable addition did not significantly improve the model, based on a predefined improvement threshold. The final model included a set of features that were highly predictive of GDM. By using FSFS, we built a model that was not only predictive but also interpretable in selecting the most important genes implicated in GDM. Validation of the refined model To rigorously assess validation effectiveness, our refined model underwent training using the TJBC training dataset. This evaluation used a fully connected neural network architecture, incorporating nine distinct features (SeqFF, BMI, CDH23, CNTN4, RNF213, SMARCD1, TWSG1, PSD3, and PIK3R1) specified by the refined model. To evaluate the validation dataset, we saved the best performed model selected from the training TJBC and use the saved refined model to test the validation datasets. The validation process was conducted using two separate datasets: Validation dataset1 (from southern China) and Validation dataset2 (from central China). To validate the datasets thoroughly, we applied a comprehensive testing approach similar to what is known as 10-fold cross-validation. This process involves dividing the dataset into ten equal parts, we used one part for testing for each fold. We repeated this procedure ten times, each time with a different part used for testing, to ensure the robustness of our validation method. This involved randomly shuffling validation datasets and partitioning them into 10 equal subsets, each constituting 10% of the data. Subsets were then independently tested, with the results aggregated to calculate the overall average performance and model CIs. Quantification and statistical analysis Univariate comparisons were conducted using two-sided Wilcox Rank Sum Tests. For repeated collected measurements, our main goal was to compare how certain cfDNA features changed between cases and controls. We used LMM (Methods: TSS score-signature identification) to calculate average cfDNA values, identify differences between groups, and understand temporal trends. We developed two models: the first was unadjusted and the second LMM accounted for covariates. The primary focus of our data analysis centered on comparing cfDNA change patterns between women with GDM and controls. We used linear mixed-effects models to compute cfDNA least squares means with respect to group differences and trend assessments. Our primary outcome was the identification of pairwise differences between cases and controls. We also hypothesized a priori that pregnancy trimester progression was potentially associated with abnormal physiological changes in GDM, such as an increased insulin-resistant state, which may have manifested as altered cfDNA levels correlated to glucose levels. Therefore, our analytical models incorporated trimester progression and associated interactions with disease status. For covariates, based on known risk factors for GDM and cfDNA predictors, we considered pre-pregnancy BMI, maternal age,[344]^93 ethnicity,[345]^94 education,[346]^95 smoking status,[347]^96 and alcohol consumption[348]^97 as potential confounding variables. Some covariates were not included in the final adjusted model for the following reasons: 1) Age: To address maternal age bias, we carefully selected age-matched case-controls (methods Case-control study design). 2) Ethnicity: All participants were of Asian ethnicity. 3) Pre-pregnancy BMI: Although this is a known factor associated with GDM and potentially influences cfDNA features, we did not include it as a covariate in our analysis. This decision was based on previous research indicating that obesity-induced DNA release from adipocytes stimulated insulin resistance.[349]^98 The interplay between obesity, GDM, and cfDNA is complicated, and potentially introduced collider bias into our study. Furthermore, our primary focus was to examine whether GDM status affected cfDNA features, and we were less concerned with how these effects were mediated by obesity. We constructed two models: model 1 which was unadjusted, and model 2 which was a linear mixed-effects model that included adjustments for education, alcohol consumption (yes/no), and also smoking during pregnancy (yes/no). Our results and discussion were based on the outcomes from these models ([350]Table S5). We defined statistical significance as a two-sided p < 0.05 value and corrected non-parametric analyses for multiple testing using the BH method. To perform correlation analyses, we separately calculated Spearman’s correlations for each trimester between GDM and control groups. Acknowledgments