Graphical abstract
graphic file with name fx1.jpg
[85]Open in a new tab
Highlights
* •
Gestational diabetes mellitus (GDM) exhibits distinct cfDNA
physical properties
* •
Dynamic cfDNA variations between GDM and controls
* •
Altered exocrine pancreas identified by islet acinar marker gene
PRSS1
* •
CfDNA links between GDM and altered lipid metabolism
__________________________________________________________________
Tang et al. present a longitudinal integrative analysis of cell-free
DNA, revealing a distinctive signature and dynamic patterns specific to
gestational diabetes mellitus. Lipidomic alterations and changes in
pancreatic exocrine markers are discerned.
Introduction
Globally, gestational diabetes mellitus (GDM) is an increasingly
prevalent disease among pregnant women and affects approximately one in
six pregnancies.[86]^1 GDM is caused by insulin resistance and
pancreatic β-cell dysfunction during pregnancy and is associated with
long-term adverse outcomes such as type 2 diabetes mellitus and
cardiovascular disease.[87]^2^,[88]^3 GDM is also associated with
metabolic imbalance that increases the risk of neonatal complications,
primarily fetal growth deviations[89]^4 and preterm birth.[90]^5
Furthermore, increasing evidence now suggests that GDM has long-term
consequences for children’s development such as cardiovascular diseases
and insulin resistance.[91]^2^,[92]^3 Research also shows that promptly
treating GDM before 20 weeks of gestation slightly reduces a
combination of negative outcomes in newborns.[93]^6 Despite great
advances in the diagnosis and clinical management of GDM, the disease
mechanisms are still unclear and early condition predictions remain
difficult.[94]^7 Since current diagnostic criteria for GDM are mainly
based on glycemic levels, dynamic changes in genetic factors are often
ignored.[95]^8 Thus, understanding temporal genetic markers in response
to a growing fetus during GDM and exploring associated dynamic genetic
changes can provide insights into gene function that influences insulin
actions and regulates metabolism.[96]^9
The association of cell-free DNA (cfDNA) in maternal plasma with
gestational disease by recent studies may provide a new insight into
GDM prediction using cfDNA.[97]^10^,[98]^11 Plasma cfDNA are short DNA
fragments primarily derived from cell apoptosis, with fragments from
different tissue-specific nucleosome arrangements.[99]^12 Initially
used as a non-invasive prenatal testing (NIPT) approach for fetal
chromosome abnormalities,[100]^13^,[101]^14 cfDNA was found to contain
non-random fragments associated with tissue damage, gene expression,
methylation levels, and other factors.[102]^15 Previous studies
reported that cfDNA molecules manifested fragmental
characteristics,[103]^16 nucleosome relationships,[104]^17 and
endpoints that reflected tissue origins and turnover
mechanisms.[105]^18 Guo et al.[106]^19 found that cfDNA coverage
patterns in gene promoter regions reflected gene expression levels and
could be used to predict pregnancy complications. Importantly, in
patients with GDM, the cfDNA fetal fraction is lower when compared with
non-GDM pregnant women.[107]^20
Recently, it was shown that cfDNA methylation and transcriptomic
signatures had the potential to predict adverse pregnancy
outcomes.[108]^21 However, more research is required to characterize
cfDNA feature associations with patients with GDM. In this study, we
compared longitudinal cfDNA profiles between women with GDM and healthy
control to identify distinctive changes in cfDNA features not
previously observed. We then correlated the underlying implications of
these cfDNA features and demonstrated the utility of cfDNA in
unraveling the biological mechanisms underpinning GDM. Furthermore, the
validation in external datasets from NIPT data supported the clinical
relevance of our results, indicating their predictive applicability
regardless of sequencing depth.
Results
Population demographics and pregnancy characteristics
A total of 299 women with GDM and 299 matched healthy non-GDM pregnant
women were selected from the Tianjin Birth Cohort (TJBC)[109]^22 as the
discovery dataset. A small prospective cohort containing 8 women with
GDM and 106 healthy controls was used as a validation cohort
(validation dataset1) ([110]Figure 1A). Additionally, a dataset from
Zhu et al.[111]^23 contained the NIPT data of 21,813 pregnant Chinese
women. Following the inclusion criteria ([112]Figure S5C), we
ultimately arrived at a final set of 6,104 samples (1,045 GDM and 5,059
healthy controls) for validation dataset2. No significant differences
in baseline characteristics were recorded in women who underwent oral
glucose tolerance tests or fasting plasma glucose tests to diagnose GDM
([113]Table S1). Peripheral blood samples from the 1^st, 2^nd, and 3^rd
trimester women underwent high-depth sequencing (∼60×) to analyze cfDNA
features, including fragment size, fetal fraction, motif patterns, and
transcription start site (TSS) scores. We found no significant
differences between different sequencing batches ([114]Figure S1A).
Women’s demographics and pregnancy characteristics are shown in
[115]Tables 1 and [116]S1. The mean ages of control and GDM were
31.20 ± 3.87 and 31.23 ± 3.93 years, respectively. Women with GDM had
lower education levels, higher pre-pregnancy body mass index (BMI), a
family history of diabetes, a lower gestational age for delivery, and a
higher rate of cesarean section compared with control. As
expected,[117]^24 women with GDM had higher total cholesterol (CHO),
triglyceride (TG), and low-density lipoprotein cholesterol (LDLC)
levels, but lower high-density lipoprotein cholesterol levels.
Figure 1.
[118]Figure 1
[119]Open in a new tab
High-depth cfDNA sequencing captures fragmentation changes in GDM
patient plasma vs. control plasma samples
(A) Study design showing how samples were collected and analyzed
(created with [120]BioRender.com).
(B) Size profile distributions varied across different groups; solid
and dashed lines represent nucleosome-derived and mitochondria-derived
cfDNA fragments, respectively.
(C) Longer cfDNA fragment ratios across GDM patient and control plasma
samples. A longer size ratio was defined as the proportion of the
number of reads from 160 to 300 bp (p < 0.001, Wilcoxon rank-sum test).
(D) Longitudinal variations between GDM and controls. The fetal
fraction differed significantly between GDM and controls (p < 0.01,
LMM). The mean is represented by the central point and the error bars
indicate the standard deviation from the mean.
Table 1.
Participants' characteristics in the TJBC
Variable Control, n = 299[121]^a GDM, n = 299[122]^a p value[123]^b
Age, years 31.20 (3.87) 31.23 (3.93) 0.92
Education level 0.01
Below senior high school 36 (12.04%) 59 (19.73%) –
Above college or others 263 (87.96%) 240 (80.27%) –
Pre-pregnancy BMI, kg/m^2 22.88 (3.32) 25.22 (4.49) <0.001
Primipara 0.6
Yes 200 (66.89%) 206 (68.90%) –
No 99 (33.11%) 93 (31.10%) –
Family history of diabetes 0.039
Yes 48 (16.05%) 68/299 (22.74%) –
No 251 (83.95%) 231 (77.26%) –
Current or former smoking 0.055
Yes 72 (24.08%) 93 (31.10%) –
No 227 (75.92%) 206 (68.90%) –
Drinking after pregnancy 0.073
Yes 12 (4.01%) 4 (1.34%) –
No 287 (95.99%) 295 (98.66%) –
Gestational age for delivery, weeks 39.15 (1.27) 38.68 (1.40) <0.001
Delivery mode 0.072
Caesarean section 144 (48.16%) 166 (55.52%) –
Vaginal delivery 155 (51.84%) 133 (44.48%) –
Gender of child 0.51
Female 141 (47.16%) 133 (44.48%) –
Male 158 (52.84%) 166 (55.52%) –
Birth weight of child, g 3,374.03 (436.61) 3,380.50 (466.07) 0.89
Birth length of child, cm 50.08 (1.44) 50.06 (1.53) 0.66
BMI of child at 2 years, kg/m^2 16.02 (1.34) 16.19 (1.78) 0.61
Unknown 183 180 –
CHO, mmol/L 4.78 (0.74) 4.90 (0.84) 0.045
TG, mmol/L 1.45 (0.56) 1.60 (0.63) 0.002
HDLC, mmol/L 1.76 (0.37) 1.71 (0.40) 0.03
LDLC, mmol/L 2.69 (0.63) 2.91 (0.78) 0.001
UA, μmol/L 203.83 (45.45) 221.37 (67.91) <0.001
PRO 0.65
Positive 22 (7.36%) 25 (8.36%) –
Negative 277 (92.64%) 274 (91.64%) –
ALT, U/L 18.75 (12.38) 23.14 (18.45) <0.001
AST, U/L 17.49 (7.05) 18.64 (10.07) 0.18
BUN, mmol/L 3.08 (0.80) 3.09 (0.89) 0.92
Cr, μmol/L 55.24 (14.09) 54.10 (17.81) 0.093
Systolic blood pressure, mm Hg 108.44 (10.62) 113.27 (12.10) <0.001
Diastolic blood pressure, mm Hg 70.34 (7.72) 72.58 (9.22) 0.002
[124]Open in a new tab
All variables were investigated or measured in early pregnancy; CHO,
plasma total cholesterol; TG, plasma triglycerides; HDLC, high-density
lipoprotein cholesterol; LDLC, low-density lipoprotein cholesterol; UA,
blood uric acid; PRO, urine protein; ALT, alanine aminotransferase;
AST, aspartate aminotransferase; BUN, blood urea nitrogen; Cr, serum
creatinine.
^a
Mean (SD); n (%).
^b
Wilcoxon rank-sum tests; Pearson’s chi-squared tests; Fisher’s exact
tests (drinking after pregnancy).
cfDNA physical properties are different in women with GDM
We mapped cfDNA paired-end sequencing data to the human reference
genome (GRCh38.p13) and divided it into nuclear and mitochondria
origins ([125]Figures 1B and [126]S1B). In both healthy and GDM groups,
nucleosome and mitochondria-derived cfDNA had peak fragment sizes of
167 and 80 bp, respectively. Nuclear-derived cfDNA had smaller fragment
peaks with periodic 10 bp intervals, patterns consistent with previous
investigations,[127]^25 and indicating potential associations with
histone-bound DNA lengths. Such patterns were similar in control and
GDM groups ([128]Figure 1C). However, women with GDM had a
significantly higher ratio of longer cfDNA than control (p < 0.001,
Wilcoxon rank-sum test). Mitochondria-derived cfDNA distribution
patterns between control and GDM groups were also different.
Additionally, different gestation trimesters were associated with
varying patterns of distribution in nuclear and mitochondria-derived
cfDNA ([129]Figure S1B).
Fetal fractions showed an increasing tendency as pregnancies progressed
([130]Figure 1D), consistent with findings reported previously.[131]^26
Notably, fetal fractions in women with GDM were constantly lower than
those in controls (p < 0.01, linear mixed-effects model [LMM]),
consistent with previous studies.[132]^20^,[133]^27 We further compared
fetal fractions in obesity and preterm subgroups and observed that
lower fetal fractions associated with GDM were independent of multiple
factors ([134]Figure S1C).
Among the 256 cfDNA fragment 4-mer end motifs, CCCA was the most
enriched in all samples, which aligned with previous research.[135]^17
GDM had a significantly higher proportion of CCCA motifs than their
controls (p < 0.05, LMM). Moreover, GDM had a significantly higher
percentage of ACTT and ACCG motifs and a lower percentage of GCGG,
GCGC, and TCGG motifs when compared with controls ([136]Figures 2A and
[137]S2A). We then calculated the frequency diversity of all 256 motifs
using motif diversity scores (MDSs), which showed decreased diversity
in all fragment size categories as pregnancies progressed ([138]Figures
2A and [139]S2B). Interestingly, the MDS in GDM was notably lower than
in controls, especially during the 1^st and 2^nd trimesters. Moreover,
MDS trends were comparable with the validation dataset1
([140]Figure S2C). We also used methylation-associated (MA) values to
infer cfDNA methylation status[141]^15 in controls and GDM and found
that the latter group had higher MA values than the former group
([142]Figure 2A). These differences represented specific cfDNA physical
properties related to gene expression patterns during GDM.
Figure 2.
[143]Figure 2
[144]Open in a new tab
CfDNA physical property differences between GDM and controls
(A) Comparing cfDNA physical properties between GDM and controls.
Signatures differed in motif frequency (CCCA = representative
sequence), MDS (four subgroup sizes, (1) all: all fragments, (2) short:
≤150 bp, (3) peak: 160–170 bp, and (4) long: ≥250 bp), and MA
(methylation-associated value). The center line in the boxplot
represents the median, and lower, upper whiskers, and outliers
correspond to the 1.5× interquartile range and outliers outside that
range, respectively. p values for disease effects were calculated from
the LMM.
(B) Left: relative PIK3R1 coverage revealed differences between GDM and
controls across different trimesters. Right: TSS PIK3R1 coverage
revealed differences between GDM and controls across different
trimesters (case1, case2, and case3 represent 1^st, 2^nd, and 3^rd
trimester of GDM, respectively.).
(C) TSS scores after a log transformation calculated for previously
identified GDM-associated genes. Wilcoxon rank-sum tests were used to
calculate p values.
(D) TSS scores at various locations on multiple chromosomes. The x axis
represents chromosome locations, while the y axis represents mean TSS
scores after a log transformation was applied. Yellow dots on the graph
indicate the 10 most significant TSS score differences between GDM and
controls.
(E) Top enriched gene set enrichment analysis terms in Wikipathways.
Colors represent normalized enrichment scores and point size represents
log[10](p adj) values.
TSS scores identify pathway changes in GDM
We next analyzed cfDNA coverage and TSS scores in three previously
reported GDM-related genes (PIK3R1, PPARG, and TCF7L2) in Asian
populations.[145]^28^,[146]^29^,[147]^30 We found that the coverage of
up- and downstream TSS regions in these genes differed between GDM and
control ([148]Figures 2B, [149]S3A, and S3B). Specifically, the TSS
score of PIK3R1 was significantly higher in women with GDM when
compared to control ([150]Figure 2C, p < 0.001, Wilcoxon rank-sum test
with Bonferroni correction). We further calculated sequencing coverages
near TSS regions using a TSS score method and identified genome-wide
changes in cfDNA coverage near these regions at different trimesters
between control and GDM groups ([151]Figures 2D, [152]S3C, S3D, and
S3E).
For validation purposes, we performed a similar genome-wide TSS
coverage analysis using a public dataset[153]^19 and observed a
correlation (Pearson correlation coefficient = 0.61, p < 0.01,
[154]Figure S6F and [155]Table S6) between the fold changes of TSS
coverage in the 1^st trimester.
To perform pathway enrichment analysis, we selected genes with TSS
score fold-change differences between control and GDM groups. In the
1^st trimester, no pathways were enriched. However, in 2^nd and 3^rd
trimesters, we found increasing numbers of enriched pathways, including
several signaling pathways related to diabetes ([156]Figure 2E). For
instance, PI3K (phosphatidylinositol 3-kinase) and MAPK
(mitogen-activated protein kinase) pathways, both of which are
activated by insulin and involved in lipid
metabolism,[157]^31^,[158]^32 were enriched in GDM (Benjamini-Hochberg
corrected p = 0.032 for PI3K in the 2^nd trimester and 0.024 for MAPK
in all trimesters) along with the insulin signaling pathway. PIK3R1, a
member of PI3K/AKT pathway, contributed to this enrichment
([159]Table S3). Additionally, we identified enriched CHO biosynthesis
and leptin signaling pathways, both of which are implicated in
diabetes.[160]^33^,[161]^34 Interestingly, we also identified pathways
involved in angiogenesis such as VEGFA (vascular endothelial growth
factor-A)-VEGF receptor 2 pathway, synaptogenesis, neurogenesis
(hippocampal synaptogenesis and neurogenesis), osteoblast
differentiation and related diseases, and ectoderm differentiation
([162]Table S4). Gene ontology (GO) analysis confirmed that biological
processes, such as brain development, tissue morphogenesis, cell
proliferation, and growth regulation were altered in GDM
([163]Table S4), which putatively suggested GDM effects on fetal growth
and development.
TSS score signatures identify genes related to preterm births during GDM
Using an LMM, we identified 55 TSS regions spanning across
all trimesters that exhibited significantly different TSS scores
between control and GDM ([164]Figure 3A). These TSS scores,
representing alterations in 50 genes to GDM, were defined as TSS score
signatures. To investigate the biological implications of these TSS
score signatures, we categorized the 50 genes into four temporal
categories using a Euclidean distance approach ([165]Figure 3B). The
first category comprised two genes with decreasing TSS scores
throughout pregnancy trimesters, exhibiting consistently higher TSS
scores in patients with GDM than in controls. The second category also
had higher TSS scores in GDM, but scores tended to increase as
gestation proceeded. In the third category, most TSS scores decreased
as pregnancy progressed and were lower in women with GDM. The fourth
category had lower TSS scores in women with GDM and showed an
inverted-U shape trend against gestation. To understand the temporal
patterns of these signature genes across trimesters, we applied an
unsupervised clustering approach, segmenting the study population into
two clusters in the 1^st trimester, four clusters in the 2^nd
trimester, and three clusters in the 3^rd trimester. This analysis
revealed a distinct pattern in population flow, with the majority
transitioning from T1-2 to T2-3 and subsequently to T3-3
([166]Figure 3C). By comparing morbidity differences in each cluster,
we observed an uneven GDM distribution complicated with preterm births,
especially the T2-2 cluster showing no preterm births ([167]Figure 3C).
We then analyzed correlation networks using GO pathways in TSS score
signatures in the 50 genes and found the associations of PIK3R1,
previously linked to diabetes,[168]^35 with multiple pathways
([169]Figure 3D and [170]Table S4). We then focused on associations
between the TSS score signatures of the 50 genes and preterm birth. Six
genes were involved in growth and development processes ([171]Tables S3
and [172]S4), in which CNTN4 and RNF213 had TSS scores associated with
preterm birth. Notably, CNTN4 is a member of the immunoglobulin
contactin family, with previous research reporting that contactin-2 may
exhibit changes before a preterm delivery diagnosis.[173]^36
Figure 3.
[174]Figure 3
[175]Open in a new tab
CfDNA signatures between GDM and controls
(A) Upset diagram shows selected TSS score signatures.
(B) Heatmap shows TSS score signatures. Differential TSS scores
separate GDM and control samples despite dynamic changes in trimesters.
Four clusters show different temporal patterns.
(C) Sankey diagram shows temporal cluster changes. Rectangle width
corresponds to sample numbers in each trimester, and connections
between rectangles represent subject flow between trimesters (T
referred to trimester). The three bar charts below show preterm birth
percentages in each trimester.
(D) A correlation network of 50 TSS scores. Line color represents
pathways and line size represents log(weight) values.
(E) The odds ratios of six growth and development-related TSS scores.
Data are presented as odds ratios with 95% confidence intervals.
TSS score signatures identified genes associated with altered lipid
metabolism in GDM
As higher TG and LDLC levels ([176]Table 1) and enriched lipid-related
pathways were observed in women with GDM in our study ([177]Figure 2E),
we examined associations between lipid metabolism and TSS score
signatures from 50 genes. TSS scores of multiple genes were correlated
with lipid-related phenotypes from clinical records, including CHO, TG,
and LDLC levels ([178]Figure 4A). Notably, a correlation was identified
between PSD3 and plasma lipid profiles, consistent with previous
findings showing that down-regulated PSD3 (via short interfering RNA)
reduced intracellular lipid content in primary human
hepatocytes.[179]^37 Furthermore, we found that the TSS score of
SMARCD1, a molecular linker of the switch/sucrose non-fermentable
(SWI/SNF) chromatin remodeling complexes and hepatic lipid
metabolism,[180]^38 was correlated with CHO and LDLC levels
([181]Figure 4A). These results reveal the possibility of using TSS
scores as the genetic markers of lipid homeostasis in GDM. To further
investigate the link between TSS score signatures and lipid changes in
GDM, we extracted plasma lipids from the study population and performed
untargeted lipidomic profiling. When compared with controls, GDM had
significantly elevated fatty acid (FA) 24:6, sphingomyelin (SM) d36:0,
and phosphatidylcholine (PC) 16:0/20:4 levels and decreased PC 44:7e
and PC 44:2e levels ([182]Figures 4B and [183]S4A).
Figure 4.
[184]Figure 4
[185]Open in a new tab
Lipid profile associations
(A) TSS score associations with different laboratory measurements. The
upper section of the plot shows Pearson correlation analysis on
multiple TSS scores. Line segment thickness corresponds to the
magnitude of Pearson correlation coefficients, indicating relationship
strength. Line color represents p values, which show statistical
significance in observed correlations (Age, age of mother pregnancy;
BMI_pre, BMI of the mother before pregnancy; CHO, plasma total
cholesterol; TG, plasma triglycerides; LDLC, low-density lipoprotein
cholesterol; ALT, alanine aminotransferase; AST, aspartate
aminotransferase; birthBMI, BMI of the child at birth).
(B) Lipid differences between women with GDM and controls.
|log2(FC)| < 0.4 and −log10(p value) > 2 threshold values are applied.
Blue and red dots represent significantly down-regulated and
up-regulated lipids in GDM, respectively. Wilcoxon rank-sum tests were
used to calculate p values.
(C) Up: the fragment end counts and PRSS1’s TSS downstream 500 bp.
Middle: a schematic showing the PRSS1 gene structure. Down: the
coverage of the corresponding region. The gray-purple area represents
introns 1–4 in ENST00000311737. Gray dotted lines indicate the
positions of two end peaks situated in the intronic region (case1,
case2, and case3 represent 1^st, 2^nd, and 3^rd trimester of GDM,
respectively. ctr1, ctr2, and ctr3 represent 1^st, 2^nd, and 3^rd
trimester of control, respectively).
(D) Heatmap shows PRSS1 and other acinar-specific gene expression
levels in different women from single-cell data from Camunas-Soler
et al.[186]^39
We then analyzed the correlation between TSS scores of 4 lipid
metabolism-related genes and altered lipids in GDM ([187]Figures S4B
and S4C). We identified significant positive correlations between PRSS1
and up-regulated FA 24:6, SM d36:0, and PC 16:0/20:4 levels in women
with GDM in the 2^nd trimester (Spearman ρ = 0.34, 0.33, and 0.39,
respectively). Additionally, we consistently observed a positive
correlation between PRSS1 and down-regulated PC 44:2e levels in
controls ([188]Figure S4B), suggesting more comprehensive lipid
profiles associated with PRSS1 expression. PRSS1 is a pancreatic acinar
cell marker[189]^40 and encodes an enzyme secreted from the pancreas.
An animal study confirmed that PRSS1 transgenic mice exhibited
disordered lipid metabolism.[190]^41 We further investigated the
coverage characteristics of cfDNA fragments near PRSS1 and found the
predominant distribution of reads starting at 260 bp downstream of TSS
(U-end) and ending at 427 bp (D-end) near the TSS ([191]Figure 4C).
These accumulated fragments precisely overlapped with the PRSS1
intronic region ([192]Figure 4C), exhibiting a highly consistent
nuclease cleavage pattern (chromatosome: nucleosome + linker histone;
∼167 bp[193]^25^,[194]^42) across samples. It has been shown that the
maternal-origin cfDNA has longer fragment sizes (∼167 bp) than
fetal-origin cfDNA (∼144 bp) due to different enzyme cutting positions.
Hence, the phenomenon of 167 bp cfDNA fragments concentrated in
the PRSS1 intron may reflect its maternal origin.[195]^43 Notably, the
cfDNA coverage of PRSS1 differed between GDM and control throughout
pregnancy, particularly the 1^st trimester ([196]Figure 4C). The most
probable explanation for this disparity is that the transcriptional
activity of PRSS1 was altered in GDM. To validate our findings, we
examined single-cell transcriptomic data from diabetic human
islets.[197]^39 We found that type 2 diabetes and prediabetic patients
showed not only higher PRSS1 expression but also higher expression of
other pancreatic acinar marker genes ([198]Figure 4D). Our analysis
demonstrated an increased expression of PRSS1 and other pancreatic
acinar marker expression in both type 2 diabetes and prediabetic
patients and agreed with recent genetic studies that suggested genetic
background differences in the exocrine pancreas of individuals with
diabetes.[199]^44^,[200]^45
Integrative specific neural network (SNN) models powerfully predict GDM and
offspring BMI
Given the robust associations between cfDNA features and phenotypic and
metabolic alterations in GDM, we exploited these cfDNA features to
predict early GDM. Leveraging diverse clinical data (including age,
pre-pregnancy BMI, drinking habits, and smoking status) and cfDNA
features (such as motifs, MDS, MA, fetal fraction, and TSS score
signatures) from the 1^st trimester, we constructed a neural network
model aimed at predicting the likelihood of GDM developing in the 3^rd
trimester ([201]Figure 5A). Utilizing solely clinical information for
prediction yielded an area under the curve (AUC) of 0.697, with a 95%
confidence interval (CI) ranging from 0.534 to 0.860 ([202]Figure 5C).
Furthermore, incorporating various cfDNA features into the model
enhanced its predictive performance. Notably, a substantial improvement
in AUC was observed when TSS score signatures were incorporated with
other features, achieving an AUC of 0.877 with a 95% CI spanning from
0.794 to 0.960 ([203]Figure 5C). We then validated our prediction model
in validation dataset1, which showed an AUC = 0.829 (95% CI:
0.746–0.912) using clinical information and all cfDNA features
([204]Figure 5D). Interestingly, clinical information and
growth-related cfDNA features could also predict the offspring BMI
using an integrated neural network model, showing the correlation to
the actual value of R^2 = 0.56 (95% CI: 0.44–0.68) and a mean absolute
error = 1.75 (95% CI: 1.19–2.31) in birth BMI ([205]Figure S6A) and a
correlation of R^2 = 0.62 (95% CI: 0.40–0.84) and a mean absolute
error = 1.53 (95% CI: 1.34–1.72) in 2-year BMI ([206]Figure S6B). Next,
we assessed the significance of cfDNA features in our model and
identified the top 35 features crucial for predicting GDM and BMI
([207]Figures 5B and [208]S6C). For GDM prediction, we found fetal
fraction and BMI to be the most important factors in our SNN model,
indicating their higher contributions to the final loss value metric
within the SNN model. Additionally, we employed the random forest Gini
factor to validate feature importance, revealing that these values are
model specific ([209]Figure 5B). The key gene LILRB1 was ranked 21^st
for early GDM predictions and 1^st for birth weight predictions.
Another crucial gene, PIK3R1, was positioned 8^th in early GDM and
2-year BMI predictions. The key gene PRSS1 ranked 23^rd and 4^th in
early GDM and 2-year BMI predictions. Therefore, these top features
were crucial in the prediction model, and their relative importance
offered a reference for the specific trained model. Among the 35
crucial features, 9 genes with known roles in growth and development
were further selected to assess their association with fetal
development. We used a derived formula, C[BMI-a], from the TSS
component of the predictive birth BMI model ([210]STAR Methods) to
calculate a growth and development score. As expected, we observed a
significant distinction between normal and large for gestational age
children (p < 0.01, Wilcoxon rank-sum test. [211]Figure S6G).
Figure 5.
[212]Figure 5
[213]Open in a new tab
Neural network model
(A) SNN structure (created with [214]BioRender.com).
(B) Feature importance plot showing different machine learning
algorithms. The dark blue feature represents more importance when
compared with the light blue, indicating boost in AUC when adding the
feature. Right color bar signifies the feature class.
(C and D) Classifier performance as quantified by a receiver operator
characteristic curve to predict GDM using early gestation samples from
TJBC (C) and validation dataset1 (D). The numerical values in
parentheses denote the mean AUC value accompanied by its 95% confidence
interval.
(E and F) Performance of the refined model predicting GDM in TJBC,
validation dataset1 (E), and validation dataset2 (F).
Given the large number of features used in our GDM prediction model, we
performed a further feature selection to develop a refined model with
fewer features, including SeqFF (fetal fraction), BMI, CDH23, CNTN4,
RNF213, SMARCD1, TWSG1, PSD3, and PIK3R1. Each feature was implicated
in some aspect of lipid metabolism and GDM pathology. The refined model
predicted GDM with an AUC = 0.849 (95% CI: 0.798–0.900) in the TJBC and
0.734 (95% CI: 0.713–0.755) in the validation dataset1
([215]Figure 5E). Using the same features, the refined model was
further tailored to adapt the ultra-lower sequencing depth in
validation dataset2 (1% of TJBC) ([216]STAR Methods), which gave rise
to GDM prediction performance of an AUC = 0.758 (95% CI: 0.724–0.792)
([217]Figure 5F). Lastly, DeLong’s tests were performed to confirm the
enhancement to the GDM prediction performance by adding TSS features to
clinical phenotypes in the prediction models ([218]Figures S6D and
S6E).
Discussion
CfDNA in maternal plasma mainly originates from maternal and placental
apoptotic cells,[219]^46 and its heterogeneous origins are ideal for
studying complex metabolic disorders during pregnancy.[220]^47 In this
study, we demonstrated the associations of cfDNA physical properties
and fragment features with GDM by showing the direct involvement of
cfDNA fragment features with typical GDM-related pathways, offspring
BMI, preterm birth, and lipid metabolite traits. We identified PRSS1 as
a biologically significant and valuable GDM marker. Furthermore, by
externally validating both medium-depth sequencing data and low-depth
NIPT datasets, we demonstrated that dynamic cfDNA changes may represent
an important strategy for predicting GDM in early pregnancy.
Previous studies reported dynamic changes in cfDNA physical properties
during pregnancy and gestational disease.[221]^21 For instance, the
fetal fraction was shown to increase in line with gestational weeks and
served as an early screening marker of complications associated with
placental dysfunction in pregnancy.[222]^48 In our study, the fetal
fraction in women with GDM was significantly decreased when compared to
control, particularly at early gestation stages, consistent with a
previous study.[223]^20 Motif differences have also been implicated in
gestational diseases,[224]^49 and we observed distinctive cfDNA
fragment motif patterns and reduced cfDNA MDS in GDM, which suggested
lower cfDNA fragment complexity in GDM. Previous studies also indicated
nuclease involvement in cfDNA turnover and cfDNA end formation.[225]^50
Thus, nuclease activity may be changed in GDM, leading to altered cfDNA
physical properties. The biological implications behind these cfDNA
motif changes and diversity in women with GDM require further
investigation.
TSS scores were previously suggested to inversely correlate with
expression levels at promoter nucleosome-depleted
regions.[226]^51^,[227]^52 A major discovery in our study was that by
comparing TSS scores between GDM and controls, a set of genes with
functions related to diabetes, organ development, and differentiation
pathways were identified. For instance, PI3K and MAPK pathways were
concurrently altered in GDM. Specifically, the TSS score for PIK3R1, a
member of PI3K/AKT pathway with well-established roles in regulating
insulin and metabolic diseases,[228]^35^,[229]^53^,[230]^54 was
significantly different between GDM and controls. Additionally, PIK3R1
was enriched in several growth-related pathways. Diabetic pregnancies
are associated with distinct growth hormone profiles, which contribute
to varied fetal growth patterns.[231]^55 Increased fetal fat deposition
commonly observed in diabetic pregnancies may be attributed to modified
cellular differentiation and underlying mechanisms that regulate body
composition.[232]^56 The enriched pathways identified in our study also
had important functions in cell growth differentiation. Thus, our
findings endorse the rationale of using candidate cfDNA biomarkers to
detect and monitor GDM.
We also examined significantly altered TSS score signatures between GDM
and controls to identify 50 genes as the candidate cfDNA biomarkers for
GDM. Consistent with aforementioned results, several genes were
directly related to diabetes and growth and development, such as
LILRB1, PIK3R1, and CBS. Moreover, some candidate biomarker genes were
also involved in preterm birth and lipid metabolism biological
processes, including CNTN4 and RNF213, which were significantly
associated with preterm births. Thus, longitudinal TSS score-signature
patterns in candidate biomarker genes may represent altered fetal
growth and development in women with GDM, leading to increased preterm
birth risks.
Lipid metabolism is a well-established process in diabetes.[233]^57
Previously, Mak et al.[234]^58 reported a correlation between
circulating lipids and cfDNA genetic aberrations. In our study, we
identified lipidomic profiling shifts with correlations with TSS scores
in the islet acinar marker PRSS1. We further used single-cell data to
validate exocrine pancreatic alterations in type 2 diabetes or
prediabetic status. Exocrine pancreatic disorder in diabetes is a
clinically relevant but poorly understood condition.[235]^59 Thus, our
findings contribute to evidence linking exocrine pancreas deficiencies
to diabetes in vivo.[236]^44^,[237]^45
We also showed that cfDNA physical properties and clinical data could
be integrated into an SNN model to predict early GDM. This prediction
performance was further confirmed and validated in two geographically
distant and independent cohorts, thus highlighting the potential
effectiveness of cfDNA in early GDM diagnoses. Furthermore, the
addition of TSS score signatures from 50 candidate genes to the
prediction model greatly improved GDM prediction performances in the
1^st trimester and supported these genes as potential GDM candidate
biomarkers. As a complex disease, GDM induces sophisticated
physiological and metabolic changes that have a multifaceted impact on
offspring growth, including macrosomia, hypoglycemia, respiratory
distress syndrome, etc.[238]^60 Our study provides a platform for
exploiting the dynamic features of lipid metabolism, preterm birth,
insulin regulation, and growth and development genes to predict GDM.
Interestingly, we also predicted offspring BMI using the cfDNA features
of 50 candidate biomarker genes, possibly because offspring’s BMI
showed a strong correlation with maternal onset in GDM. Plasma cfDNA
has been widely used to non-invasively test for fetal trisomy and
copy-number variants in clinical practice.[239]^13 Critically, our work
shows the potential for integrating GDM predictions into NIPT in early
pregnancy in the future.
In conclusion, we conducted an integrative cfDNA analysis that reveals
a temporal relationship between cfDNA and GDM, providing comprehensive
insights into the genetic alterations associated with various
biological processes in GDM. This non-invasive approach holds promise
for clinical applications in early prediction of GDM.
Limitations of the study
Despite our study having one of the largest sample sizes in high-depth
sequencing cfDNA GDM cohorts, it is relatively smaller in comparison to
other prospective NIPT studies.[240]^61^,[241]^62 Moreover, we utilized
a nested case-control design where samples were carefully matched for
key variables such as maternal gestational age and sampling weeks,
thereby enhancing comparability between groups and improving
statistical efficiency. However, this approach may limit extrapolation
to broader populations. Furthermore, despite conducting external
validation using two separate cohorts with substantial numbers of
participants, determining the optimal sequencing depth for clinical
translation into NIPT remains an open question. Despite these efforts,
additional replication studies involving diverse and larger populations
are imperative.
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
TJBC cfDNA BED files This paper NGDC:
OMIX004723; [242]https://ngdc.cncb.ac.cn/omix/preview/z7lvCjn0
Single-cell RNA sequencing data related to islet dysfunction in
diabetes Camunas-Soler, J. et al.[243]^39
[244]https://github.com/jcamunas/patchseq.
__________________________________________________________________
Software and algorithms
__________________________________________________________________
Python 3.8.13 N/A [245]https://www.python.org
Transcription Start Site (TSS) NCBI
[246]https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_00000
1405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gff.gz
R 4.2.2 The R Foundation [247]https://www.r-project.org/
Fastp 0.23.2 Chen, S et al.[248]^63
[249]https://github.com/OpenGene/fastp
Minmap2 Li, H[250]^64 [251]https://github.com/lh3/minimap2
biobambam Tischler, G[252]^65 [253]https://github.com/gt1/biobambam
Samtools 1.6 Li, H et al.[254]^66
[255]https://github.com/samtools/samtools
Bamtools 2.5.2 Barnett, D. W.[256]^67
[257]https://github.com/pezmaster31/bamtools
ggcor 0.9.8.1 Huang, H et al.[258]^68
[259]https://github.com/mj163163/ggcor-1
igraph 1.4.1 Csardi G et al.[260]^69
[261]https://github.com/igraph/rigraph
ggraph 2.1.0 Pedersen, T et al.[262]^70
[263]https://github.com/thomasp85/ggraph
lme4 1.1–31 Bates D et al.[264]^71 [265]https://github.com/lme4/lme4
lmerTest 3.1–3 Kuznetsova A et al.[266]^72
[267]https://github.com/runehaubo/lmerTestR
nlme 3.1–160 Pinheiro J et al.[268]^73
[269]https://github.com/cran/nlme
clusterProfiler 4.6.0 Yu, G et al.[270]^74
[271]https://github.com/YuLab-SMU/clusterProfiler
metascape Zhou, Y et al.[272]^75
[273]https://metascape.org/gp/index.html#/main/step1
Sklearn 1.1.2 Pedregosa, F et al.[274]^76
[275]https://scikit-learn.org/
Tensorflow 2.9.1 Martin, A[276]^77 [277]https://www.tensorflow.org/
cfDNA features analysis This paper
[278]https://github.com/zqr2008/cfdna/tree/main/feature_analysis
Modelcode This paper
[279]https://github.com/zqr2008/cfdna/tree/main/modelcode
Figure reproduce This paper
[280]https://github.com/zqr2008/cfdna/tree/main/figure_reproduce
[281]Open in a new tab
Resource availability
Lead contact
Further information and requests for resources and reagents should be
directed to and will be fulfilled by the lead contact, Dr. Qiangrong
Zhai (zhaiqiangrong@genomics.cn).
Materials availability
This study did not generate new unique reagents.
Data and code availability
* •
The data supporting our findings have been deposited into the CNGB
Sequence Archive (CNSA)[282]^78 of China National GeneBank DataBase
(CNGBdb)[283]^79 CNGB: CNP0004352
([284]https://db.cngb.org/search/?q=CNP0004352) and China National
center for Bioinformation NGDC: OMIX004723
([285]https://ngdc.cncb.ac.cn/omix/preview/z7lvCjn0). These files
are accessible in compliance with Chinese legal regulations
(2023BAT1256), enabling future referencing and potential validation
analyses using our data.
* •
Single-cell RNA sequencing data related to islet dysfunction in
diabetes are available at
[286]https://github.com/jcamunas/patchseq. The data used in the
current article was preprocessed datasets provided[287]^39.
* •
The code used in the article is saved at
[288]https://github.com/zqr2008/cfdna.
* •
Any additional information required to reanalyze the data reported
in this work paper is available from the [289]lead contact upon
request.
Experimental model and subject details
This was a nested case-control study and women were selected from the
TJBC,[290]^22 which is an ongoing prospective cohort established in
2017 in Tianjin, China. The TJBC recruited pregnant women of ≤14 + 6
gestational weeks (1^st trimester) who were followed-up at 15–27 weeks
(2^nd trimester) and ≥28 weeks (3^rd trimester). The study was approved
by the Ethics Committee of Tianjin Women and Children’s Health Center
(No. 201706012-1), the Institutional Review Board of BGI (BGI-IRB
17116), and was conducted in compliance with the ethical standards
outlined in the Declaration of Helsinki and its subsequent revisions or
similar ethical standards. Before enrollment, women provided written
informed consent.
A case-to-control ratio of 1:1 was selected to increase statistical
power. Women underwent standard GDM screening; an overnight fast was
followed by a 75 g OGTT or an FPG test at 24–28 weeks of
gestation.[291]^80 The study consisted of 299 GDM cases and 299 healthy
controls. Inclusion criteria for GDM cases were as follows: 1)
singleton pregnancy, 2) blood samples donated at each visit during
1^st, 2^nd, and 3^rd trimesters, and 3) questionnaires completed at
each visit during pregnancy. Controls were selected using the same
inclusion criteria. Both cases and controls were individually matched
based on age (±2 years) and gestational week (±4 weeks) at the time of
sample collection ([292]Figure 1). No women with GDM or controls had
preexisting type 1 or type 2 diabetes. A GDM diagnosis was assigned
between 24 and 28 weeks of gestation following Guidelines for GDM
Diagnosis and Treatment (2014) (Department of Obstetrics and
Gynecology, Chinese Medical Association).[293]^80 GDM screening and
diagnosis tests were performed in central-regional-local healthcare
networks in Tianjin using aforementioned tests. For FPG screening,
women with an FPG value ≥5.1 mmol/L were diagnosed with GDM, and no
additional diagnostic tests were required. For women with FPG values
between 4.4 mmol/L and 5.1 mmol/L, a 75 g OGTT was recommended to
confirm or rule out GDM. Women with no GDM had FPG values <4.4 mmol/L.
For OGTT, women were deemed to have GDM if any of the following
conditions were met: 1) fasting plasma glucose levels ≥5.1 mmol/L, 2)
plasma glucose levels at 1 h ≥ 10.0 mmol/L, or 3) plasma glucose levels
at 2 h ≥ 8.5 mmol/L. Of the 598 study women, 524 were diagnosed using
the OGTT method and 74 using the FPG method. Obesity[294]^81 was
defined as a pre-pregnancy body mass index (BMI) ≥ 28 kg/m^2.
Gestational age at birth was defined as completed gestation weeks based
on the estimated delivery date in women’s clinical records. Preterm
birth was defined as a live birth before 37 completed gestation weeks.
Large for gestational age (LGA) was defined as > the 90^th birth weight
percentile for gestational age by infant gender. Birth weight, weight
at 2 years old, and length/height data were also collected from
clinical records. Low birth weight was defined at < 2500 g in live
births and macrosomia was defined as ≥ 4000 g.
To validate cfDNA dynamic signature changes throughout the three
trimesters with respect to GDM, we selected cfDNA sequencing data
(∼30×, paired-end sequencing) from an independent study of 202 samples
(8 women with GDM and 106 healthy controls, each pregnant woman is
sampled more than once) based on gestational age at plasma collection
and subsequent follow-up as a Validation dataset1 ([295]Figure 1).
These 114 individuals from the Shenzhen locality were recruited.
Written informed consent was obtained from all individuals (No. BGI-IRB
20012) ([296]Figure S5B).
To confirm trait generalizability across datasets, characterized by
lower sequencing depths and larger sample sizes, we used cfDNA
sequencing data from Zhu et al.[297]^23 (∼0.6× and single-end
sequencing) as a Validation dataset2. Using the same methodology, we
computed TSS scores and fetal fraction for this dataset. Initially, we
excluded samples with missing BMI values and fetal fraction <0.01.
Subsequently, using the refined model, we validated 6104 samples
(comprising 1045 women with GDM and 5059 healthy controls) with no more
than three missing feature values ([298]Figure S5C).
Method details
Library construction and sequencing data
At all visits, 5 mL of peripheral blood was drawn into
ethylene-diamine-tetra-acetic-acid (EDTA) blood tubes (ComWin, Beijing,
China). Plasma was generated using a two-step centrifugation
protocol[299]^82 and stored at −80°C. CfDNA was isolated from 200 μL
plasma using the MagPure Circulating DNA Mini KF kit (Magen, Guangzhou,
China) according to manufacturer’s protocols. Extracted cfDNA was then
used to prepare cfDNA libraries for sequencing using the MGIEasy free
DNA library preparation reagent set (MGI, Shenzhen, China) following
manufacturer’s protocols. Prepared libraries were circularized to
create single-stranded DNA (ssDNA) circles using the MGIEasy
Circularization Kit (MGI) following manufacturer’s instructions. A
Qubit ssDNA assay kit (Invitrogen, USA) was used to quantify purified
ssDNA circles, after which they underwent rolling circle amplification
to generate DNA nanoballs. After these steps, the final products were
quantified using a Qubit ssDNA Assay Kit (Invitrogen) and loaded onto a
DNBSEQ platform (MGI) for multiplex sequencing using a paired-end
100 bp strategy.
Lipid extraction and untargeted lipidomic profiling
Lipids were extracted from maternal blood as previously
described.[300]^83^,[301]^84 Briefly, precooled isopropanol spiked with
a lipid internal standard mix (SPLASH LIPIDOMIX Mass Spec Standard,
Avanti, USA) was added to plasma. After vortexing and then overnight
incubation at −20°C, samples were centrifuged and supernatants analyzed
using a liquid chromatography–mass spectrometer (LC-MS). Samples were
separated on a CSH C18 column (1.7 μm, 2.1 × 100 mm, Waters, USA) and
analyzed using a QExactive MS (Thermo Fisher Scientific, USA). LC
gradient and MS conditions were previously reported.[302]^84 Mobile
phase A contained acetonitrile/water (60:40) plus 10 mM ammonium
formate and 0.1% formic acid, while mobile phase B contained
isopropanol/acetonitrile (90:10) plus 10 mM ammonium formate and 0.1%
formic acid. The scan range for MS detection was 200–200m/z with a
resolution of 70,000, and the automatic gain control (AGC) target for
MS acquisition was set to 3e6 with a maximum ion injection time of
100 ms. The top three precursors were selected for subsequent tandem
mass spectrometry fragmentation with a resolution of 17,500 and a
maximum ion injection time of 50 ms. The AGC was 1e5, and the stepped
normalized collision energy was set to 15, 30, and 45 eV. Lipid
identification and quantitation were performed using LipidSearch 4.1
SP2 software (Thermo Fisher, USA), and data scaling and normalization
were processed in metaX.[303]^85
Sequence alignments and filtering
Raw fastq data were subjected to quality filtering using Fastp
0.23.2[304]^63 software based on the following criteria: (1) removal of
adapter sequences, (2) elimination of sequences containing >10% unknown
bases, and (3) removal of low-quality sequences. Filtered reads were
then aligned to the GRCh38.p13 reference genome using MiniMap2[305]^64
comparison software. Resultant comparison outputs were saved as bam
files, which were sorted using Biobambam.[306]^65 Additionally,
duplicate reads generated during amplification steps were marked in
sorted BAM files and filtered. Only paired reads with proper mapping
orientation and insert size (i.e., ≤600 bp) were retained for
downstream analyses ([307]Table S2).
Fetal fraction
The fetal fraction in plasma samples was estimated using SeqFF,[308]^86
a method that capitalizes on dissimilar chromatin structures between
the mother and fetus, leading to irregular cfDNA dispersion across the
genome. To determine the fetal fraction, several regression models
(Enet and WRSC) were developed using read counts from multiple cfDNA
regions in maternal plasma. We used mean results from both models as
the fetal fraction value ([309]Table S3).
Motifs and MDS
End motif analysis was performed on single cfDNA fragments.[310]^17 We
used the first four nucleotides at 5′ ends to calculate the frequencies
of 256 (
[MATH: 4×4×4×4 :MATH]
) possible motifs (4 bp sequences and 4-mer motifs), and motifs were
normalized according to the total number of ends.
MDS in samples were calculated using [311]Equation 1.
[MATH: MDS=∑i
=1256−Pi×log2(Pi)/log2(256
mn>) :MATH]
(Equation 1)
where P[i] is the frequency of a particular motif. A higher MDS
indicated a higher diversity (i.e., a higher degree of randomness). The
theoretical scale ranged from 0 to 1. A previous study demonstrated
that maternal and fetal cfDNA fragments exhibited distinct length
patterns, with maternal cfDNA generally longer than fetal
fragments.[312]^87 In our analysis, we computed motif frequencies and
MDS for all samples. We also divided fragments into three subsets:
short, peak, and long. The short subset consisted of ≤150 bp fragments,
the peak subset comprised fragments in the 160–170 bp range, and the
long subset encompassed >250 bp fragments. Subsets were analyzed
separately to investigate their respective characteristics
([313]Table S3).
MA
Methylated cytosine-phosphate-guanines (CpGs) have a higher likelihood
of cleavage at cytosine when compared to unmethylated CpGs, while
having a reduced likelihood of cleavage at the base preceding the
CpG.[314]^15 Such differential cleavage patterns can cause increased
CGN motifs but decreased NCG motifs. By analyzing cfDNA cleavage
patterns and resulting CGN/NCG motifs, cfDNA methylation status was
inferred across different regions.[315]^15 We used CGN/NCG motif ratios
([316]Equation 2) to assess methylation status in samples and named it
as methylation-associated (MA) value ([317]Table S3).
[MATH: MA=No.<
mspace width="0.25em">of5’CGNendmotif
mi>sNo.of5’NCGendmotif
mi>s :MATH]
(Equation 2)
5′CGN end motifs (i.e., 5′- CGA, CGT, CGG, and CGC). 5′ NCG end motifs
(i.e., 5′- ACG, TCG, GCG, and CCG).
TSS scores
Transcriptional activity is correlated with chromatin status around the
TSS.[318]^88^,[319]^89 To quantify gene expression, we analyzed TSS
region coverage. Specifically, we calculated upstream and downstream
500 bp coverage of the TSS from aligned BAM files using the SAMtools
1.6 depth function.[320]^66 To account for sequencing depth and bias,
we normalized TSS region coverage by dividing the 1 kb region into
three parts. Average side bin depth was used to normalize the depth of
the middle 500 bp (TSS -250 to TSS +250) and the normalization rate of
the middle bin was defined as the TSS score ([321]Equation 3).
[MATH: TSSscore
mi>=dep<
mi>th(mid
dlebin)
depth<
mspace
width="0.25em">(sid
ebin)
:MATH]
(Equation 3)
where depth (side bin) is the average depth of TSS -500 to TSS -250 and
TSS +250 to TSS +500 regions. Depth (middle bin) refers to the average
depth of the middle 500 bp (TSS -250 to TSS +250). High TSS scores
indicated high coverage in the TSS region, indicating that cfDNA was
highly protected and not easily bound to transcription-related factors,
thereby eliciting low gene expression. In contrast, lower TSS scores
were associated with higher gene expression.[322]^19^,[323]^51 In our
study, TSS scores were used as gene expression measures.
TSS score-signature identification
Selected TSS score-signatures were based on statistical tests and
log2-transformed fold-change (log2(FC)) values. We used a linear
mixed-effects model (LMM) approach to screen TSS scores, using a
two-sided false-discovery rate threshold of <0.1 for selection. The LMM
model included trimesters, groups, and interactions between trimesters
and groups as fixed effects while incorporating a subject-specific
random effect. We used Least Squares Means estimates to test for
differences between groups. To determine the threshold for TSS
score-signatures, we performed pairwise comparisons using TSS scores to
calculate log2(FC) values to represent differences between GDM and
control samples for each trimester, and collectively for all
trimesters. A log2(FC) > 0.05 value was the threshold. Simultaneously,
we conducted univariate statistical tests for these four comparisons
using Wilcoxon rank sum tests (p < 0.1 threshold). A TSS score meeting
all aforementioned criteria was selected as a signature. If a gene
corresponded to multiple TSS scores, the final TSS score was primarily
selected using the maximum absolute log2(FC) value. If both positive
and negative log2(FC) values occurred, the direction was determined
based on the location of the majority of TSS scores, and the TSS score
with the maximum absolute value in the determined direction of log2(FC)
values was chosen.
Gene set enrichment analysis and gene ontology analysis
We performed GSEA[324]^90 using the R package clusterProfiler[325]^74
and biological pathways from Human Wikipathways[326]^91 were used as
primary sources for analyses. The Benjamini–Hochberg (BH) method was
used for multiple testing corrections. A gene list was generated based
on TSS scores, after which genes were ranked using pairwise log2(FC)
values between GDM and control samples. We also used Metascape[327]^75
for gene ontology analysis using the following thresholds: minimum
overlap = 3, P-value cutoff = 0.01, and minimum enrichment = 1.5. Gene
Ontology (GO) biological process pathways were selected in analyses.
C[BMI-a] definition
To gain deeper insights into the relationships between high-importance
TSS score-signatures among the top 20 predictive features for birth BMI
and growth development, we used a derived formula from the TSS
component of the predictive birth BMI model, C[BMI-a] ([328]Equation
4):
[MATH:
CBMI−a=13.4589+(0.0809)
mrow>×LILRB1+(−0.1696<
/mn>)×MIR8071
−1+THAP12+(−0.0106<
/mn>)×RNF213<
mo
linebreak="goodbreak">+(−0.2094<
/mn>)×SPATS2L+(−1.4290<
/mn>)×BRPF1+(−0.0686<
/mn>)×MIR3689
C+(−0.0530<
/mn>)×PCMTD2+(−0.0088<
/mn>)×LPGAT1 :MATH]
(Equation 4)
Where LILRB1, MIR8071-1, THAP12, RNF213, SPATS2L, BRPF1, MIR3689C,
PCMTD2, and LPGAT1 are genes in the C[BMI-a]. In this equation, each
gene was represented by its TSS gene score.
External dataset validation
To rigorously authenticate our findings, we retrieved cfDNA sequencing
data from women with GDM, as published by Guo et al..[329]^19
Accounting for sequencing depth and reference genome distinctions, we
meticulously determined TSS scores that displayed consistent trends
across healthy and GDM cohorts. Using a significance FDR threshold =
0.05, selected TSS scores were subjected to further correlation
analysis.
The neural network models
Neural network input data
We implemented a parallel-connected Neural Network Model. CfDNA data
were extracted as motifs, MDS, MA, and TSS scores. In input data, we
had 256 4-mer motif features, four MDS features, three MA features, and
50 TSS score features from previous steps.
The sampling process for this network model involved detailed
extraction and feature selection from cfDNA data. We used 256 4-mer
motif features, four MDS features, three MA features, and 50 TSS score
features. The dataset was meticulously partitioned into training,
validation, and testing sets, with a typical split of 80% for training
and 20% for testing. Stratified random sampling procedures ensured the
representation of key variables such as age and disease status across
subsets. To address data imbalance, techniques such as oversampling the
minority class or undersampling the majority class were used.
For internal validation, we used 10-fold cross-validation in the
training set, which allowed for a comprehensive evaluation of model
performance. This method involved training the model on nine subsets
and validating it on the remaining subset, in an iterative process.
External validation was achieved using two external datasets to
validate model applicability and robustness in different settings or
populations.
Feature selection
From preliminary studies, the majority of features from TSS genes and
motifs included too much noise which contribute little to our results.
Thus, we implemented forward feature selection as a first step to
remedy this. This step started from an empty logistic model. We then
added features one by one to determine the best performance feature for
each step. The model was adapted from a logistic model used by Hu
et al.[330]^92 ([331]Equation 5).
[MATH:
ey′1+ey′
msup>=[β1β2…βn]×[f1<
/mtr>f2<
mo>…fn<
/mtr>]+[b]
:MATH]
(Equation 5)
At first, there were no components in β and f vectors. Starting from an
empty model, we first added one feature to the initial model and
compared performance improvements using likelihood ratio Chi-square
tests with the previous model. The feature with the smallest Chi-square
test p value (p < 0.05) was selected for the model. Then, in the next
step, we selected the best features from the remaining features. We
continued this step-forward selection process until no qualified
features were available to improve the model performance using
likelihood ratio Chi-square test p < 0.05 values. These features were
then used to generate an SNN model.
SNN sub-networks
In our model, we generated the final input as a 93-features vector that
included 31 motifs, four MDS, three MA, and 55 TSS scores. Given the
distinct patterns in each feature, we developed different neural
network branches in the SNN to process data prior to the final
categorization.
Here, we used a dense layer from Keras to build a fully connected
neural network to process the 31 motifs ([332]Equation 6). We used Wi,
Wj, …Wm to represent the weight matrix of layers i, j, …until m. Each
layer processed the previous layers’ output and used the Relu function
to generate an output for the next layer.
[MATH: L=[Relu(Wm[…[Relu(Wj[Relu(Wi·
xi+bi)]+bk)]…]+bn)] :MATH]
(Equation 6)
Our Motif features exhibits consistent values with a fixed range. From
this uniformity, a fully connected neural network can effectively
interpret such patterns and create an optimal classification model. In
other words, such a network can yield the best results for classifying
or categorizing data based on uniform motif features. For MDS and MA,
we processed these with a simple 1-unit neuron ([333]Equation 7), where
x was the input matrix of MDS or MA, and W was an n×1 weight matrix.
Here, we used three MA features as an example. These features were
relatively straightforward and did carry much information. Due to their
simplicity, only one neuron was required to transfer processed data to
the classification layer. This limited information did not necessitate
a larger, more complex network structure; hence a single output neuron
sufficed.
[MATH: M=Relu([x1x2x3]×[w1<
/mtr>w2<
msub>w3]+[b]<
/mo>) :MATH]
(Equation 7)
Next, we used convolutional layers to process TSS scores ([334]Equation
8), which we based on the observation that the TSS score matrix had
many internal relationships between different genes; a convolution
layer learns and emphasizes the relationship between two genes and thus
helps with the final classification job.
[MATH: R=Maxp
ool(∑j
∑kfilter[j,
mo>k]TSS[m−j,n−k]) :MATH]
(Equation 8)
j and k represent the coordinators of the convolutional kernel filter.
We used a 2 dimension 3 times 3 kernal filter for the TSS matrix
([335]Equation 7), where m and n were TSS matrix coordinates. Thus, as
depicted ([336]Equation 7), each value in the TSS matrix used
multiplication and summation processes with a convolutional kernel
filter using its neighboring values to generate a new feature map
value.
Classification and regression
After SNN processing, all SNN outputs were concatenated as a whole
input matrix for classification and regression ([337]Equations 9
and [338]10).
[MATH: class=soft
max(p)=soft
max([LM<
mi>R]·Wb
msub>+bc) :MATH]
(Equation 9)
where W[b] is the weight matrix of the final classification layer, and
b refers to binary classification. As a binary classification, we used
only one classification output p to represent GDM probability. Thus, W
is an n × 1 matrix; n is the total output number from output L, M, and
R from [339]formula 5, [340]6, and [341]7. Finally, we added a bias to
the classification layer - shown as b[c] - where c stands for
classification.
[MATH:
linear=Relu([LM<
mi>R]·Wl
msub>+bl) :MATH]
(Equation 10)
we used W[l] to represent the weight matrix of the final output in
linear regression analyses. We used b[l] for the bias of the last
output layer.
We also used a fully connected layer, using Softmax activation and
cross-entropy loss functions, for the classification output. The
activation function of the regression output was Relu since there was
no negative output for BMI and weight. The loss function of the
regression output was the mean square error. We chose the Area under
the ROC Curve (AUC) as the main SNN evaluation metric. Briefly, we
randomly shuffled our dataset and split it (n = 598; GDM n = 299 and
controls n = 299) according to 75% training and 25% testing. Test
processing included a 10-fold cross-validation. Thus, we trained 10
models for training data and performed 10 test steps for testing data,
which yielded 10 AUCs for 95% CI.
Feature importance
Feature importance contributed to the final model and was evaluated
using the following steps: 1) We randomly shuffled the value of each
feature for every data record in the test dataset; 2) We loaded the
pre-trained model and predicted classification or regression results of
the shuffled test dataset; 3) We recorded the loss of prediction
results and compared it with the loss of the original non-shuffled test
dataset; and 4) We determined differences in loss values as feature
importance.
Random forest classification model
We used a random forest classification model to verify feature
importance in SNN models. In this scenario, we used features as
classification attributes in the random forest model. This was
implemented using sklearn[342]^76 with estimator trees = 100, the Gini
factor as a criterion, and no limitation on tree max depth.
Forward stepwise feature selection (FSFS) in logistic regression for TSS
selection
We further refined TSS features using FSFS which is a sequential method
that begins with no predictors in the model and adds them one at a
time. Each variable is chosen based on its contribution to improving
the model’s fit, and evaluated by a statistical criterion (likelihood
ratio test). The logistic regression model used in this selection
process is described by the following [343]Equation 11:
[MATH: ln(p1−p<
/mi>)=β0+β1x
1+β2x
2+…+βn
xn :MATH]
(Equation 11)
where p is the probability of a feature occurrence,
[MATH: β0 :MATH]
is the intercept, and
[MATH: β1 :MATH]
,
[MATH: β2 :MATH]
, …,
[MATH: βn :MATH]
are coefficients of the predictors x[1], x[2], …, x[n].
This iterative process continued until new variable addition did not
significantly improve the model, based on a predefined improvement
threshold. The final model included a set of features that were highly
predictive of GDM. By using FSFS, we built a model that was not only
predictive but also interpretable in selecting the most important genes
implicated in GDM.
Validation of the refined model
To rigorously assess validation effectiveness, our refined model
underwent training using the TJBC training dataset. This evaluation
used a fully connected neural network architecture, incorporating nine
distinct features (SeqFF, BMI, CDH23, CNTN4, RNF213, SMARCD1, TWSG1,
PSD3, and PIK3R1) specified by the refined model. To evaluate the
validation dataset, we saved the best performed model selected from the
training TJBC and use the saved refined model to test the validation
datasets. The validation process was conducted using two separate
datasets: Validation dataset1 (from southern China) and Validation
dataset2 (from central China). To validate the datasets thoroughly, we
applied a comprehensive testing approach similar to what is known as
10-fold cross-validation. This process involves dividing the dataset
into ten equal parts, we used one part for testing for each fold. We
repeated this procedure ten times, each time with a different part used
for testing, to ensure the robustness of our validation method. This
involved randomly shuffling validation datasets and partitioning them
into 10 equal subsets, each constituting 10% of the data. Subsets were
then independently tested, with the results aggregated to calculate the
overall average performance and model CIs.
Quantification and statistical analysis
Univariate comparisons were conducted using two-sided Wilcox Rank Sum
Tests. For repeated collected measurements, our main goal was to
compare how certain cfDNA features changed between cases and controls.
We used LMM (Methods: TSS score-signature identification) to calculate
average cfDNA values, identify differences between groups, and
understand temporal trends.
We developed two models: the first was unadjusted and the second LMM
accounted for covariates. The primary focus of our data analysis
centered on comparing cfDNA change patterns between women with GDM and
controls. We used linear mixed-effects models to compute cfDNA least
squares means with respect to group differences and trend assessments.
Our primary outcome was the identification of pairwise differences
between cases and controls. We also hypothesized a priori that
pregnancy trimester progression was potentially associated with
abnormal physiological changes in GDM, such as an increased
insulin-resistant state, which may have manifested as altered cfDNA
levels correlated to glucose levels. Therefore, our analytical models
incorporated trimester progression and associated interactions with
disease status. For covariates, based on known risk factors for GDM and
cfDNA predictors, we considered pre-pregnancy BMI, maternal
age,[344]^93 ethnicity,[345]^94 education,[346]^95 smoking
status,[347]^96 and alcohol consumption[348]^97 as potential
confounding variables. Some covariates were not included in the final
adjusted model for the following reasons: 1) Age: To address maternal
age bias, we carefully selected age-matched case-controls (methods
Case-control study design). 2) Ethnicity: All participants were of
Asian ethnicity. 3) Pre-pregnancy BMI: Although this is a known factor
associated with GDM and potentially influences cfDNA features, we did
not include it as a covariate in our analysis. This decision was based
on previous research indicating that obesity-induced DNA release from
adipocytes stimulated insulin resistance.[349]^98 The interplay between
obesity, GDM, and cfDNA is complicated, and potentially introduced
collider bias into our study. Furthermore, our primary focus was to
examine whether GDM status affected cfDNA features, and we were less
concerned with how these effects were mediated by obesity. We
constructed two models: model 1 which was unadjusted, and model 2 which
was a linear mixed-effects model that included adjustments for
education, alcohol consumption (yes/no), and also smoking during
pregnancy (yes/no). Our results and discussion were based on the
outcomes from these models ([350]Table S5).
We defined statistical significance as a two-sided p < 0.05 value and
corrected non-parametric analyses for multiple testing using the BH
method. To perform correlation analyses, we separately calculated
Spearman’s correlations for each trimester between GDM and control
groups.
Acknowledgments