Graphical abstract

   graphic file with name fx1.jpg
   [59]Open in a new tab

Highlights

     * •
       An assay employs machine learning models for early detection of
       esophageal cancer
     * •
       Integrates 4 cfDNA fragmentomics features extracted from
       whole-genome sequencing
     * •
       Achieves consistent sensitivity and specificity across low-coverage
       sequencing
     * •
       Shows potential in the application of cfDNA fragmentomics in cancer
       diagnostics
     __________________________________________________________________

   The utilization of non-invasive, liquid biopsy-based analysis of cfDNA
   has emerged as a promising approach in disease detection. By
   integrating features derived from whole-genome sequencing including
   fragmentomics profiles, copy-number variation, and nucleosome
   positioning, Jiao et al. develop a robust model capable of accurately
   detecting esophageal squamous cell carcinoma.

Introduction

   Esophageal cancer (EC) is a prevalent malignancy worldwide, with its
   incidence rate having increased by 6-fold from 1999 to 2008.[60]^1
   Projections by Liu et al. indicate that the incidence rate is expected
   to continue rising over the next two decades and beyond.[61]^2 EC is
   recognized as one of the most aggressive malignancies globally and
   ranks as the sixth leading cause of cancer-related mortality.[62]^3 The
   prognosis for EC is generally poor, with a 5-year survival rate ranging
   from 15% to 25%.[63]^3 Squamous cell carcinoma (SCC) is the most
   predominant subtype of EC worldwide, arising from the squamous cells
   lining in the upper esophagus. Risk factors associated with the
   development of esophageal SCC (ESCC) include heavy tobacco and alcohol
   use, daily consumptions of hot foods, etc. The diagnosis of ESCC is
   often delayed due to the absence of early symptoms. Symptoms such as
   dysphagia, weight loss, and heartburn only show in the later stages of
   cancer development. Therefore, there is a pressing demand for accurate
   and effective early detection interventions.

   The utilization of non-invasive, liquid biopsy-based analysis of
   cell-free DNA (cfDNA) has emerged as a promising approach in disease
   detection. During cellular apoptosis and necrosis, DNA are fragmented
   and released into the bloodstream, forming cfDNA.[64]^4 These fragments
   carry distinct genetic and epigenetic information derived from their
   respective cell and tissue origins.[65]^5 Notably, a specific fraction
   of cfDNA, referred to as circulating tumor DNA (ctDNA), consists of DNA
   fragments originating from tumor cells that carry genetic and
   epigenetic alterations characteristic of tumor.[66]^6 The strong cancer
   signals due to the abundance of ctDNA contents make EC detectable
   through liquid biopsy. As demonstrated by Bettegowda, EC ranked third
   in terms of the highest proportion of patients with detectable
   ctDNA.[67]^7 To identify ctDNA from non-tumorous cfDNA, targeted
   sequencing has been commonly employed for tumor mutation
   detection.[68]^8^,[69]^9 However, mutation-based approaches using
   targeted panels are constrained by their limited sensitivities.
   Chidambaram et al. reported a pooled sensitivity of only 71% in four
   studies specifically focused on EC diagnosis.[70]^10 Klein et al.
   achieved a sensitivity of 85% using their targeted methylation
   assay.[71]^11 As an alternative, cfDNA fragmentation patterns including
   fragmentation size, which can be captured through whole-genome
   sequencing (WGS), has demonstrated diagnostic potential in previous
   studies.[72]^9^,[73]^12^,[74]^13^,[75]^14

   Recent studies have integrated multiple dimensions of fragmentation
   features, thereby enhancing the predictive capabilities of machine
   learning models for the early detection of cancers. For instance,
   Mathios et al. developed a model that incorporates fragmentation
   features, imaging data, clinical risk factors, and levels of
   carcinoembryonic antigen to differentiate patients with non-small cell
   lung cancer from healthy individuals.[76]^14 Similarly, Cristiano
   et al. examined fragmentation patterns across various cancer types,
   shedding light on screening, early detection, and cancer
   monitoring.[77]^12 Chen et al. have highlighted the utilization of
   genome-wide 5-hydroxymethylcytosine, nucleosome positioning (NP), 5′
   end motif, and fragmentation size profiles, emphasizing the potential
   of comprehensive features for early detection of cancer.[78]^15
   Furthermore, researchers have increasingly employed stacked ensemble
   models, which combine multiple machine learning algorithms, to achieve
   superior results. Wang et al. demonstrated the advantages of their
   cost-effective yet highly sensitive assay for patients with non-small
   cell lung cancer by implementing a stacked ensemble model comprising
   generalized linear model (GLM), gradient boosting machine, random
   forest, deep learning, and XGBoost.[79]^16 Similar approaches have been
   applied to colorectal cancer[80]^17 and breast cancer,[81]^18 both
   achieving superior performance compared to individual algorithms.
   However, further verification is required to assess the applicability
   of stacked ensemble models in patients with ESCC.

   In this study, we developed a stacked ensemble model that leverages
   cfDNA fragmentation for the early detection of ESCC. The model combined
   four fragmentomics features obtained from WGS and employed four machine
   learning algorithms. The performance of the model was sequentially
   evaluated on an external dataset. Additionally, we assessed the model’s
   robustness and repeatability across low coverage and repeated measured
   samples. The results underscore the promising potential of our model as
   an effective strategy for early diagnosis and management of ESCC in
   clinical settings.

Results

Participant characteristics of cohort

   The dataset consists of 499 participants; this includes 207
   participants in the training, 201 individuals in the validation
   ([82]Figure 1A), and 91 individuals in the external cohort
   ([83]Figure 1B). It is worth noting that the three cohorts were
   recruited separately and are independent of each other. The training
   set was then used to train the model exclusively and determine the
   cutoff, which was later applied to validation and external cohorts to
   assess the model’s performance.

Figure 1.

   [84]Figure 1
   [85]Open in a new tab

   Flowchart of cohort design and study method

   (A–C) Schematic diagram.

   The demographic characteristics of participants are comparable as
   displayed in [86]Table S1. The training cohort includes 17 participants
   at stage 0/I (17/102, [16.7%]), 58 participants with tumor size ≤3
   (58/102, [56.9%]), and 14 participants with lowly differentiated tumor
   (14/102, [13.7%]). Similarly, the validation cohort had 15 (15/97,
   [15.5%]) and 56 (56/97, [57.7%]) participants at 0/I stage and tumor
   size ≤3 and 14 participants with lowly differentiated tumor (19/102,
   [19.6%]), respectively. In external cohort, 5 (5/44, [11.4%]) patients
   were diagnosed at stage 0/I. In addition, we observed that 2 (2/199,
   [1.0%]) patients had distant metastatic disease, and 92 (92/199,
   [46.2%]) patients had lymph node involvement in training and
   validation. Among the 209 healthy individuals in training and
   validation cohorts, 48 reported symptoms of dyspepsia, with 5 of them
   also indicating gastroesophageal reflux, and 33 individuals reported
   experiencing loss of appetite. Additionally, 14 participants reported
   dyspepsia and 6 had loss of appetite in the external cohort. Finally,
   participants’ smoking and drinking status is also comparable across
   training and validation cohorts.

Evaluating fragmentomics features and stacked ensemble machine learning
algorithms

   We performed unsupervised principal component analysis with 2 principal
   components (PCs) to test the distinguishability of features such as
   fragmentation size coverage (FSC), fragmentation size distribution
   (FSD), NP, and copy-number variation (CNV) in the training cohort prior
   to training the machine learning algorithm. As demonstrated in
   [87]Figure S1, the features are distinct between healthy individuals
   represented in orange and patients with cancer represented in green by
   visually identifying the two clusters based on their differences in
   centers and spreads. Among the features, PC1 of FSD captures 65.87%
   variations, followed by PC1 of FSC which was responsible for 34.12%
   variations between clusters. We further examined the fragmentation
   profiles of features. Feature profiles were consistent in healthy
   individuals ([88]Figures 2B–2D, 2F; [89]Figure S2), while unstable
   behavior of feature profiles was observed in patients with cancer
   ([90]Figures 2A–2C, 2E; [91]Figure S2). Evident chromosomal arm gains
   are observed in 3q, 6q, 7p, 11q, 12p, 18q, and 19q where values greater
   than 0 indicate copy-number gains and values less than 0 signify
   copy-number losses ([92]Figures 2A and 2B). The long and short
   fragmentation profiles of patients with cancer were distinct from those
   of healthy individuals ([93]Figures 2C–2F). When both short and long
   DNA fragments are aggregated, their respective Z scores are calculated
   such that the mean is zero with Z scores centered around the mean.
   However, this central tendency changes when short and long fragments
   are considered separately. For long fragments, the standardized mean
   converge is around 1 in which values greater than 1 indicate a fragment
   size longer than the average and values less than 1 indicate a size
   shorter than the average for long fragments ([94]Figures 2C and 2D).
   Conversely, the standardized mean for short fragments is approximately
   −1; values greater than −1 indicate a longer fragment size, and values
   less than −1 indicate a shorter fragment size ([95]Figures 2E and 2F).
   Likewise, distinct NP patterns were seen between patients with cancer
   and healthy individuals ([96]Figure 2G), and 154 statistically
   significant transcription factor-binding sites (p < 0.1, Wilcoxon
   signed-rank test) selected from 854 transcription factors were found to
   be statistically associated with 22 pathways ([97]Figure 2H).

Figure 2.

   [98]Figure 2
   [99]Open in a new tab

   cfDNA fragmentation profiles in patients with EC and healthy
   individuals

   (A and B) Genome-wide copy-number variation (CNV) change in patients
   with EC and healthy individuals.

   (C and D) Genome-wide fragmentation profile of long fragments
   (151–220 bp) at 1 Mb resolution.

   (E and F) Genome-wide fragmentation profile of short fragments
   (100–150 bp) at 1 Mb resolution.

   (G) Heatmap of 154 significant nucleosome profiling around sites of
   interest in patients with EC and healthy individuals.

   (H) Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment
   analysis of 154 significant nucleosome profiles.

   After identifying the presence of significant variation between
   individuals classified as healthy and those diagnosed with cancer, a
   stacked ensemble model was trained to effectively distinguish between
   the two groups. The machine learning model was constructed integrating
   XGBoost, GLM, distributed random forest (DRF), and deep learning based
   on the four distinct fragmentomic features: CNV, FSC, FSD, and NP
   ([100]Figure 1C). Utilizing the random grid search method for
   hyperparameter exploration allowed us to identify and select the top 10
   best-performing base models for our stacking approach, diverging from
   the traditional practice of using a fixed set of hyperparameters seen
   in previous studies.[101]^17^,[102]^18^,[103]^19 As presented in
   [104]Figure 3A, the model learned and extracted significant patterns
   from training, generating an area under the curve (AUC) of 0.995
   (confidence interval [CI]: 0.990–1.000). Its performance was then
   evaluated using the independent validation set, which consisted of data
   not previously encountered by the model. The predictive model yielded a
   slightly lower AUC in validation (0.986, CI: 0.974–0.998) as expected.
   With an AUC of 0.995, the performance of stacked model was superior to
   all single feature models, demonstrated outstanding robustness
   ([105]Figures 3C and 3D; [106]Tables S2 and [107]S3), and significantly
   differed from CNV and NP ([108]Table S3). The performance achieved
   during the training phase was subsequently utilized to determine the
   cutoff threshold for the validation process, enabling accurate
   classification of patients and healthy individuals. By carefully
   assessing the metrics, a cutoff of 0.6887 at specificity level of 98%
   was selected ([109]Figure 3B). At this specificity level, the model was
   capable of correctly identifying healthy individuals while minimizing
   the occurrence of false positives, which yielded a sensitivity of 91.8%
   (CI: 84.4%–96.4%) and a specificity of 98.1% (CI: 93.2%–99.8%) in the
   validation cohort ([110]Table 1). Among the high-scoring healthy
   individuals, one person was diagnosed with solid thyroid nodule during
   the follow-up period in September 2023. Notably, our stacked ensemble
   model was more sensitive in predicting for patients at later stages,
   and patients had lymph node involvement in the validation cohort
   ([111]Table S4). The results reinforce the validity of our model, as it
   accurately reflects the expected relationship between progression
   status and the strength of the cancer signal. The predictive model
   demonstrates robust generalizability in the external cohort,
   maintaining a strong performance with a slightly lower sensitivity of
   86.4% (CI: 72.6%–94.8%) and a specificity of 95.7% (CI: 85.5%–99.5%)
   ([112]Table 1). A notable trend is observed across different stages
   within the external cohort, indicating an increasing proficiency in
   distinguishing between cancer and non-cancer cases across varying
   stages of disease progression within the external dataset
   ([113]Table S4). We also examined our model’s performance in
   other gastrointestinal (GI) cancers including gastric cancer,
   colorectal cancer, and liver cancer. This retrospective cohort was
   composed of 30 individuals diagnosed with gastric cancer, 30 patients
   with colorectal cancer, and 30 patients with liver cancer
   ([114]Table S5). Despite observing low sensitivities in detecting other
   GI tract cancers with our model, we consider this outcome to be
   expected as our model was designed and trained to identify the
   characteristics distinguishing ESCC from healthy controls, with a
   primary focus on ESCC detection. All patients successfully met the
   criteria set for both sampling and sequencing quality control. The
   model achieved a sensitivity of 26.7% (CI: 12.3%–45.9%), 16.7% (CI:
   5.6%–34.7%), and 10.0% (CI: 2.1%–26.5%) in gastric cancer, liver
   cancer, and colorectal cancer, respectively. These results underscore
   the specificity of our predictive model, which has been finely tuned to
   detect signals specific to ESCC with a high degree of precision.
   Moreover, it suggests that while the model excels within its specific
   domain, its application may not extend with the same level of efficacy
   to other forms of GI cancers.

Figure 3.

   [115]Figure 3
   [116]Open in a new tab

   Evaluation of the predictive model in training and validation cohort

   (A) Receiver operating characteristic (ROC) curve evaluating the
   performance of the stacked ensemble model.

   (B) Boxplot showing the distribution of cancer scores in groups of
   patients with ESCC and healthy control groups. The 98% specificity
   cutoff score of training set is 0.69, represented in dotted line.

   (C) ROC curve evaluating performance of predictive model with all
   features and predictive models with each individual feature.

   (D) Oncoprint of combined feature predictive model and each individual
   model.

Table 1.

   Diagnostic performance of the predictive model in training, validation,
   and external cohorts
   Cancer Healthy
   Training cohort
     __________________________________________________________________

   Predicted
    Cancer 96 2
    Healthy 6 103
   Sensitivity (95% CI) 0.94 (0.88, 0.98)
   Specificity (95% CI) 0.98 (0.93, 1.00)
   Positive predictive value (95% CI) 0.98 (0.93, 1.00)
   Negative predictive value (95% CI) 0.94 (0.88, 0.98)
   Accuracy (95% CI) 0.97 (0.93, 0.99)
     __________________________________________________________________

   Validation cohort
     __________________________________________________________________

   Predicted
    Cancer 89 2
    Healthy 8 102
   Sensitivity (95% CI) 0.92 (0.84, 0.96)
   Specificity (95% CI) 0.98 (0.93, 1.00)
   Positive predictive value (95% CI) 0.98 (0.92, 0.97)
   Negative predictive value (95% CI) 0.93 (0.86, 0.97)
   Accuracy (95% CI) 0.95 (0.91, 0.98)
     __________________________________________________________________

   External cohort
     __________________________________________________________________

   Predicted
    Cancer 38 2
    Healthy 6 45
   Sensitivity (95% CI) 0.86 (0.73, 0.95)
   Specificity (95% CI) 0.96 (0.85, 0.99)
   Positive predictive value (95% CI) 0.95 (0.83, 0.99)
   Negative predictive value (95% CI) 0.88 (0.83, 0.99)
   Accuracy (95% CI) 0.91 (0.83, 0.96)
   [117]Open in a new tab

Stability and robustness assessments of the predictive model

   To assess the robustness of our approach, we performed down-sampling on
   the sequences, reducing their coverage depth to 4X, 3X, 2X, 1X, and
   0.5X. This evaluation is crucial as it simulates scenarios where
   sequencing data may be limited or incomplete, thereby testing the
   robustness and generalizability of our approach in real-world settings.
   By decreasing the coverage depth, we examined the performance and
   reliability of our methodology under conditions of shallow sequencing
   depth. The results demonstrated exceptional performance, yielding an
   AUC over 98.8% (CI: 97.7%–99.9%) when operating at 2X coverage depth
   and AUC above 97% at 1X and 0.5X sequencing depth ([118]Figure 4A). The
   sensitivities of coverage depth at 4X, 3X, 2X, 1X, and 0.5X were 96.7%
   (CI: 90.8%, 99.3%), 94.7% (CI: 88.1%, 98.3%), 91.7% (CI: 84.2%, 96.3%),
   91.8% (CI: 84.4%, 96.4%), and 87.6% (CI: 79.4%, 93.4%), respectively
   ([119]Table S6). Furthermore, the results showed a modest decrease
   trend in AUC as the coverage depth reduced. These results provided
   valuable insights into the resilience and stability of our methodology
   and ascertained its practical applicability in scenarios with limited
   sequencing resources or constraints.

Figure 4.

   [120]Figure 4
   [121]Open in a new tab

   Analytical validity assessment of the predictive model

   (A) ROC curve evaluating the robustness of the predictive model in
   different coverage depth samples.

   (B) Boxplot of five healthy controls repeatedly measured before and
   after meal and before and after exercise in one day. The 98%
   specificity cutoff score is based on the training cohort, represented
   in dotted line.

   (C) Boxplot of within-run and between-run tests of two patients with
   cancer and three healthy controls in order to evaluate the
   repeatability and reproducibility of the model. The 98% specificity
   cutoff score is based on the training cohort, represented in dotted
   line.

   (D) Boxplot of within-run and between-run tests of three patients with
   cancer and three healthy controls from external. The 98% specificity
   cutoff score is based on the training cohort, represented in dotted
   line.

   We performed between-sample and within-sample runs using five patients
   with cancer and six healthy patients to examine the model’s ability to
   consistently reproduce similar results undergoing technical repetition
   and batch repetition. The selection of patients with cancer and healthy
   controls was conducted through a random sampling process from the
   validation dataset and external dataset to mitigate any potential bias
   and to provide a representative subset that aligns with the overall
   characteristics of the cohort under investigation. As illustrated in
   [122]Figure 4C, the repeatability analysis revealed highly consistent
   results across the replicates by accurately classifying the patients
   with cancer as positive and the healthy patients as negative in the
   validation cohort. Within-sample replicates exhibited minimal variation
   (p > 0.05; Wilcox signed-rank test), indicating the stability of the
   extraction and analysis procedures. Additionally, 3 patients with ESCC
   and 3 healthy participants were randomly selected from the external
   cohort. The model exhibited accurate discrimination between
   participants with cancer and healthy participants, and no statistically
   significant differences were observed within the batches (p > 0.05;
   Wilcox signed-rank test) ([123]Figure 4D).

   Overall, the examination of the model’s repeatability highlights its
   capacity to generate consistent predictions and reinforces its utility
   as a dependable and reproducible tool for classification.

   To further evaluate the robustness of our model, we conducted a
   comprehensive examination involving 5 healthy individuals at different
   times of the day and under varying diet and exercise conditions.
   Samples were collected multiple times before and after meal or exercise
   throughout the day to capture the impact of different conditions and
   time points. As indicated by [124]Figure 4B (scores details shown in
   [125]Table S7), our predictive model could make an accurate
   classification with a minor variance.

   The results revealed that regardless of the time of day or the diet and
   exercise conditions, the model’s predictions remained reliable and
   aligned with the expected outcomes.

Performance of the predictive model in subgroups

   Additionally, we explored the performance of our predictive model in
   different subgroups to delve deeper into its predictive capabilities by
   considering key factors such as progression stage, tumor
   differentiation, and tumor size. [126]Figure 5 exhibited our model’s
   ability to generate stable and reliable predictions within each
   subgroup. It is worth mentioning that the results showcased the
   remarkable predictive power of our model, achieving high sensitivities
   across differentiation grades (low: 94.7% [CI, 74.0%–99.9%];
   low-medium: 90.0% [CI, 73.5%–97.9%]; medium: 96.0% [CI, 79.6%–99.9%];
   medium-high: 94.4% [CI, 72.7%–99.9%]; high: 100% [CI, 29.2%–100%]), and
   tumor sizes (≤3 cm: 91.1% [CI, 80.4%–97.0%]; > 3 cm: 92.7% [CI,
   80.1%–98.5%]) in validation ([127]Table S4). A decreasing variance of
   scores was observed as cancer stage progressed from stage I to stage
   III ([128]Figure 5C). These results highlight the model’s ability to
   effectively capture and discriminate between different subpopulations,
   further reinforcing its potential as a reliable tool for precise
   prediction within a variety of contexts.

Figure 5.

   [129]Figure 5
   [130]Open in a new tab

   Diagnostic sensitivities of the predictive model in different subgroups
   of patients with ESCC

   (A) Dot plot with error bar of patients with ESCC >65-year-old and
   ≤65-year-old. The 98% specificity cutoff score is 0.69.

   (B) Dot plot with error bars of female and male patients with ESCC.

   (C) Dot plot with error bars of patients with ESCC across stage 0/I to
   stage IV.

   (D) Dot plot with error bars of patients with ESCC in five
   differentiation grades.

   (E) Dot plot with error bars of patients with ESCC with tumor size
   >3 cm and ≤3 cm.

Benefit of the model in stage shift

   By effectively identifying cancer cases at an earlier stage, our model
   has the potential to revolutionize the standard of care and
   significantly improve survival rates. As presented in [131]Figure 6A,
   prior to utilizing our model, 50.5% of patients with ESCC were
   diagnosed at stage III and stage IV, with no more than 19% of patients
   detected at stage I. This led to 59% of death rate in 5 years. With the
   integration of our model, a notable increase in the number of patients
   diagnosed at early stages has been observed, and 99% of patients with
   ESCC could be recognized at stage I under ideal clinical conditions. As
   a result, the survival rate could be increased by 20%. However, it is
   essential to note that the outcomes are contingent upon the actual
   compliance rate observed in real-world scenarios. As the compliance
   rate varied across different thresholds, specifically at 85%, 70%, and
   55%, our model showed a decrease in the identification of patients at
   stage I, accompanied by an increase in the recognition of patients at
   later stages ([132]Figure 6B). This shift in diagnostic stage
   highlights the effectiveness of our model in enabling timely detection
   and improving patient outcomes through early intervention and
   treatment.

Figure 6.

   [133]Figure 6
   [134]Open in a new tab

   Modeled changes in stage-specific EC incidence

   (A) Sankey diagram showing cancer stage shift after intercepted by the
   predictive model. Percentage of incidence in each condition is shown in
   parentheses.

   (B) Pie charts showcasing the number of patients with ESCC diagnosed at
   different stages at 85%, 70%, and 55% compliance rate.

Discussion

   This study aimed to develop a stacked ensemble machine learning
   algorithm that combines cfDNA fragmentomic features for the
   classification of patients with ESCC and non-cancer individuals, with
   the goal of improving early detection of ESCC.

   The proposed assay incorporates cfDNA fragmentomic features and
   integrates them into a stacked ensemble algorithm. This integration is
   made possible by leveraging the non-random cleavage and fragmentation
   process of cfDNA. Certain genomic regions exhibit preferred cleavage
   patterns that are associated with specific conditions, such as the
   tissue origins and disease status of the cfDNA samples. These preferred
   cleavage patterns arise due to factors like chromatin accessibility and
   nuclease activities and are more prevalent and easily detectable
   compared to mutations.[135]^20^,[136]^21^,[137]^22 While previous
   studies have explored cfDNA somatic mutations and methylation for EC
   classification, the results have shown limited sensitivities. For
   instance, Ueda et al. identified four somatic mutations that achieved a
   sensitivity of 78.9% in distinguishing patients with EC.[138]^23
   Similarly, Qiao et al. employed cfDNA methylation profiles and observed
   a sensitivity of 74.4% with detection bias in stage 0–II patients
   (sensitivity, 58.8%).[139]^24 Therefore, our exceptional performance in
   this study highlights the promising role of cfDNA fragmentomics in
   cancer detection, surpassing the limitations observed in previous
   investigations. Furthermore, recent studies have demonstrated the
   potential of DNA from red blood cells as a diagnostic marker.[140]^25
   This could present an intriguing avenue for future research.

   Our stacked ensemble model demonstrates its applicability for early
   detection of ESCC in a clinical setting. Through independent
   validation, our stacked model achieved a sensitivity of 91.8% and a
   specificity of 98.1% ([141]Table 1), highlighting its superior
   predictive capacity. The stacked model, with an AUC of 0.995,
   outperformed all single feature models, and the performance of the
   stacked model significantly varied from that observed with CNV and NP
   features. Even though individual models based on FSC and FSD perform
   well on the validation set, they may be susceptible to overfitting
   particular characteristics of the training data. Consequently, the
   ensemble model, which combines or weights outputs from multiple models,
   generally offers better generalization to unseen data. The predictions
   of the model remained consistently high across different stages (stage
   I to IV) and demonstrated sensitivity toward early-stage
   characteristics. Specifically, the model achieved a sensitivity of
   88.9% in patients with highly differentiated cells and a sensitivity of
   91.4% in patients with tumor size less than 3 cm ([142]Table S4). It is
   also worthy of mentioning that our predictive model yielded the highest
   AUC when benchmarked with other published
   models[143]^12^,[144]^14^,[145]^17^,[146]^26^,[147]^27 ([148]Table S8).
   The three most important CNV features were found in regions on
   chromosomes 1, 9, and 20, with chromosome 1 showing significant
   amplifications in 184 tumors analyzed by GISTIC ([149]Table S9). The
   most notable FSD is noted for fragments 105–109 bp on chromosome 4q,
   175–179 bp on chromosome 6p, and 110–114 bp on chromosome 5q
   ([150]Table S9). However, research in the area of DNA fragmentomics is
   still evolving and studies focusing on the significance of specific
   fragment size bins are limited, leading to a gap in understanding their
   exact role in disease processes. Significant NP, like PRDM9, GFI1B, and
   CEBPZ, has shown overexpression in cancer samples in The Cancer Genome
   Atlas database, with PRDM9 linked to genomic instability
   ([151]Table S9). Recently, several studies have developed assays for
   the detection of an episodic modification 2′-O-methylation on different
   RNAs, revealing the role of this modification in tumor
   development[152]^28 Based on the findings of this study, further
   exploration of the synergistic relationship between EC segment
   characteristics and changes in the corresponding gene modifications may
   provide an alternative perspective in cancer research.

   Furthermore, stage shift analysis revealed that a significant
   proportion of patients would be diagnosed at stage I following
   intervention, and it was observed that 20% of patients would experience
   a positive outcome within a 5-year period under ideal clinical setting
   ([153]Figure 6). These findings underscore the model’s efficacy in the
   early diagnosis of ESCC. It is noteworthy that our model exhibited
   robustness across different diet and exercise conditions in healthy
   participants, as well as at shallow coverage depth. Notably, even at a
   coverage depth of 0.5X, our model achieved an impressive area AUC of
   0.978 ([154]Figure 4A). This robust performance ensures cost efficiency
   and given the high AUC at 0.5X coverage depth, there is potential for
   further reduction in sequencing costs by using coverage depths below
   0.5X. Under a clinical setting, our stacked ensemble model is
   recommended for elevated-risk individuals. For those have a positive
   screening result, an endoscopy and biopsy are recommended for
   diagnosis. For individuals with a negative early screening result or
   those with a positive early screening result but a negative biopsy
   result, it is recommended to pursue further diagnostic tests, including
   computed tomography scans and colonoscopy that comprehensively assess
   the potential presence of other GI tract cancers. Moreover, it is
   advisable to continue with annual screenings for ESCC using the early
   detection model, to maintain vigilant monitoring and ensure timely
   detection.

   This study has some constraints that must be acknowledged in the
   interpretation of the findings. Firstly, it is important to recognize
   that the cohort utilized in this study exclusively comprises
   individuals from the Chinese population, thereby potentially limiting
   the generalizability of the resulting model to other populations.
   Moreover, the sample size employed in this study is relatively limited,
   and certain subgroups exhibit uneven distribution, which may introduce
   potential biases and compromise the accuracy of the results in these
   specific subgroups. Finally, the model may show slight signs of
   overfitting, as evidenced by a subtle decrease in the performance
   observed in our external cohort results. In our subsequent research
   endeavors, we aim to expand the sample size and diversify the
   subgroups, while also conducting prospective studies at multiple
   hospitals to evaluate the feasibility of our model in real-world ESCC
   early detection scenarios. We also advocate for additional independent
   studies to verify the generalizability and robustness of our model
   across diverse patient populations prior to its integration into future
   clinical practice.

   In summary, we have developed a highly sensitive stacked ensemble model
   by integrating four fragmentomic features. Our method stems from
   identifying the most effective base models through extensively
   exploring various hyperparameter settings and subsequently integrating
   these models to develop the optimal final model. Moreover, we have
   demonstrated the robustness and repeatability of our assay, even at
   shallow coverage depth and across various conditions. As a result, our
   model presents an accurate and cost-effective approach for the early
   detection of ESCC, thus offering valuable insights into early cancer
   diagnosis in clinical settings.

Limitations of the study

   This study has several limitations that should be considered when
   interpreting the results. Firstly, it is critical to acknowledge that
   this study’s cohort is entirely made up of individuals from the Chinese
   population, which might restrict the model’s applicability to other
   demographic groups. Additionally, the study employs a relatively small
   sample size, with some subgroups not evenly represented, potentially
   leading to biases and affecting the precision of the findings within
   these groups. Lastly, there are indications of possible overfitting in
   the model, as suggested by a minor reduction in performance in the
   external cohort results.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Critical commercial assays
     __________________________________________________________________

   QIAamp Circulating Nucleic Acid Kit QIAGEN Cat# 55114
   Qubit dsDNA HS Assay Kit Thermo Fisher Scientific Cat# [155]Q32854
   KAPA HyperPrep Kit Roche Cat# KK8504
     __________________________________________________________________

   Deposited data
     __________________________________________________________________

   Human plasma samples of ESCC patients and healthy controls This paper
   HRA007522
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   Stacked Ensemble Model for ESCC Prediction This paper GitHub:
   [156]https://github.com/cancer-oncogenomics/cfDNA_fragmentomics_ESCC_pr
   ediction (Zenodo: [157]https://doi.org/10.5281/zenodo.11658224)
   R (v. 4.0.3) CRAN [158]https://www.r-project.org/
   Trimmomatic Bolger et al.[159]^29 GitHub:
   [160]https://github.com/usadellab/Trimmomatic
   Picard Broadinstitute GitHub:
   [161]http://broadinstitute.github.io/picard/
   BWA Li H.[162]^30 GitHub: [163]https://github.com/lh3/bwa
   ichorCNA Adalsteinsson et al.[164]^31 GitHub:
   [165]https://github.com/broadinstitute/ichorCNA
   gsignal Short et al. GitHub:
   [166]https://github.com/gjmvanboxtel/gsignal
   H2O AutoML H2O.ai GitHub: [167]https://github.com/h2oai/h2o-3
   Stage Shift Hubbell et al.[168]^32 GitHub:
   [169]https://github.com/grailbio-publications/Hubbell_CEBP_Interception
   Model
   pROC (v. 1.17.0.1) Robin et al.[170]^33 GitHub:
   [171]https://xrobin.github.io/pROC/
   ggplot2 Wickham et al.[172]^34 GitHub:
   [173]https://github.com/tidyverse/ggplot2
   [174]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources and reagents should be
   directed to and will be fulfilled by the lead contact, Tao Wang
   (wangtao_pumc@live.cn).

Materials availability

   This study did not generate new unique reagents.

Data and code availability

     * •
       The raw sequence data reported in this paper have been deposited in
       the Genome Sequence Archive (GSA) (Genomics, Proteomics &
       Bioinformatics 2021) in National Genomics Data Center (Nucleic
       Acids Res 2021), China National Center for Bioinformation/Beijing
       Institute of Genomics, Chinese Academy of Sciences, under the
       accession number listed in the [175]key resources table. These data
       from this study are available under restricted access due to
       patient privacy concerns. To request access to the data, please
       contact the [176]lead contact, and submit a request for access at
       [177]https://ngdc.cncb.ac.cn/gsa-human/browse/HRA007522. For
       detailed guidance on making the data access request, see
       GSA-Human_Request_Guide_for_Users
       ([178]https://ngdc.cncb.ac.cn/gsa-human/document/GSA-Human_Request_
       Guide_for_Users_us.pdf). Upon receiving a request, the [179]lead
       contact reviews the request and, if approved, notifies GSA to
       release the data to the requestor. The requestor must describe the
       objectives of the research project for which the data will be used.
       Data access will be considered for research purposes and
       non-commercial use only. In order to ensure patient privacy, access
       to personally identifiable information or sensitive clinical
       information will not be provided, and requests for data access must
       rigorously adhere to the consent agreements established with study
       participants.
     * •
       All code has been deposited at Zenodo and is publicly available as
       of the date of publication. DOIs are listed in the [180]key
       resources table.
     * •
       Any additional information required to reanalyze the data reported
       in this paper is available from the [181]lead contact upon request.

Experimental model and study participant details

Participant enrollment

   This study included a total of 408 participants used for model training
   and validation ([182]Figure 1A) and 91 participants as external cohort
   for further validation of model ([183]Figure 1B). The independence of
   three cohorts is indicated by distinct recruitment timelines and
   locations. Specifically, the training cohort was collected between July
   2021 and March 2022 and the validation cohort recruitment occurred from
   July 2022 to May 2023 at Nanjing Drum Hospital. The recruitment for the
   external dataset began in June 2023 and terminated in December 2023 at
   a The Second Affiliated Hospital of Nanchang University. The training
   and validation cohorts consisted of healthy volunteers (n = 209) and
   previously untreated Esophageal Squamous Cell Carcinoma (ESCC) patients
   (n = 199) spanning from stage I to IV. The external cohort is composed
   of 47 healthy individuals and 44 ESCC patients.

   In the training cohort, 18 ESCC patients were excluded due to
   unconfirmed stage information, treatment history sample processing QC
   failure and sequencing QC failure, and 16 healthy individuals were
   excluded because of their history of cancer, acute inflammation, sample
   processing and sequencing QC failure. The ensemble stacked model was
   subsequently constructed with the resulting 102 ESCC samples and 105
   healthy samples. The validation cohort began sample collection only
   after the construction and training of our model. The exclusion process
   involved removing: 27 ineligible ESCC patients without confirmed stage
   information, treatment history and QC failure during sample processing
   or sequencing; 19 healthy individuals had acute inflammatory response,
   failed sample processing QC or sequencing QC, and withdrew from the
   study. The resulted validation cohort contains 97 ESCC patients and 104
   healthy individuals. Finally, 6 ESCC patients and 3 healthy
   participants were excluded in external cohort.

   The cancer and noncancer cohorts were matched for gender and age in all
   cohorts (Wilcoxon signed-rank test, Fisher test; p > 0.05 for all
   groups). The patients were mainly diagnosed in late stage, which
   reflects the challenges associated with the late detection of ESCC in
   real-world scenario, making it difficult to predominantly obtain
   early-stage cases. The findings were shared exclusively to the medical
   practitioners involved, with no disclosure to the patients, given that
   this study does not involve direct alterations to medical decisions or
   interventions with individual patients. This study was approved by the
   Ethics Committee of the Nanjing Drum Tower Hospital affiliated to
   Nanjing University School of Medicine, on the use of human samples for
   experimental studies. It adhered to the Declaration of Helsinki. All
   patients provided written informed consent prior to sample collection.

Method details

cfDNA extraction and WGS

   Venous blood samples were obtained during routine physical checks for
   healthy volunteers and preoperatively for cancer patients. Blood
   samples were collected upon patients’ admission to the hospital, which
   was after diagnosis of esophageal cancer through endoscopic biopsy
   staining and prior to the performance of surgery. All blood sampling
   occurred within a time frame of 1–3 weeks after diagnosis. The final
   pathological staging was determined after the surgical procedure. This
   standardized approach was consistent across all recruited ESCC patients
   to minimize variations in the time interval. The collection,
   processing, and sequencing of blood samples from all participants
   followed the same protocols. A venous full blood sample (8-10mL) from
   each participant was collected using an EDTA tube at Nanjing Drum Tower
   Hospital (Nanjing, China) or The Second Affiliated Hospital of Nanchang
   University (Nanchang, China). The plasma underwent a secondary
   centrifugation step at 18,000g to remove residual cellular debris in
   the hospital within 6 h of collection, and was subsequently stored at
   −80°C. The plasma sample was then shipped with dry ice to the central
   laboratory of Nanjing Geneseeq Technology Inc. (Nanjing, China), which
   is qualified by College of American Pathologists (CAP), Clinical
   Laboratory Improvement Amendments (CLIA) and ISO 15189. The following
   cfDNA extraction and sequencing processed were performed within 24 h
   cfDNA extraction from plasma was carried out using the QIAamp
   Circulating Nucleic Acid Kit (Qiagen) following the manufacturer’s
   instructions. The concentration and quality of cfDNA were assessed
   using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific) in
   accordance with the provided guidelines. For each sample, 5–10 ng of
   plasma cfDNA was subjected to PCR-free WGS library construction using
   the KAPA HyperPrep Kit (Roche). 8 Samples (failure rate = 1.64%) with
   plasma cfDNA less than 5ng were excluded. For samples with extracted
   plasma cfDNA exceeding 10 ng, a maximum of 10 ng cfDNA was utilized for
   the construction of the whole-genome sequencing (WGS) library.

   The libraries were subjected to paired-end sequencing on the
   NovaSeq6000 platform (Illumina). Initial trimming of FASTQ file was
   conducted using Trimmomatic as part of the quality control (QC)
   protocol. Following the removal of PCR duplicates by the Picard toolkit
   ([184]http://broadinstitute.github.io/picard/), the qualified reads
   were subsequently aligned to the human reference genome (GRCh37/UCSC
   hg19) using the sequence aligner BWA with a mean sequencing depth of
   10.92× (range 5.35×-24.81×). The sequencing QC includes: Q30, GC
   content, proportion of reads aligned to reference genome, median insert
   and mean depth. All participants who were excluded from the study due
   to failing sequencing QC had a mean depth of less than 5×. To minimize
   bias, the sample operating team remained blinded to the case/control
   status of the samples throughout the entire process.

   To address variations in WGS depths among samples (range 5.35×-24.81×),
   we performed down-sampling to a uniform depth of 5× for model
   optimization. Thereby, the potential coverage differences were
   mitigated, and the maximum sample inclusion was kept. The optimized
   model was subsequently validated using raw WGS data with varying
   sequencing depths or down-sampling to 4×, 3×, 2×, 1×, and 0.5× depths
   using Picard toolkit, enabling a comprehensive evaluation of its
   performance.

Bioinformatic analysis and modeling

   For feature selection and model construction, we extracted four
   fragmentomic features from the WGS data. These features encompassed
   copy number variation (CNV), fragmentation size coverage (FSC),
   fragmentation size distribution (FSD), and nucleosome profiling (NP).
   Copy number variation (CNV) refers to the gains or losses of
   chromosomal region > 1kb, which could significantly associate with
   cancer when the amplifications or deletions occur on oncogenes or tumor
   suppressor genes. Fragmentation Size Coverage (FSC) and Fragmentation
   Size Distribution (FSD) capture the characteristic of the size of
   fragments. Finally, Nucleosome Profiling (NP) analyzes for chromatin
   structure and implies the gene expression level. By combining these
   features extracted from cfDNA, we could potentially capture the
   differences between healthy individuals and cancer patients.

   Copy number variation (CNV) feature extraction was adapted from Wan
   et al.[185]^35 using ichorCNA developed by them. The genome was tiled
   in to 1Mb disjoint bins, resulting in 2475 bins in total, with respect
   to the reference genome (GRCh37/UCSC hg19). The coverage depth of each
   bin was corrected by GC content and compared against the software
   baseline. The log2 ratio was then computed by ichorCNA for each bin.
   The procedure for extracting Fragmentation Size Coverage (FSC) was
   adapted from the approach introduced by D. Mathios et al.[186]^14 The
   genome was divided into 5 Mb non-overlapping bins, resulting in a total
   of 541 bins. Subsequently, reads with lengths less than 100 bp or
   greater than 220 bp were filtered out, the short (100–150 bp) and long
   fragments (151–220 bp) were standardized and transformed into z-scores,
   and the ratio of short fragments to long fragments was computed for
   each bin. The extraction of Fragmentation Size Distribution (FSD) was
   derived from S. Su et al.[187]^36 FSD captures the coverage of
   100-220bp cfDNA fragments in 5 bp increments across 39 chromosome arms.
   Notably, the short arms of 5 acrocentric chromosomes were excluded from
   the analysis due to largely unknown sequence,[188]^37 and two arms of
   the X chromosome were excluded due to a clear distinction observed in
   males and females. The raw coverages were then converted into z-scores,
   resulting in a total of 936 features across the 39 chromosomes. We
   adopted the methodology outlined by Doebley et al.[189]^38 for the
   extraction of Nucleosome Profiling (NP) which includes the selection of
   transcription factor binding sites (TFBS) and profiling associated
   nucleosome coverage patterns. Transcription factors (TFs) with over
   10,000 highly mappable sites were identified from GTRD database
   (v 21.12, [190]https://gtrd.biouml.org/downloads/21.12/chip-seq/), and
   854 TFs remained after excluding TFs beyond the list of known binding
   sites in the CIS-BP database (v2.00,
   [191]http://cisbp.ccbr.utoronto.ca/bulk.php). 10,000 mappable sites
   with the highest peak counts were selected for each TF. To profile
   nucleosome coverage, reads between 100 and 220bp around each TF binding
   site were extracted in a window (-5kbp to +5kbp) of each binding site
   and weighted by GC bias. Weights of the reads’ midpoints falling within
   each 15bp bin were summed as GC-corrected midpoint coverage profiles.
   The averaged GC-corrected midpoint coverage profile was normalized for
   each TF, and the resulting curve was smoothed using a Savitzky-Golay
   filter (150bp window, polynomial order = 3) to produce the final TF
   coverage profile. Three quantified features were extracted for each of
   the 854 TF: 1) mean coverage of the region flanking 1kbp upstream and
   1kbp downstream of the center; 2) coverage at the center; and 3)
   amplitude of coverage peaks surrounding the center through Fast Fourier
   Transform.

   The input data for CNV, FSC, FSD and NP consists of 2475, 1082, 936 and
   2562 (854∗3) columns, respectively, with each column represents
   distinct characteristics of the feature. Notably, all available
   characteristics were incorporated into the model and no feature
   selection was applied. To construct the models, each fragmentomic
   feature served as the basis for building individual base models using
   four different algorithms: generalized linear model (GLM), distributed
   random forest, deep learning, and XGBoost. A fully connected
   multi-layer artificial neural network was selected as our deep learning
   approach to capture the intricate and non-linear patterns present in
   the data. These algorithms generated a cancer score ranging from 0 to 1
   for each cancer and noncancer sample, with a score closer to 1
   indicating a higher probability of cancer. Subsequently, the cancer
   probability scores from all algorithms were aggregated and their means
   was calculated to create an ensemble model. The model training process
   exclusively utilized the H2O AutoML package.[192]^39 Specifically, the
   determination of hyperparameters was carried out by the H2O AutoML
   algorithm, employing a random grid search approach. In this process,
   various hyperparameters were randomly selected and applied to each
   model. Each model had a distinct set of hyperparameters to explore.
   Notably, for GLM, which has a non-default value set for a
   hyperparameter, AutoML conducted an internal grid search. It passed a
   list of alpha parameters to build a model, searching for the optimal
   combination of alpha-lambda hyperparameters. This optimized combination
   was then applied to construct the GLM model. During the grid search of
   XGBoost, AutoML explored values for specific hyperparameters, including
   booster, col_sample_rate, col_sample_rate_per_tree, max_depth,
   min_rows, ntrees, reg_alpha, reg_lambda, and sample_rate. For the deep
   learning model, AutoML selected values from a predefined list for
   parameters for following hyperparameters: activation, epochs, epsilon,
   hidden, hidden_dropout_ratios, input_dropout_ratio, and rho. Following
   the training process, a Leaderboard of trained models was generated. To
   identify the optimal model, AUC was specified as the sorting metric.
   The final score was determined by averaging the top 10 models from the
   Leaderboard.

   To prevent overfitting of the model, 5-fold cross-validation was
   employed based on the training dataset for both the individual base
   model and the ensemble model. The occurrence of overfitting can be
   detected by using this recognized approach which partitioning training
   dataset into 5 sets, iterate training process in the 4-folds of it and
   validate model on the remining 1-fold of the data. Additionally,
   runtime restriction, ensemble techniques and the incorporation of an
   external cohort were implemented to enhance model generalizability. The
   prediction performance of these models was evaluated using area under
   the curve (AUC) values in the validation cohorts, providing a
   comprehensive assessment of their effectiveness.

   The training of the model was entirely executed on training set, while
   keeping the validation set untouched during training process. After
   finalizing the parameters using the training cohort, the model was
   applied to the validation cohort to generate cancer prediction scores
   for each sample and evaluate the performance of the model. The cut-off
   score was determined at 98% specificity of the training cohort.

Model analytical validity assessment

   To access the efficacy of the stacked ensemble model, individual models
   were developed for each fragmentomic feature. Subsequently, a cancer
   score between 0 and 1 was predicted for each sample, and the models’
   performance was evaluated by examining AUC values.

   To evaluate the model’s repeatability and reproducibility, a total of
   eleven participants: three cancer patients and two healthy individuals
   were randomly selected from validation cohort, three cancer patients
   and three healthy individuals from the external cohort, to perform
   within-run and between-run tests. In the within-run tests,
   approximately 10 mL of venous blood sample was collected from each
   participant for plasma preparation (batch 1). Three technical
   replicates were created for each participant by equally dividing their
   blood sample into three. These replicates underwent cfDNA extraction,
   library preparation, and sequencing by a single group of lab
   technologists. To perform between-run tests, a second round of blood
   sample collection was conducted three weeks later on the participants
   (batch 2). The same procedures were followed but performed by a
   different group of lab technologists. The within-run and between-run
   tests yielded a total of 66 samples. From these samples, fragmentomic
   features were extracted and analyzed by the predictive model to obtain
   cancer scores following the assay described previously. The
   repeatability of the model was examined by comparing the results of
   each participant within the same batch, while the reproducibility was
   assessed by comparing the results between batch 1 and batch 2.

   In order to determine the model’s robustness, we evaluated the model’s
   sensitivity to cell free DNA (cfDNA) fractions with low coverage.
   Plasma samples from validation cohort were subjected to WGS and were
   randomly subsampled to 4×, 3×, 2×, 1×, and 0.5× fold coverage. These
   fractions were subsequently inputted into the predictive model to
   obtain cancer scores. Detection sensitivity and AUC value were used to
   quantify and evaluate the models’ performance. The cut-off determined
   by the training cohort at 98% specificity was employed to determine the
   cancer/noncancer status of these samples.

Stage shift analysis

   A stage shift analysis was performed using the stage shift model
   developed by Hubbell et al.[193]^32 This model was utilized to predict
   the diagnostic stage shift resulting from the application of the early
   detection model, in comparison to receiving standard care and
   screening. The standard care and screening data was based on China’s
   population cancer incidence and mortality data.

Quantification and statistical analyses

   We utilized the pROC package (v. 1.17.0.1) in R (v. 4.0.3) to generate
   receiver operating characteristic curves for the statistical analysis.
   Sensitivity (TP/[TP + FN]), specificity (TN/[TN + FP]), accuracy ([TP +
   TN]/[TP + FP + TN + FN]), along with their corresponding 95% confidence
   interval (CI), were calculated based on the true positive (TP), true
   negative (TN), false positive (FP), and false negative (FN) values of
   cancer prediction. To invesgtigate the differenciatiability of
   features, Principal Component Analysis (PCA) was implemented employing
   built-in stats (3.6.2) package in R. Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathway analysis was conducted using the DAVID
   Bioinformatics Resources[194]^40 to further gain insights into the
   functional pathways and biological processes associated with distinct
   NP.

Acknowledgments