Graphical abstract graphic file with name fx1.jpg [59]Open in a new tab Highlights * • An assay employs machine learning models for early detection of esophageal cancer * • Integrates 4 cfDNA fragmentomics features extracted from whole-genome sequencing * • Achieves consistent sensitivity and specificity across low-coverage sequencing * • Shows potential in the application of cfDNA fragmentomics in cancer diagnostics __________________________________________________________________ The utilization of non-invasive, liquid biopsy-based analysis of cfDNA has emerged as a promising approach in disease detection. By integrating features derived from whole-genome sequencing including fragmentomics profiles, copy-number variation, and nucleosome positioning, Jiao et al. develop a robust model capable of accurately detecting esophageal squamous cell carcinoma. Introduction Esophageal cancer (EC) is a prevalent malignancy worldwide, with its incidence rate having increased by 6-fold from 1999 to 2008.[60]^1 Projections by Liu et al. indicate that the incidence rate is expected to continue rising over the next two decades and beyond.[61]^2 EC is recognized as one of the most aggressive malignancies globally and ranks as the sixth leading cause of cancer-related mortality.[62]^3 The prognosis for EC is generally poor, with a 5-year survival rate ranging from 15% to 25%.[63]^3 Squamous cell carcinoma (SCC) is the most predominant subtype of EC worldwide, arising from the squamous cells lining in the upper esophagus. Risk factors associated with the development of esophageal SCC (ESCC) include heavy tobacco and alcohol use, daily consumptions of hot foods, etc. The diagnosis of ESCC is often delayed due to the absence of early symptoms. Symptoms such as dysphagia, weight loss, and heartburn only show in the later stages of cancer development. Therefore, there is a pressing demand for accurate and effective early detection interventions. The utilization of non-invasive, liquid biopsy-based analysis of cell-free DNA (cfDNA) has emerged as a promising approach in disease detection. During cellular apoptosis and necrosis, DNA are fragmented and released into the bloodstream, forming cfDNA.[64]^4 These fragments carry distinct genetic and epigenetic information derived from their respective cell and tissue origins.[65]^5 Notably, a specific fraction of cfDNA, referred to as circulating tumor DNA (ctDNA), consists of DNA fragments originating from tumor cells that carry genetic and epigenetic alterations characteristic of tumor.[66]^6 The strong cancer signals due to the abundance of ctDNA contents make EC detectable through liquid biopsy. As demonstrated by Bettegowda, EC ranked third in terms of the highest proportion of patients with detectable ctDNA.[67]^7 To identify ctDNA from non-tumorous cfDNA, targeted sequencing has been commonly employed for tumor mutation detection.[68]^8^,[69]^9 However, mutation-based approaches using targeted panels are constrained by their limited sensitivities. Chidambaram et al. reported a pooled sensitivity of only 71% in four studies specifically focused on EC diagnosis.[70]^10 Klein et al. achieved a sensitivity of 85% using their targeted methylation assay.[71]^11 As an alternative, cfDNA fragmentation patterns including fragmentation size, which can be captured through whole-genome sequencing (WGS), has demonstrated diagnostic potential in previous studies.[72]^9^,[73]^12^,[74]^13^,[75]^14 Recent studies have integrated multiple dimensions of fragmentation features, thereby enhancing the predictive capabilities of machine learning models for the early detection of cancers. For instance, Mathios et al. developed a model that incorporates fragmentation features, imaging data, clinical risk factors, and levels of carcinoembryonic antigen to differentiate patients with non-small cell lung cancer from healthy individuals.[76]^14 Similarly, Cristiano et al. examined fragmentation patterns across various cancer types, shedding light on screening, early detection, and cancer monitoring.[77]^12 Chen et al. have highlighted the utilization of genome-wide 5-hydroxymethylcytosine, nucleosome positioning (NP), 5′ end motif, and fragmentation size profiles, emphasizing the potential of comprehensive features for early detection of cancer.[78]^15 Furthermore, researchers have increasingly employed stacked ensemble models, which combine multiple machine learning algorithms, to achieve superior results. Wang et al. demonstrated the advantages of their cost-effective yet highly sensitive assay for patients with non-small cell lung cancer by implementing a stacked ensemble model comprising generalized linear model (GLM), gradient boosting machine, random forest, deep learning, and XGBoost.[79]^16 Similar approaches have been applied to colorectal cancer[80]^17 and breast cancer,[81]^18 both achieving superior performance compared to individual algorithms. However, further verification is required to assess the applicability of stacked ensemble models in patients with ESCC. In this study, we developed a stacked ensemble model that leverages cfDNA fragmentation for the early detection of ESCC. The model combined four fragmentomics features obtained from WGS and employed four machine learning algorithms. The performance of the model was sequentially evaluated on an external dataset. Additionally, we assessed the model’s robustness and repeatability across low coverage and repeated measured samples. The results underscore the promising potential of our model as an effective strategy for early diagnosis and management of ESCC in clinical settings. Results Participant characteristics of cohort The dataset consists of 499 participants; this includes 207 participants in the training, 201 individuals in the validation ([82]Figure 1A), and 91 individuals in the external cohort ([83]Figure 1B). It is worth noting that the three cohorts were recruited separately and are independent of each other. The training set was then used to train the model exclusively and determine the cutoff, which was later applied to validation and external cohorts to assess the model’s performance. Figure 1. [84]Figure 1 [85]Open in a new tab Flowchart of cohort design and study method (A–C) Schematic diagram. The demographic characteristics of participants are comparable as displayed in [86]Table S1. The training cohort includes 17 participants at stage 0/I (17/102, [16.7%]), 58 participants with tumor size ≤3 (58/102, [56.9%]), and 14 participants with lowly differentiated tumor (14/102, [13.7%]). Similarly, the validation cohort had 15 (15/97, [15.5%]) and 56 (56/97, [57.7%]) participants at 0/I stage and tumor size ≤3 and 14 participants with lowly differentiated tumor (19/102, [19.6%]), respectively. In external cohort, 5 (5/44, [11.4%]) patients were diagnosed at stage 0/I. In addition, we observed that 2 (2/199, [1.0%]) patients had distant metastatic disease, and 92 (92/199, [46.2%]) patients had lymph node involvement in training and validation. Among the 209 healthy individuals in training and validation cohorts, 48 reported symptoms of dyspepsia, with 5 of them also indicating gastroesophageal reflux, and 33 individuals reported experiencing loss of appetite. Additionally, 14 participants reported dyspepsia and 6 had loss of appetite in the external cohort. Finally, participants’ smoking and drinking status is also comparable across training and validation cohorts. Evaluating fragmentomics features and stacked ensemble machine learning algorithms We performed unsupervised principal component analysis with 2 principal components (PCs) to test the distinguishability of features such as fragmentation size coverage (FSC), fragmentation size distribution (FSD), NP, and copy-number variation (CNV) in the training cohort prior to training the machine learning algorithm. As demonstrated in [87]Figure S1, the features are distinct between healthy individuals represented in orange and patients with cancer represented in green by visually identifying the two clusters based on their differences in centers and spreads. Among the features, PC1 of FSD captures 65.87% variations, followed by PC1 of FSC which was responsible for 34.12% variations between clusters. We further examined the fragmentation profiles of features. Feature profiles were consistent in healthy individuals ([88]Figures 2B–2D, 2F; [89]Figure S2), while unstable behavior of feature profiles was observed in patients with cancer ([90]Figures 2A–2C, 2E; [91]Figure S2). Evident chromosomal arm gains are observed in 3q, 6q, 7p, 11q, 12p, 18q, and 19q where values greater than 0 indicate copy-number gains and values less than 0 signify copy-number losses ([92]Figures 2A and 2B). The long and short fragmentation profiles of patients with cancer were distinct from those of healthy individuals ([93]Figures 2C–2F). When both short and long DNA fragments are aggregated, their respective Z scores are calculated such that the mean is zero with Z scores centered around the mean. However, this central tendency changes when short and long fragments are considered separately. For long fragments, the standardized mean converge is around 1 in which values greater than 1 indicate a fragment size longer than the average and values less than 1 indicate a size shorter than the average for long fragments ([94]Figures 2C and 2D). Conversely, the standardized mean for short fragments is approximately −1; values greater than −1 indicate a longer fragment size, and values less than −1 indicate a shorter fragment size ([95]Figures 2E and 2F). Likewise, distinct NP patterns were seen between patients with cancer and healthy individuals ([96]Figure 2G), and 154 statistically significant transcription factor-binding sites (p < 0.1, Wilcoxon signed-rank test) selected from 854 transcription factors were found to be statistically associated with 22 pathways ([97]Figure 2H). Figure 2. [98]Figure 2 [99]Open in a new tab cfDNA fragmentation profiles in patients with EC and healthy individuals (A and B) Genome-wide copy-number variation (CNV) change in patients with EC and healthy individuals. (C and D) Genome-wide fragmentation profile of long fragments (151–220 bp) at 1 Mb resolution. (E and F) Genome-wide fragmentation profile of short fragments (100–150 bp) at 1 Mb resolution. (G) Heatmap of 154 significant nucleosome profiling around sites of interest in patients with EC and healthy individuals. (H) Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of 154 significant nucleosome profiles. After identifying the presence of significant variation between individuals classified as healthy and those diagnosed with cancer, a stacked ensemble model was trained to effectively distinguish between the two groups. The machine learning model was constructed integrating XGBoost, GLM, distributed random forest (DRF), and deep learning based on the four distinct fragmentomic features: CNV, FSC, FSD, and NP ([100]Figure 1C). Utilizing the random grid search method for hyperparameter exploration allowed us to identify and select the top 10 best-performing base models for our stacking approach, diverging from the traditional practice of using a fixed set of hyperparameters seen in previous studies.[101]^17^,[102]^18^,[103]^19 As presented in [104]Figure 3A, the model learned and extracted significant patterns from training, generating an area under the curve (AUC) of 0.995 (confidence interval [CI]: 0.990–1.000). Its performance was then evaluated using the independent validation set, which consisted of data not previously encountered by the model. The predictive model yielded a slightly lower AUC in validation (0.986, CI: 0.974–0.998) as expected. With an AUC of 0.995, the performance of stacked model was superior to all single feature models, demonstrated outstanding robustness ([105]Figures 3C and 3D; [106]Tables S2 and [107]S3), and significantly differed from CNV and NP ([108]Table S3). The performance achieved during the training phase was subsequently utilized to determine the cutoff threshold for the validation process, enabling accurate classification of patients and healthy individuals. By carefully assessing the metrics, a cutoff of 0.6887 at specificity level of 98% was selected ([109]Figure 3B). At this specificity level, the model was capable of correctly identifying healthy individuals while minimizing the occurrence of false positives, which yielded a sensitivity of 91.8% (CI: 84.4%–96.4%) and a specificity of 98.1% (CI: 93.2%–99.8%) in the validation cohort ([110]Table 1). Among the high-scoring healthy individuals, one person was diagnosed with solid thyroid nodule during the follow-up period in September 2023. Notably, our stacked ensemble model was more sensitive in predicting for patients at later stages, and patients had lymph node involvement in the validation cohort ([111]Table S4). The results reinforce the validity of our model, as it accurately reflects the expected relationship between progression status and the strength of the cancer signal. The predictive model demonstrates robust generalizability in the external cohort, maintaining a strong performance with a slightly lower sensitivity of 86.4% (CI: 72.6%–94.8%) and a specificity of 95.7% (CI: 85.5%–99.5%) ([112]Table 1). A notable trend is observed across different stages within the external cohort, indicating an increasing proficiency in distinguishing between cancer and non-cancer cases across varying stages of disease progression within the external dataset ([113]Table S4). We also examined our model’s performance in other gastrointestinal (GI) cancers including gastric cancer, colorectal cancer, and liver cancer. This retrospective cohort was composed of 30 individuals diagnosed with gastric cancer, 30 patients with colorectal cancer, and 30 patients with liver cancer ([114]Table S5). Despite observing low sensitivities in detecting other GI tract cancers with our model, we consider this outcome to be expected as our model was designed and trained to identify the characteristics distinguishing ESCC from healthy controls, with a primary focus on ESCC detection. All patients successfully met the criteria set for both sampling and sequencing quality control. The model achieved a sensitivity of 26.7% (CI: 12.3%–45.9%), 16.7% (CI: 5.6%–34.7%), and 10.0% (CI: 2.1%–26.5%) in gastric cancer, liver cancer, and colorectal cancer, respectively. These results underscore the specificity of our predictive model, which has been finely tuned to detect signals specific to ESCC with a high degree of precision. Moreover, it suggests that while the model excels within its specific domain, its application may not extend with the same level of efficacy to other forms of GI cancers. Figure 3. [115]Figure 3 [116]Open in a new tab Evaluation of the predictive model in training and validation cohort (A) Receiver operating characteristic (ROC) curve evaluating the performance of the stacked ensemble model. (B) Boxplot showing the distribution of cancer scores in groups of patients with ESCC and healthy control groups. The 98% specificity cutoff score of training set is 0.69, represented in dotted line. (C) ROC curve evaluating performance of predictive model with all features and predictive models with each individual feature. (D) Oncoprint of combined feature predictive model and each individual model. Table 1. Diagnostic performance of the predictive model in training, validation, and external cohorts Cancer Healthy Training cohort __________________________________________________________________ Predicted Cancer 96 2 Healthy 6 103 Sensitivity (95% CI) 0.94 (0.88, 0.98) Specificity (95% CI) 0.98 (0.93, 1.00) Positive predictive value (95% CI) 0.98 (0.93, 1.00) Negative predictive value (95% CI) 0.94 (0.88, 0.98) Accuracy (95% CI) 0.97 (0.93, 0.99) __________________________________________________________________ Validation cohort __________________________________________________________________ Predicted Cancer 89 2 Healthy 8 102 Sensitivity (95% CI) 0.92 (0.84, 0.96) Specificity (95% CI) 0.98 (0.93, 1.00) Positive predictive value (95% CI) 0.98 (0.92, 0.97) Negative predictive value (95% CI) 0.93 (0.86, 0.97) Accuracy (95% CI) 0.95 (0.91, 0.98) __________________________________________________________________ External cohort __________________________________________________________________ Predicted Cancer 38 2 Healthy 6 45 Sensitivity (95% CI) 0.86 (0.73, 0.95) Specificity (95% CI) 0.96 (0.85, 0.99) Positive predictive value (95% CI) 0.95 (0.83, 0.99) Negative predictive value (95% CI) 0.88 (0.83, 0.99) Accuracy (95% CI) 0.91 (0.83, 0.96) [117]Open in a new tab Stability and robustness assessments of the predictive model To assess the robustness of our approach, we performed down-sampling on the sequences, reducing their coverage depth to 4X, 3X, 2X, 1X, and 0.5X. This evaluation is crucial as it simulates scenarios where sequencing data may be limited or incomplete, thereby testing the robustness and generalizability of our approach in real-world settings. By decreasing the coverage depth, we examined the performance and reliability of our methodology under conditions of shallow sequencing depth. The results demonstrated exceptional performance, yielding an AUC over 98.8% (CI: 97.7%–99.9%) when operating at 2X coverage depth and AUC above 97% at 1X and 0.5X sequencing depth ([118]Figure 4A). The sensitivities of coverage depth at 4X, 3X, 2X, 1X, and 0.5X were 96.7% (CI: 90.8%, 99.3%), 94.7% (CI: 88.1%, 98.3%), 91.7% (CI: 84.2%, 96.3%), 91.8% (CI: 84.4%, 96.4%), and 87.6% (CI: 79.4%, 93.4%), respectively ([119]Table S6). Furthermore, the results showed a modest decrease trend in AUC as the coverage depth reduced. These results provided valuable insights into the resilience and stability of our methodology and ascertained its practical applicability in scenarios with limited sequencing resources or constraints. Figure 4. [120]Figure 4 [121]Open in a new tab Analytical validity assessment of the predictive model (A) ROC curve evaluating the robustness of the predictive model in different coverage depth samples. (B) Boxplot of five healthy controls repeatedly measured before and after meal and before and after exercise in one day. The 98% specificity cutoff score is based on the training cohort, represented in dotted line. (C) Boxplot of within-run and between-run tests of two patients with cancer and three healthy controls in order to evaluate the repeatability and reproducibility of the model. The 98% specificity cutoff score is based on the training cohort, represented in dotted line. (D) Boxplot of within-run and between-run tests of three patients with cancer and three healthy controls from external. The 98% specificity cutoff score is based on the training cohort, represented in dotted line. We performed between-sample and within-sample runs using five patients with cancer and six healthy patients to examine the model’s ability to consistently reproduce similar results undergoing technical repetition and batch repetition. The selection of patients with cancer and healthy controls was conducted through a random sampling process from the validation dataset and external dataset to mitigate any potential bias and to provide a representative subset that aligns with the overall characteristics of the cohort under investigation. As illustrated in [122]Figure 4C, the repeatability analysis revealed highly consistent results across the replicates by accurately classifying the patients with cancer as positive and the healthy patients as negative in the validation cohort. Within-sample replicates exhibited minimal variation (p > 0.05; Wilcox signed-rank test), indicating the stability of the extraction and analysis procedures. Additionally, 3 patients with ESCC and 3 healthy participants were randomly selected from the external cohort. The model exhibited accurate discrimination between participants with cancer and healthy participants, and no statistically significant differences were observed within the batches (p > 0.05; Wilcox signed-rank test) ([123]Figure 4D). Overall, the examination of the model’s repeatability highlights its capacity to generate consistent predictions and reinforces its utility as a dependable and reproducible tool for classification. To further evaluate the robustness of our model, we conducted a comprehensive examination involving 5 healthy individuals at different times of the day and under varying diet and exercise conditions. Samples were collected multiple times before and after meal or exercise throughout the day to capture the impact of different conditions and time points. As indicated by [124]Figure 4B (scores details shown in [125]Table S7), our predictive model could make an accurate classification with a minor variance. The results revealed that regardless of the time of day or the diet and exercise conditions, the model’s predictions remained reliable and aligned with the expected outcomes. Performance of the predictive model in subgroups Additionally, we explored the performance of our predictive model in different subgroups to delve deeper into its predictive capabilities by considering key factors such as progression stage, tumor differentiation, and tumor size. [126]Figure 5 exhibited our model’s ability to generate stable and reliable predictions within each subgroup. It is worth mentioning that the results showcased the remarkable predictive power of our model, achieving high sensitivities across differentiation grades (low: 94.7% [CI, 74.0%–99.9%]; low-medium: 90.0% [CI, 73.5%–97.9%]; medium: 96.0% [CI, 79.6%–99.9%]; medium-high: 94.4% [CI, 72.7%–99.9%]; high: 100% [CI, 29.2%–100%]), and tumor sizes (≤3 cm: 91.1% [CI, 80.4%–97.0%]; > 3 cm: 92.7% [CI, 80.1%–98.5%]) in validation ([127]Table S4). A decreasing variance of scores was observed as cancer stage progressed from stage I to stage III ([128]Figure 5C). These results highlight the model’s ability to effectively capture and discriminate between different subpopulations, further reinforcing its potential as a reliable tool for precise prediction within a variety of contexts. Figure 5. [129]Figure 5 [130]Open in a new tab Diagnostic sensitivities of the predictive model in different subgroups of patients with ESCC (A) Dot plot with error bar of patients with ESCC >65-year-old and ≤65-year-old. The 98% specificity cutoff score is 0.69. (B) Dot plot with error bars of female and male patients with ESCC. (C) Dot plot with error bars of patients with ESCC across stage 0/I to stage IV. (D) Dot plot with error bars of patients with ESCC in five differentiation grades. (E) Dot plot with error bars of patients with ESCC with tumor size >3 cm and ≤3 cm. Benefit of the model in stage shift By effectively identifying cancer cases at an earlier stage, our model has the potential to revolutionize the standard of care and significantly improve survival rates. As presented in [131]Figure 6A, prior to utilizing our model, 50.5% of patients with ESCC were diagnosed at stage III and stage IV, with no more than 19% of patients detected at stage I. This led to 59% of death rate in 5 years. With the integration of our model, a notable increase in the number of patients diagnosed at early stages has been observed, and 99% of patients with ESCC could be recognized at stage I under ideal clinical conditions. As a result, the survival rate could be increased by 20%. However, it is essential to note that the outcomes are contingent upon the actual compliance rate observed in real-world scenarios. As the compliance rate varied across different thresholds, specifically at 85%, 70%, and 55%, our model showed a decrease in the identification of patients at stage I, accompanied by an increase in the recognition of patients at later stages ([132]Figure 6B). This shift in diagnostic stage highlights the effectiveness of our model in enabling timely detection and improving patient outcomes through early intervention and treatment. Figure 6. [133]Figure 6 [134]Open in a new tab Modeled changes in stage-specific EC incidence (A) Sankey diagram showing cancer stage shift after intercepted by the predictive model. Percentage of incidence in each condition is shown in parentheses. (B) Pie charts showcasing the number of patients with ESCC diagnosed at different stages at 85%, 70%, and 55% compliance rate. Discussion This study aimed to develop a stacked ensemble machine learning algorithm that combines cfDNA fragmentomic features for the classification of patients with ESCC and non-cancer individuals, with the goal of improving early detection of ESCC. The proposed assay incorporates cfDNA fragmentomic features and integrates them into a stacked ensemble algorithm. This integration is made possible by leveraging the non-random cleavage and fragmentation process of cfDNA. Certain genomic regions exhibit preferred cleavage patterns that are associated with specific conditions, such as the tissue origins and disease status of the cfDNA samples. These preferred cleavage patterns arise due to factors like chromatin accessibility and nuclease activities and are more prevalent and easily detectable compared to mutations.[135]^20^,[136]^21^,[137]^22 While previous studies have explored cfDNA somatic mutations and methylation for EC classification, the results have shown limited sensitivities. For instance, Ueda et al. identified four somatic mutations that achieved a sensitivity of 78.9% in distinguishing patients with EC.[138]^23 Similarly, Qiao et al. employed cfDNA methylation profiles and observed a sensitivity of 74.4% with detection bias in stage 0–II patients (sensitivity, 58.8%).[139]^24 Therefore, our exceptional performance in this study highlights the promising role of cfDNA fragmentomics in cancer detection, surpassing the limitations observed in previous investigations. Furthermore, recent studies have demonstrated the potential of DNA from red blood cells as a diagnostic marker.[140]^25 This could present an intriguing avenue for future research. Our stacked ensemble model demonstrates its applicability for early detection of ESCC in a clinical setting. Through independent validation, our stacked model achieved a sensitivity of 91.8% and a specificity of 98.1% ([141]Table 1), highlighting its superior predictive capacity. The stacked model, with an AUC of 0.995, outperformed all single feature models, and the performance of the stacked model significantly varied from that observed with CNV and NP features. Even though individual models based on FSC and FSD perform well on the validation set, they may be susceptible to overfitting particular characteristics of the training data. Consequently, the ensemble model, which combines or weights outputs from multiple models, generally offers better generalization to unseen data. The predictions of the model remained consistently high across different stages (stage I to IV) and demonstrated sensitivity toward early-stage characteristics. Specifically, the model achieved a sensitivity of 88.9% in patients with highly differentiated cells and a sensitivity of 91.4% in patients with tumor size less than 3 cm ([142]Table S4). It is also worthy of mentioning that our predictive model yielded the highest AUC when benchmarked with other published models[143]^12^,[144]^14^,[145]^17^,[146]^26^,[147]^27 ([148]Table S8). The three most important CNV features were found in regions on chromosomes 1, 9, and 20, with chromosome 1 showing significant amplifications in 184 tumors analyzed by GISTIC ([149]Table S9). The most notable FSD is noted for fragments 105–109 bp on chromosome 4q, 175–179 bp on chromosome 6p, and 110–114 bp on chromosome 5q ([150]Table S9). However, research in the area of DNA fragmentomics is still evolving and studies focusing on the significance of specific fragment size bins are limited, leading to a gap in understanding their exact role in disease processes. Significant NP, like PRDM9, GFI1B, and CEBPZ, has shown overexpression in cancer samples in The Cancer Genome Atlas database, with PRDM9 linked to genomic instability ([151]Table S9). Recently, several studies have developed assays for the detection of an episodic modification 2′-O-methylation on different RNAs, revealing the role of this modification in tumor development[152]^28 Based on the findings of this study, further exploration of the synergistic relationship between EC segment characteristics and changes in the corresponding gene modifications may provide an alternative perspective in cancer research. Furthermore, stage shift analysis revealed that a significant proportion of patients would be diagnosed at stage I following intervention, and it was observed that 20% of patients would experience a positive outcome within a 5-year period under ideal clinical setting ([153]Figure 6). These findings underscore the model’s efficacy in the early diagnosis of ESCC. It is noteworthy that our model exhibited robustness across different diet and exercise conditions in healthy participants, as well as at shallow coverage depth. Notably, even at a coverage depth of 0.5X, our model achieved an impressive area AUC of 0.978 ([154]Figure 4A). This robust performance ensures cost efficiency and given the high AUC at 0.5X coverage depth, there is potential for further reduction in sequencing costs by using coverage depths below 0.5X. Under a clinical setting, our stacked ensemble model is recommended for elevated-risk individuals. For those have a positive screening result, an endoscopy and biopsy are recommended for diagnosis. For individuals with a negative early screening result or those with a positive early screening result but a negative biopsy result, it is recommended to pursue further diagnostic tests, including computed tomography scans and colonoscopy that comprehensively assess the potential presence of other GI tract cancers. Moreover, it is advisable to continue with annual screenings for ESCC using the early detection model, to maintain vigilant monitoring and ensure timely detection. This study has some constraints that must be acknowledged in the interpretation of the findings. Firstly, it is important to recognize that the cohort utilized in this study exclusively comprises individuals from the Chinese population, thereby potentially limiting the generalizability of the resulting model to other populations. Moreover, the sample size employed in this study is relatively limited, and certain subgroups exhibit uneven distribution, which may introduce potential biases and compromise the accuracy of the results in these specific subgroups. Finally, the model may show slight signs of overfitting, as evidenced by a subtle decrease in the performance observed in our external cohort results. In our subsequent research endeavors, we aim to expand the sample size and diversify the subgroups, while also conducting prospective studies at multiple hospitals to evaluate the feasibility of our model in real-world ESCC early detection scenarios. We also advocate for additional independent studies to verify the generalizability and robustness of our model across diverse patient populations prior to its integration into future clinical practice. In summary, we have developed a highly sensitive stacked ensemble model by integrating four fragmentomic features. Our method stems from identifying the most effective base models through extensively exploring various hyperparameter settings and subsequently integrating these models to develop the optimal final model. Moreover, we have demonstrated the robustness and repeatability of our assay, even at shallow coverage depth and across various conditions. As a result, our model presents an accurate and cost-effective approach for the early detection of ESCC, thus offering valuable insights into early cancer diagnosis in clinical settings. Limitations of the study This study has several limitations that should be considered when interpreting the results. Firstly, it is critical to acknowledge that this study’s cohort is entirely made up of individuals from the Chinese population, which might restrict the model’s applicability to other demographic groups. Additionally, the study employs a relatively small sample size, with some subgroups not evenly represented, potentially leading to biases and affecting the precision of the findings within these groups. Lastly, there are indications of possible overfitting in the model, as suggested by a minor reduction in performance in the external cohort results. STAR★Methods Key resources table REAGENT or RESOURCE SOURCE IDENTIFIER Critical commercial assays __________________________________________________________________ QIAamp Circulating Nucleic Acid Kit QIAGEN Cat# 55114 Qubit dsDNA HS Assay Kit Thermo Fisher Scientific Cat# [155]Q32854 KAPA HyperPrep Kit Roche Cat# KK8504 __________________________________________________________________ Deposited data __________________________________________________________________ Human plasma samples of ESCC patients and healthy controls This paper HRA007522 __________________________________________________________________ Software and algorithms __________________________________________________________________ Stacked Ensemble Model for ESCC Prediction This paper GitHub: [156]https://github.com/cancer-oncogenomics/cfDNA_fragmentomics_ESCC_pr ediction (Zenodo: [157]https://doi.org/10.5281/zenodo.11658224) R (v. 4.0.3) CRAN [158]https://www.r-project.org/ Trimmomatic Bolger et al.[159]^29 GitHub: [160]https://github.com/usadellab/Trimmomatic Picard Broadinstitute GitHub: [161]http://broadinstitute.github.io/picard/ BWA Li H.[162]^30 GitHub: [163]https://github.com/lh3/bwa ichorCNA Adalsteinsson et al.[164]^31 GitHub: [165]https://github.com/broadinstitute/ichorCNA gsignal Short et al. GitHub: [166]https://github.com/gjmvanboxtel/gsignal H2O AutoML H2O.ai GitHub: [167]https://github.com/h2oai/h2o-3 Stage Shift Hubbell et al.[168]^32 GitHub: [169]https://github.com/grailbio-publications/Hubbell_CEBP_Interception Model pROC (v. 1.17.0.1) Robin et al.[170]^33 GitHub: [171]https://xrobin.github.io/pROC/ ggplot2 Wickham et al.[172]^34 GitHub: [173]https://github.com/tidyverse/ggplot2 [174]Open in a new tab Resource availability Lead contact Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Tao Wang (wangtao_pumc@live.cn). Materials availability This study did not generate new unique reagents. Data and code availability * • The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive (GSA) (Genomics, Proteomics & Bioinformatics 2021) in National Genomics Data Center (Nucleic Acids Res 2021), China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences, under the accession number listed in the [175]key resources table. These data from this study are available under restricted access due to patient privacy concerns. To request access to the data, please contact the [176]lead contact, and submit a request for access at [177]https://ngdc.cncb.ac.cn/gsa-human/browse/HRA007522. For detailed guidance on making the data access request, see GSA-Human_Request_Guide_for_Users ([178]https://ngdc.cncb.ac.cn/gsa-human/document/GSA-Human_Request_ Guide_for_Users_us.pdf). Upon receiving a request, the [179]lead contact reviews the request and, if approved, notifies GSA to release the data to the requestor. The requestor must describe the objectives of the research project for which the data will be used. Data access will be considered for research purposes and non-commercial use only. In order to ensure patient privacy, access to personally identifiable information or sensitive clinical information will not be provided, and requests for data access must rigorously adhere to the consent agreements established with study participants. * • All code has been deposited at Zenodo and is publicly available as of the date of publication. DOIs are listed in the [180]key resources table. * • Any additional information required to reanalyze the data reported in this paper is available from the [181]lead contact upon request. Experimental model and study participant details Participant enrollment This study included a total of 408 participants used for model training and validation ([182]Figure 1A) and 91 participants as external cohort for further validation of model ([183]Figure 1B). The independence of three cohorts is indicated by distinct recruitment timelines and locations. Specifically, the training cohort was collected between July 2021 and March 2022 and the validation cohort recruitment occurred from July 2022 to May 2023 at Nanjing Drum Hospital. The recruitment for the external dataset began in June 2023 and terminated in December 2023 at a The Second Affiliated Hospital of Nanchang University. The training and validation cohorts consisted of healthy volunteers (n = 209) and previously untreated Esophageal Squamous Cell Carcinoma (ESCC) patients (n = 199) spanning from stage I to IV. The external cohort is composed of 47 healthy individuals and 44 ESCC patients. In the training cohort, 18 ESCC patients were excluded due to unconfirmed stage information, treatment history sample processing QC failure and sequencing QC failure, and 16 healthy individuals were excluded because of their history of cancer, acute inflammation, sample processing and sequencing QC failure. The ensemble stacked model was subsequently constructed with the resulting 102 ESCC samples and 105 healthy samples. The validation cohort began sample collection only after the construction and training of our model. The exclusion process involved removing: 27 ineligible ESCC patients without confirmed stage information, treatment history and QC failure during sample processing or sequencing; 19 healthy individuals had acute inflammatory response, failed sample processing QC or sequencing QC, and withdrew from the study. The resulted validation cohort contains 97 ESCC patients and 104 healthy individuals. Finally, 6 ESCC patients and 3 healthy participants were excluded in external cohort. The cancer and noncancer cohorts were matched for gender and age in all cohorts (Wilcoxon signed-rank test, Fisher test; p > 0.05 for all groups). The patients were mainly diagnosed in late stage, which reflects the challenges associated with the late detection of ESCC in real-world scenario, making it difficult to predominantly obtain early-stage cases. The findings were shared exclusively to the medical practitioners involved, with no disclosure to the patients, given that this study does not involve direct alterations to medical decisions or interventions with individual patients. This study was approved by the Ethics Committee of the Nanjing Drum Tower Hospital affiliated to Nanjing University School of Medicine, on the use of human samples for experimental studies. It adhered to the Declaration of Helsinki. All patients provided written informed consent prior to sample collection. Method details cfDNA extraction and WGS Venous blood samples were obtained during routine physical checks for healthy volunteers and preoperatively for cancer patients. Blood samples were collected upon patients’ admission to the hospital, which was after diagnosis of esophageal cancer through endoscopic biopsy staining and prior to the performance of surgery. All blood sampling occurred within a time frame of 1–3 weeks after diagnosis. The final pathological staging was determined after the surgical procedure. This standardized approach was consistent across all recruited ESCC patients to minimize variations in the time interval. The collection, processing, and sequencing of blood samples from all participants followed the same protocols. A venous full blood sample (8-10mL) from each participant was collected using an EDTA tube at Nanjing Drum Tower Hospital (Nanjing, China) or The Second Affiliated Hospital of Nanchang University (Nanchang, China). The plasma underwent a secondary centrifugation step at 18,000g to remove residual cellular debris in the hospital within 6 h of collection, and was subsequently stored at −80°C. The plasma sample was then shipped with dry ice to the central laboratory of Nanjing Geneseeq Technology Inc. (Nanjing, China), which is qualified by College of American Pathologists (CAP), Clinical Laboratory Improvement Amendments (CLIA) and ISO 15189. The following cfDNA extraction and sequencing processed were performed within 24 h cfDNA extraction from plasma was carried out using the QIAamp Circulating Nucleic Acid Kit (Qiagen) following the manufacturer’s instructions. The concentration and quality of cfDNA were assessed using the Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific) in accordance with the provided guidelines. For each sample, 5–10 ng of plasma cfDNA was subjected to PCR-free WGS library construction using the KAPA HyperPrep Kit (Roche). 8 Samples (failure rate = 1.64%) with plasma cfDNA less than 5ng were excluded. For samples with extracted plasma cfDNA exceeding 10 ng, a maximum of 10 ng cfDNA was utilized for the construction of the whole-genome sequencing (WGS) library. The libraries were subjected to paired-end sequencing on the NovaSeq6000 platform (Illumina). Initial trimming of FASTQ file was conducted using Trimmomatic as part of the quality control (QC) protocol. Following the removal of PCR duplicates by the Picard toolkit ([184]http://broadinstitute.github.io/picard/), the qualified reads were subsequently aligned to the human reference genome (GRCh37/UCSC hg19) using the sequence aligner BWA with a mean sequencing depth of 10.92× (range 5.35×-24.81×). The sequencing QC includes: Q30, GC content, proportion of reads aligned to reference genome, median insert and mean depth. All participants who were excluded from the study due to failing sequencing QC had a mean depth of less than 5×. To minimize bias, the sample operating team remained blinded to the case/control status of the samples throughout the entire process. To address variations in WGS depths among samples (range 5.35×-24.81×), we performed down-sampling to a uniform depth of 5× for model optimization. Thereby, the potential coverage differences were mitigated, and the maximum sample inclusion was kept. The optimized model was subsequently validated using raw WGS data with varying sequencing depths or down-sampling to 4×, 3×, 2×, 1×, and 0.5× depths using Picard toolkit, enabling a comprehensive evaluation of its performance. Bioinformatic analysis and modeling For feature selection and model construction, we extracted four fragmentomic features from the WGS data. These features encompassed copy number variation (CNV), fragmentation size coverage (FSC), fragmentation size distribution (FSD), and nucleosome profiling (NP). Copy number variation (CNV) refers to the gains or losses of chromosomal region > 1kb, which could significantly associate with cancer when the amplifications or deletions occur on oncogenes or tumor suppressor genes. Fragmentation Size Coverage (FSC) and Fragmentation Size Distribution (FSD) capture the characteristic of the size of fragments. Finally, Nucleosome Profiling (NP) analyzes for chromatin structure and implies the gene expression level. By combining these features extracted from cfDNA, we could potentially capture the differences between healthy individuals and cancer patients. Copy number variation (CNV) feature extraction was adapted from Wan et al.[185]^35 using ichorCNA developed by them. The genome was tiled in to 1Mb disjoint bins, resulting in 2475 bins in total, with respect to the reference genome (GRCh37/UCSC hg19). The coverage depth of each bin was corrected by GC content and compared against the software baseline. The log2 ratio was then computed by ichorCNA for each bin. The procedure for extracting Fragmentation Size Coverage (FSC) was adapted from the approach introduced by D. Mathios et al.[186]^14 The genome was divided into 5 Mb non-overlapping bins, resulting in a total of 541 bins. Subsequently, reads with lengths less than 100 bp or greater than 220 bp were filtered out, the short (100–150 bp) and long fragments (151–220 bp) were standardized and transformed into z-scores, and the ratio of short fragments to long fragments was computed for each bin. The extraction of Fragmentation Size Distribution (FSD) was derived from S. Su et al.[187]^36 FSD captures the coverage of 100-220bp cfDNA fragments in 5 bp increments across 39 chromosome arms. Notably, the short arms of 5 acrocentric chromosomes were excluded from the analysis due to largely unknown sequence,[188]^37 and two arms of the X chromosome were excluded due to a clear distinction observed in males and females. The raw coverages were then converted into z-scores, resulting in a total of 936 features across the 39 chromosomes. We adopted the methodology outlined by Doebley et al.[189]^38 for the extraction of Nucleosome Profiling (NP) which includes the selection of transcription factor binding sites (TFBS) and profiling associated nucleosome coverage patterns. Transcription factors (TFs) with over 10,000 highly mappable sites were identified from GTRD database (v 21.12, [190]https://gtrd.biouml.org/downloads/21.12/chip-seq/), and 854 TFs remained after excluding TFs beyond the list of known binding sites in the CIS-BP database (v2.00, [191]http://cisbp.ccbr.utoronto.ca/bulk.php). 10,000 mappable sites with the highest peak counts were selected for each TF. To profile nucleosome coverage, reads between 100 and 220bp around each TF binding site were extracted in a window (-5kbp to +5kbp) of each binding site and weighted by GC bias. Weights of the reads’ midpoints falling within each 15bp bin were summed as GC-corrected midpoint coverage profiles. The averaged GC-corrected midpoint coverage profile was normalized for each TF, and the resulting curve was smoothed using a Savitzky-Golay filter (150bp window, polynomial order = 3) to produce the final TF coverage profile. Three quantified features were extracted for each of the 854 TF: 1) mean coverage of the region flanking 1kbp upstream and 1kbp downstream of the center; 2) coverage at the center; and 3) amplitude of coverage peaks surrounding the center through Fast Fourier Transform. The input data for CNV, FSC, FSD and NP consists of 2475, 1082, 936 and 2562 (854∗3) columns, respectively, with each column represents distinct characteristics of the feature. Notably, all available characteristics were incorporated into the model and no feature selection was applied. To construct the models, each fragmentomic feature served as the basis for building individual base models using four different algorithms: generalized linear model (GLM), distributed random forest, deep learning, and XGBoost. A fully connected multi-layer artificial neural network was selected as our deep learning approach to capture the intricate and non-linear patterns present in the data. These algorithms generated a cancer score ranging from 0 to 1 for each cancer and noncancer sample, with a score closer to 1 indicating a higher probability of cancer. Subsequently, the cancer probability scores from all algorithms were aggregated and their means was calculated to create an ensemble model. The model training process exclusively utilized the H2O AutoML package.[192]^39 Specifically, the determination of hyperparameters was carried out by the H2O AutoML algorithm, employing a random grid search approach. In this process, various hyperparameters were randomly selected and applied to each model. Each model had a distinct set of hyperparameters to explore. Notably, for GLM, which has a non-default value set for a hyperparameter, AutoML conducted an internal grid search. It passed a list of alpha parameters to build a model, searching for the optimal combination of alpha-lambda hyperparameters. This optimized combination was then applied to construct the GLM model. During the grid search of XGBoost, AutoML explored values for specific hyperparameters, including booster, col_sample_rate, col_sample_rate_per_tree, max_depth, min_rows, ntrees, reg_alpha, reg_lambda, and sample_rate. For the deep learning model, AutoML selected values from a predefined list for parameters for following hyperparameters: activation, epochs, epsilon, hidden, hidden_dropout_ratios, input_dropout_ratio, and rho. Following the training process, a Leaderboard of trained models was generated. To identify the optimal model, AUC was specified as the sorting metric. The final score was determined by averaging the top 10 models from the Leaderboard. To prevent overfitting of the model, 5-fold cross-validation was employed based on the training dataset for both the individual base model and the ensemble model. The occurrence of overfitting can be detected by using this recognized approach which partitioning training dataset into 5 sets, iterate training process in the 4-folds of it and validate model on the remining 1-fold of the data. Additionally, runtime restriction, ensemble techniques and the incorporation of an external cohort were implemented to enhance model generalizability. The prediction performance of these models was evaluated using area under the curve (AUC) values in the validation cohorts, providing a comprehensive assessment of their effectiveness. The training of the model was entirely executed on training set, while keeping the validation set untouched during training process. After finalizing the parameters using the training cohort, the model was applied to the validation cohort to generate cancer prediction scores for each sample and evaluate the performance of the model. The cut-off score was determined at 98% specificity of the training cohort. Model analytical validity assessment To access the efficacy of the stacked ensemble model, individual models were developed for each fragmentomic feature. Subsequently, a cancer score between 0 and 1 was predicted for each sample, and the models’ performance was evaluated by examining AUC values. To evaluate the model’s repeatability and reproducibility, a total of eleven participants: three cancer patients and two healthy individuals were randomly selected from validation cohort, three cancer patients and three healthy individuals from the external cohort, to perform within-run and between-run tests. In the within-run tests, approximately 10 mL of venous blood sample was collected from each participant for plasma preparation (batch 1). Three technical replicates were created for each participant by equally dividing their blood sample into three. These replicates underwent cfDNA extraction, library preparation, and sequencing by a single group of lab technologists. To perform between-run tests, a second round of blood sample collection was conducted three weeks later on the participants (batch 2). The same procedures were followed but performed by a different group of lab technologists. The within-run and between-run tests yielded a total of 66 samples. From these samples, fragmentomic features were extracted and analyzed by the predictive model to obtain cancer scores following the assay described previously. The repeatability of the model was examined by comparing the results of each participant within the same batch, while the reproducibility was assessed by comparing the results between batch 1 and batch 2. In order to determine the model’s robustness, we evaluated the model’s sensitivity to cell free DNA (cfDNA) fractions with low coverage. Plasma samples from validation cohort were subjected to WGS and were randomly subsampled to 4×, 3×, 2×, 1×, and 0.5× fold coverage. These fractions were subsequently inputted into the predictive model to obtain cancer scores. Detection sensitivity and AUC value were used to quantify and evaluate the models’ performance. The cut-off determined by the training cohort at 98% specificity was employed to determine the cancer/noncancer status of these samples. Stage shift analysis A stage shift analysis was performed using the stage shift model developed by Hubbell et al.[193]^32 This model was utilized to predict the diagnostic stage shift resulting from the application of the early detection model, in comparison to receiving standard care and screening. The standard care and screening data was based on China’s population cancer incidence and mortality data. Quantification and statistical analyses We utilized the pROC package (v. 1.17.0.1) in R (v. 4.0.3) to generate receiver operating characteristic curves for the statistical analysis. Sensitivity (TP/[TP + FN]), specificity (TN/[TN + FP]), accuracy ([TP + TN]/[TP + FP + TN + FN]), along with their corresponding 95% confidence interval (CI), were calculated based on the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values of cancer prediction. To invesgtigate the differenciatiability of features, Principal Component Analysis (PCA) was implemented employing built-in stats (3.6.2) package in R. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was conducted using the DAVID Bioinformatics Resources[194]^40 to further gain insights into the functional pathways and biological processes associated with distinct NP. Acknowledgments