Abstract Neoadjuvant therapy has been widely used in breast cancer, but treatment response varies among individuals. We conducted multiomic profiling on tumor samples from 149 Chinese patients with breast cancer across ER^−HER2^+, ER^+HER2^+, and ER^−HER2^− subtypes, categorizing outcomes as pathologic complete response (pCR; n = 81) or residual disease (RD; n = 68). We identified distinct molecular features linked to pCR in each subtype: elevated cell proliferation in patients with ER^−HER2^− pCR, higher CDKN2A methylation in patients with ER^−HER2^− RD, increased KIT methylation in patients with ER^−HER2^+ RD, and MAP4K1 hypermethylation in patients with ER^+HER2^+ RD. These findings were subsequently validated in independent datasets. By integrating clinical and multiomic data, we developed MOPCR, a subtype-specific machine learning model that outperformed single-omic approaches in predicting treatment response. MOPCR demonstrated potential generalizability across cohorts and provided preliminary stratification of patient subgroups with higher pCR probability, offering valuable insights for precision cancer management. __________________________________________________________________ Neoadjuvant therapy outcomes for breast cancer become more predictable with machine learning and multiomic profiling. INTRODUCTION As the most common tumor type in women, breast cancer caused 685,000 deaths in 2020 ([56]1). The age-standardized rates of breast cancer in China were estimated as 17.0 to 34.3 cases per 100,000 women ([57]2). Neoadjuvant chemotherapy (NACT) is a widely used method for treating breast cancer before surgery. NACT aims to shrink tumors and make locally advanced or inoperable breast cancer operable, avoiding axillary lymph node dissection and allowing breast-conserving surgery. Another substantial advantage lies in the potential to obtain additional prognostic and predictive data. Given that the residual cancer burden (RCB) is linked to the recurrence risk ([58]3), therapy can be customized according to the treatment response. For example, the KATHERINE trial showed that patients with human epidermal growth factor receptor 2 (HER2)–positive breast cancer and residual disease (RD) after HER2-targeted therapy benefited from trastuzumab emtansine (T-DM1), improving recurrence-free survival ([59]4). While achieving pathologic complete response (pCR) is associated with better event-free survival ([60]5), NACT may delay surgery and increase disease progression risk ([61]6). Early pCR prediction could mitigate these risks and improve survival. However, the molecular mechanisms underlying NACT response remain unclear, and accurate pCR prediction remains challenging. Breast cancer exhibits substantial heterogeneity in genomics, transcriptomics, proteomics, and treatment responses, underscoring the need for individualized, multiomic-driven management. The PAM50 model, for instance, associates luminal A tumors with the best relapse-free survival but the lowest pCR rates ([62]7). The Cancer Genome Atlas Program (TCGA)’s multiomic analysis of 825 invasive breast adenocarcinomas identified subtype-specific features, such as TP53 mutations and MYC gains in basal-like tumors, and distinct methylation patterns ([63]8). However, these studies did not focus on NACT, leaving a gap in understanding multiomic features linked to NACT outcomes. Recent studies highlight the role of clinical, transcriptomic, cellular, and genomic features in NACT response. High Ki67 levels (>20%) correlate with increased pCR rates ([64]9), and gene expression-based predictors have been developed for HER2^− breast cancer ([65]10–[66]12). Tumor-infiltrating lymphocytes are also associated with higher pCR rates ([67]13). Despite these insights, most studies examine features in isolation, leaving the predictive potential of integrative analyses underexplored. Integrative DNA and RNA analyses have improved pCR prediction in HER2^+ tumors, linking pCR to intrinsic subtypes and tumor microenvironment features ([68]14). Other multiomic approaches integrating DNA, RNA, and pathology images have improved pCR prediction accuracy compared to traditional clinical predictors, emphasizing the importance of data integration from multiple platforms ([69]15). However, these models were primarily developed in Western populations and lack validation in specific subtypes or diverse ethnic groups. This is particularly relevant as different receptor subtypes exhibit varying pCR rates in clinical practice, and studies indicate that Chinese patients with breast cancer have distinct molecular profiles compared to Western ones ([70]16, [71]17). Developing subtype-specific models is essential for enhanced predictive performance, and incorporating non-Western populations could help assess model generalizability. In addition, integrating epigenomics and proteomics—often overlooked in prior research ([72]10, [73]14, [74]15)—could provide critical diagnostic and prognostic insights, as demonstrated in other cancers ([75]18–[76]22). In this study, we performed multiomic profiling of 149 Chinese patients with breast cancer, analyzing somatic mutations, copy number alterations, DNA methylation, gene expression, protein expression, and phosphorylation. Patients were classified as pCR or RD based on postsurgery pathology. After identifying subtype-specific multiomic features associated with treatment response, we developed multi-omics pCR prediction models (MOPCR), a machine learning prototype model integrating clinical and multiomic data to predict pCR. The MOPCR prototype demonstrated promising accuracy and generalizability across independent cohorts, offering insights into treatment outcomes and survival. The model prototype was deployed on an online server at [77]http://www.wang-lab-hkust.com:8050 as a resource for the research community. To sum up, this study provided a comprehensive multiomic molecular portrait of Chinese breast cancer responders and nonresponders and a computational framework to predict NACT treatment outcomes. RESULTS Patient enrollment and treatment outcome evaluation To study breast cancer treatment outcomes, we enrolled 149 newly diagnosed patients (age range: 23 to 73 years) from the Guangdong Provincial People’s Hospital (GDPH) ([78]Fig. 1, A and B). All patients received NACT followed by surgical intervention. HER2 and estrogen receptor (ER) statuses were assessed at diagnosis using immunohistochemistry staining, and the enrolled patients were grouped into three subtypes: ER^+HER2^+ (n = 55), ER^−HER2^+ (n = 42), and ER^−HER2^− (n = 52). Anti-HER2 treatments were applied to all HER2^+ patients except for one case. As outlined in fig. S1A, chemotherapy regimens included docetaxel plus carboplatin (TCb; n = 74), epirubicin plus cyclophosphamide followed by docetaxel (EC-T; n = 61), and docetaxel plus pyrotinib (TY; n = 7). Seven additional patients received regimens, including docetaxel alone (T). Fig. 1. Summary of the cohort and clinical features associated with response to NACT. [79]Fig. 1. [80]Open in a new tab (A) Design of this study. Biopsy samples for multiomic data were collected at diagnosis, with treatments administered before surgery. Pathologic examination classified patients into pCR and RD groups. Biomarkers were identified by comparing these groups across different omic data and integrated using machine learning models to predict treatment outcomes. (B) Summary of clinical information and availability of four omic data types used in this study. T, tumor size; N, lymph node metastasis; grade, histological grade; chemo, the combination of chemotherapy; T, docetaxel; Cb, carboplatin; EC, epirubicin plus cyclophosphamide; Y, pyrotinib; anti-HER2. anti-HER2 treatment. (C) Multivariate analysis results by logistic regression, highlighting clinical features with significant independent association with NACT outcome (P < 0.1). (D and E) Association between pCR and subtypes (D), clinical tumor grade (E), and lymph node metastasis status (E) at diagnosis. P values were calculated by the chi-square test (D) and Fisher’s exact test (E). (F) Association between pCR and Ki67 index among subtypes. P values were calculated by Wilcoxon rank sum tests. To evaluate the treatment outcomes of the NACT, a pathologic examination was performed after the completion of surgical resection to assess whether there were residual tumors after the neoadjuvant treatments (fig. S1B) ([81]23, [82]24). Patients were classified as pCR if no residual cancer cells were detected in resected tissues or lymph nodes during pathologic evaluation. Those identified as RD were further classified into RCB classes, including RCB I, RCB II, and RCB III (Materials and Methods and fig. S1C). Of 149 patients, 81 patients achieved pCR (26 in ER^+HER2^+, 32 in ER^−HER2^+, and 23 in ER^−HER2^−), while 68 patients had RD (29 in ER^+HER2^+, 10 in ER^−HER2^+, and 29 in ER^−HER2^−) (table S1). We found that pCR is associated with improved overall survival in the ER^−HER2^− and favorable disease-free survival in both the ER^−HER2^− and ER^−HER2^+ subtypes (fig. S1, D and E). We next assessed the association between clinical features and pCR using multivariable and univariate logistic regression analysis, identifying significant correlations with several clinical characteristics ([83]Fig. 1C and fig. S2A). The HER2/ER-based subtypes had discriminative pCR rates of 47.3% for ER^+HER2^+, 76% for ER^−HER2^+, and 44% for ER^−HER2^− (P = 0.0035, chi-square test; [84]Fig. 1D). As tumor size increased, pCR rates decreased from 61% in T1 and T2 to 17% in T3 and T4 tumors ([85]Fig. 1E). Furthermore, the presence of lymph node metastasis was associated with resistance to NACT, resulting in a lower likelihood of pCR, with an odds ratio (OR) of 0.39 [confidence interval (CI): 0.18 to 0.80, P = 0.0075, Fisher’s exact test, confidence level of 0.95; [86]Fig. 1E]. Moreover, a higher tumor proliferation marker Ki67 value was associated with an increased pCR rate in the ER^−HER2^− subtype (P = 0.0079, Wilcoxon rank sum tests; [87]Fig. 1F). These findings align with observations from previous studies involving other cohorts ([88]9, [89]15, [90]25–[91]28). We next compared pCR rates in our cohort to those of the previously reported breast cancer cases who underwent NACT (anti-HER2 therapy and/or chemotherapy). We found that our cohort showed overall higher pCR rates compared to the British breast cancer cohort TransNEO by Sammut et al. ([92]15) in ER^−HER2^+ (P = 0.094) and ER^+HER2^+ (P = 0.040) subtypes (fig. S2B). A younger patient age in the pCR group reported by Sammut et al. ([93]15) was not found in our cohort (P = 0.108), although the age distribution was comparable between the two cohorts (P = 0.873) (fig. S2C). Subsequently, we integrated our cohort with the TransNEO cohort and performed a multivariable analysis with 243 patients from two cohorts and found that tumor size of T3/T4, lymph node metastasis, and ER-positive status were still significantly associated with RD, while histologic grade III was associated with pCR. Notably, the cohort factor emerged as an independent predictor of RD, with an OR of 3.94 (CI: 1.95 to 8.24, P = 1.7 × 10^−4). The confidence level for this analysis was set at 95% (fig. S2D). It has been recently reported that Chinese breast cancer had remarkable genomic disparities from the Caucasian cohort ([94]16, [95]17). Specifically, the frequency of PIK3R1 (phosphoinositide-3-kinase regulatory subunit 1) mutation was prevalent in Chinese breast cancer regardless of molecular subtypes ([96]16), while the frequency of PIK3CA mutation was reported to be lower in Chinese ER^−HER2^− ([97]17). A multiomic dataset focusing on well-labeled Chinese patients with breast cancer might reveal population-specific disease mechanisms and help to establish alternative models with better performance on East Asian patients. Multiomic profiling of breast cancer samples We collected multiomic data from pretreatment tumor biopsies ([98]Fig. 1B and table S1). Specifically, whole-exome sequencing (WES) was performed on 104 tumor-blood pairs to identify somatic single-nucleotide variants (SNVs), small insertions/deletions (INDELs), and somatic copy number variants (CNVs) from cancer genomes. Moreover, whole-genome bisulfite sequencing (WGBS), RNA sequencing (RNA-seq), and liquid chromatography–mass spectrometry were carried out to profile methylome, transcriptome, and proteome/phosphorylome on 145, 140, and 137 tumor samples, respectively. Among all the enrolled patients, 89 had data from all the above-mentioned omic platforms (fig. S2E). The average sequencing depth of tumor WES data was approximately 600× (fig. S3A). Somatic SNVs and small INDELs were called across all samples using GATK Mutect2 ([99]29), yielding a median load of 74 (ranging from 3 to 1078) nonsynonymous mutations in coding regions (Materials and Methods and table S2). These mutations included previously reported breast cancer driver genes, such as TP53 (n = 71, 75.5%), PIK3CA (n = 32, 34.0%), KMT2D (n = 8, 8.5%), KMT2A and KMT2C (n = 7, 7.4%), GATA3, MED13, and CSMD1 (n = 6, 6%) ([100]Fig. 2A) ([101]30–[102]33). The median variant allele frequency of these driver genes was 0.166 (fig. S3B). Mutational signature analysis demonstrated a predominance of age-related signature (SBS1 and SBS5) and APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like) signature (SBS3 and SBS13), suggesting that these processes drove the most of the observed mutagenesis ([103]Fig. 2B) ([104]34). As expected, SBS1 and SBS5 were positively correlated with patient age at diagnosis (fig. S3C). Notably, the SBS3 signature was enriched in the ER^−HER2^− (P = 1.1 × 10^−5) but not the SBS13 (fig. S3D). Further, we inferred the CNV and purity based on the paired WES data and observed no significant difference in tumor purity between pCR and RD (fig. S3E and table S3). The CNV landscape revealed a distinct profile among subtypes (fig. S3F). Afterward, genomic identification of significant targets in cancer 2 (GISTIC2) ([105]35) was applied to find recurrent focal and arm-level copy number alterations in this cohort (fig. S3G). Consistent with previous results, ERBB2, MYC, AIM1, and CCND1 gain/amplification were abundant in the breast cancer cohort ([106]Fig. 2B) ([107]36). Fig. 2. Somatic mutation signature and landscape. [108]Fig. 2. [109]Open in a new tab (A) Somatic mutation landscape in pCR and RD groups, respectively. Only 94 samples with available purity by FACETS were shown. (B) Absolute and relative SNV signature distribution in 94 patients. Patients were clustered on the basis of the relative signature contribution. (C and D) Focal amplification and focal deletion comparison between pCR and RD groups in the ER^−HER2^− subtype. (E) Log[2] ratio of cytoband 6q27 in the ER^−HER2^− subtype. Somatic mutational landscape of pCR versus RD breast cancer genomes To interrogate whether somatic mutations contribute to treatment response, we compared the frequencies of nonsilent somatic alterations in the pCR group versus the RD group. No somatic gene SNV/INDEL was significantly associated with pCR after the false discovery rate (FDR) adjustment. For the CNV, by comparing the focal CNV landscape between the pCR and RD groups in each subtype, we found that the 6q27 copy number was significantly lower in the RD group in the ER^−HER2^− ([110]Fig. 2, C to E; Materials and Methods; and fig. S3H). Further inclusion of more samples might verify this observation and enable the identification of more genetic variants that can predict the treatment outcomes. Methylome comparison between pCR versus RD breast cancer samples Methylation of gene regulatory regions plays a crucial role in regulating gene expression and may influence treatment response. To investigate the relationship between pCR and the methylation landscape, we analyzed WGBS data from 145 tumor samples, focusing on predefined gene regulatory regions and their association with NACT outcome. Initial analysis revealed no significant differences in methylation levels between pCR and RD groups when subtypes were not stratified (fig. S4A). However, consistent with prior studies, we observed significantly higher methylation levels in promoter and enhancer regions in HER2^+ subtypes compared to ER^−HER2^− subtypes ([111]Fig. 3A) ([112]37). When comparing methylation levels between pCR and RD groups within each subtype, we identified elevated methylation in the distal enhancer-like signature region (P = 0.076) and the CCCTC-Binding Factor (CTCF)-only region (P = 0.060) in the ER^+HER2^+ subtype, although these differences did not reach statistical significance ([113]Fig. 3A). These findings suggest potential subtype-specific methylation patterns that may be associated with treatment response, warranting further investigation ([114]Fig. 3A). Fig. 3. Methylation features as predictors for pCR. [115]Fig. 3. [116]Open in a new tab (A) Mean methylation levels in cis-regulatory elements for pCR and RD in each subtype. P values were obtained by comparing pCR versus RD and among subtypes; only P < 0.1 is shown. DNase, deoxyribonuclease. (B) Framework for identifying methylation biomarkers. Genome-wide CpG methylation was segmented into regions, and differential methylation regions (DMRs) were identified using rank sum tests. A representative region was selected for each gene to represent the gene by considering methylation differences (pCR versus RD) and associations with RNA expression. Comparison of representative regions helps identify the differentially methylated genes (DMGs). Genes from the BRCANET list were prioritized. DMRs within the DMGs are prioritized as biomarkers for downstream analysis (C) DMGs for pCR versus RD in each subtype (P < 0.05). BRCANET genes are outlined in black. ∆β indicates mean methylation differences; larger points reflect more significant associations with gene expression. (D) CpG-level methylation for CDKN2A in ER^−HER2^− (pCR versus RD). Top: Mean β values for each CpG site; dashed lines mark ±1000 bp from the transcription start site (TSS). The second panel: Top two DMRs identified through segmentation. The third panel: Rank sum test P values for CpG sites (P < 0.05); larger points indicate greater significance. RefSeq transcripts are shown at the bottom, with red arrows marking the gene start and end. (E) The distribution of methylation level at chr9: 21,996,194 by pyrosequencing in the independent ER^−HER2^− validation cohort. P value was calculated by the Wilcoxon rank sum test. To uncover more specific methylation differences between pCR and RD, we segmented WGBS CpG methylation data into methylation regions for differential methylation analysis ([117]Fig. 3B). To address the high dimensionality of CpG methylation data and reduce false positives, we developed a network-based feature selection method. This approach integrated prior knowledge from biological pathways ([118]38), gene-disease association ([119]39), protein-protein interaction ([120]40), and gene regulation ([121]41) to construct a breast cancer–specific biological network, termed BRCANET ([122]Fig. 3B and Materials and Methods). We further correlated methylation regions with corresponding gene expression to identify methylation-regulated regions, prioritizing those with significant methylation expression anticorrelations and differential methylation between pCR and RD ([123]Fig. 3B and Materials and Methods). Using this framework, we highlighted methylation region candidates within the context of BRCANET. To explore the relationship between methylation and pCR across molecular subtypes, we conducted differential methylation gene analysis (Materials and Methods). Briefly, genome-wide CpG methylation data were segmented into regions, and the top-ranking differentially methylated region (DMR) for each gene was selected as its representative [differentially methylated gene (DMG)] based on RNA-methylation associations and methylation differences stratified by the NACT outcome (Materials and Methods). When analyzing all subtypes collectively, hypermethylation of the FGF1 and SNAI1 genes was associated with the pCR group, while hypermethylation of the MAP4K1 and CDKN2A genes was observed in the RD group (fig. S4B). Separately in the ER^−HER2^−, ER^−HER2^+, and ER^+HER2^+ subtypes, we identified 465, 576, and 2242 hypermethylated genes in the pCR group, compared to 1381, 708, and 208 hypermethylated genes in the RD group, respectively. Using the BRCANET framework, we further refined these findings to 37, 32, and 98 hypermethylated genes in the pCR group and 69, 15, and 11 hypermethylated genes in the RD group for the ER^−HER2^−, ER^−HER2^+, and ER^+HER2^+ subtypes, respectively ([124]Fig. 3C and fig. S4C). Notably, CDKN2A and BIRC5, involved in the cell cycle, were significantly hypermethylated in ER^−HER2^− RD tumors. XRCC2 in ER^−HER2^+ and MAP4K1 and MFNG in ER^+HER2^+ were hypermethylated in RD tumors ([125]Fig. 3C). Conversely, NFIA (nuclear factor I A), BRCA1 (BRCA1 DNA repair associated, involved in DNA repair), and SNAI1 (snail family transcriptional repressor 1, involved in epithelial-to-mesenchymal transition) hypermethylation was associated with pCR in ER^−HER2^− tumors; GADD45A (growth arrest and DNA damage–inducible α), NCOR2 (nuclear receptor corepressor 2), SNAI1, and KIT (KIT proto-oncogene, receptor tyrosine kinase) hypermethylation in ER^−HER2^+ tumors; and AQP1 (aquaporin 1), PPARA (Peroxisome Proliferator–Activated Receptor α), SPOP (speckle type BTB/POZ protein), and JUP (junction plakoglobin, a common junctional plaque protein) hypermethylation in ER^+HER2^+ tumors ([126]Fig. 3C). While these genes exhibited significant P values and relatively large methylation differences between pCR and RD, the limited sample size and the lack of FDR correction raise the possibility of high false positives, complicating their selection for downstream analysis. Nonetheless, among these, CDKN2A in the ER^−HER2^− subtype exhibited a strong β value difference and significant P value, prompting further investigation. CDKN2A methylation as a predictor of pCR in triple-negative breast cancer CDKN2A, a known tumor suppressor gene, is frequently inactivated in various cancers through mechanisms including promoter hypermethylation ([127]42, [128]43). In our analysis, CDKN2A methylation levels were significantly lower in the ER^−HER2^− subtype compared to HER2^+ subtypes (fig. S4D), which was also supported by the TCGA cohort (fig. S4E) ([129]8). Notably, higher CDKN2A methylation levels were specifically associated with RD in ER^−HER2^− tumors, with no such association observed in other subtypes (fig. S4D). This association persisted when stratified by treatment regimens and RCB classes (fig. S4, F and G). CDKN2A methylation exhibited a strong negative correlation with both gene expression (Pearson’s correlation coefficient R = −0.55, P = 7.1 × 10^−05) and protein expression (Pearson’s correlation coefficient R = −0.31, P = 0.042) (fig. S5, A to C). Consistent with these findings, CDKN2A expression was lower in patients with RD, particularly in the ER^−HER2^− subtype (P = 0.0499) (fig. S5D). When stratified by treatment regimen, reduced CDKN2A expression remained evident (fig. S5E). To validate the association between CDKN2A methylation and treatment response, we analyzed six external ER^−HER2^− cohorts with NACT and pCR labels. Given the scarcity of methylation data and the strong negative correlation between CDKN2A methylation and expression in our cohort, we focused on CDKN2A gene expression as a surrogate marker. In five of the six cohorts—Sammut2021 ([130]15), Wolf2022 ([131]44), Hatzis2011 ([132]10), Hoogstraat2022 ([133]45), and BrightNess ([134]46)—CDKN2A expression was consistently higher in patients with pCR compared to patients with RD, although some RD cases also exhibited elevated CDKN2A expression. The Anurag2022 ([135]47) cohort was the exception, showing no significant difference but only a trend (fig. S5, F to K). These findings further support the elevated CDKN2A expression in ER^−HER2^− pCR breast cancer. To further confirm the CDKN2A DMR methylation as a predictor of pCR in ER^−HER2^−, we collected an independent cohort of 75 patients with ER^−HER2^− treated with NACT. We designed a region for bisulfite pyrosequencing based on the methylation profiles in the discovery cohort, using sliding windows of 100 bp. This approach enabled us to identify the optimized CpG region that was most strongly associated with pCR in ER^−HER2^− tumors, which was found near one of the transcription start sites (TSSs) of CDKN2A ([136]Fig. 3D). Pyrosequencing was then performed on formalin-fixed, paraffin-embedded (FFPE) samples from the 75 additionally collected cases (table S4), and 10 cases with WGBS data were used for quality control. After quality control based on consistency between WGBS data and pyrosequencing data, as well as sequencing quality (Materials and Methods), one CpG site (chromosome 9: 21,996,194) was selected for downstream analysis (fig. S6, A and B). Despite the small sample size and FFPE samples, we observed a significant association between hypermethylation and RD in the validation cohort (P = 0.042; [137]Fig. 3E). In the ER^−HER2^+ subtype, highlighted DMGs such as GADD45A, SNAI1, and KIT, whose differential methylation regions were near TSSs (fig. S7A), did not exhibit significant differences in gene expression in independent cohorts (fig. S7B). However, KIT expression was significantly higher in the RD groups in this study cohort (fig. S7B). Upon further examination of the CpG level methylation pattern in this study and the expression in independent external datasets, MAP4K1 hypermethylation in the RD group of ER^+HER2^+ subtypes might be a potential biomarker for NACT outcomes ([138]Fig. 3C, Materials and Methods, and fig. S8A). While MAP4K1 methylation differed significantly between pCR and RD (P = 0.017; fig. S8B) and significantly correlated with its expression (P = 1.4 × 10^−09, Spearman R = −0.49; fig. S8C), no significant expression difference was observed in our cohort (P = 0.805; fig. S8D). Stratified by the treatment regimen, the MAP4K1 methylation difference remained significant and had the same trend in TCb and EC-T regimens, respectively (fig. S8E), suggesting minimal regimen-driven impact. Higher MAP4K1 methylation was observed in each RCB class compared to pCR (fig. S8F). In the Hurvitz2020 cohort ([139]48), the expression difference approached significance (P = 0.059) in the treatment regimen with the largest sample size (fig. S8G). Furthermore, significantly lower MAP4K1 expression in patients with RD from the Sammut2021 (P = 4.7 × 10^−4) and Wolf2022 (P = 3.5 × 10^−3) cohorts supported its association with treatment outcome, aligning with the hypermethylation observed in patients with RD (fig. S8, H and I). In summary, CDKN2A hypermethylation and reduced expression in ER^−HER2^− RD tumors suggest its potential as a predictive biomarker for NACT outcomes in this subtype. KIT and MAP4K1 hypermethylation in patients with RD in the ER^−HER2^+ and ER^+HER2^+ subtypes could also be a candidate for predicting pCR. Subtype-associated transcriptomic predictors of chemosensitivity Using the RNA-seq on 140 patients, we further investigated the association between pCR and gene or pathway expression. Initially, we examined the distribution of gene expression through t-distributed stochastic neighbor embedding (t-SNE) and found that samples belonging to the same subtype showed a trend to cluster together ([140]Fig. 4A). Examination of the t-SNE1 and t-SNE2 component distributions revealed significant differences among the various subtypes (fig. S9A). Stratified by the pCR and RD, only t-SNE2 exhibited a significant difference in the ER^−HER2^+ subtype (fig. S9B). Next, to create expression-based clusters linked to clinical phenotypes, we first conducted a three-way analysis of variance (ANOVA) to associate gene expression with HER2 status, ER status, and pCR status, followed by implementing hierarchical clustering using the genes identified by ANOVA ([141]Fig. 4B and fig. S9, C and D). Fig. 4. Transcriptomic features associated with NACT outcome. [142]Fig. 4. [143]Open in a new tab (A) t-SNE analysis of gene expression. Points are colored by molecular subtypes, with solid points indicating pCR and hollow points representing RD. (B) Analysis of variance (ANOVA)–based clustering of gene expression. The top 100 subtype-associated genes (ANOVA) were used for clustering, with rows showing functional annotations of enriched gene clusters. (C) Differentially expressed genes between pCR and RD in each subtype (DESeq2, P < 0.05). BRCANET genes are highlighted with black borders. Dot size represents P values; FC indicates fold change. (D) Gene set enrichment analysis (GSEA) on the hallmark pathway for each subtype. Pathways were clustered by −log[10] (adjusted P value). Dot size indicates significance; red denotes pCR enrichment, and blue denotes RD enrichment. Subtypes: −/− (ER^−HER2^−), −/+ (ER^−HER2^+), and +/+ (ER^+HER2^+). NES, normalized enrichment score. Next, we conducted differential expression analysis using DESeq2 ([144]49) and identified 1481 and 1431 overexpressed genes in the pCR and RD groups, respectively (fig. S9, E and F). BRCANET highlighted 84 pCR-associated and 45 RD-associated overexpressed genes, including ERBB2, which was overexpressed in pCR (fig. S9, E and F). Gene set enrichment analysis (GSEA) ([145]50) revealed that pCR-associated genes were enriched in proliferation-related pathways (e.g., MTORC1_SIGNALING and E2F_TARGETS) and immune-related pathways (e.g., ALLOGRAFT_REJECTION and INFLAMMATORY_RESPONSE) (fig. S9G). Subtype-specific analysis identified 911, 782, and 584 overexpressed genes in the pCR group and 905, 1056, and 564 in the RD group for ER^−HER2^−, ER^−HER2^+, and ER^+HER2^+ tumors, respectively ([146]Fig. 4C and fig. S10A). Wilcoxon rank sum tests and BRCANET highlighted 59, 38, and 19 pCR-associated overexpressed genes and 34, 39, and 20 RD-associated overexpressed genes in ER^−HER2^−, ER^−HER2^+, and ER^+HER2^+ tumors, respectively ([147]Fig. 4C and fig. S10A). In ER^−HER2^− tumors, pCR was associated with lower expression of KLK3 and AR but higher expression of proliferation-related genes such as CDK1 and MKI67. In ER^−HER2^+ tumors, patients with RD exhibited higher expression of GFRA3 and FOXO3, while in ER^+HER2^+ tumors, patients with RD showed elevated expression of ABCC3, an ABC transporter gene linked to multidrug resistance ([148]Fig. 4C) ([149]51). Pathway enrichment analysis revealed enrichment of cell cycle pathways (e.g., E2F_TARGETS, G2M_CHECKPOINT, MTORC1_SIGNALING) and immune-related pathways (e.g., MYC_TARGETS_V1 and INTERFERON_ALPHA_RESPONSE and INTERFERON_GAMMA_RESPONSE) across subtypes in pCR groups, with varying levels of significance depending on the subtype ([150]Fig. 4D). Accordingly, in the single-sample gene set enrichment analysis (ssGSEA) ([151]52), we extracted sample-level pathway expressions as features (table S5). In ER^−HER2^− tumors, we observed that E2F_TARGETS ssGSEA scores were higher in the pCR group (fig. S10B) and strongly correlated with the Ki67 index (fig. S11A), indicating consistency between data from different platforms in our cohort. In addition, we found that overexpression of the E2F_TARGETS pathway was an exclusive pCR predictor for ER^−HER2^− tumors in our cohort, which was also observed in the TransNEO cohort (fig. S11B). In ER^−HER2^+ tumors, apical surface and estrogen response pathways showed significantly higher expression in the pCR group, although this trend was not statistically significant in the TransNEO cohort (figs. S10B and S11C). Overall, transcriptomic analysis revealed elevated expression of proliferation-associated genes and pathways in ER^−HER2^− tumors achieving pCR. The differentially expressed genes and pathways between the pCR and RD groups vary across molecular subtypes, suggesting the presence of potential subtype-specific mechanisms of treatment resistance or sensitivity. TOP2A as a predictor in triple-negative tumors identified in proteomic analysis To uncover the posttranscriptional differences between patients with pCR and patients with RD, we analyzed protein and phosphoprotein expression data from 137 individuals. After filtering low-expression proteins and normalizing the data (Materials and Methods), we compared 5473 proteins, identifying 152 overexpressed in the pCR group and 156 in the RD group. BRCANET highlighted 13 pCR-associated proteins [e.g., MCM7 (minichromosome maintenance complex component 7), ERBB2 (erb-B2 receptor tyrosine kinase 2), and S100A8 (S100 calcium binding protein A8)] and 8 RD-associated proteins [e.g., VCAN (versican), PTPRG (protein tyrosine phosphatase receptor type G), and EGFR (epidermal growth factor receptor)] (fig. S12A). Among 1737 phosphoproteins, 42 were overexpressed in the pCR group [e.g., RPS6KB1 (ribosomal protein S6 kinase B1), FBLN2 (fibulin 2), MCM2 (minichromosome maintenance complex component 2), and ABCC1 (ATP binding cassette subfamily C member 1)], while 45 were overexpressed in the RD group [e.g., HIC1 (HIC ZBTB transcriptional repressor 1), FBXO4 (F-box only protein 4), and CREBBP (CREB binding protein)] (fig. S12B). Subtype-specific analysis revealed differentially expressed proteins associated with pCR. In ER^−HER2^− tumors, BRCANET highlighted 27 pCR-associated proteins [e.g., MCM7, CDK1 (cyclin dependent kinase 1), PCNA (proliferating cell nuclear antigen), MCM2, and TOP2A (DNA topoisomerase II alpha)] and 20 RD-associated proteins (e.g., HERC2, a DNA damage response regulator; [152]Fig. 5A and fig. S12, C and D) ([153]53). In ER^−HER2^+ tumors, CTR9 and HIF1AN were overexpressed in the pCR group, while STAT5B, essential for breast cancer cell migration, and MSH6, involved in DNA mismatch repair, were elevated in patients with RD. Specifically, MSH6 protein showed a trend of increasing expression corresponding to higher RCB status ([154]Fig. 5A and fig. S12D) ([155]54, [156]55). In ER^+HER2^+ tumors, eight pCR-associated and five RD-associated proteins were highlighted by BRCANET ([157]Fig. 5A and fig. S12C). Fig. 5. Proteomic features associated with NACT outcome. [158]Fig. 5. [159]Open in a new tab (A and B) Differentially expressed proteins (A) and phosphoproteins (B) between pCR and RD in each subtype. P values were calculated using the Wilcoxon rank sum test. Significantly differentially expressed proteins and phosphoproteins (P < 0.05) within BRCANET are outlined in black. Dot size represents P values. FC, fold change. (C) Proteins and phosphoprotein pathway differences between pCR and RD. Pathway activity was assessed using protein and phosphoprotein abundance based on the ssGSEA method. Only significant pathways are shown for the ER^−HER2^− subtype, while the top five pathways are displayed for HER^+ subtypes. P value was calculated using the Wilcoxon rank sum test. (D) Hallmark pathway enrichment analysis of differentially expressed genes in single-omic data (Meth., methylation; RNA, transcription; Pro., protein; Pho., phosphoprotein) and multiomic data between pCR and RD in each subtype using the multiomic integration method. P values less than 1 × 10^−4 are labeled as 1 × 10^−4, with significant results (P < 0.05) marked by the asterisk (*). UV, ultraviolet. For differentially expressed phosphoproteins (DEPPs), ER^−HER2^− tumors showed overexpression of cell cycle–related proteins (e.g., TOP2A, MKI67, MCM2, and RPS6KB1) in the pCR group, while patients with RD exhibited higher expression of HSBP1, HIC1 (linked to chromosomal stability), and MTOR (involved in stress responses) ([160]Fig. 5B and fig. S12, E and F) ([161]56, [162]57). In ER^−HER2^+ and ER^+HER2^+ tumors, only a few DEPPs were identified, such as FBXO4 and ADAM17 in ER^−HER2^+ and SMARCA2 and CTNND1 in patients with ER^+HER2^+ RD ([163]Fig. 5B and fig. S12, E and F). Hallmark pathway analysis using protein and phosphoprotein abundance revealed that patients with ER^−HER2^− pCR exhibited higher expression of cell cycle–related pathways (e.g., E2F_TARGETS and G2M_CHECKPOINT), while patients with RD showed elevated adipogenesis and myogenesis pathways, supported by both protein and phosphoprotein data ([164]Fig. 5C). Higher expression of the adipogenesis protein pathway was independently validated in an independent ER^−HER2^− cohort (fig. S13A). TOP2A, a DNA topoisomerase, was significantly up-regulated at both protein and phosphoprotein levels in patients with ER^−HER2^− pCR (fig. S13B), with a correlated expression pattern across RNA, protein, and phosphoprotein levels (fig. S13C). Its overexpression has been linked to enhanced anthracycline-based therapy response ([165]58). Given the TOP2A difference supported by multiomics, we investigated whether multiomic integration could help identify additional convergent differences between patients with pCR and patients with RD. We conducted multiomic integration and Hallmark pathway enrichment analysis on significantly differentially expressed genes (P < 0.1) across different breast cancer subtypes under pCR and RD conditions ([166]Fig. 5D). Pathway enrichment patterns varied substantially among subtypes. In the ER^−HER2^− subtype, both multiomic and single-omic analyses identified significantly enriched pathways (P < 0.05), primarily associated with cell cycle regulation (e.g., E2F_TARGETS, G2M_CHECKPOINT, and MITOTIC_SPINDLE). For example, we can observe that genes in the E2F_TARGETS pathway exhibit significant and directionally consistent changes across various single-omic datasets (fig. S13D). In contrast, ER^−HER2^+ and ER^+HER2^+ tumors showed more scattered pathway enrichment. Nonetheless, with multiomic integration, we observed potential biologically relevant signals, such as the XENOBIOTIC_METABOLISM pathway in ER^−HER2^+ tumors, which was not significant in single-omic analyses ([167]Fig. 5D). An integrated model to predict patient outcome By comparing patients with pCR and patients with RD across omic layers for each subtype, we identified candidate biomarkers associated with treatment outcomes. The initial prediction of the Chinese breast cancer cohort using the TransNEO machine learning model yielded modest performance [area under the receiver operating characteristic curve (AUC): 0.62 for clinical + RNA versus 0.57 for clinical only; fig. S13E]. To improve predictive power, we sought to develop subtype-specific pCR predictors by integrating multiomic features within a machine learning framework. However, high collinearity between omics and among features (fig. S14, A to C) and the risk of overfitting due to limited sample sizes and numerous features posed challenges for model generalizability. Furthermore, most publicly available datasets primarily contained RNA-seq or array data, limiting the generalizability of models to external validation datasets as well. To address this, we calculated hallmark pathway activity using single-sample GSEA for RNA, protein, and phosphoprotein data. These pathway-level features, combined with clinical information and highly reproducible methylation markers validated in external datasets (Materials and Methods and figs. S15, A to C, and S16A), were used to train a pCR prediction model, ensuring generalizability to independent validation datasets. We developed models using the LightGBM algorithm ([168]59) and evaluated the performance of the MOPCR against single-omic models through 30 repeated fivefold cross-validation ([169]Fig. 6A and Materials and Methods). MOPCR consistently outperformed single-omic models across all subtypes ([170]Fig. 6B and fig. S16, B and C). In the ER^−HER2^− subtype, MOPCR achieved an average accuracy of 0.82, an AUC of 0.98, and an area under the precision-recall curve of 0.97, significantly surpassing single-omic models. Similar improvements were observed in ER^−HER2^+ and ER^+HER2^+ subtypes ([171]Fig. 6B and fig. S16, B and C). Fig. 6. Machine learning predictor of clinical outcome. [172]Fig. 6. [173]Open in a new tab (A) Framework for training and validating the machine learning model. External datasets with missing omic features were imputed using multiomic data from this study before independent validation. (B) Cross-validation performance (accuracy) comparison between single-omic models and multiomic pCR prediction models (MOPCR) across three subtypes. P values (Wilcoxon rank sum tests) compare MOPCR to aggregated results from other methods. (Clin., clinical features; Meth., methylation; RNA, gene expression; Prot., protein expression; Pho., phosphoprotein expression; MOPCR, multiomic model with feature selection.) The mean accuracy for each model was labeled. (C) SHapley Additive exPlanations (SHAP) values for features in the optimized MOPCR model. The dot color represents the feature value. (Lymph, lymph node metastasis; grade 3, tumor grade 3 at diagnosis; tumor size, T3/T4 at diagnosis; Path., pathologic; Meth., methylation; RNA, gene expression; Prot., protein expression; Pho., phosphoprotein expression.) EMT, EPITHELIAL_MESENCHYMAL_TRANSITION. (D) Receiver operating characteristic curves for independent validation in external datasets for each subtype ([174]15, [175]45, [176]46). AUC values are shown in parentheses for each cohort. [FPR, false-positive rate; TPR, true-positive rate; Sammut Dis, discovery dataset from the TransNEO cohort ([177]15); Sammut Val, validation dataset from the TransNEO cohort ([178]15). Only cohorts with >10 cases were included.] (E) Summary, vision, and perspectives of the machine learning predictors in this study. To ensure interpretability, we calculated SHapley Additive exPlanations (SHAP) values ([179]60) to assess feature importance in MOPCR ([180]Fig. 6C). Key features for the ER^−HER2^− model included MTORC1_SIGNALING, E2F_TARGETS, and UV_RESPONSE_UP protein pathways, KI67 index, and CDKN2A methylation. In the ER^−HER2^+ model, KIT methylation status was the most critical feature, followed by tumor grade, tumor size, and RNA-level pathways including EPITHELIAL_MESENCHYMAL_TRANSITION and INFLAMMATORY_RESPONSE. For the ER^+HER2^+ model, the MAP4K1 methylation level was the top feature, followed by RNA-level MYC_TARGETS_V1, phosphoprotein-level UNFOLDED_PROTEIN_RESPONSE, and protein-level HEME_METABOLISM pathways. Notably, methylation features consistently ranked as the top significant predictors in ER^−HER2^+ and ER^+HER2^+ subtypes ([181]Fig. 6C), underscoring the importance of methylation profiling in treatment outcome prediction. Further validation in larger and more diverse cohorts is needed to confirm these findings and explore the causal relationship between methylation markers and treatment outcomes. MOPCR demonstrated that integrating CDKN2A methylation and Ki67 index outperformed using a single feature only in the pyrosequencing cohort, underscoring the value of multiomic integration (fig. S16D). To further evaluate robustness and generalizability, we applied MOPCR to publicly available datasets. For datasets lacking methylation data, we imputed methylation levels using a least absolute shrinkage and selection operator (LASSO) model based on RNA data ([182]Fig. 6A and Materials and Methods). In the TransNEO dataset, MOPCR exhibited better or comparable performance with imputed methylation than with RNA as a replacement of methylation (fig. S16, E to G). Across three independent RNA-seq datasets, AUC values ranged from 0.589 to 0.768 in the ER^−HER2^− subtype. A similar performance could also be observed for the ER^−HER2^+ and ER^+HER2^+ subtypes ([183]Fig. 6D). In the BrightNess study, which included three treatment regimens—(A) paclitaxel + carboplatin + veliparib; (B) paclitaxel + carboplatin + placebo; and (C) paclitaxel + placebo + placebo—MOPCR achieved AUC values of 0.61, 0.60, and 0.55 for arms A, B, and C, respectively. The improved performance in arms A and B, which included carboplatin (a DNA replication–targeting drug) and veliparib (a DNA repair–targeting drug), suggests that the MOPCR ER^−HER2^− model may be more informative for evaluating treatment regimens combining paclitaxel with other cell cycle–targeting drugs (fig. S17, A and B), although this model could not suggest different treatment regimens. Of all the independent RNA-seq cohorts subjected to MOPCR to predict the probability of pCR, the ER^−HER2^− subtype exhibited a significantly higher likelihood in the pCR group than the RD group (fig. S17C), although there were instances of high pCR probability scores in the RD group and low pCR probability scores in the pCR group. Future integration of more high-confident biomarkers using larger datasets might help improve the prediction performance. To facilitate the use of MOPCR by the research communities to achieve the long-term vision and perspective of NACT pCR prediction in breast cancer ([184]Fig. 6E), we developed a website for deploying the prototype models, which can be accessed at [185]http://www.wang-lab-hkust.com:8050. In sum, although further investigation is still needed, our study showed that the combination of multiomic data and clinical information using machine learning techniques can lead to the informative prediction of pCR. DISCUSSION The motivation for predicting the pCR of NACT stemmed from the observed correlation between pCR and improved prognosis ([186]5). In this study, patients achieving pCR generally experienced better outcomes than those with RD, underscoring the importance of predicting NACT response before treatment (fig. S1, D and E). Specifically, patients with RD in the ER^−HER2^− subtype showed poorer overall and disease-free survival, while in the ER^−HER2^+ subtype, RD did not affect overall survival but was associated with earlier disease progression (P = 0.024; fig. S1, D and E). For ER^+HER2^+ patients, where pCR is not directly linked to survival, additional data, such as drug sensitivity from NACT, would be invaluable. Nevertheless, accurate pCR prediction could still help optimize treatment regimens and reduce costs for ER^+HER2^+ patients, although our current sample size limits this capability. Accurate pCR prediction could help mitigate NACT limitations by identifying patients unlikely to benefit. By facilitating patient stratification and personalized treatment, pCR prediction is critical for enhancing survival, quality of life, and cost effectiveness. NACT response is influenced by subtypes, racial factors, and a tumor microenvironment, making prediction challenging ([187]15, [188]25, [189]61). To address this, we conducted comprehensive multiomic profiling of a Chinese breast cancer cohort. The model by Sammut et al. was trained using all subtypes, prompting us to develop subtype-specific predictors tailored to the Chinese population while ensuring generalizability to external cohorts. We identified subtype-specific biomarkers by comparing pCR and RD groups. In the ER^−HER2^− subtype, CDKN2A methylation and cell proliferation were associated with pCR and validated in an independent cohort. While cell proliferation and immune-related pathways have been linked to pCR ([190]15, [191]62, [192]63), our findings highlight their subtype-specific relevance, suggesting distinct response mechanisms across breast cancer subtypes. We developed MOPCR, a subtype-specific multiomic model for pCR prediction, which demonstrated promising accuracy and generalizability. While the model identifies patients unlikely to achieve pCR, it has the potential to inform future clinical decision-making ([193]10). Our study has several limitations. First, the diversity of treatment regimens, resistance mechanisms, and sample size constraints may limit the model’s generalizability to larger, more diverse datasets. Second, to avoid overfitting, not all identified biomarkers from each omic layer were included in MOPCR. Third, the limited sample size restricts subtype-specific stratification by treatment regimen, preventing the model from accounting for regimen-specific effects. In addition, in the machine learning–independent validation, the distribution of some key features in the external datasets was explored to ensure reliability before validation, which means that the validation process is not entirely independent. Last, while our model aims at the long-term vision of application, it remains a research prototype, and further investigation is required before it can be applied in clinical settings. Future directions include expanding sample sizes to uncover additional resistance mechanisms and improve model performance. Single-cell RNA-seq could provide deeper insights into tumor-infiltrating lymphocytes ([194]64–[195]66). Germline BRCA1/BRCA2 mutations and structural variants, detectable via long-read sequencing technologies, warrant further investigation ([196]67–[197]70). In conclusion, our study demonstrates that pCR prediction is enhanced by integrating subtype-specific multiomic features. We identified CDKN2A and MAP4K1 methylation as predictive biomarkers for ER^−HER2^− and ER^+HER2^+ patients and developed MOPCR, an accessible tool for evaluating breast cancer treatment outcomes. This work also provides a comprehensive multiomic dataset for the community. MATERIALS AND METHODS Declaration of generative AI and AI-assisted technologies in the writing process During the preparation of this work, the author(s) used (DeepSeek) to (correct grammar errors and polish the manuscript). After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication. Typical prompt: Below is an introduction/result/method/discussion section of my manuscript aiming at Science Advances. Please help correct the grammar errors and polish it more concisely, coherently, and logically. Patients and ethics statements In this study, we enrolled 149 women with early and locally advanced breast cancer who underwent neoadjuvant treatment according to NCCN (National Comprehensive Cancer Network) guidelines. All participants provided written informed consent before enrollment in the study. Ethical approval for the study was granted by the GDPH review board (research ethics reference: KY-Z-2020-680-02). Molecular subtypes were determined through immunohistochemistry according to the four molecular indicators: ER/PR/HER2/KI67. Eligible patients included those with triple-negative (ER^−/PR^−/HER2^−) and HER2^+ subtypes for further analysis. All the cases had a complete molecular/digital pathology dataset, received more than 4 cycles of chemotherapy and/or targeted therapy, and underwent efficacy assessment. The response was assessed on completion of NACT and surgery by two pathologists, with confirmation from clinicians. pCR was defined as the absence of any residual cancer cell within the resected surgical specimen, including both breast and axillary lymph node specimens, irrespective of ductal carcinoma in situ (ypT0/ is ypN0). Patients with any residual cancer cells within the breast or lymph nodes were classified as RD. Specifically, one patient showing vascular tumor emboli and with evidence of cancer cells in the adipose tissue was classified as RD. For patients with RD, the RCB class was categorized as RCB I, II, or III, as previously reported ([198]15, [199]23, [200]24). In this study, the MD Anderson RCB calculator ([201]www.mdanderson.org/breastcancer_RCB) was used for calculation. Sequence data processing Fresh-frozen pretreatment core needle tumor biopsies were collected from these cases using ultrasound guidance and sent to Novogene Co. Ltd. Sequencing data and proteomic data were generated by Novogene Co. Ltd. WES data were aligned to the human hg38 reference genome using BWA v0.7.8-455 ([202]71) and was subject to marking duplication using Picard. WGBS data were aligned to GRCh38_Ensemble_94 human reference genome obtained from [203]ftp://ftp.ensembl.org/pub/release-94/fasta/homo_sapiens/dna/ using BSMAP v2.90 (with parameters: -n 1 -r 0 -p 4 -S 1234) ([204]72). After sorting the data with Samtools (1.3.1), Sambamba was used to mark duplicates using the command markdup -l 1 -t 10 --sort-buffer-size 16000 --overflow-list-size 10000000. Samples with an average sequencing depth below 20 were subsequently excluded. The remaining BAM files were then analyzed with wgbstools ([205]73) and Bedtools ([206]74) to generate a β value matrix comprising 28,292,717 CpG features. Positions with a depth lower than 10 were filtered out. RNA-seq data were aligned to GRCh38_Ensembl_97 human reference genome obtained from [207]ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/ using STAR v2.0.5 ([208]75), and read counts, fragments per kilobase per million (FPKM), and transcripts per million (TPM) values were calculated with RNA-Seq by Expectation Maximization (RSEM) ([209]76). SNP pairing tests were performed to ensure the paired samples from the same individual were matched using NGSCheckMate ([210]77). Somatic variant calling Somatic mutations were identified from tumor and normal paired WES BAM files using the Mutect2 pipeline and were annotated using the Funcotator embedded in GATK4 ([211]29). Somatic copy numbers and tumor purity were calculated using FACETS with default hg19 version of bam files and parameters ([212]78). Tumor purity was set to not available (NA) if not callable by FACETS. The thresholds of CNAs were 0.26 for gain, 0.80 for amplification, −0.32 for loss, and −0.73 for deletion. Focal copy number alterations were identified using GISTIC2 ([213]35) with hg19 parameters: -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme. Focal somatic copy number difference First, cytobands demonstrating focal amplification and/or deletion from GISTIC2 ([214]35) were aggregated for each subtype, stratified by pCR or RD status. Segment files were annotated using oncotator to extract coding gene-level log[2] ratio. Only genes annotated in more than 80% of samples were kept. The median log[2] ratio of all genes within the relevant cytoband was calculated to represent that cytoband’s log[2] ratio. Specifically, the log[2] ratio, which indicates copy number changes, was determined for each individual using tumor alongside matched blood control WES data via the FACETS tool ([215]78). Last, for each cytoband, the difference in the log[2] ratio between the RD and pCR groups was assessed using the rank sum test for each subtype. Mutation signature analysis Somatic SNVs within coding regions of 94 samples with callable tumor purity were subjected to SigProfilerExtractor to extract the 96 context mutation matrix and perform de novo identification of the mutation signature ([216]79). Then, the de novo signature was aligned and decomposed to the Catalogue Of Somatic Mutations In Cancer (COSMIC) signature to obtain the contribution of each COSMIC signature in each patient ([217]80). Methylation of the gene regulatory region The Encyclopedia of DNA Elements (ENCODE) registry of candidate cis-regulatory elements (CCRE), such as CTCF, trimethylation of histone H3 at lysine 4 (H3K4me3), enhancer domain and enhancer-promoter interaction, and promoter regions ([218]81), were obtained through the UCSC genome browser [219]https://genome.ucsc.edu/. Promoter positions were defined as regions located from 1000 bp upstream of the TSSs to 1000 bp downstream of the TSSs. TSSs were obtained from the UCSC Table Browser ([220]82). Methylation levels of defined gene regulatory regions were calculated as the average methylation β value of CpG sites within regions. Differential methylation analysis CpG methylation level data were segmented into 1,356,835 regions using the wgbstools segment command ([221]73). Regions containing fewer than five CpG sites were excluded. The mean methylation level for each region was calculated per sample using the wgbstools beta_to_table -c 10 command. DMRs were identified for each subtype using the Wilcoxon rank sum test on the resulting methylation matrix. The genomic annotation of DMRs was performed using the ChIPseeker R package ([222]83) with the TxDb.Hsapiens.UCSC.hg38.knownGene database. Regions within the CDKN2A gene were specifically annotated to the CDKN2A locus. For segments annotated to a single gene, prioritization was performed using a two-step approach. First, segments were ranked on the basis of the Spearman correlation P value between segment methylation levels and corresponding RNA expression across all samples, with a focus on identifying significant negative associations. Subsequently, segments were further prioritized by evaluating the methylation difference between pCR and RD groups using the Wilcoxon rank sum test. Representative segments were selected on the basis of the significance level and mean β value differences between pCR and RD to form DMGs. Only DMGs exhibiting high variance (>50%) across all samples were retained for downstream analysis. Last, for verification, CpG-level information and genomic positions of both the representative DMRs and other segments showing significant methylation differences between pCR and RD, but not selected as representatives, were visualized and examined. Biological network BRCANET A list of breast cancer drivers, including genes in the collection named Breast Cancer with ID [223]H00031 in the Kyoto Encyclopedia of Genes and Genomes human diseases database ([224]38), and the elite genes in the collection named Breast Cancer with ID BRS047 in the Malacards database ([225]39) were extracted. The neighbors of these genes from the protein-protein interaction database Search Tool for Retrieval of Interacting Genes/Proteins (STRING) ([226]40) filtered by “experiments scores greater than 700” and the gene regulation database RegNetwork ([227]41) filtered by “organism: human; evidence: experiment; confidence: high” were also extracted as breast cancer–related genes (table S6). TCGA-independent methylation data The TCGA breast cancer subtype information was obtained as previously defined ([228]84), and only three receptor subtypes pertinent to this study were kept. The TCGA CpG site-level methylation data sequenced using the Illumina HM450K platform were downloaded from the LinkedOmics website [229]https://linkedomics.org/login.php ([230]85). For validation of CDKN2A methylation data difference regarding the DMR identified in the ER^−HER2^−, CpG site, cg14069088 (hg38, chr9: 21,996,207 to 21,996,209), overlapping with the CDKN2A DMR was selected. Bisulfite pyrosequencing of CDKN2A promoter Two regions, S1 with five CpGs and S2 with six CpGs, were analyzed by bisulfite pyrosequencing using PyroMark Q96/48 ID. For region S1 with sequence to be analyzed: GGYGGGGGATGGGGAAAGAGAAGTTTGTYGTTTTTTTAGGTTATYGTATYGTTTTAAGTGGGTGGTTYGGT TAGTGTAGGTTTTATTTTTTTT, the forward polymerase chain reaction (PCR) primer was AGGGTTTAGTTTTGTTTTGGGTTTGT, the reverse PCR primer was ACCCCCTTCTCAATAACTTCCTATTCA, and the sequencing primer was GGTAGTAAAGTTAAGTGATGG. For sequencing region S2 with sequence to be analyzed: YGTAAGAGTGATYGTTAAATTTTATTTTTTTTAAAGGGTTTYGTTTTGTTTTGGGTTTGTATYGAGGTTYG GTAGTAAYGTTAAGTGATGGGG, the forward PCR primer was AGTAAGGTGTTGGGTAAATTTTGAGTGAA, and the reverse PCR primer was ACAAACTTCTCTTTCCCCATCC. Ten samples in the discovery cohort were also prepared for bisulfite pyrosequencing to perform quality control, of which nine samples were successfully sequenced for S1 and all samples were successfully sequenced for S2. Sequencing qualities were first used to filter CpG, and sites with more than half of the samples having good quality were kept. Methylation ratios calculated from bisulfite pyrosequencing results were compared with WGBS results, and successful sites were defined as those with Spearman correlation coefficients R > 0.3 between the two platforms. Gene expression–based unsupervised clustering TPM was log-transformed with base 2 after adding a pseudo-count of 0.01. The three-way ANOVA was performed to select genes for clustering, using the Python package statsmodels v0.12.2. Genes ranked top 100 for each term in ANOVA were used for clustering. Principal components analysis was implemented with the Python package Scipy v1.5.2 ([231]86). Hierarchical clustering was performed using the Ward method and Pearson correlation as the distance metric and implemented with the Python package Scipy v1.5.2 ([232]86). Proteomic data processing For the 137 patients who underwent proteomic quantification, proteins or phosphoproteins that were not expressed in more than 90% of samples were filtered out. Expression values were log-transformed with base 2 after adding a pseudo-count of 1. Quantile normalization was applied to the whole cohort to remove batch effects and ensure that all samples share the same distribution of protein/phosphoprotein expression. Differential gene expression, protein, and phosphoprotein analysis Differential analyses were conducted between patients with pCR and patients with RD within each subtype. For protein and phosphoprotein data, Wilcoxon rank sum tests were performed using the Python package Scipy v1.5.2, with fold change quantified by reversing the normalized values of proteins and phosphoproteins from log[2]-transformed format back to their original scale ([233]86). For proteins, a P value smaller than 0.05 in normalized data was considered significant. For phosphoproteins, the difference in mean values between pCR and RD was assessed, with P < 0.05 for normalized data regarded as significant. For gene expression data, DESeq2 ([234]49) was applied to find differentially expressed genes. GSEA for the preranked gene list derived from DESeq2 was conducted using fgsea ([235]87). Single-sample gene set enrichment analysis Gene expression TPM values were adjusted by adding a pseudo-count of 0.01 and then log[2]-transformed. After performing a z-score by gene, ssGSEA scores for Hallmark gene sets ([236]88) were calculated for each sample using GSEApy ([237]52). For public datasets with normalized z-score values, pathway scores were calculated directly for each sample. Integration analysis of multiomic data Differential analysis results for methylation, transcription, protein, and phosphoprotein datasets were obtained as described in the above sections. These results were integrated using the ActivePathways R package (v2.0.5) with default parameters ([238]89). Pathway enrichment analysis was conducted for each omic dataset as well as for the integrated multiomic data using the human Hallmark pathway collection (v2024.1). For the directional integration approach, the relationships between omic datasets were modeled as follows: Methylation was negatively correlated with transcription and protein, transcription was positively correlated with protein, and phosphoproteins were assumed to have no directional correlation with the other omic datasets. Machine learning feature preselection Feature selection was performed using study data to prioritize high-confidence features and pathway-level features to mitigate the potential model overfitting problem. For clinical features, those significantly associated with pCR in multivariate analysis (P < 0.05, logistic regression) were included across all subtypes. The Ki67 feature was significantly associated with pCR in the ER^−HER2^− subtype and was retained. For methylation data, only high-confidence DMGs were selected for each subtype to avoid model overfitting. Transcriptomic features were restricted to the top three significant ssGSEA pathways based on GSEA enrichment results to minimize false positives. For protein and phosphoprotein data, the top three ssGSEA hallmark pathways associated with pCR were selected on the basis of the sample-level comparison. Machine learning model training All discrete features were one-hot–encoded (tumor grade, 0 for T1/T2 and 1 for T3/T4; lymph node metastasis, 0 for no and 1 for yes). All continuous features were transformed to z-scores on the corresponding subtype. A boosted tree algorithm was used and implemented with LightGBM ([239]59). The hyperparameters, the number of estimators T and learning rate α, were tuned in the 30-times-repeated five-fold cross-validation using Scikit-learn ([240]90). The scope for T was set to {50, 100, 200}, and the scope for α was set to {0.05, 0.1, 0.2}. The optimized models were selected with the highest average AUC in repeated cross-validation. The threshold for feature selection was set to 0.3 for all three subtypes. SHAP values ([241]60) were calculated for the model explanation. We implemented a systematic feature selection approach during cross-validation to optimize model stability and predictive performance in our multiomic integration framework. The strategy began with the segregation of clinical features, which were maintained as essential baseline predictors, while nonclinical features were subjected to comprehensive combinatorial analysis. We generated an exhaustive set of feature combinations through power set computation, systematically evaluating each possible combination ranging from individual features to the complete set of nonclinical predictors. Each feature combination was assessed using a rigorous cross-validation framework, where model performance was evaluated on both training and validation sets. The selection criteria prioritized feature combinations that minimized the absolute difference between training and validation set AUC while maintaining validation performance superior to the best single-omic model. To ensure robustness, we tracked the selection frequency of each feature across all cross-validation iterations, ultimately retaining features that demonstrated consistent selection in a certain threshold. This approach effectively balanced the trade-off between model complexity and generalizability while maintaining the biological relevance of selected features. Feature imputation for independent cohorts A LASSO-based imputation method was developed to address missing features during independent validation. The GDPH cohort gene expression matrix was used as the initial dataset. After quantile normalization and log[2] transformation, gene filtering was performed on the basis of mean expression values, and the resulting set was intersected with breast cancer–associated genes. This process yielded several hundred candidate genes, whose expression matrix was subsequently used to construct a LASSO cross-validation model. The model was designed to predict various missing target features, such as CDKN2A methylation status. To assess potential overfitting during cross-validation, ~30% of patient samples were randomly allocated to an independent validation cohort. Model performance was evaluated using Spearman correlation coefficients and their corresponding P values. For those imputed features whose Spearman correlation is between 0 and 0.1 during independent validation, we manually set them NA. Statistical analysis The Wilcoxon rank sum and Student’s t test were used for quantitative pair-wise comparisons. A Kruskal-Wallis test was used for multiple-group quantitative comparision. A chi-square test was used for comparing categorical variables. Pearson or Spearman methods were used for calculating the correlation. The Kaplan-Meier method was used for survival data comparisons. An AUC value was calculated using pROC ([242]87) and plotROC ([243]88) R packages. Acknowledgments