Abstract Background Endometriosis and breast cancer are significant global health burdens affecting women worldwide. Both conditions share notable characteristics including estrogen dependence, progressive growth patterns, recurrence tendencies, and metastatic potential. Despite these biological parallels, the molecular mechanisms connecting these conditions remain incompletely characterized. This study aimed to identify shared gene signatures and underlying molecular processes in breast cancer and endometriosis. Methods Expression matrices for both conditions were obtained from the Gene Expression Omnibus (GEO), UCSC Xena, and the Molecular Taxonomy of Breast Cancer International Consortium. Common differentially expressed genes (DEGs) were identified using the limma package. Comprehensive analyses included Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment, machine learning-based diagnostic and prognostic model development, potential therapeutic compound screening, tumor immune microenvironment (TIME) characterization, and hub gene identification with subsequent validation. Results The analysis identified 47 common DEGs between breast cancer and endometriosis. Functional assessment of these genes revealed their involvement in critical biological processes including cell cycle regulation, oxidative stress response, and secretory granule and recycling endosome dynamics. Integration of comprehensive genomic and clinical data led to the development of a prognostic model for breast cancer and a diagnostic model for endometriosis. Conclusion This study provides molecular insights into shared pathogenic mechanisms underlying breast cancer and endometriosis, highlighting common physiological pathways and key regulatory genes. These findings offer novel perspectives for understanding disease pathogenesis and potential therapeutic interventions for both conditions. Supplementary Information The online version contains supplementary material available at 10.1007/s12672-025-02887-4. Keywords: Breast cancer, Endometriosis, Multi-omics analysis, Machine learning, Hub genes Introduction Endometriosis and breast cancer are major global health challenges for women. Endometriosis affects 5–10% of reproductive-age women, with over 176 million cases worldwide [[28]1, [29]2]. Characterized by ectopic endometrial tissue growth, this condition manifests as pelvic pain, dysmenorrhea, and infertility [[30]3]. These symptoms occur in 50–80% of women with pelvic pain and up to 50% of those experiencing fertility difficulties [[31]2, [32]4]. The pathogenesis primarily involves retrograde menstruation, wherein endometrial fragments flow into the peritoneal cavity, where they implant and infiltrate pelvic structures [[33]5, [34]6]. Additional contributing factors include obstructed menstrual flow, extended estrogen exposure (from early menarche or late menopause), genetic predisposition, immune dysfunction, and lifestyle factors. As an estrogen-dependent chronic inflammatory disorder, molecular alterations in estrogen signaling and inflammatory pathways facilitate both implantation and proliferation of abnormal endometrial tissue [[35]7]. Diagnosis of endometriosis typically involves pelvic examination and ultrasound imaging, though laparoscopy with histopathological confirmation remains the gold standard despite risks including trauma, adhesion formation, and potential impacts on fertility [[36]8]. The biomarker CA125, while elevated in advanced disease, lacks sensitivity for early detection. The absence of reliable peripheral blood or endometrial tissue biomarkers, coupled with the requirement for invasive surgical procedures, often delays diagnosis by 7–11 years, hampering timely intervention [[37]9]. Addressing these limitations is essential for developing non-invasive diagnostic approaches and elucidating the fundamental mechanisms of endometriosis. Breast cancer accounted for 11.7% of all global cancer cases in 2020, with approximately 2.3 million new diagnoses, representing a leading cause of mortality among women [[38]10]. Risk factors include advancing age, genetic predisposition, history of benign breast disease, endogenous hormone exposure, fertility issues, obesity, and radiation exposure [[39]11]. Diagnostic evaluation comprises comprehensive clinical assessment and detailed imaging (mammography, breast ultrasound), typically confirmed by core biopsy before treatment planning [[40]12]. Research has classified breast cancer into four major molecular subtypes through gene clustering analysis [[41]13]: luminal, human epidermal growth factor receptor 2 (HER2)-enriched, basal-like, and normal breast-like. At the RNA level, subtype differentiation primarily depends on estrogen receptor (ER) activity, ER-associated genes, proliferation drivers, and to a lesser extent, HER2 and genes within the HER2 amplicon on chromosome 17 [[42]14]. Treatment strategies based on diagnostic findings typically include surgery, radiotherapy, chemotherapy, targeted therapy, and endocrine treatment [[43]12, [44]15]. The heterogeneity of breast cancer is reflected in its multiple clinically relevant mutations, with molecular characterization of metastatic disease and subsequent targeted therapy assessed through next-generation sequencing and mutation analysis, potentially improving prognosis and survival. Endometriosis and breast cancer share several significant characteristics and risk factors, including estrogen dependence, progressive growth patterns, invasiveness, recurrence, and metastatic potential [[45]16]. Elevated estrogen levels in ectopic lesions of endometriosis patients [[46]17] and endogenous hormone exposure both contribute to increased breast cancer risk. The infertility associated with endometriosis often results in nulliparity or delayed childbearing, established risk factors for breast cancer [[47]18]. Moreover, common treatments for endometriosis, such as progestins and oral contraceptives, may influence breast health [[48]19]. While research has established a significant association between endometriosis and increased risk of epithelial ovarian cancer [[49]5], evidence linking endometriosis to breast cancer progression remains inconclusive. Further investigation is needed to elucidate the underlying pathological connections and identify shared genetic markers between these conditions, potentially revealing common drug targets and improving treatment strategies for both diseases. The development of biomarkers for endometriosis and breast cancer that combine high sensitivity with precise specificity remains inadequate. Understanding the biological pathways and molecular networks underlying these diseases is essential for effective screening, prevention, diagnosis, and treatment. In this study, we analyzed datasets from the Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA), and the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) to investigate the relationship between shared differentially expressed genes in both diseases and their impact on endometriosis diagnosis and breast cancer prognosis. Using machine learning algorithms, we identified 11 signature genes predictive of endometriosis and constructed a three-gene model for breast cancer prognosis. This model was validated with both internal and external datasets, confirming its stability and reliability in predicting outcomes for breast cancer patients. Our findings suggest potential novel biomarkers for endometriosis diagnosis and breast cancer prognostication, while also highlighting possible therapeutic targets. Materials and methods Data acquisition Datasets for endometriosis and breast cancer were obtained from multiple platforms. Two endometriosis datasets, [50]GSE51981 [[51]20] and [52]GSE35287 [[53]21], were acquired from the NCBI GEO. The [54]GSE51981 dataset, generated using the Affymetrix Human Genome U133 Plus 2.0 array ([55]GPL570), contained 77 samples from endometriosis patients and 71 samples from healthy controls. The [56]GSE35287 dataset, used for external validation, was produced with the Affymetrix Human Gene 1.0 ST Array ([57]GPL6244) and included 40 endometriosis and 40 normal samples. Breast cancer datasets from TCGA and METABRIC were obtained from cBioPortal [[58]22] and UCSC Xena [[59]23]. These datasets were generated using the Illumina platform, with TCGA comprising 1050 tumor and 98 normal samples, and METABRIC containing 1980 tumor samples, which served as external validation cohorts. Data preprocessing Data from GEO were processed according to previously described methods using the “GEOquery” R package [[60]24]. Gene probes were annotated with gene symbols, and probes lacking symbols or matching multiple symbols were excluded. For duplicate gene symbols, the maximum expression value was retained. DEGs screening and Functional Analysis DEGs were identified using the “limma” package [[61]25] from the TCGA-breast cancer and [62]GSE51981 datasets. Genes with an absolute Log Fold Change (LogFC) greater than 1 and adjusted P-value below 0.05 were considered statistically significant. Common DEGs were visualized with a Venn diagram, and their expression patterns were displayed in a heatmap generated using R. Functional enrichment of these genes was analyzed through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways using Metascape [[63]26], with a minimum overlap of 3 and enrichment factor of 1.5. Enrichment results with a P-value below 0.01 were considered statistically significant. Characteristic genes in endometriosis To identify distinctive genes associated with endometriosis, three complementary machine learning techniques were employed: Random Forests (RF), Least Absolute Shrinkage and Selection Operator (LASSO) logistic regression, and Support Vector Machine-Recursive Feature Elimination (SVM-RFE). These methods were selected for their distinctive strengths: LASSO for feature selection and regularization to prevent overfitting, SVM-RFE for effective ranking of gene features, and RF for robust handling of complex interactions. The RF technique was implemented using the “randomForest” package [[64]27]. LASSO logistic regression was conducted with the “glmnet” package [[65]28], selecting the minimal lambda as optimal. Optimization parameters were cross-verified with a tenfold factor, ensuring minimal criteria for partial likelihood deviation. Genes commonly identified across all models were selected for further analysis. A diagnostic column line graph predicting endometriosis occurrence was generated using the “rms” package. The [66]GSE35287 dataset served as the validation set, with model effectiveness evaluated through receiver operating characteristic (ROC) curves and area under the curve (AUC). The predictive power and clinical utility of the model were further assessed using the consistency index (C-index) and decision curve analysis (DCA) based on the calibration curve. Establishing prognostic markers in breast cancer The prognostic relevance of common DEGs was initially assessed through univariate Cox regression analysis, with significance defined at p < 0.05. The prognostic gene set was refined using the stepwise Akaike information criterion (stepAIC) method implemented in the “MASS” package. Individual patient risk scores were derived using the following equation: graphic file with name d33e347.gif where Inline graphic and Inline graphic are the normalized expression levels and corresponding regression coefficients of the candidate genes, respectively. Patients were stratified into high- and low-risk categories based on the median risk score as the cutoff value. The efficacy of the gene signature was evaluated through Kaplan–Meier survival plots and ROC curve analyses using the ‘survminer’, ‘survival’, and ‘survivalROC’ packages. The prognostic independence of the risk score from other clinical variables in breast cancer patients was determined through both univariate and multivariate Cox regression analyses. Prognostic characteristics of the tumor microenvironment This study compared genomic alterations, gene expression patterns, immune microenvironment composition, hypoxia status, tumor stemness scores, and biological functions between risk groups. We used “maftools” and cBioPortal to analyze gene mutations. The abundance of immune cells in each patient sample was determined by single-sample gene set enrichment analysis (ssGSEA), using marker genes for 28 distinct immune cell types as reference [[67]29]. Hypoxia scores were obtained from cBioPortal, and drug sensitivities were predicted using “oncoPredict” [[68]30]. Tumor stemness was evaluated using 26 gene sets from StemChecker [[69]31], employing ssGSEA via the GSVA method to derive stemness enrichment scores. Differential gene expression analysis was performed to compare high- and low-risk groups. Gene set variation analysis (GSVA) was performed using hallmark gene sets from MSigDB v7.5. The resulting enrichment scores, reflecting pathway activity in individual samples, were compared between risk groups using the Wilcoxon rank-sum test. DEGs were identified using thresholds of |logFC| > 1 and FDR < 0.05. These genes underwent GO/KEGG pathway analysis using Metascape. Statistical evaluation methods All statistical analyses were performed using R (version 4.3.1). Prognostic outcomes and survival rates across patient subgroups were analyzed using Kaplan-Meier survival plots and the log-rank test. Normality of data distribution was evaluated using the Shapiro-Wilk test. Due to significant deviation from normal distribution in most variables, non-parametric statistical methods were selected for between-group comparisons. The Wilcoxon rank-sum test was used for two-group comparisons, while the Kruskal-Wallis test was applied for analyses involving multiple groups. The prognostic significance of clinical characteristics within high- and low-risk groups was determined using both univariate and multivariate Cox regression analyses, conducted via the “survival” package in R. qRT-PCR methodology Total mRNA was isolated from cellular samples using TRIpure reagent (ELK Biotechnology). Reverse transcription was performed using EntiLink™ 1 st Strand cDNA Synthesis Super Mix with the following temperature profile: 5 min at 25 °C, 30 min at 42 °C, and 5 min at 85 °C. Quantitative real-time PCR (qRT-PCR) was conducted using a real-time PCR system (Applied Life Technologies, USA), with relative expression levels calculated using the 2^-ΔΔCT method. Specific primers were used for targeted gene amplification: H-ACTIN * Forward: GTCCACCGCAAATGCTTCTA * Reverse: TGCTGTCACCTTCACCGTTC H-SHCBP1 * Forward: GGTGCTGGTATAGAAATCTACCCT * Reverse: GTTTCACCAAGACAACACCATAAC H-PMAIP1 * Forward: GTGCTACTCAACTCAGGAGATTTG * Reverse: TCTTTCTTCAAATTGATGAAACGT H-LTF * Forward: TGCAAATTTGATGAATATTTCAGTC * Reverse: CATTGTTATTTCCATCAGTGTTCTG Western blotting Cells were lysed using Aspen buffer for total protein extraction. Proteins were separated by SDS-PAGE and transferred to PVDF membranes. Membranes were blocked with 5% skim milk and incubated with primary antibodies: SHCBP-1 (No:12672-1-AP, 1:1000, Proteintech), PAMIP (No: PA5-19977, 1:500, Thermofisher), LTF (No:10933-1-AP, 1:1000, Proteintech), and GAPDH (Cat No. ab181602, 1:10000, Abcam). After washing, the membranes were incubated with secondary antibodies (1:10000, Aspen). Protein bands were visualized, scanned, and documented. For both PCR and western blotting (WB) experiments, each gene was analyzed in duplicate and all experiments were performed in triplicate. Neither PCR nor WB procedures were conducted under blind conditions. Statistical analysis was performed using SPSS. Differences between groups were assessed using one-way ANOVA and Student’s T-test, with P < 0.05 considered statistically significant. Results Identification of common genes associated with endometriosis and breast cancer Differential expression analysis identified 1,600 DEGs between breast cancer and normal tissue samples in the TCGA-breast cancer cohort, and 179 DEGs between endometriosis and normal tissues in the [70]GSE51981 cohort (Fig. [71]1A, B). Further analysis revealed 47 common genes associated with both endometriosis and breast cancer in these cohorts. (Fig. [72]1C). Expression profiles of these 47 genes were characterized for both cohorts (Fig. [73]1D, E). GO/KEGG pathway analysis demonstrated enrichment in biological processes including chromosome segregation, cell cycle regulation, positive regulation of cell cycle phase transition, oxidative stress response, muscle cell development, and secretory granule and recycling endosome dynamics (Fig. [74]1F, G). Fig. 1. [75]Fig. 1 [76]Open in a new tab Differential expression analysis. A Volcano graph of the normal group and breast cancer group in differential analysis. B Volcano diagram for difference analysis of normal group and endometriosis. C Venn Figure for intersected genes in differentially expressed genes of breast cancer and endometriosis. D Heat map of differential analysis between breast cancer and normal group. E Heat map of differential analysis between endometriosis and normal group. F, G The GO terms and KEGG pathway enrichment analysis of common DEGs. GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes Selection of endometriosis’s signature genes using machine learning algorithm Endometriosis biomarkers were identified using three machine learning algorithms: RF, SVM-RFE, and LASSO regression. The RF model identified 22 genes (Fig. [77]2A), SVM-RFE identified 43 genes (Fig. [78]2B), and LASSO analysis yielded 18 genes (Fig. [79]2C, D). Intersection of these results revealed 11 robust core biomarkers (OLFM4, APOBEC3B, BPIFB1, CPM, MSRB3, EZH2, SCGB3A1, F13A1, PTGER3, FOS, and RCAN1) (Fig. [80]2E). Using the ‘rms’ package, we constructed a diagnostic column line graph for endometriosis (Fig. [81]2F). A calibration curve showed minimal deviation between predicted and actual risk, confirming the model’s accuracy (Fig. [82]3A, B). DCA demonstrated that this model provided significant net benefit compared to alternative strategies (Fig. [83]3C, D). The model exhibited high AUC values in both the training ([84]GSE51981) and external validation ([85]GSE35287) sets, with scores of 0.896 and 0.988, respectively (Fig. [86]3E, F). These findings corroborated the superior predictive performance of the diagnostic model. Fig. 2. [87]Fig. 2 [88]Open in a new tab Detection of diagnostic markers using machine-learning algorithms in endometriosis. A Based on RF algorithm to screen biomarkers. B Based on SVM-RFE to screen biomarkers. C, D LASSO logistic regression algorithm to screen diagnostic markers. E Venn diagram showed the intersection of diagnostic markers obtained by the three algorithms. F Nomogram is used to predict the occurrence of Endometriosis Fig. 3. [89]Fig. 3 [90]Open in a new tab Verification of nomogram model for endometriosis. A, B Construction of the calibration curve for assessing the predictive efficiency of the nomogram model in both A [91]GSE51981 and B [92]GSE35287. C, D Decision curve analysis of risk prediction nomogram for endometriosis in both C [93]GSE51981 and D [94]GSE35287. E, F ROC curve validation of risk prediction nomogram for endometriosis in both E [95]GSE51981 and F [96]GSE35287 Development and evaluation of breast cancer prognostic models We developed a prognostic model for breast cancer using univariate Cox regression analysis, which initially identified five genes with significant prognostic impact (p < 0.05). Further refinement through stepAIC analysis yielded three key prognostic genes. The risk score was calculated as: Risk score = (0.2857) × SHCBP1 + (−0.1610) × PMAIP1 + (−0.0534) × LTF (Supplementary Fig. [97]1). The median risk score served as the cutoff point to stratify patients into high- and low-risk groups, and was applied consistently in the external validation cohort (METABRIC) to assess model generalizability. Based on median signature values, 525 patients were categorized into high- or low-risk groups. In the TCGA cohort, the low-risk (LR) group demonstrated significantly longer overall survival (OS) than the high-risk (HR) group (median duration 215.0 months vs. 115.0 months, p < 0.0001, Fig. [98]4A). Lower risk scores consistently correlated with improved survival (Fig. [99]4C). The model’s robustness was confirmed in the independent METABRIC cohort, where LR patients also exhibited superior OS (median time = 167.0 months vs. 145.0 months, P = 0.02, Fig. [100]4B). These validation findings confirmed the efficacy of the model across multiple datasets. The distribution of risk scores and survival status in the METABRIC cohort is shown in Fig. [101]4D. Both univariate and multivariate Cox regression analyses confirmed that the prognostic risk score was independent of other clinical characteristics including age, stage, TNM classification, and radiation therapy in the TCGA-breast cancer cohort (Fig. [102]4E, F). Fig. 4. [103]Fig. 4 [104]Open in a new tab Construction and validation of a prognosis signature for breast cancer. A, B Overall survival in the low- and high-risk score group patients in A TCGA- breast cancer and B METABRIC. C, D Distribution of risk score according to the survival status and time in C TCGA- breast cancer and D METABRIC. E Univariate analysis for the clinicopathologic characteristics and risk score in TCGA- breast cancer. F Multivariate analysis for the clinicopathologic characteristics and risk score in TCGA- breast cancer. StepAIC: stepwise Akaike information criterion Association between cancer hallmarks and risk groups We examined correlations between risk scores and immune responses by measuring enrichment scores for immune cell subsets and their associated activities through ssGSEA. The LR group showed greater infiltration by eosinophils, mast cells, natural killer (NK) cells, neutrophils, and plasmacytoid dendritic cells (Fig. [105]5A). In contrast, the HR group displayed elevated levels of activated CD4 and CD8 T cells, effector memory CD4 T cells, γδ T cells, and regulatory T cells (Fig. [106]5A). Expression of immune checkpoint inhibitors varied significantly with risk scores. Patients in the LR category exhibited increased expression of NRP1, CD200, and CD44, while those in the HR group showed elevated levels of CD276, IDO1, PDCD1LG2, and TNFRSF9 (Fig. [107]5B). Cancer stem cell assessment using 26 stemness gene sets revealed higher enrichment scores in the HR group (Fig. [108]5C). Additionally, HR patients demonstrated elevated hypoxia scores (Fig. [109]5D) and higher non-synonymous tumor mutation burden (TMB) (Fig. [110]5E). Analysis of the 15 most frequently mutated genes revealed distinct mutation patterns between risk groups (Fig. [111]5F), with significant differences observed for PIK3CA (22% in HR vs. 44% in LR) and TP53 (50% in HR vs. 17% in LR) (Fig. [112]5G). Further genomic analyses showed that the HR group had significantly higher fraction genome altered (FGA) and distinctive copy number variation (CNV) patterns compared to the LR group (Fig. [113]5H, I). Fig. 5. [114]Fig. 5 [115]Open in a new tab Dissection of tumor microenvironment based on prognosis signature. A The box plot of 28 infiltrated immune cell types was calculated by ssGSEA. B Box plot of expression levels of immune checkpoint-associated genes. C Box plot displaying the differences of 26 ssGSEA stemness scores between low risk and high-risk group. D Violin plot of significantly increased hypoxic score in high-risk patients. E Comparison of tumor mutation burden (TMB). F Oncoplot of mutation, deletion, insertion, and frameshift. G Comparison of different mutation sites of TP53 and PIK3CA. H The score of fraction of genome altered (FGA) in different risk groups. I Copy number variation (CNV) patterns in different risk cohorts. * p < 0.05; ** p < 0.01; *** p < 0.001; **** p < 0.0001 Efficacy of prognostic signature in predicting drug sensitivity We evaluated associations between our prognostic model and drug responsiveness by measuring IC[50] values for various therapeutic agents in breast cancer samples. Differences in IC[50] values indicated varying drug sensitivities correlated with risk groups (Fig. [116]6A). Higher IC[50] values for Lapatinib, Temsirolimus, and Vinorelbine in the HR group indicated resistance to these agents, whereas lower IC[50] values for Cisplatin, Paclitaxel, and Rapamycin suggested sensitivity (Fig. [117]6B-G). These findings highlighted the potential utility of Cisplatin, Paclitaxel, and Rapamycin in treating chemotherapy-resistant breast cancer. Fig. 6. [118]Fig. 6 [119]Open in a new tab Efficacy of prognosis signature in predicting drug sensitivity. A Bubble plot of the relationship between drugs and model genes. Boxplots of the comparison of IC50 of drugs between high- and low-risk groups, and correlation between the IC50 and riskscore in TCGA- breast cancer cohort: B Lapatinib; C Temsirolimus; D Vinorelbine; E Cisplatin; F Paclitaxel; G Rapamycin Biological characteristics between risk groups Analysis of the prognostic gene model revealed distinct biological characteristics between risk groups. Differential expression analysis identified 91 genes, visualized in a volcano plot (Fig. [120]7A). A protein-protein interaction (PPI) network constructed using the Metascape database with the MCODE plug-in (minimum interaction score of 0.7) identified two critical functional modules (Fig. [121]7B). GO/KEGG pathway analysis linked these genes to diverse biological processes including cell cycle phase transition, mitotic cell cycle regulation, immune response, epithelial cell differentiation, inflammatory response, neuronal apoptotic regulation, supramolecular fiber organization, and cortical actin cytoskeleton dynamics (Fig. [122]7C-D). GSVA demonstrated significant associations between the HR group and DNA damage repair and cell cycle-related functions (Fig. [123]7E). Fig. 7. [124]Fig. 7 [125]Open in a new tab Biologic functions underlying the breast cancer prognostic model. A Volcano plot showed DEGs (FDR < 0.05 and |log2FC|> 1) between high risk and low-risk group. B PPI network of differentially expressed genes between high risk and low-risk group based on the Metascape website. C, D The GO terms and KEGG pathway enrichment analysis of differentially expressed genes. E Heatmap of GSVA analysis shows different biological functions between high risk and low-risk group. GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes Validation of breast cancer prognostic gene expression through qRT-PCR and WB Expression levels of key prognostic genes were validated using both qRT-PCR and WB in breast cancer and control samples. Results confirmed significantly higher expression of SHCBP1 and PMAIP1 in breast cancer samples, while LTF expression was markedly decreased (Fig. [126]8A-E). These findings reinforced the potential utility of these genes as biomarkers for predicting breast cancer outcomes. Fig. 8. [127]Fig. 8 [128]Open in a new tab The expression of genes was verified by qRT-PCR and West-blotting. A The expression of SCHBP1 between breast cancer group and control group. B The expression of PMAIP1 between breast cancer group and control group. C The expression of LTF between breast cancer group and control group. D Protein expression levels of SCHBP1, PMAIP1 and LTF in breast cancer group 1 and control group. E Protein expression levels of SCHBP1, PMAIP1 and LTF in breast cancer group 2 and control group. * p < 0.05; ** p < 0.01; *** p < 0.001; **** p < 0.0001 Discussion Endometriosis, a chronic gynecological disorder dependent on estrogen, exhibits traits similar to malignant cells despite its benign classification, including local and distant metastasis with resultant tissue damage [[129]2]. This condition shares several risk factors with breast cancer, including endogenous estrogen exposure, reproductive characteristics, obesity, and hormone replacement therapy. Our study explored these associations, suggesting that identification of common differential genes and construction of prognostic risk models for breast cancer could elucidate shared underlying mechanisms and potentially reveal novel biomarkers for breast cancer prognosis. Accurate diagnosis of endometriosis remains challenging, often resulting in delays and misdiagnoses [[130]32], highlighting the need for precise clinical diagnostic tools to initiate timely treatment. This investigation employed three machine learning algorithms—RF, LASSO logistic regression, and SVM-RFE—to identify eleven robust core biomarkers: OLFM4, APOBEC3B, BPIFB1, CPM, MSRB3, EZH2, SCGB3A1, F13A1, PTGER3, FOS, and RCAN1. These biomarkers demonstrated high diagnostic accuracy for endometriosis in a diagnostic column line graph, outperforming other strategies and indicating significant clinical utility. OLFM-4, an extracellular matrix protein highly expressed in human endometrium [[131]33], is downregulated in endometriosis compared to controls [[132]9]. This protein may stabilize the endometrium and modulate inflammation through negative regulation of M2 macrophages [[133]34]. APOBEC3B, a member of the cytidine deaminases superfamily [[134]35], contributes to DNA mutation by converting cytosine to uracil, potentially increasing the mutational burden in endometriosis [[135]36, [136]37] and is associated with poorer outcomes in ER-positive breast cancer due to its elevated expression [[137]38–[138]40]. EZH2, a component of the polycomb repressive complex 2 (PRC2), mediates transcriptional silencing through histone H3 methylation [[139]41, [140]42]. Hypoxic conditions enhance EZH2 expression, amplifying activity in pathways such as Wnt/β-catenin that are critical in the epithelial-to-mesenchymal transition observed in both breast cancer [[141]43] and endometriosis [[142]44]. BPIFB1 expression is stimulated by estrogen, and elevated levels correlate with negative prognosis in luminal A breast cancer [[143]45, [144]46]. MSRB3, a protein repair enzyme, is associated with apoptotic cell death in various cancers, including breast cancer [[145]47]. FOS, an immediate response gene, plays a crucial role in estrogen-driven proliferation of endometrial cells [[146]48]. PTGER3, a receptor with high affinity for prostaglandin E2 (PGE2), is upregulated in endometriosis and implicated in tumor-associated angiogenesis, influencing clinical outcomes in various cancers [[147]8, [148]49]. RCAN1 functions as a tumor suppressor, inhibiting cellular growth and angiogenesis in breast cancer [[149]50]. Secretoglobin family 3 A member 1 (SCGB3A1) enhances stem cell characteristics and aggressiveness in breast cancer cells [[150]51]. Carboxypeptidase M (CPM), found on tumor-associated macrophages, may serve as a cancer biomarker [[151]52]. Factor XIII A chain (F13A1) participates in fibrin network stabilization and potentially facilitates tumor matrix formation and progression [[152]53]. These genes may play key roles in the development of both diseases and could serve as targets for future therapies. Through univariate Cox regression analysis combined with stepAIC, we constructed a prognostic model incorporating three key genes: SHCBP1, PMAIP1, and LTF. This model effectively stratified breast cancer patients into high- and low-risk groups. The HR group demonstrated significantly reduced OS compared to the LR group in both the TCGA-breast cancer and METABRIC cohorts. The model’s reliability was further validated in the METABRIC study. Within the TCGA-breast cancer cohort, model-derived risk scores emerged as independent prognostic factors, remaining significant regardless of age, stage, TNM classification, or radiation treatment status. SHCBP1, a member of the SHC protein family, plays vital roles in cell proliferation, migration, adhesion, and cell cycle regulation, contributing significantly to carcinogenesis [[153]54, [154]55]. In breast cancer, elevated SHCBP1 expression correlates with advanced clinical stages and shorter survival times [[155]54, [156]56–[157]58]. PMAIP1, a pro-apoptotic member of the BCL-2 protein family, interacts with the p53 pathway to enhance apoptosis [[158]59–[159]62]. It functions as a tumor suppressor and shows elevated expression in breast cancer samples [[160]63], with critical importance in paclitaxel response in triple-negative breast cancer [[161]64]. High PMAIP1 mRNA expression represents a positive prognostic marker for relapse-free and OS across diverse breast cancer molecular subtypes [[162]64]. LTF, a multifunctional glycoprotein belonging to the transferrin family, exhibits significant anti-tumor properties through mechanisms including inhibition of tumor cell proliferation and promotion of apoptosis or necrosis [[163]65–[164]68]. Pan-cancer analysis confirms that low LTF expression in tumors supports its classification as a tumor suppressor gene [[165]69]. Further analysis revealed distinct patterns in immune cell infiltration and immune checkpoint expression between risk groups. The LR group exhibited increased infiltration by eosinophils, mast cells, and NK cells. Conversely, the HR group showed greater presence of activated CD4 and CD8 T cells, alongside elevated stemness enrichment scores and hypoxia scores, suggesting more aggressive tumor characteristics. Despite this activation pattern, the LR group maintained higher total CD8 + T cell levels with reduced immunosuppressive M2 macrophage presence—potentially explaining enhanced immunotherapy responsiveness. Pharmacogenomic analyses revealed higher predicted Lapatinib IC[50] values in the HR group, indicating potential HER2-targeted therapy resistance. The HR group also demonstrated increased non-synonymous mutation burden and aneuploidy, reflecting underlying genomic instability. Several limitations exist regarding sample size and clinical annotation depth. Future investigations require larger cohorts with comprehensive clinical and longitudinal data to enhance model generalizability and better account for clinical heterogeneity. Collaborations are being established to access well-annotated prospective datasets. Subsequent studies will implement network-based analyses with experimental validation to elucidate shared gene functions between pathologies. Advanced statistical approaches, including causal inference and propensity score matching, will address potential confounders. While METABRIC provided valuable validation, cohort heterogeneity, processing variations, and treatment history differences necessitate further validation through prospective multi-center studies. This study identified common genes between endometriosis and breast cancer, facilitating the development of diagnostic and prognostic models. Our diagnostic model, based on 11 core biomarkers, accurately predicted endometriosis onset. The prognostic model, utilizing three genes, effectively stratified breast cancer patients into distinct risk categories that correlated with specific clinical outcomes and biological behaviors. These risk groups exhibited unique immune cell profiles and genomic features, enhancing our understanding of the molecular dynamics underlying both conditions. These insights are essential for advancing personalized diagnostic and treatment approaches. Electronic supplementary material [166]Supplementary Material 1^ (7MB, docx) Acknowledgements